Exploration of Event Extraction Techniques in Late Medieval and Early Modern Administrative Records Ismail Prada Ziegler Digital Humanities, University of Bern, Switzerland Department of History, University of Basel, Switzerland Abstract While an increasing amount of studies exploring named entity recognition in historical corpora are published, application of other information extraction tasks such as event extraction remains scarce. This study explores two accessible methods to facilitate the detection of events and the classification of entities into roles: rule-based systems and RNN-based machine learning techniques. We focus on a German-language corpus from the 15th-17th c. and property purchases as the event types. We show that these relatively simple methods can retrieve useful information and discuss ideas to further enhance the results. Keywords information extraction, historical data, digital history, machine learning 1. Introduction Among historical documents from the late medieval and early modern periods administrative records are one of the most prevalent types of source material. These documents exhibit a high density of information and often display some degree of standardisation within collec- tions. These traits make them ideal candidates for digital methods of information extraction and analysis. However, applying digital information extraction techniques to historical documents presents numerous challenges. Annotated historical datasets are limited both in size and in number, and variations in grammar and spelling due to the lack of standardisation pose signif- icant obstacles. Despite these difÏculties, notable advancements have been made in the field due to growing interest in digital history and digital humanities. An overview of recent studies concerning named entity recognition can be found in [3]. This paper contributes to this evolving field by presenting a case study on extracting event information from historical land registers. In our project Economies of Space we work to digitize these registers and explore the potential of extracting information such as entities, relations and events.1 Our goal is to create a knowledge base where the individual histories of persons, properties, and organizations can be explored, as well as to enable distant reading methods of analysis. In [6] we demonstrated that robust named entity recognition is possible for our CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark £ ismail.prada@unibe.ch (I. Prada Ziegler) ȉ 0000-0003-4229-8688 (I. Prada Ziegler) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://dg.philhist.unibas.ch/en/bereiche/mittelalter/forschung/oekonomien-des-raums/ 761 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings data. In this study, we explore the potential of event extraction as a first step to investigate interactions between the found entities. We compare two methods: rule-based extraction and RNN-based machine learning. While this case study focuses on a narrow example, we hope that the findings of these experiments will benefit other teams working with similar datasets. 2. Dataset 2.1. The Historical Land Registers The experiments were conducted with the Historical Land Registers of Basel.2 This archival collection aimed to bring together excerpts from all archival documents which mention a prop- erty inside the old city of Basel. The content is a mix of legal and bookkeeping information relating to property ownership, rents, and transactions. Our project focuses on 80,000 excerpts from between 1400 and 1700, written in Early New High German.3 Almost all excerpts are kept to a single sentence, even when describing complex events. For the remainder of this paper, the term ”sample” will refer to an individual document within this collection. The documents were automatically transcribed with an average CER of 3.6%. 2.2. Entity Annotation 640 samples were annotated following the BeNASch guidelines.4 BeNASch applies a nested entity representation, which means for each entity mention, a mention span (e.g., ’the house at the river’, ’Hans Stuber, the tailor’) is annotated as well as a head element (e.g., ’house’, ’Hans Stuber’). All entity mentions that fall into one of the categories PER (persons), ORG (organiza- tions), LOC (locations), or GPE (Geo-political entities), including pronouns, are annotated. 2.3. Event Annotation The 640 samples also feature event annotation. We define an event as a ”specific occurrence involving participants” following the ACE guidelines.5 Only events that belong to categories which were determined in our project to be of interest to historical research are annotated. An event is characterized by two main elements: the trigger and the roles. The trigger represents a word or phrase around which the event is centered. Roles match entity annotations and describe the entities part in the event. See Appendix A for an annotation example. 2 https://dls.staatsarchiv.bs.ch/records/1016781 3 Although as is always the case with copied documents, we must suspect that at least in some cases modifications and to some degree modernization of text took place. To answer the question ”to what degree?” is part of our research project. 4 https://dhbern.github.io/BeNASch/ 5 https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf. Similar guidelines have since been adopted in BeNASch. 762 3. Methodology 3.1. Data and Evaluation Table 1 Number of occurrences of each role and trigger phrases. Category Count seller 200 buyer 220 property 184 price 129 trigger 173 For the purposes of this study, we focus on the event type property purchase. It appears in 167 out of the 640 samples, making it comparatively frequent. We define the following roles for the event property purchase: seller (PER or ORG), buyer (PER or ORG), property (LOC), price (MONEY). Every role may appear multiple times, and only the property role must appear at least once. The total occurrences of each role are shown in Table 1. We implement 5-fold stratified cross-validation because our dataset is still extremely small, especially for machine- learning purposes. We split each fold 60/20/20% for training, validation and testing respectively. The results represent the average across the five folds. This dataset still contains all other event- annotated samples, but triggers and roles in those have been removed (we do this to evaluate if our systems can distinguish property purchase events from other events as well). 3.2. Rule-based System 3.2.1. Trigger Detection For each fold, we create a gazetteer of potential trigger phrases by counting the trigger- annotated phrases in that folds training-set. We exclude phrases which appear fewer than k times from the gazetteer. We then compare this gazetteer to the input samples and apply fuzzy-matching, using the thefuzz python library6 , to mark one or multiple tokens as triggers. We allow our algorithm to detect multiple trigger phrases in a single sample. For each fold, we determine the minimum ratio for the fuzzy matching as well as the minimum frequency k by running different parameter combinations against the validation set and choosing the best result. The best parameters were either setting k to 3 and minimum fuzz-ratio to 0.8 or setting both parameters to 1. We avoid some frequent problems with additional rules: 1. To prevent misidentification of the word ”Kauf” in documents titled ”Kauf-Urkunde” (purchase deed), we forbid the first token in a document to match a trigger. 2. Triggers may only match words outside of entity men- tions, this prevents for example ”verkauft” in ”das verkauft Haus” (the sold house) from being identified as a trigger. 3. If one trigger follows another trigger without any entity mention in 6 https://github.com/seatgeek/thefuzz 763 between, we remove the second one (e.g. ”es verkauft und gibt zu kaufen”). While these kinds of errors don’t have a negative impact on the document classification or role detection, they distort the trigger detection scores to look more negative than they actually are. 4. In some cases two predicted triggers are separated by a person or organization mention. We reclassify the second trigger as a helper in that case. They are helpful information in the role detection and their use will be explained in the next section. 5. Finally, we remove triggers where a MONEY or TIME annotation is found between the trigger and a LOC. These cases indicate rent purchase documents which are very similar in language and structure to property purchase documents. 3.2.2. Role Detection To detect roles, we apply a simple template system whenever a trigger is present. For prop- erty purchase documents, we identify three different kinds of structures (ignoring non-entity- mentions and non-values): 1. 2. 3. Templates 1 and 2 are usually found when the sale is the central event of the excerpt, while template 3 usually follows a seizure event, giving information who bought the property after it was seized and auctioned off. Sometimes roles are missing from the text, so we only require a trigger and at least one LOC-mention to apply a template. The template used is decided by looking at the differences: A LOC before the trigger implies template 3, otherwise check if a PER/ORG is present before the TRIGGER, if yes then template 1 is used, otherwise template 2. We can match mentions to roles due to the restrictions in their categories, as long as their position relative to the other roles and the trigger is correct: PER and ORG can only be SELLER and BUYER, while LOC can only be PROPERTY and MONEY can only be PRICE. One challenge is the distinction between SELLER and BUYER in template 2. To show what can already be done by simple means in this case study, we solve this by putting the first half of all PER/ORG mentions as SELLER and the second half of all PER/ORG mentions as BUYER (in case of an odd number of candidates, SELLER gets the additional one). If a helper-trigger is present, we use it to distinguish buyer and seller. 3.3. Machine-Learning System 3.3.1. Architecture Our approach to event extraction by machine learning is inspired by previous successes to extract entities in pre-modern German texts [4][6]. Like entities, we can model roles and trigger as annotation spans in the text and apply a sequence tagging strategy (this is one common way to model event extraction [5]). We implement our experiment using the FlairNLP framework [1]. For the language model, we stack a forward and backward model of contextual character embeddings [2] which we obtained by finetuning the de-model on all handwritten documents in the Historical Land 764 Figure 1: Overview of both strategies to recognize events. Dashed frames imply variants. Using the event recognition on plain texts is also possible, but prevents the rule-based approach and the shortening step. Registers, including later than 1700 (appr. 9.14M token). Character-based embeddings have demonstrated robustness against the inherent variability of pre-modern German spelling and vocabulary [4]. For the event extraction, we train a sequence tagging model with the default settings of Flair (single-layered Bi-LSTM + CRF decoder). 3.3.2. Pretagging To insert the information from the named entity annotation into the model, we add a prefix and sufÏx token to each entity mention. ”Hans sold his house .” becomes ”[B-PER] Hans [E- PER] sold [B-LOC] his house [E-LOC] .” For experiments focused only on role detection, we incorporate trigger information in the same manner. (”[B-SALE] sold [E-SALE]”). We conduct experiments with manually annotated tags as well as automatically predicted ones. The predicted annotations are trained as a Flair SequenceTagger as well, using the same language model as the event recognition. A separate model is trained for each fold so no data contamination occurs. When pretagging is applied, the role detection is not required to match the whole span of the pretagged entity, instead it is trained to classify the prefix token (e.g. ”[B-PER]”) correctly. The training data is adjusted accordingly (see Appendix B for an example). 3.3.3. Variants Shortening: To shorten our samples and possibly remove noise, we remove all tokens inside entity annotations which are not part of the head. This reduces our sample length by an average of about a third of all tokens. The NER models for this variant of pretagging were trained as described in [6]. Document Filtering: Because the system in initial tests often annotated roles in documents even when they (correctly) identified no trigger, we added a rule to disregard all role annota- tions in documents where no trigger is present. 765 Note that these variants do only apply to the machine learning strategy. Shortening is ir- relevant to the rule-based strategy because trigger detection only happens outside of entity annotations and role detection is based only on entity annotations positions and classes, not content. Document Filtering doesn’t apply because only documents that contain triggers will be further processed. 4. Results & Discussion 4.1. Experimental setup We experiment with four base settings for the machine learning method: • Pretagged training and test-sets contain named entities retrieved from our ground truth dataset. • PredNEsTest training-set with pretags from the ground truth, but a test-set with au- tomatically predicted NE-annotations. This represents the practical scenario for our project, but is highly dependent on the quality of the NER model. • PredNEs automatically predicted entities in both training and test-set. We test this setup to see if training on noisy entity mentions improves the models robustness during testing when encountered with similar noise. • Plain no pretagging. Additionally, we test variants adding the shortening augmentation (+Shortening) and document filter (+DocFilter). For the rule-based system, we report two setups, one using the ground truth entity mentions and one using automatically predicted entity mentions (analogous to PredNEsTest). 4.2. Rule-based vs. Machine Learning Table 2 shows that the machine learning systems significantly outperform our rule-based sys- tems no matter if tags are generated from ground truth information or are automatically pre- dicted. Interestingly, the trained model without any pretagging (Plain) still performs similar in role detection compared to a rule-based system working with pretagging information. In Table 3 the results between the respective best models are shown per category. We ob- serve that the machine learning system is slanted heavily to achieve high precision values. The document filter rule has part in this, reducing recall by appr. one percentage point, but also without filter, a significant slant towards precision remains. Depending on the use of the an- notations, this may be problematic. Especially when the annotations are used as a tool to find interesting data points, which are then manually investigated, false positives would likely be less problematic than false negatives. In a more thorough review of the errors found in the machine learning predictions (specif- ically Pretagged+Shortening+DocFilter) and the rule-based predictions, we observe three main points that the machine learning system is able to handle better: First, in our dataset, people or organizations represented by someone else are not annotated as taking part in the event (their connection to the event is handled in the form of a relationship 766 Table 2 Micro f1-score with standard deviation for each task: trigger detection, role detection with given trigger and role detection with automatically predicted trigger. trigger roles /w gt trigger roles /w pred trigger Rule-based 0.8591 ± 0.0457 0.8674 ± 0.0356 0.8374 ± 0.0442 Pretagged 0.8855 ± 0.0140 0.9112 ± 0.0178 0.8720 ± 0.0115 Pretagged+DocFilter 0.8855 ± 0.0140 0.9151 ± 0.0205 0.8951 ± 0.0072 Pretagged+Shortening 0.9028 ± 0.0261 0.9208 ± 0.0141 0.9048 ± 0.0096 Pretagged+Shortening+DocFilter 0.9028 ± 0.0261 0.9220 ± 0.0140 0.9127 ± 0.0123 Rule-based with PredNEs 0.8586 ± 0.0478 n/a 0.7044 ± 0.0446 PredNEsTest+Shortening+DocFilter 0.8799 ± 0.0256 n/a 0.8118 ± 0.0454 PredNEsTest+DocFilter 0.8717 ± 0.0219 n/a 0.7708 ± 0.0443 PredNEs+DocFilter 0.8414 ± 0.0650 n/a 0.7518 ± 0.0474 Plain 0.8224 ± 0.0709 n/a 0.7107 ± 0.0646 Table 3 Performance metrics per label type between rule-based system and machine learning system (with shortening and pretagging). Both using pretagging derived from ground truth data. Rule-Based Machine Learning Recall Precision F-Score Recall Precision F-Score Trigger 80.42% 92.37% 85.91% 89.61% 91.13% 90.28% Seller 93.96% 76.94% 84.37% 89.69% 95.23% 92.22% Buyer 81.55% 89.62% 85.37% 88.10% 93.30% 90.54% Property 83.89% 82.46% 82.96% 84.83% 96.91% 90.45% Price 79.51% 83.14% 81.18% 87.40% 97.41% 92.00% between them and the person representing them). E.g. ”Es verkauft Hans Vöglin innamen seines bruders kinder” (Hans Vöglin sells in the name of his brothers children...) only classifies ”Hans Vöglin” as seller, but not ”kinder”. Our rule-based system does not contain a rule to ignore these mentions when looking for the roles ”seller” and ”buyer”. Writing rules for these cases isn’t trivial either, as phrasing and spelling of words indicating these occurrences varies. The machine learning system was able to correctly ignore these mentions in the examples we investigated manually. Second, as already expected in the methodology section, the rule-based system struggles with the misidentification of buyer as seller, and conversely, seller as buyer. We observe that the machine learning system reduces the amount of errors of this kind by two thirds. Finally, we observe a remarkable difference when it comes to slightly altered phrasing in the documents. While the machine learning system still fails when confronted with completely foreign structures (such as a property purchase being discussed as a past event in the middle of a rent purchase), it can handle small alterations quite well. 767 Figure 2: F1-Score plotted in relation to size of training data for the machine learning system. 134-135 samples in total training+validation (depending on fold). 4.3. Learning Curve Analysis Figure 2 illustrates the performance of the machine learning system (+Shortening+DocFilter) compared to the rule-based system. We observe that using around 40% of the training material (appr. 54 samples) will result in role annotations comparable to the rule-based system, while using 50% will achieve significantly better results. As usual for machine learning systems, the increase in performance lessens with increasing sample size. 4.4. Impact of Variants The Shortening augmentation improves the scores in all settings where it was applied. The strongest difference could be observed when evaluating roles with predicted triggers (p-value = 0.0131). During error analysis, we found that the removed tokens can also result in a loss of relevant information. Specifically, clauses where a husband is named in conjunction with his wife, e.g. ”Es verkaufen Hans, seine Frau Anna...” (Hans and his wife Anna sell...) which would get shortened to ”Es verkaufen Hans, Anna...”, followed by the names of the sellers, would sometimes result in the misidentification of the wife as a buyer, while the un-shortened model would classify these instances correctly. We thus see the shortening strategy as a success with further research required for more fine-grained variants, e.g. only shortening mentions when certain conditions are met. The document filter rule worked well and improved the results over the board. When shortening is not applied at the same time, we observe a significant improvement (p-value = 0.0316). Otherwise we still observe a positive trend (p-value = 0.0518). 768 4.5. Practical usability Our final evaluation of this system in a practical use case scenario is mixed. On one hand, the system produces annotations that may well be used to create data of larger quantities where general trends can be observed. An example for possible analysis could be to combine the role annotations with the nested entity annotation to observe economic interactions between occupational groups over time. On the other hand, the systems show a - larger or smaller, depending on the method applied - amount of bias of only finding the events when the structure of the document fits one of the three main templates. So any conclusions drawn from the predicted event information need to consider this bias with caution. 5. Conclusion In this case study, we’ve shown that even with relatively simple means, we can achieve auto- mated annotations which are usable in historical research. The scope of this case was intention- ally kept small to simplify evaluation and interpretation, but future research in our project will explore how these systems perform across a broader range of event types. Most event types that occur with sufÏcient frequency for machine learning are of similar structural homogene- ity to the documents in this study. Therefore we assume the findings for property purchases will also be applicable to other event types. We also aim to explore how transfer learning can benefit event recognition for less frequent event types. We’ve shown that with our kind of data, a machine-learning system can outperform a rule-based system by a significant margin even when when only little training data is available. Writing rules may be quicker than annotating documents still, but considering both systems rely on pretagged texts, the amount of necessary work can probably be reduced significantly if events and entities are annotated at the same time. When working with any data annotated by these methods, knowledge of the bias that is inher- ent to them is crucial. For example, the samples which do not fit the main templates might be coming from a very specific source, which would lead the automated system to miss most documents from that specific source, which would distort whatever conclusions we’re trying to draw from the quantitative results. But this study is only the first foray into event extraction in historical texts and only looked at two quick-to-implement and easily accessible methods. In future research the possible application of LLMs to this task should be investigated, as LLMs have shown to perform well in low-resource scenarios [7], but their applicability to historical German must be evaluated first. References [1] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf. “FLAIR: An Easy- to-Use Framework for State-of-the-Art NLP”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, Minnesota, 2019, pp. 54–59. doi: 10.18653/v1/N19-4010. 769 [2] A. Akbik, D. Blythe, and R. Vollgraf. “Contextual String Embeddings for Sequence La- beling”. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA, 2018, pp. 1638–1649. [3] M. Ehrmann, A. Hamdi, E. L. Pontes, M. Romanello, and A. Doucet. “Named Entity Recog- nition and Classification in Historical Documents: A Survey”. In: ACM Comput. Surv. 56.2 (2023). doi: 10.1145/3604931. [4] T. Hodel, I. Prada Ziegler, and C. Schneider. Pre-Modern Data: Applying Language Modeling and Named Entity Recognition on Criminal Records in the City of Bern. Presented at the Digital Humanities 2023. Collaboration as Opportunity (DH2023), Graz, Austria. 2023. doi: 10.5281/zenodo.8107616. [5] Q. Li, J. Li, J. Sheng, S. Cui, J. Wu, Y. Hei, H. Peng, S. Guo, L. Wang, A. Beheshti, and P. S. Yu. “A Survey on Deep Learning Event Extraction: Approaches and Applications”. In: IEEE Transactions on Neural Networks and Learning Systems 35.5 (2024), pp. 6301–6321. doi: 10.1109/tnnls.2022.3213168. [6] I. Prada Ziegler. What’s in an entity? Exploring Nested Named Entity Recognition in the Historical Land Register of Basel (1400-1700). Presented at the Digital Humanities Benelux 2024, Leuven, Belgium. 2024. doi: 10.5281/zenodo.11500543. [7] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang. GPT-NER: Named Entity Recognition via Large Language Models. preprint arXiv 2304.10428. 2023. arXiv: 23 04.10428 [cs.CL]. A. Event Annotation Example Gend ze kaufen Heinrich Trech von Lauezhut der Kremer u Margareth Lang Walcherin sin ewirtin , Blesin Winsperg dem schnider u . Margarethen siner ewirtin , daz Hus u . Hofstatt genant zer Thannen , so gelegen als man von dem Vischmergt heruf zem Sunfegen gat [...] , ist erb von dem gotshus Lienh denen jährl darab gand 3 lb 21 lot pfeffer ze wysung um 150 fl . appr. english translation: Give to buy Heinrich Trech of Lauezhut the trader and Margareth Lang Walcherin his wife, Blesin Winsperg the taylor and Margareth his wife, the property called zer Thannen, lies when you go from the cattle market to zum Sunfegen [...], is owned by the church St. Lienhart which is paid 3 lb 21 lot of pepper for 150 lb . B. Ground Truth Example With Pretagged Text (BIO-Format) 770 Token Role Gibt O ze O kaufen B-Trigger [B-PER] B-Seller Heinrich O Trech O [E-PER] O [B-LOC] B-Property daz O Hus O zer O Tannen O [E-LOC] O um O [B-MONEY] B-Price 150 O fl. O [E-MONEY] O 771