REVERINO: REgesta generation VERsus latIN summarizatiOn Giovanni Puccetti1,* , Laura Righi2 , Ilaria Sabbatini3 and Andrea Esuli1 1 Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo”, via G. Moruzzi 1, 56124, Pisa PI, Italy 2 Università degli Studi di Modena e Reggio Emilia - Dipartimento di Educazione e Scienze Umane, viale Timavo, 93 – 42121, Reggio Emilia RE, Italy 3 Università degli Studi di Palermo - Dipartimento Culture e Società, via delle Scienze, 15 – 90128, Parlermo Pa, Italy Abstract In this work we introduce the REVERINO dataset, a collection of 4533 pairs of Latin regesta with their respective full text medieval pontifical document extracted from two collections, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) and Les Registres de Gregoire IX (1227/41). We describe the pipeline used to extract the text from the images of the printed pages and we make high level analysis of the corpus. After developing REVERINO we use it as a benchmark to test the ability of Large Language Models (LLMs) to generate the regestum of a given Latin text. We test 3 LLMs among the best performing ones, GPT-4o, Llama 3.1 70b and Llama 3.1 405b and find that GPT-4o is the best at generating text in Latin. Interestingly, we also find that for Llama models it can be beneficial to first generate a text in English and then translate it in Latin to write better regesta. Keywords Regesta, Latin Text Summarization, Large Language Models, Digital Humanities 1. Introduction ITSERR [1] (Italian Strengthening of the ESFRI RI RESILIENCE) is a interdisciplinary and distributed Research Infrastructure for Religious Studies. In the context of this project, REVERINO is a novel dataset of regesta with the medieval Latin texts they summarize and their apparatus. A dataset designed to recreate the methodology of regesta generation, and specifically designed for the creation of an Artificial Intelligence-based tool for summarizing medieval documents, with a particular focus on pontifical documents. The decision to employ the system of regesta for organizing, indexing, and summarizing medieval texts through generative AI stems from the integration of various scholarly needs, which we explore and test in depth. To create a new automated organizational process tailored to historical documents, we have chosen to focus on automatic summarization, drawing on an established and scientifically validated methodology – namely the creation of regesta, a practice improved by humanists and scholars since the 19th century. Scholars studying medieval charters often need to explore specific topics, historical figures, or places within vast corpora of sources, sometimes employing a comparative or longue durée approach. These corpora remain widely dispersed and are preserved across various libraries and archives that are geographically distant and differently organized, making them difficult to access. This is particularly true for the extensive documentation produced by royal and papal chanceries. Starting from this observation, we decided to work on the creation of a specifically designed dataset for IRCDL 2025: 21st Conference on Information and Research science Connecting to Digital and Library science, February 20-21 2025, Udine, Italy * Corresponding author. † Contributions: Puccetti and Esuli set up eScriptorium, wrote the code for automatic regesta generation and wrote Section 3.2. Righi conducted the qualitative analysis of the results and wrote Section 1 and Section 3.3. Sabbatini took care of the full data pipeline and wrote Section 2. All the authors wrote abstract, conclusions, and revised the final version of the paper. $ giovanni.puccetti@isti.cnr.it (G. Puccetti); laura.righi@unimore.it (L. Righi); sabbatini@fscire.it (I. Sabbatini); andrea.esuli@isti.cnr.it (A. Esuli) € https://gpucce.github.io (G. Puccetti); https://esuli.it/ (A. Esuli)  0000-0003-1866-5951 (G. Puccetti); 0000-0002-5770-0978 (L. Righi); 0009-0001-2936-6562 (I. Sabbatini); 0000-0002-5725-4322 (A. Esuli) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the development of a tool for the summarization of documents produced by medieval pontiffs (c. 1200 to 1350). 1.1. Regesta A regestum, the Latin word for list, enumeration, specification1 , is a summary of a document made for the use of stakeholders and scholars, making the document content readily available without the need to consult it in its entirety. Each regestum contains some essential information, namely, 1) the name of the author (i.e. the Pope); 2) the name of the recipient; 3) an abstract of the content (with the object and the operative verb); 4) the date (calculated from the year of pontificate) and 5) the place of production of the document. A regestum always has a reference full text document, of which it is the “summary” and both come together with an apparatus, a formal text indicating the collection and the manuscript where the regestum is found. While the three components, regestum, full text and apparatus, are conceptually close, collecting them together can be challenging. Indeed, there are three main issues when trying to retrieve both a regestum and its full text from a collection or creating a new one: a) the regestum and the corresponding document are often not collected in the same volume (as in the Potthast collection [2]); b) these modern printed collections are not easily accessible and readable; c) the publication of new editions or the update of existing regesta collections is extremely time consuming. For these reasons, many regesta collections have been created in the past, especially of medieval documents produced by royal and papal chanceries, but few of these have been updated or created from scratch in recent years. 1.2. Text summarization In the Natural Language Processing (NLP) literature, automatic text summarization is the task of rewriting the content of a text passage into a shorter form while retaining the relevant information without involving a writer in the process. Regesta fit well in this framework since they are summaries of longer texts meant to expose specific information in an easy to consult form. Therefore, REVERINO is well suited to be both an easy to inspect dataset of Latin regesta, as well as a training and testing benchmark for Latin text summarization. 2. The REVERINO Dataset There is an extensive number of printed collections of regesta in Latin, edited from several scholars during the 19th and 20th century, however only a few examples that are digitally available, generally as sets of high quality digital images, and only very rarely with a full text, machine readable version (and often performed with old or bad quality OCRs). One of the most relevant fields in which regesta have been produced is the corpus of the pontifical acts and letters issued by the popes and the papal chancery during the Middle Ages. This corpus guarantees the presence of large collections of regesta and extended texts that were already edited and published in printed versions during the 19th and 20th centuries, such as Jaffé and Potthast’s Regesta Pontificum Romanorum; the collection published by the Bibliothèque des Écoles françaises d’Athènes et de Rome (BEF); the editorial series of the Monumenta Germaniae Historica (MGH). 2.1. Data Selection Regesta collections are only available as images of full pages, this poses a first obstacle to the creation of a large scale corpus of these documents. Each manuscript has different pagination, layout, writing font, image quality, format, etc., and thus requires a custom pipeline for the extraction of the text into a machine readable format. Nevertheless, manuscripts from one collection undergo a similar digitization 1 https://glosbe.com/la/en/regesta Figure 1: Example of the escriptorium Interface. procedure and therefore can be processed together through a single pipeline, to extract the content of all the documents. Given the need for a custom approach for each corpus, we choose to limit our collection to two sets of printed collections, specifically, we identify 2 main sources: 1. MGH: Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) [3] 2. Auvray: Les Registres de Gregoire IX (1227/41) [4] While conceptually similar these two collections are formally different: MGH is written in a single column format while Auvray in two columns, the first has apparati visually separated from the original document, while for the second they are part of the regestum. The different collections have thus several smaller visual differences linked to the layout and the font used. Finally, from a qualitative perspective they collect and summarize the documents related to two different medieval popes: Gregory IX and Honorius III. In particular, Auvray collects only the documents related to pope Gregory IX, and MGH collects the documents issued by both Gregory IX and Honorius III. These collections were chosen as starting corpus because they allow for the collection of different types of regesta. Indeed, although the creation of a regestum is based on specific rules shared in the research domain, different scholars inevitably produce different regesta from the same document. It is therefore important to consider regesta produced by different scholars in different periods. 2.2. Data Curation The pipeline leading from a collection of images of printed pages to the REVERINO corpus is composed of 4 steps: Annotation, Training, Extraction and Post-processing. Annotation We manually annotate a selected set of pages from each collection of regesta to use as a training dataset, this is done on a local instance of the eScriptorium platform [5], an example of the interface can be seen in Figure 1. Our pipeline involves both segmentation of the written parts of each image as well as OCR. Annotating data for the latter is too time demanding and existing models are effective enough, therefore we limit ourselves to annotating pages to train a segmentation model and rely on available OCR ones. The models in eScriptorium ingest annotations with two kinds of information: 1. areas isolating the parts of a page that contain text, and 2. lines identifying the text of a line and its position in the page. Thus, each page is annotated in two steps, first the relevant areas are circled and then each line is colored. Training To adapt models to the outline of a manuscript, we start from a working segmentation model provided by eScriptorium, catmus print large [6]. This model works sufficiently well on MGH and we can use it as is. Differently, the Auvray collection has a two columns format and we need to train the model on a dataset collected in the Annotation step. We go back and forth between Training and Annotation to fix the limitations of each trained model, reaching a total of 91 annotated pages. This process led to high quality results in segmenting the outline of the Auvray manuscript and extracting text. Extraction Once the model has been trained, we use it to process all the pages in each collection, obtaining text lines that the model is able to identify along with their position on the page, and thus giving us a clean continuous stream of text spanning the full document. Post-Processing The last step consists in using a series of heuristics based on the content and the position information in the page of each text line extracted, to separate each regestum from the longer text it summarizes and from the apparatus. 2.3. Data Statistics From a quantitative perspective, MGH is composed of a total of 2283 regesta and full text pairs and Auvray of 3983. However, for Auvray, several of the full texts extracted were only short passages, often quota- tions or incipits that don’t contain the information needed to generate a regestum, therefore we drop them. After this cleaning there 60 are 2250 regesta left in Auvray. MGH Auvray While MGH and Auvray are similar, – they are both collections of regesta –, they 40 show several differences: they collect doc- uments written by different popes and 20 t-SNE Component 2 they are edited by different scholars, and on top of this they are different as printed 0 publications. Indeed, due to the layout of the two collections, as MGH is com- posed of single column pages while Au- 20 vray is composed of two columns pages. Also, the quality of the second dataset is 40 generally lower, due to minor errors in OCR quality, few characters and numbers are wrongly transcribed by our custom 60 model. Therefore we keep two separate 40 20 0 20 40 t-SNE Component 1 splits of the dataset. To provide a qualitative understanding of the difference, we use t-SNE [7], after Figure 2: T-SNE plot showing samples from the two encoding the regesta using LaBERTA a manuscripts MGH and Auvray. latin adaptation of BERT [8]. Table 1 Prompts used to make the LLMs generate a regestum given a full text. Format Backtranslate Given the following text in Latin please write in Given the following text in Latin please first Latin a «regesto» for it, containing: translate it to English and then write in Latin a «regesto» for it, containing: Shared Prompt 1. The name of the author (i.e. the Pope); 2. The name of the recipient; 3. An abstract of the content (with the object and the operative verb); 3. The date (calculated from the year of pontificate); 4.The place. TEXT: ... Figure 2 shows the t-SNE plot of the two datasets, while they are well separated there is non-negligible overlapping between the two, hinting that in the future they can be used together to train a language model able to automatically write the regestum of a Latin document. In the future, to prevent this “bipartite” distribution of the samples in REVERINO, we will add regesta from different manuscripts to contribute a more broadly distributed dataset. 3. Text Summarization in Latin Text summarization is a long standing task in Natural Language processing [9], which in the past was tackled through dedicated approaches often based on the retrieval of similar passages. Since LLMs have shown the ability to generate free-form text [10] they are currently the best performing systems for summarizing texts. An example of a widely used benchmark is the XSUM dataset [11], which is a dataset composed of CNN articles from 2021 along with their summary and the task consists in generating a summary given the full article. To evaluate text summarization the most used metrics are based on text overlap, the most spread one is Rouge [12], given an integer 𝑛 Rouge measures the number of overlapping n-grams between the generated and the reference text. We focus on Rouge-1, Rouge-2 and Rouge-L, the first two measure respectively the number of overlapping words, 1-grams, and the number of overlapping word pairs, 2-grams, between the reference and the generated text. The third, Rouge-L, measures the longest overlapping n-gram between the reference and the generated text. An alternative metric, Bleu [13], is also based on quantifying text overlap, but it measures overlapping sub-strings instead of words. 3.1. Experimental Setup To understand how well LLMs can summarize text in Latin, we measure the performance of three powerful LLMs, Llama 3.1 70b, Llama 3.1 405b and GPT-4o. The first two are openly available language models released from Meta [14] while the third is a closed source model from OpenAI [15]. We test these models in two settings, in the first, format, the model is asked to generate the regestum directly based on the full text it refers to in the second, backtranslate, when presented with the full text, the model is asked to initially write a “regestum” in English and then to translate it in Latin. Each setting, format and backtranslate, is identified by the prompt we provide the LLM to make it generate the regesta, Table 1 shows the prompt used for each setting as well as a Shared Prompt, added next to the setting specific one, where we request the model to at least add the key elements of a regestum, as mentioned in Section 1.1: the author, the recipient, the summary, the date and the place. Finally, to facilitate the model during generation we add two full texts with their respective regesta. We let the models generate up to 8048 tokens and we use greedy decoding, i.e. we pick the most likely word and avoid any form of sampling during inference since the regestum is meant to be a short and detailed summary, we will ablate different sampling techniques in future works. Table 2 Summarization scores for gpt4o, Llama-70b-instruct-hf and Llama-405-instruct-hf, in bold the highest result for each metric. model dataset experiment n samples rouge1 rouge2 rougeL bleu format 2213 0.13 0.05 0.11 0.03 mgh backtranslate 2213 0.23 0.09 0.20 0.06 llama-3.1-70b-instruct-hf format 2054 0.12 0.05 0.10 0.03 auvray backtranslate 2054 0.17 0.07 0.14 0.04 format 2213 0.21 0.09 0.19 0.06 mgh backtranslate 2213 0.21 0.09 0.19 0.06 llama-3.1-405b-instruct-hf format 2054 0.13 0.06 0.11 0.03 auvray backtranslate 2054 0.15 0.07 0.13 0.04 format 2213 0.39 0.18 0.34 0.16 mgh backtranslate 2207 0.34 0.16 0.30 0.05 gpt-4o format 2052 0.28 0.14 0.24 0.12 auvray backtranslate 2051 0.25 0.12 0.21 0.06 To evaluate model performance we use both a quantitative and a qualitative analysis: first, the quantitative analysis is based on Rouge and Bleu measuring the similarity between synthetic regesta generated by an LLM and the original ones from the REVERINO dataset summarizing the same text, second, the qualitative analysis is based on inspecting in detail a subset of the machine generated regesta to understand which of the 5 key properties they lack. 3.2. Quantitative Results Table 2 shows Rouge and Bleu achieved by the three models we test: Llama 3.1 70b, Llama 3.1 405b and GPT-4o. The first finding is that no model can generate regesta proficiently, this appears from the fact that none of those we tested can achieve a Rouge higher than 0.40 and a Bleu above 0.15. We can also see that GPT-4o strongly outperforms both Llama models. The highest Rouge-1 is 0.38, achieved by GPT-4o on MGH, which also has the higher Rouge-1 on Auvray, although at a significantly lower value, 0.28, which we attribute to the lower quality of the Auvray dataset. The wide gap between Rouge-1 and Rouge-2 shows how generally LLMs output texts that share the general context, higher value of 1-grams overlap, but they find it harder to generate actually similar texts, lower value for 2-grams overlap. The two versions of Llama, Llama 3.1 70b and Llama 3.1 405b, show a small performance gap, indicating that it is not useful to use the larger and more costly Llama 3.1 405b, comparing format and backtranslate the first is the best setting for the GPT-4o model, while the opposite is true for Llama models, which show higher performance when asked to translate in English before writing in Latin. We have performed a limited prompt tuning that resulted in the choice of the format and backtranslate settings and we will further explore this aspect in future works. Finally, we notice how in rare cases, between 2 and 6 for MGH and between 2 and 3 for Auvray, the guardrails preventing GPT-4o to answer questions involving violence make it refuse to generate a regestum, thus the lower values in the N. Samples column, while Llama models do not incur in this issue. 3.3. Qualitative Results For a more in-depth analysis of the model abilities as seen in the quantitative results, we identify a corpus of 20 pairs of extended regesta (10 from MGH and 10 from Auvray) to conduct a qualitative analysis of the results. The regesta are humanly checked by a domain expert in their different versions produced by GPT-4o, Llama 3.1 70b and Llama 3.1 405b, for a total of 120 artificially created regesta reviewed and compared with their original versions. Through a manual inspection it is possible to identify the reasons for the results presented in Table 2 and possibly take corrective actions in the future. In agreement with Table 2, also from a qualitative analysis GPT-4o performs better than both Llama models. This mainly concerns the generation of Latin text and thus the summarization of the document content in the regestum form. Indeed, Llama models encounter more problems in text generation, as shown by the fact that the best results are obtained when the summarization is created in English and then translated into Latin (i.e., backtranslate). More broadly, it can also be observed that the systems perform better in the case of MGH, but as already mentioned, this can be traced back to how the dataset is constructed. Finally, the qualitative analysis reveals the most critical failures in automatic summarization (i.e., automatic regesta creation). One of the problems identified concerns the recognition of documents’ author, namely the Pope. Indeed, in the case of MGH, which collects documents from multiple popes, both systems show difficulties in recognizing the correct author. In fact, GPT-4o correctly recognizes the Pope in 11 cases out on 20 taken into account. Another critical element concerns dating, which in these medieval texts is based on the year of pontificate and not on the modern dating system. Although Llama 3.1 70b, Llama 3.1 405b and GPT-4o recognize and identify the dating system used in the extended text of the medieval document and show that they have the tools to accomplish the conversion, both systems show difficulty in providing correct dating (either because they do not offer it or because they miscalculate it). Out of 20 manually inspected records generated by GPT-4o, only 3 cases correctly report the date of the document. The result is improved in the case of the recognition of the document recipient (often reported in the first line of the extended text), which in the same sample is recognized correctly by GPT-4o in 15 out of 20 cases. Finally, it should be noted that in a few cases, since our prompt requests the data topica (the place) to be extracted, the systems correctly extract it even when the original regestum does not report this information. Thus lead to a lower score in the table, but to a qualitatively better result in regesta generation. 4. Conclusions In this work we have developed the REVERINO dataset, a dataset of 4533 pairs of regesta with their respective full text (and apparatus). The texts in this dataset come from two collections of regesta, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268) (MGH) and Les Registres de Gregoire IX (1227/41) (Auvray), to collect the dataset we have followed a pipeline composed of 4 steps: annotation, training, extraction and post-processing. Despite containing more than 4000 samples, REVERINO is too small to be used as a training set for a language model that automatically generates regesta, however it can be used as a benchmark to test the ability of existing LLMs to do summarization in Latin and thus to develop better tools and methodologies in the future. We have tested 3 LLMs among the best performing ones, our general finding is that these models can’t be used as-is to summarize texts in Latin. More precisely, we find that GPT-4o is the best and that models from the Llama family are less able to generate text in Latin. Interestingly, for both Llama 3.1 70b and Llama 3.1 405b we find that initially translating to English is an effective technique to generate better regesta. We also want to underline the limitations of our work, the samples in our dataset are automatically extracted, and therefore a share of them contain transcription errors and imperfections. However, we use the dataset only as a benchmark and it is still too small to serve as a training dataset for a text-summarization model. Despite these limitations, we hope that REVERINO will foster future works on the development of Language Models proficient in Latin and we will continue improving on by extending it to grow larger than 10k samples, and by using it to train a custom Language Model specifically tailored to the generation of regesta in Latin. Acknowledgments This work was supported by project "Italian Strengthening of ESFRI RI RESILIENCE" (ITSERR) funded by the European Union under the NextGenerationEU funding scheme (CUP:B53C22001770006). References [1] ITSERR (Italian Strengthening of the ESFRI RI RESILIENCE), 2024. URL: https://itserr.it. [2] A. Potthast, Regesta Pontificum Romanorum, Rudolf de Decker, 1874. [3] G. H. Pertz, K. Rodenberg, Epistolae saeculi XIII e regestis pontificum Romanorum selectae. (1216-1268), volume 1-3, Weidmann, 1894. [4] L. Auvray, Les Registres de Gregoire IX (1227/41), volume 1-3, Bibliothèque des Écoles françaises d’Athènes et de Rome, 1890 - 1918. [5] B. Kiessling, R. Tissot, P. Stokes, D. Stoekl Ben Ezra, escriptorium: An open source platform for historical document analysis, 2019, pp. 19–19. doi:10.1109/ICDARW.2019.10032. [6] S. Gabay, T. Clérice, Catmus-print [large], 2024. URL: https://doi.org/10.5281/zenodo.10592716. doi:10.5281/zenodo.10592716. [7] L. van der Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (2008) 2579–2605. URL: http://jmlr.org/papers/v9/vandermaaten08a.html. [8] F. Riemenschneider, A. Frank, Exploring large language models for classical philology, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), Association for Computational Linguistics, Toronto, Canada, 2023. URL: https://arxiv.org/abs/2305. 13698, to appear. [9] M. Gambhir, V. Gupta, Recent automatic text summarization techniques: a survey, Artificial Intelligence Review 47 (2017) 1–66. [10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Lan- guage models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [11] S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1797–1807. URL: https://aclanthology.org/D18-1206. doi:10.18653/v1/D18-1206. [12] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013. [13] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040. doi:10.3115/1073083.1073135. [14] M. L. . Team, The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783. arXiv:2407.21783. [15] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 (2023). URL: https://doi.org/10.48550/arXiv. 2303.08774. doi:10.48550/ARXIV.2303.08774. arXiv:2303.08774.