Benchmarking BERT-based Models for Latin: A Case Study on Biblical References in Ancient Christian Literature Davide Caffagni1,† , Federico Cocchi1,2,† , Anna Mambelli3,4 , Fabio Tutrone5 , Marco Zanella6,7 , Marcella Cornia3,* and Rita Cucchiara1 1 Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy 2 Department of Informatics, University of Pisa, Pisa, Italy 3 Department of Education and Humanities, University of Modena and Reggio Emilia, Reggio Emilia, Italy 4 Fondazione per le scienze religiose (FSCIRE), Bologna, Italy 5 Department of Cultures and Societies, University of Palermo, Palermo, Italy 6 Department of Mathematics, University of Padua, Padua, Italy 7 Department of History and Cultures, University of Bologna, Bologna, Italy Abstract Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. This paper investigates the adaptation of BERT-based embedding models for Latin, a language central to the study of the sacred texts of Christianity. Focusing on Jerome’s Vulgate, pre-Vulgate Latin translations of the Bible, and patristic commentaries such as Augustine’s De Genesi ad litteram, we address the challenges posed by Latin’s complex syntax, specialized vocabulary, and historical variations at the orthographic, morphological, and semantic levels. In particular, we propose fine-tuning existing BERT-based embedding models on annotated Latin corpora, using self-generated hard negatives to improve performance in detecting biblical references in early Christian literature in Latin. Experimental results demonstrate the ability of BERT-based models to identify citations of and allusions to the Bible(s) in ancient Christian commentaries while highlighting the complexities and challenges of this field. By integrating NLP techniques with humanistic expertise, this work provides a case study on intertextual analysis in Latin patristic works. It underscores the transformative potential of interdisciplinary approaches, advancing computational tools for sacred text studies and bridging the gap between philology and computational analysis. Keywords Sentence Similarity Search, Sentence Embeddings, Ancient Languages 1. Introduction The advent of Transformer-based language models [1] such as BERT [2, 3, 4, 5] has revolutionized the field of Natural Language Processing (NLP), offering unprecedented capabilities in tasks ranging from text classification to semantic similarity analysis [6, 7, 8, 9] and demonstrating their adaptability to other modalities beyond text [10]. By leveraging self-attention mechanisms and large-scale pre-training, these models capture fine-grained contextual relationships previously unattainable with traditional machine learning. While highly effective for modern languages [11, 12, 13, 14], their application to historical languages remains underexplored [15]. Historical languages pose unique challenges, including scarce high-quality annotated datasets, variations in orthography or morphology, and the need to deal with diachronic linguistic changes that can make finding semantic patterns very difficult [16, 17, 18]. Despite IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20–21, 2025, Udine, Italy * Corresponding author. † These authors contributed equally. $ davide.caffagni@unimore.it (D. Caffagni); federico.cocchi@unimore.it (F. Cocchi); anna.mambelli@unimore.it (A. Mambelli); fabio.tutrone@unipa.it (F. Tutrone); marco.zanella@unipd.it (M. Zanella); marcella.cornia@unimore.it (M. Cornia); rita.cucchiara@unimore.it (R. Cucchiara)  0009-0002-3279-6480 (D. Caffagni); 0009-0005-1396-9114 (F. Cocchi); 0000-0001-5538-5882 (A. Mambelli); 0000-0002-7063-7782 (F. Tutrone); 0009-0000-1208-6743 (M. Zanella); 0000-0001-9640-9385 (M. Cornia); 0000-0002-2239-283X (R. Cucchiara) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings these challenges, understanding historical languages like Latin holds significant promise, not only for enriching NLP methodologies but also for advancing research in fields such as historical linguistics, philology, historical-religious studies, and exegesis. This work investigates the adaptation of BERT-based models [19, 20, 21] for Latin, a pivotal language in the study of the sacred texts of Christianity and their receptions. Latin’s central role in the Christian exegetical tradition, along with its rich corpus of sacred and hermeneutical texts, provides an ideal context for developing NLP models for historical languages. Analyzing ancient Latin biblical texts – Jerome’s Vulgate and the pre-Vulgate Latin translations of the Greek Bible (Vetus Latina) – is crucial to understanding the context and history of their formation, as well as the reception history of the Hebrew and Greek Bibles, with their various forms of exegesis and rewriting. Indeed, the issue with authoritative sources lies in their intrinsic textual plurality, which is itself a sign of exegetical plurality. At the same time, sacred texts, as historical objects, may also be reconstructed from their tradition, closely connecting biblical texts to later Christian works that comment on, quote, rework, and allude to them. In particular, Latin patristic commentaries, such as Augustine’s De Genesi ad litteram, encapsulate intricate intertextual relationships with the biblical texts. These textual corpora pose distinct challenges for NLP due to their complex syntax, specialized vocabulary, and historical variations at the orthographic, morphological, and semantic levels. Further complicating this analysis, biblical references in patristic texts are frequently oblique, involving rephrasings, paraphrases, or allusions rather than quotations. To address these challenges, this paper explores the potential of BERT-based models trained on Latin textual corpora to improve the identification and analysis of biblical references in Latin patristic commentaries. Our approach includes fine-tuning the models using corresponding passages1 from the Vulgate and pre-Vulgate Latin translations of the Bible, leveraging the natural variations (i.e., variant readings) between these biblical versions as a rich source of data for refining the embedding space. We report results on annotated biblical references from ancient Christian Latin commentaries, demonstrating the effectiveness of this methodology. During fine-tuning, we further enhance the model performance by employing self-generated hard negatives, derived from the embedding model itself, to refine its ability to discern subtle distinctions in intertextual relationships. This process supports the development of computational tools capable of detecting both “explicit” citations and “implicit” allusions in Latin texts with a high degree of accuracy. The contributions of this study are threefold. First, it outlines the methodological integration of humanistic expertise and NLP techniques, particularly the fine-tuning of BERT for sacred texts in Latin. Second, it presents a case study on the identification of biblical references in Latin patristic commentaries, demonstrating the practical applications of these models. Third, it highlights the potential of interdisciplinary approaches to transform the study of sacred texts and their receptions, bridging computational analysis and traditional philology. By advancing the application of Transformer-based models to Latin, this paper contributes to both the technical and scholarly dimensions of biblical text studies. In doing so, it underscores the transformative possibilities of interdisciplinary research at the intersection of computer science and the humanities, fostering new insights into the textual, intellectual, and exegetical heritages of religious communities. 2. Intertextual References in Ancient Christian Commentaries: A Case Study on Biblical Corpora 2.1. Annotating Biblical References The analysis of biblical references within ancient Christian commentaries relies on manually curated datasets from Latin biblical and patristic texts in their critical reference editions. In particular, the commentary chosen for this case study is Augustine’s De Genesi ad litteram libri duodecim2 , a pivotal 1 With a slight abuse of notation, we will use the terms “passage” and “verse” interchangeably, referring to a piece of the biblical text identified by a book, a chapter, and a verse number (e.g., Gen. 3.1). 2 The edition used in this study is that of J. Zycha [22] (i.e., the most recent critical edition to date), downloaded from the Corpus Corporum database available at https://mlat.uzh.ch/ and manually revised before annotation. Table 1 Distribution of annotated references across similarity score ranges for the two biblical corpora, W_VULG (Vulgate) and S_VL (Vetus Latina). The total number of biblical passages in each corpus is also provided. # References Corpus # Passages 0.0-0.25 0.25-0.5 0.5-0.75 0.75-1.0 All W_VULG 35,057 51 50 46 45 192 S_VL 20,791 44 23 20 83 170 work in the Christian exegetical tradition. This commentary, completed in the early 5th century, provides Augustine’s detailed hermeneutical reflections on the Book of Genesis, which inspire and give way to the definition of broader theological motifs. It also explicitly and implicitly interacts with multiple versions of the Latin Bible, that is, Jerome’s Vulgate and pre-Vulgate translations. Given Augustine’s intellectual prominence and central role in shaping Christian hermeneutics, his works provide an ideal case for studying biblical references in ancient Christian literature. As biblical textual corpora, we employ two (at least partially) different versions of the Latin Bible: the Vulgate (W_VULG) and the so-called Vetus Latina (S_VL). The W_VULG, a critical edition by R. Weber and R. Gryson [23]3 , is the standard scholarly edition of Jerome’s Vulgate. In contrast, the Vetus Latina, an older and fragmentary collection of Latin translations reconstructed mostly by indirect tradition, is accessible as a whole through the 18th-century edition of the Benedictine monk P. Sabatier [24]4 (S_VL). Compared to the Vulgate, the Sabatier’s edition presents challenges due to its lack of digital integration. Annotating Augustine’s commentary involves identifying textual parallels to passages in the Bible, determining whether references are exact quotations, paraphrases, or thematic allusions, and system- atically tagging them using the INCEpTION annotation platform [25]5 . This platform facilitates the encoding of detailed information about each reference, including its source (i.e., W_VULG or S_VL), its location (i.e., book, chapter, verse), and a similarity score quantifying the degree of lexical overlap between the annotated passage of the commentary and corresponding biblical verses. The similarity score ranges from 0 to 1, where 0 indicates no lexical overlap and 1 denotes an exact lexical match. 2.2. Benchmark Characteristics The resulting dataset comprises 192 annotated references to the W_VULG Bible and 170 to the S_VL Bible, classified into four similarity categories based on their lexical overlap scores: 0.0-0.25, 0.25-0.5, 0.5-0.75, and 0.75–1.0. These similarity ranges capture the spectrum of intertextual relations, from loose thematic connections to verbatim citations. Table 1 details the distribution of references across these similarity ranges. Notably, references to W_VULG are distributed relatively evenly, while references to S_VL skew toward high similarity scores, with 83 instances scoring between 0.75 and 1.0. It is also important to note the differing overall sizes of the two biblical corpora. The W_VULG contains 35,057 passages (each corresponding to a biblical verse), whereas the S_VL only comprises 20,791 passages, due to the unavailability of some original books in digital format. 3. Mapping Intertextuality via BERT-based Models for Latin Our goal is to identify intertextual references between patristic commentaries and biblical passages. For this task, we focus on Augustine’s De Genesi ad litteram as the query text, influenced by the Latin Bible as a key source, and examine references to the W_VULG and S_VL Latin translations of the Bible, as detailed in Sec. 2. We frame this problem as an information retrieval task: given a query, the objective is to retrieve the most relevant documents from a collection. In our settings, a query 𝑞 is a passage 3 Available in digitized form from the Deutsche Bibelgesellschaft at https://www.die-bibel.de/en/bible/VUL/. 4 Available at the following links: https://archive.org/details/bibliorumsacroru01saba/page/n7/mode/2up, https://archive.org/ details/bibliorumsacroru02saba/page/n7/mode/2up, https://archive.org/details/Sabatier3. 5 https://inception-project.github.io/ S_VL Augustine's Fine-tuning Gen.3.1 Commentary Inference Serpens autem erat Positives prudentissimus omnium bestiarum inditis Negatives quae sunt super terram, quas fecit diei noctisque 𝑓𝜃 uocabulis Hard Negatives 𝑓𝜃 W_VULG Gen.3.1 Gen.1.5 appellavitque lucem diem et tenebras noctem factumque est vespere sed et serpens erat Sir.18.26 a mane usque ad vesperam mutatur tempus et haec omnia citata callidior cunctis animantibus terrae quae 𝑓𝜃 Gen.1.18 et praeessent diei ac nocti et dividerent lucem ac tenebras fecerat Dominus Deus Deut.33.14 de pomis fructuum solis ac lunae W_VULG Sir.43.7 a luna signum diei festi luminare quod minuitur in consummatione Figure 1: Overview of fine-tuning and inference pipelines. from Augustine’s commentary and documents 𝑑 are verses from the two considered versions of the Bible. Each query is associated with a positive document 𝑑* , corresponding to an intertextual reference between the commentary and the Bible(s). In practice, 𝑞 may be a literal citation of the biblical verse 𝑑* , or it may just allude to 𝑑* . The former type of relationship is typically easier to identify by measuring the text overlap between a query and a document. Conversely, allusions to the Bible(s) are hard to detect, as they require complex semantical analysis, a task that is not trivial even for human experts. 3.1. Retrieving Bible Passages from Commentary Sentences We propose to leverage Transformer-based language models [1], such as BERT [2], to effectively capture the complex intertextual references between patristic commentaries and biblical passages. To this end, let 𝑓𝜃 be a BERT-like pre-trained model. Before processing a query sentence or a document with 𝑓𝜃 , the input is first tokenized. Each token is assigned a unique integer ID, which acts as an index to select the corresponding embedding in the input embedding matrix of 𝑓𝜃 . This sequence of token embeddings is then passed through a stack of twelve Transformer layers, each comprising two main components: the attention operator, which relates each token to all other tokens in the sequence, and a feed-forward network that processes each token independently. The result is a sequence of output embeddings from the final Transformer layer, one for each token in the input. To obtain a single feature vector (i.e., embedding) representing the entire input sequence, we experiment with two aggregation strategies: • CLS Token Embedding. BERT-like models prepend a special classification token (i.e., CLS) to the input sequence. The output embedding corresponding to the CLS is often regarded as a condensed and global representation of the entire input sequence. • Token Averaging. An alternative strategy involves aggregating information from all tokens in the sequence to create a more comprehensive representation. This is achieved by taking the average of the embeddings of all tokens, except the CLS, in the input. Unlike the CLS token, which focuses on providing a global summary, token averaging distributes equal importance to each token, potentially capturing finer-grained information about the input sequence. These embeddings, representing the query and the document, are mathematically expressed as follows: q = 𝑓𝜃 (𝑞) ∈ R𝑚 , (1) d = 𝑓𝜃 (𝑑) ∈ R𝑚 . At this point, we measure the relevance of d with respect to q by calculating the cosine similarity between the two vectors: q d⊤ 𝑠(q, d) = (2) ‖q‖ ‖d‖ where ‖·‖ indicates the Euclidean norm. Ideally, the relevance score between a query and its positive document should be maximized. Conversely, the similarity score with respect to any negative document – defined as any document other than the positive one – should be minimized. 3.2. Fine-tuning with Self-Hard Negative Mining While the model 𝑓𝜃 is pre-trained on general language modeling tasks, it has not been specifically trained for the task of text retrieval. To adapt 𝑓𝜃 for this purpose, we fine-tune it using contrastive learning, a method that has proven effective for retrieval [26, 27] and other multimodal tasks [28, 29, 30]. In detail, given a batch of query-positive document pairs (𝑞, 𝑑* ) ∈ B, we embeds queries and documents with 𝑓𝜃 , and then we compute the InfoNCE loss function [31]: ∑︁ exp 𝑠(q, d* ) L=− log (3) exp(𝑠(q, d* )) + ∑︀ exp(𝑠(q, d)) (q,d* )∈B d∈N By minimizing Eq. 3, we encourage 𝑓𝜃 to map a query and its positive document (𝑞, 𝑑* ) to two points on the unit sphere that are close to each other. Conversely, negative documents unrelated to 𝑞, that are represented by N in the preceding formula, are pushed away from the embedding representation of 𝑞. Overcoming the Lack of Training Data. A key challenge in training 𝑓𝜃 is the limited availability of commentary queries paired with their corresponding biblical passages (cf. Table 1). To mitigate this issue, we draw inspiration from self-supervised contrastive learning [32, 33] and propose a surrogate task for training. Specifically, we sample a verse from the W_VULG Bible as a query 𝑞, and pair it with the corresponding verse from the S_VL version as the positive document 𝑑* (or vice versa). At each training step, we sample 𝑁 negative documents for each query. In addition, we treat the positive and negative documents from other queries within the same batch as further negatives. Additional Hard Negative Samples. The previously described procedure, commonly employed in contrastive learning [34, 35, 36], enhances model sensitivity to the distinctions between related and unrelated documents by exposing it to a larger number of negative samples. The quality of these negatives is crucial: documents that are similar to the query in the embedding space but not semantically related are referred to as hard negatives. These hard negatives are known to improve the robustness of models trained with contrastive loss functions [37, 38, 39, 40] like InfoNCE. In this work, we propose an effective strategy for mining hard negatives during training. First, we generate document embeddings by processing verses from the W_VULG version of the Bible with the pre- trained model 𝑓𝜃 . Then, for each positive document 𝑑* associated with a query 𝑞, we retrieve the top-𝑘 most similar documents and use them as hard negatives for 𝑞. Fine-tuning the model using hard negative documents coming from the BERT model itself, as opposed to randomly sampling documents, makes the loss function in Eq. 3 more challenging to minimize, ultimately leading to improved performance. 4. Experimental Results 4.1. Experimental Setup Considered BERT-based Embedding Models. In this study, we model 𝑓𝜃 with three language models sharing the architecture of the BERT model [2], namely Latin RoBERTa [20], Latin BERT [19], and LaBERTa [21]. All considered models have been pre-trained with the masked language modeling objective [2, 3]: the model is asked to predict the missing words that are randomly masked in the input sentence. The main difference between the three models is the Latin corpus chosen for pre-training. Latin RoBERTa [20] was trained on 390M tokens extracted from the Latin portion of CC-100 [41]. Latin BERT [19] used 642M tokens from a variety of sources spanning the Classical era to the 21st century. Lastly, LaBERTa [21] was trained on Corpus Corporum6 for a total of 167M tokens. Training Details. All models produce embeddings of size 𝑚 equal to 768. We fine-tune them with the loss function detailed in Eq. 3, using identical hyperparameters and settings. Specifically, we train with the Adam [42] optimizer, a learning rate fixed to 1 × 10−6 , a batch size of 32 queries, and we sample 7 negative documents for each query. Training typically requires 6 hours on a single NVIDIA A40 GPU. 6 https://mlat.uzh.ch Table 2 Performance comparison of existing BERT-based models for Latin on the W_VULG and S_VL corpora, using either the CLS token or the mean of all tokens in the sentence to compute similarities. All results are reported without fine-tuning the embedding model. Corpus: W_VULG Corpus: S_VL Model Aggregation R@1 R@2 R@3 R@5 R@10 R@1 R@2 R@3 R@5 R@10 Latin RoBERTa [20] CLS Token 13.0 13.0 13.5 13.5 14.1 4.7 6.5 8.2 8.2 10.0 Latin RoBERTa [20] Token Averaging 18.2 20.3 23.4 27.6 30.2 15.9 17.6 18.2 20.6 23.5 Latin BERT [19] CLS Token 18.2 24.5 26.6 28.1 29.7 18.8 24.1 28.2 32.9 34.1 Latin BERT [19] Token Averaging 33.3 38.0 41.7 44.8 47.9 35.3 39.4 42.9 45.3 48.8 LaBERTa [21] CLS Token 31.3 39.1 44.3 47.4 55.7 29.4 34.7 40.0 45.3 50.6 LaBERTa [21] Token Averaging 34.4 40.6 43.8 47.9 52.6 33.5 37.6 40.0 43.5 47.6 Table 3 Performance comparison of Latin BERT [19] and LaBERTa [21] with different fine-tuning strategies on the W_VULG and S_VL corpora, including results with and without hard negatives. Corpus: W_VULG Corpus: S_VL Model Fine-tuning R@1 R@2 R@3 R@5 R@10 R@1 R@2 R@3 R@5 R@10 Latin BERT [19] - 33.3 38.0 41.7 44.8 47.9 35.3 39.4 42.9 45.3 48.8 Latin BERT [19] w/o Hard Neg. 38.5 46.9 52.1 55.2 58.9 35.9 42.3 47.1 49.4 54.1 Latin BERT [19] w/ Hard Neg. 47.4 51.6 54.2 55.2 59.9 38.8 42.9 45.3 50.0 55.3 LaBERTa [21] - 34.4 40.6 43.8 47.9 52.6 33.5 37.6 40.0 43.5 47.6 LaBERTa [21] w/o Hard Neg. 41.1 50.0 54.2 59.4 64.6 37.1 45.3 48.8 57.6 62.4 LaBERTa [21] w/ Hard Neg. 43.2 50.5 52.1 56.3 63.5 41.8 45.9 48.2 55.3 61.2 4.2. Evaluating BERT-based Embedding Models for Latin Impact of Token Aggregation Strategies. Table 2 provides an in-depth comparison of the three pre-trained BERT-based models for Latin considered in this study, evaluated on the W_VULG and S_VL corpora. These evaluations assess their ability to retrieve the correct biblical passage corresponding to a query without any task-specific fine-tuning. Performance is measured using Recall at top-𝑘 (R@𝑘) for 𝑘 ∈ {1, 2, 3, 5, 10}. As described in Sec. 3.1, the analysis explores two distinct strategies for aggregating token embeddings into fixed-size representations: the CLS token and token averaging. As it can be seen, token averaging consistently demonstrates its utility by capturing finer-grained information distributed across all tokens in a sequence, leading to substantial performance improvements for almost all models on both W_VULG and S_VL. Among the three evaluated models, Latin BERT and LaBERTa are the most effective configurations across both corpora, achieving the highest recall scores in most scenarios and surpassing the performance of Latin RoBERTa by a consistent margin. Therefore, in the rest of the paper, we focus on the Latin BERT and LaBERTa models and report fine-tuning results using token averaging as aggregation strategy. Effect of Fine-tuning and Self-Hard Negative Mining. Table 3 presents a performance comparison of Latin BERT and LaBERTa models with different fine-tuning strategies. The results clearly demonstrate that fine-tuning significantly enhances retrieval performance, and the addition of hard negatives further boosts effectiveness across all settings, particularly for R@1 which is critical for precise retrieval tasks. Without fine-tuning, both Latin BERT and LaBERTa show moderate performance, with R@1 values below 35% for both corpora. Fine-tuning without hard negatives consistently improves the retrieval accuracy. For instance, Latin BERT improves from an R@1 of 33.3% to 38.5% on W_VULG, while LaBERTa increases from 34.4% to 41.1%. Similar trends are observed on S_VL, with notable gains across other recall metrics as well. This highlights the importance of adapting pre-trained models to the specific task of retrieving intertextual references. The inclusion of hard negatives during fine-tuning further enhances performance across all metrics, confirming the effectiveness of this strategy. Latin BERT achieves the highest gains, with R@1 reaching Table 4 Performance comparison of BERT-based models, with and without fine-tuning, across various subsets of the W_VULG and S_VL corpora based on annotated similarity scores. All 0.0-0.25 0.25-0.5 0.5-0.75 0.75-1.0 Model Fine-tuning R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Corpus: W_VULG Latin RoBERTa [20] ✗ 18.2 27.6 30.2 0.0 0.0 0.0 8.0 18.0 18.0 34.7 50.0 54.3 33.3 46.7 53.3 Latin BERT [19] ✗ 33.3 44.8 47.9 1.9 1.9 5.9 20.0 34.0 38.0 60.8 69.6 73.9 55.5 80.0 80.0 LaBERTa [21] ✗ 34.4 47.9 52.6 0.0 5.8 9.8 20.0 42.0 48.0 63.0 71.7 78.3 60.0 77.7 80.0 Latin BERT [19] ✓ 47.4 55.2 59.9 5.9 15.7 25.5 40.0 52.0 52.0 73.9 78.3 82.6 75.6 80.0 84.4 LaBERTa [21] ✓ 43.2 56.3 63.5 15.7 31.4 41.2 26.0 46.0 50.0 69.6 73.9 82.6 66.7 77.8 84.4 Corpus: S_VL Latin RoBERTa [20] ✗ 15.9 20.6 23.5 0.0 0.0 0.0 0.0 4.3 4.3 5.0 10.0 15.0 31.3 38.6 42.2 Latin BERT [19] ✗ 35.3 45.3 48.8 0.0 2.3 2.3 4.3 8.7 13.0 40.0 45.0 50.0 61.4 78.3 83.1 LaBERTa [21] ✗ 33.5 43.5 47.6 0.0 0.0 4.5 8.7 26.1 30.4 45.0 50.0 55.0 55.4 69.9 73.5 Latin BERT [19] ✓ 38.8 50.0 55.3 0.0 6.8 15.9 8.7 17.4 21.7 40.0 45.0 55.0 67.5 83.1 85.5 LaBERTa [21] ✓ 41.8 55.3 61.2 9.1 27.3 40.9 13.0 34.8 34.8 40.0 55.0 55.0 67.5 75.9 80.7 47.4% on W_VULG and 38.8% on S_VL. LaBERTa also benefits significantly, improving R@1 to 43.2% on W_VULG and 41.8% on S_VL. These results underline the role of hard negatives in refining the ability of the models to distinguish between closely related and unrelated documents. Analyzing Performance at Higher Reference Difficulty Levels. Table 4 reports the performance of models with and without fine-tuning at varying levels of difficulty, quantified as the similarity between a query and its referred biblical passage. The lowest similarity range (i.e., 0.0-0.25) corresponds to the hardest queries with low text overlap concerning the biblical passage. In this range, models struggle to identify corresponding passages, with recall scores close to zero when not fine-tuned. These results underscore the challenge of detecting loosely referred passages. However, fine-tuning significantly improves the models, particularly LaBERTa, which achieves a recall of 15.7% on the W_VULG corpus and 9.1% on S_VL. In the mid-similarity ranges (i.e., 0.25-0.5 and 0.5-0.75), performance sees a substantial boost, with fine-tuned versions of Latin BERT and LaBERTa achieving notably higher recall scores. For instance, in the 0.5-0.75 range, LaBERTa reaches 69.6% on W_VULG and 40.0% on S_VL. In the highest similarity range (i.e., 0.75-1.0), models perform the best, with fine-tuned versions of Latin BERT and LaBERTa achieving R@1 scores close to or above 70% for both corpora. This analysis suggests that while models excel at identifying exact or near-exact matches, their performance decreases significantly as the references become less direct, though fine-tuning helps mitigate this challenge. 5. Conclusion In this paper, we demonstrated the effectiveness of BERT-based models in capturing intertextual references within Latin texts, with a particular focus on patristic commentaries. By employing a fine-tuning strategy that incorporates hard-negative mining, we achieved significant performance improvements across both the W_VULG and S_VL corpora. The experimental results showcase the ability of models fine-tuned with the proposed strategy to handle references with varying degrees of lexical overlap, including implicit allusions that present particular challenges. These results underscore the potential of Transformer-based approaches for Latin NLP tasks and provide a solid foundation for future research in historical text analysis, bridging computational methods with philological expertise. Acknowledgments We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. This work was supported by the PNRR project Italian Strengthening of Esfri RI Resilience (ITSERR) funded by the European Union – NextGenerationEU (CUP B53C22001770006). References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017. [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Trans- formers for Language Understanding, in: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2018. [3] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv preprint arXiv:1907.11692 (2019). [4] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, in: Advances in Neural Information Processing Systems, 2019. [5] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, arXiv preprint arXiv:1909.11942 (2019). [6] R. Nogueira, K. Cho, Passage Re-ranking with BERT, arXiv preprint arXiv:1901.04085 (2019). [7] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019. [8] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating Text Generation with BERT, in: Proceedings of the International Conference on Learning Representations, 2020. [9] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in bertology: What we know about how bert works, Transactions of the Association for Computational Linguistics 8 (2021) 842–866. [10] D. Caffagni, F. Cocchi, L. Barsellotti, N. Moratelli, S. Sarto, L. Baraldi, M. Cornia, R. Cucchiara, The Revolution of Multimodal Large Language Models: A Survey, in: Findings of the Annual Meeting of the Association for Computational Linguistics, 2024. [11] T. Pires, E. Schlinger, D. Garrette, How Multilingual is Multilingual BERT, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2019. [12] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, S. Pyysalo, Multilingual is not enough: BERT for Finnish, arXiv preprint arXiv:1912.07076 (2019). [13] M. Polignano, P. Basile, M. De Gemmis, G. Semeraro, V. Basile, et al., AlBERTo: Italian BERT language understanding model for NLP challenging tasks based on tweets, in: Proceedings of the Italian Conference on Computational Linguistics, 2019. [14] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de La Clergerie, D. Seddah, B. Sagot, CamemBERT: a tasty French language model, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020. [15] T. Sommerschield, Y. Assael, J. Pavlopoulos, V. Stefanak, A. Senior, C. Dyer, J. Bodel, J. Prag, I. An- droutsopoulos, N. de Freitas, Machine learning for ancient languages: A survey, Computational Linguistics 49 (2023) 703–747. [16] E. Manjavacas, L. Fonteyn, Adapting vs. Pre-Training Language Models for Historical Languages, Journal of Data Mining & Digital Humanities (2022). [17] A. Palmero Aprosio, S. Menini, S. Tonelli, BERToldo, the Historical BERT for Italian, in: Proceedings of the Workshop on Language Technologies for Historical and Ancient Languages, 2022. [18] B. Hutchinson, Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing, in: Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2024. [19] D. Bamman, P. J. Burns, Latin BERT: A Contextual Language Model for Classical Philology, arXiv preprint arXiv:2009.10053 (2020). [20] P. B. Ströbel, RoBERTa Base Latin Cased v1, 2022. URL: https://huggingface.co/pstroe/ roberta-base-latin-cased. [21] F. Riemenschneider, A. Frank, Exploring Large Language Models for Classical Philology, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023. [22] J. Zycha (ed.), Sancti Aureli Augustini: De Genesi ad litteram libri duodecim eiusdem libri capitula. De Genesi ad litteram imperfectus liber. Locutionum in Heptateuchum libri septem, Pragae- Vindobonae-Lipsiae, Tempsky-Freyta, 1894. [23] R. Weber, R. Gryson (eds.), Biblia Sacra iuxta Vulgatam Versionem, Stuttgart, Deutsche Bibelge- sellschaft, 20075 (R. Weber, 19691 ). [24] P. Sabatier (ed.), Bibliorum Sacrorum latinae versiones antiquae seu Vetus Italica (3 vols.), Reims, Reginaldus Florentain, 1743–1751. [25] J.-C. Klie, M. Bugert, B. Boullosa, R. E. De Castilho, I. Gurevych, The INCEpTION Platform: Machine-Assisted and Knowledge-Oriented Interactive Annotation, in: Proceedings of System Demonstrations of the International Conference on Computational Linguistics, 2018. [26] M. Cornia, M. Stefanini, L. Baraldi, M. Corsini, R. Cucchiara, Explaining Digital Humanities by Aligning Images and Textual Descriptions, Pattern Recognition Letters 129 (2020) 166–172. [27] N. Messina, M. Stefanini, M. Cornia, L. Baraldi, F. Falchi, G. Amato, R. Cucchiara, ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, in: Proceedings of the International Conference on Content-based Multimedia Indexing, 2022. [28] S. Sarto, M. Barraco, M. Cornia, L. Baraldi, R. Cucchiara, Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023. [29] S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, R. Cucchiara, Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models, in: Proceedings of the European Conference on Computer Vision, 2024. [30] N. Moratelli, D. Caffagni, M. Cornia, L. Baraldi, R. Cucchiara, Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization, in: Proceedings of the British Machine Vision Conference, 2024. [31] A. Oord, Y. Li, O. Vinyals, Representation Learning with Contrastive Predictive Coding, arXiv preprint arXiv:1807.03748 (2018). [32] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave, Unsupervised Dense Information Retrieval with Contrastive Learning, Transactions on Machine Learning Research (2022). [33] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy, et al., Text and Code Embeddings by Contrastive Pre-Training, arXiv preprint arXiv:2201.10005 (2022). [34] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A Simple Framework for Contrastive Learning of Visual Representations, in: Proceedings of the International Conference on Machine Learning, 2020. [35] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan, Supervised Contrastive Learning, in: Advances in Neural Information Processing Systems, 2020. [36] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning Transferable Visual Models From Natural Language Supervision, in: Proceedings of the International Conference on Machine Learning, 2021. [37] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, VSE++: Improving Visual-Semantic Embeddings with Hard Negatives, in: Proceedings of the British Machine Vision Conference, 2018. [38] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, D. Larlus, Hard Negative Mixing for Contrastive Learning, in: Advances in Neural Information Processing Systems, 2020. [39] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing dense retrieval model training with hard negatives, in: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021. [40] L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, Aligning Text and Document Illustrations: Towards Visually Explainable Digital Humanities, in: Proceedings of the International Conference on Pattern Recognition, 2018. [41] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2020. [42] D. P. Kingma, J. L. Ba, ADAM: a Method for Stochastic Optimization, in: Proceedings of the International Conference on Machine Learning, 2015.