1. Introduction and Motivations

The Meaning of Beatus: Disambiguating Latin with Contemporary AI Models

Eleonora Ghizzota

Pierpaolo Basile

Lucia Siciliani

Giovanni Semeraro

0 0 Department of Computer Science, University of Bari Aldo Moro , via Edoardo Orabona 4, 70125, Bari , Italy

2025

The objective of this work is to assess the performance of Large Language Models (LLMs) on the task of Word Sense Disambiguation (WSD) for Latin.We evaluate state-of-the-art LLMs-including GPT-4o-mini and LLaMA variants-in both zero-shot and fine-tuned settings, using a dataset derived from the SemEval-2020 Latin Lexical Semantic Change task. Our study aims to determine whether instruction tuning and task-specific fine-tuning can significantly improve the models' ability to disambiguate Latin word senses. Results show that while LLMs demonstrate a non-trivial baseline ability in zero-shot settings, fine-tuning - particularly instruction-based - provides improvements in accuracy and F 1 scores. These findings highlight the potential of LLMs when applied to low-resourced historical languages.

eol>Lexical Semantics Word Sense Disambiguation Large Language Models Latin Low-resource languages

1. Introduction and Motivations

CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics, September 24 — 26, 2025, Cagliari, Italy * Corresponding author. † These authors contributed equally. $ e.ghizzota@phd.uniba.it (E. Ghizzota); pierpaolo.basile@uniba.it (P. Basile); lucia.siciliani@uniba.it (L. Siciliani); giovanni.semeraro@uniba.it (G. Semeraro)

0000-0002-0751-3891 (E. Ghizzota); 0000-0002-0545-1105 (P. Basile); 0000-0001-7116-9338 (L. Siciliani); 0000-0002-9421-8566 (G. Semeraro) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://www.github.com/dbamman/latin-bert

Attribution 4.0 International (CC BY 4.0). tionised the landscape by providing vast corpora and ex- period to the 21st century. It achieves state-of-the-art tensive knowledge graphs extracted from online sources, performance in Latin part-of-speech tagging across all thereby amplifying the capabilities of both supervised Universal Dependency datasets. To capture the full range and knowledge-based methods. The introduction of of linguistic variation, the model was trained on multiple transformer-based architectures [22] marked a signifi- corpora, including the Corpus Thomisticum, the Internet cant turning point. These models use dense vector rep- Archive, the Latin Library, Patrologia Latina, Perseus, and resentations to capture semantic meaning in context, re- the Latin Wikipedia. Latin BERT uses Latin-specific sensulting in further advancements in disambiguation tech- tence and word tokenizers from the Classical Language niques. A significant development in this domain is the Toolkit, resulting in a vocabulary of 32,895 subword units. rise of Large Language Models (LLMs), which are built To assess Latin BERT performance in the WSD task, the upon the Transformer architecture and trained on exten- authors reformulated it into a binary classification task sive text corpora. LLMs exhibit proficiency in a myriad and created an ad hoc dataset of Latin sense examples of tasks in zero-shot or few-shot contexts, ruling out extracted from the Lewis and Short Latin Dictionary [23]. the necessity of task-specific training data. This implies In order to be selected, headwords must have at least an inherent capacity for semantic understanding within two distinct senses – typographically denoted by “I.” and these models. Nonetheless, LLMs can also be fine-tuned “II.” – supported by at least 10 sentences each, and longer on particular tasks by utilising tailored training data, en- than five words. For the task, only the two major senses hancing their performance in specific applications. of a headword were retained; the final dataset consists of

Considering these premises, the intent of this work is 8,354 examples for 201 dictionary headwords. For each to assess how state-of-the-art LLMs perform on under- headword, an instance of Latin BERT was fine-tuned on represented languages like Latin through the lens of a 80% of the examples. The number of training instances long-standing task in NLP like WSD. In particular, our in- per headword ranges from 16 (8 per sense) to 192 (96 per vestigation has two objectives. First, we want to test mod- sense); 59% of headwords have 24 or fewer training exels out-of-the-box ability to disambiguate Latin senses in amples. Latin BERT achieves 75.4% accuracy, compared a zero-shot setting. In this way, we aim to first establish to the 67.3% of a bidirectional LSTM with static word how well the models inherent multilingual knowledge embeddings. These results show that, even with few performs in accurate sense prediction. Next, we also per- training examples, Latin BERT was able to disambiguate form task-specific fine-tuning, which enables us to adapt senses. both standard and instruction versions of LLMs. The aim A few years later, [24] fine-tuned Latin BERT on a is to gauge the gain obtained with this additional training portion of sense representations in the Thesaurus Linstep. guae Latinae2 (TLL). The TLL is the first comprehensive

The paper is structured as follows: Section 2 provides dictionary of ancient Latin usage up to 600 AD, ofering an overview of works related to solving the WSD task a comprehensive, documented overview of every Latin with LLMs; Section 3 introduces the corpus of choice word’s history, including meanings and constructions, for this study, while 4 illustrates the methodology. Sec- etymology, inflexion peculiarities, spelling, and prosody, tion 5 describes the experimental setting and discusses as well as comments from ancient sources on the word itthe results and the limitations of the proposed strategy, self. The ongoing TLL project begun in 1894 and has been while Section 6 summarises the takeaway messages of regularly updated since; currently, it contains lemmata this paper and suggests some future works. from a to resurge¯sco, and it is estimated to contain approximately 56,000 entries. Inspired by the WSD dataset created by Bamman and Burns for Latin BERT, the au2. Related Work thors requested data for the same lemmata from TLL, obtaining 25,227 quotes for 40 lemmata. The new dataset 2.1. Latin Word Sense Disambiguation leads to a performance gain, with the Mean Macro F1 Currently, solving the WSD task for Latin using language increasing from .695 to .794. models remains an unexplored strategy, with very few Although both [17] and [24] achieved promising reworks investigating this line of research in recent years. sults, Latin is still an under-represented language for The idea of using WSD for measuring the ability of lan- which very few annotated resources are available, when guage models to deal with Latin is supported by the work compared to English. [25] proposes a language pivoting proposed by [17] in which Latin BERT is tested on the framework for Latin. Language pivoting, borrowed from sense disambiguation task. Machine Translation [26], consists of propagating anno

Latin BERT is a contextual language model tailored tations from high-resource languages to lower-resource for Latin, trained on a corpus of 642.7 million words ones. Starting from the 40 lemmata manually annotated drawn from diverse sources ranging from the Classical

2.2. LLMs and Word Sense Disambiguation

for SemEval-2020 [13], the authors extract an aligned currence within a sentence, the LLM must generate the Latin-English dataset in which these lemmata occur. To appropriate definition; and ( ii) given a word occurrence thi s day, the dataset of SemEval-2020 Task 1 is the only and a list of predefined meanings, the LLM must identify benchmark for Latin, manually annotated by Latin ex- the correct one. Moreover, they use the training data of perts. These lemmata were then mapped to WordNet, XL-WSD to fine-tune an open LLM based on LLaMA3.1Latin WordNet3 and Princeton WordNet [27], allowing 8B. The results indicate that while LLMs perform well for annotation propagation from English to Latin. The in zero-shot settings, they still fall short of surpassing ifnal result is a dataset of 3,886 annotated sentences for current state-of-the-art methods. Larger models achieve training and experimentation. the strongest results, whereas medium-sized models tend to underperform. Notably, however, a fine-tuned model with a medium parameter size outperforms all others, including existing state-of-the-art approaches.

Over the years, LLMs have consistently demonstrated 3. Dataset their ability to perform various tasks in a zero- or fewshot setting with minimal or no specific training data, suggesting an intrinsic capability of LLMs to grasp the 3.1. Resource semantics behind language [28, 29]. The dataset of choice is the Latin annotated dataset for [30] demonstrates that BERT-like models are capable of the Unsupervised Lexical Semantic Change Detection efectively diferentiating between various word senses, (L SCD) shared task of SemEval-2020 [13]. even when only a few examples are available for each. This dataset is a fragment of LatinISE4 [5], a 13 million Their analysis further reveals that although language words diachronic, annotated Latin corpus. The primary models can perform nearly perfectly on coarse-grained source of LatinISE is the Latin portion of the IntraText noun disambiguation in ideal settings where training digital library5. To semi-automatically annotate this cordata and resources are abundant, such conditions are pus, 2013 state-of-the-art NLP tools – PROIEL6, Quick rare in practical scenarios, presenting ongoing challenges. Latin7, and TreeTagger8– were used. Hence, LatinISE Along the lines of BERT-like approaches, [31] examines provides morphological annotations like part-of-speech multiple WSD methods, including those that use lan- tags and lemma for each word. guage models to extract contextual embedding s as in- Back in 2020 , for the SemEval-2020 Unsupervised Lexiput features and as a foundation for training supervised cal Semantic Change task, two time-specific sub-corpora models on sense-annotated data. [32] assesses language 1 and 2 were extracted from LatinISE [13, 6]: 1 covmodels’ WSD capabilities through three behavioural ex- ers the period from 2 century BC to 0 (1.7M tokens), periments designed to evaluate children’s ability to dis- 2 from 0 to 21 century AD (9.4M tokens). ambiguate word senses. The study ofers a compelling As concerns target words, they are either (i) words comparison between how children understand semantics that changed their meaning(s) between 1 and 2; or (ii) and how it is encoded in transformer-based models. The stable words that did not change their meaning during authors identify a bias in the models toward the most fre- that time. The choice of the set of lexemes for the annoquent sense and observe a negative correlation between tation was based on an initial process of lexical selection the size of the training data and model performance. and pre-annotation, carried out by a team member [6]. A [33] evaluated WSD accuracy of LLMs on eight list of target words comprising those whose meaning has datasets via a multiple-choice question format, and [34] been attested to have changed between the pre-Christian extended the analysis by gauging LLM performance on and Christian era [38, 39, 40, 23] was selected. The presingle-choice questions and examining how diferent annotation trial verified whether the corpus showed evimodel sizes afect disambiguation accuracy. Similarly, dence of both the late antiquity senses and the previous [35] creates a benchmark specific for the Italian language senses, and whether the late antiquity senses appeared in with the aim of evaluating LLMs’ abilities in selecting the later texts only and the classical senses in the earlier the correct meaning of a word and in generating the texts, although they may also have occurred in later texts. definition of a word in a sentence. Finally, [ 36] analy- Conversely, stable words were chosen since they are not ses WSD capabilities of only open LLMs experimenting known for having undergone lexical semantic change with diferent parameter configurations on several languages: English, Spanish, French, Italian and German. 4Available at https://lindat.mf.cuni.cz/repository/xmlui/handle/ The authors extend the existing XL-WSD benchmark [37] 11234/1-2506 5http://www.intratext.com to include two additional subtasks: (i) given a word oc- 6https://www.hf.uio.no/ifikk/english/research/projects/proiel/ 7http://www.quicklatin.com/ 3http://latinwordnet.exeter.ac.uk/ 8https://www.cis.uni-muenchen.de/âĹĳschmid/tools/TreeTagger/ associated with the period of late antiquity. The final list stratification process, 70% training and 30% testing, outcomprises 40 target words, of which 23 are stable, while puts a training set of 6,299 sentences and a testing set of 17 have undergone changes in meaning in relation to 2,690. Due to the absence of annotations, sentences of the Christianity. lemma oportet were excluded from the dataset. DuREL

For each target word, its primary sense definitions annotation statistics are summarised in Table 2 below. were taken from the Latin portion of the Logeion On- We take full advantage of the annotations in the dataset line Dictionary9, which includes Lewis and Short’s Latin by creating a separate prompt for each of the judgments English Lexicon [23], Lewis’s Elementary Latin Dictio- assigned to each of the proposed senses for a single sennary [41], and Du Fresne Du Cagne’s Glossarium mediae tence. For example, if the annotator marked virtute as “4 et infimae latinitatis [42]. Depending on the cases, the - Identical” for the “manliness, courage, virtue, strength” sense inventory was simplified, or the definitions were and “1 - Unrelated” for the sense “virtue, personified as a shortened, while maintaining the principal distinction deity”, two separate prompts are created, each structured between senses. Finally, for each target word 60 passages as shown in Listings 1 and 2. sample sentences were extracted, 30 from 1 and 30 2 Listing 1: Prompt generated by each sense annotation for respectively, for a total of 2,398 passages. regression task.

The lack of native Latin speakers adds a further layer of complexity to the sense annotation process. 10 an- Instruction: Given the target word ‘‘virtute’’ notators with a high-level knowledge of Latin were re- ainsdentchleosseendtbeynctehien[TiAnRpGuEtTw]hteraeg,tahnedwtohred cruited, ranging from undergraduate students to senior following meaning ‘‘virtue, personified as researchers. Annotators – only one per target word – a deity’’, assign a score between 0 and scored the relatedness between a usage and a sense def- 4. The score meaning is the following: inition according to the Diachronic Usage Relatedness 0: Cannot decide (DUReL) framework [43], specially designed for lexical 1: Unrelated semantic change annotations. The DUReL framework 2: Distantly Related consists of a 4-point scale for quantifying the relatedness 3: Closely Related of a word usage and a sense, or score 0 if the annotator 4: Identical cannot decide: Answer just with the score.

Input: <left context> [TARGET] virtue [TARGET] <right context> • 0 - Cannot decide • 1 - Unrelated • 2 - Distantly related • 3 - Closely related • 4 - Identical

3.2. Data preparation

This process yields a total of 8,989 prompts for the regression task.

As for the binary classification task, the DuREL 1-to-4

Table 1 shows an example of the usage annotation scale was binary encoded as follows: for target word beatus. The senses presented to the an- • Pairs of sense and sentence scores equal to or notators were: (a) “blessed”, (b) “rich”, (c) “fortunate”, above 3 were labelled as yes; (d) “happy” and (e) “rewarded”. Let’s focus on the sense • Pairs of sense and sentence scores equal to or “blessed”, which only emerged later with the advent of below 2 were labelled as no. Christianity. Notice how it scores 1 for the first usage, The prompt is the following: dated 46 BC, while it scores 4 for the second usage, dated circa 1100 AD. Listing 2: Prompt generated by each sense annotation for

Target word virtus was chosen for calculating the inter- binary classification task. annotator agreement between four annotators: the average pairwise agreement computed as Spearman correlation coeficient was 0.69, comparable with interannotator agreement for modern languages, e.g., English 0.69, Swedish 0.57 and German 0.59 [43]. See [6] for the detailed process behind the creation and annotation of the dataset.

Instruction: Given the target word ‘‘virtute’’ and the sentence in input where the word is enclosed by the [TARGET] tag, and the following meaning ‘‘virtue, personified as a deity’’, assign a label "yes" or "no".

The label meaning is the following: "yes": The sense for the target word occurrence

is correct "no": The sense for the target word occurrence

is not correct Answer just with the label.

Pairs of sense and sentence were split in a stratified manner, based on the scores assigned to each sense. This 9https://logeion.uchicago.edu/ Input: <left context> [TARGET] virtue [TARGET] <right context>

Pairs of sense and sentence with score 0 were not considered in this experiment; thus, with respect to the scores distribution in Table 2, the training set for binary classification task consists of 6,255 instances instead of 6,299, and the testing set has 2,675 examples instead of 2,690, yielding a total of and 8,930 prompts. This binary encoded dataset comprises 956 instances of class yes and 1,719 no, resulting in a very imbalanced dataset in which class yes represents only 35.73% of the entire dataset.

The idea behind this work is to leverage this dataset for building a benchmark for the evaluation of LLMs in disambiguating Latin words as described in the following section.

4. Methodology

As stated in the introduction, one of the aims of this paper is to assess whether fine-tuning on LLMs can improve their performance on a lower-represented language, compared to a zero-shot setting. To do so, we exploit the prompt dataset created from LatinISE, described in Section 3. Tables 3 and 4 introduce the LLMs of choice and summarise their characteristics. • LLaMA-3 instruction-tuned. We use publicly available checkpoints of Meta’s LLaMA 3.3-70B

During training, we use the following parameters: = 32, ℎ = 64, _ = 2 − 4 and ℎ_ = 32. We train all models for five epochs on the whole training dataset. The training was performed using a single GPU NVIDIA RTX A6000 with 48GB of memory.

5. Evaluation

and 3.1-8B variants with instruction tuning, ac- As mentioned in Section 1, our study has two objectives. cessed via the TogetherAI API10 and Unsloth First, we want to test the models ability to disambiguate API11, respectively; Latin senses in a zero-shot setting. In this way, we aim to • GPT-4o-mini. accessed via Microsoft Azure API, first establish how well the model inherent multilingual this model is used without any task-specific train- knowledge performs in accurate sense prediction. Next, ing. Prompting is designed to simulate realistic we perform task-specific fine-tuning, which enables us WSD instructions. to adapt both standard and instruction versions of LLMs.

The objective is to quantify the gain obtained through

For zero-shot WSD, we directly use the prompt test this additional training step. set, unseen during fine-tuning (see Section 4.2). After a It is worth noticing that the dataset of choice was preliminary prompt engineering step, we use the prompt initially devised for the Unsupervised LSCD task [13], in Listing 3, which is the same as the one used for fine- not for WSD; therefore, comparing the results of the tuning. shared task with the results of this work is not feasible. GPT-4o-mini and LLaMA-3.3-70B-instruct-turbo 4.2. Fine-tuning act as a zero-shot baseline for this experiment, to assess the capabilities of models not specially devised or fineUsing the training split of the dataset, we fine-tune the tuned for the Latin WSD task. open-weight LLaMA-3.1-8B model. Given the compu- It is crucial to note that the dataset is highly imbaltational constraints associated with full fine-tuning of anced, as many instances are annotated with 1, since each large models, we adopt a parameter-eficient fine-tuning word occurrence is generally assigned a single meaning; (PEFT) approach based on Low-Rank Adaptation (LoRA). consequently, all other meanings receive the lowest score.

LoRA [44] introduces trainable, low-rank matrices into Notice that all the metrics are computed with the dataset each transformer layer to adapt the model to a down- imbalance in mind. Balanced Accuracy12 is defined as stream task. Instead of updating all model parameters, the average recall obtained inch class. Weighted PreciLoRA freezes the pre-trained weights and injects a low- sion13, Recall14 and F115 calculate metrics for each larank decomposition into the linear projections of the bel, and find their average weighted by support. Finally, self-attention and/or feed-forward layers. This strategy Macro F1 and Micro F1 scores are variants of F1. The significantly reduces the number of trainable parameters former is the only metric that does not take into account and memory usage, allowing eficient fine-tuning even label imbalance, but computes metrics for each label and on consumer-grade GPUs. We use the implementation ifnds their unweighted mean; the latter calculates metprovided by the Unsloth library, which enables us to re- rics globally by counting the total true positives, false duce the required memory and accelerate the training negatives and false positives. Details about the DuREL process. During the training, we format the instruction annotation statistics are reported in Table 2. data using the prompt reported in Listing 3 by relying We release the following resources, available on on the chat template specific to the LLaMA models. GitHub16: i) the source code; ii) instruction fine-tuning Listing 3: Prompt used for the fine-tuning. and testing data; iii) links to the fine-tuned models on HuggingFace and the outputs of all evaluated models.

System: <Instruction> User: <Input> Assistant: <Output> 10https://www.together.ai/ 11https://unsloth.ai/

5.1. Regression task 6. Conclusions and Future Works

Table 5 illustrates the results of the WSD task. Mean This study explores the ability of Large Language Models Squared Error (MSE) and Root Mean Squared Error (LLMs) to address Word Sense Disambiguation (WSD) (RMSE) show that the fine-tuned model is better at pre- in Latin, a historically rich yet computationally lowdicting the annotation score. To give a complete overview resourced language. The first contribution of our work is of the results, we also provide classification metrics. Al- the release of a dataset for evaluating the WSD abilities though GPT-4o-mini shows a higher precision, LLaMA- of LLMs in Latin. This dataset is created by leveraging 3.1-8B-instruct-ft outperforms every other model. It an existing manually annotated dataset. Then, using the is interesting to note the high diference in performance new dataset and through both zero-shot and fine-tuned between LLaMA-3.3-70B-instruct-turbo and LLaMA- evaluations, we observed that while general-purpose 3.1-8B-instruct-ft. These results prove that the fine- LLMs exhibit a promising baseline ability to handle Latin tuning of a medium-sized LLM using a single GPU can WSD, significant improvements are achieved through overcome a model of the same family with about nine task-specific fine-tuning. The fine-tuned LLaMA-3.1-8Btimes the number of parameters. instruct model outperformed larger and more resource

To better understand the behaviour of each model, intensive models in accuracy and F1 scores, underscorwe report the confusion matrix of each model in B. The ing the impact of targeted instruction tuning, even on matrices of GPT-4o-mini (Figure 1) and LLaMA-3.3-70B medium-sized architectures. Nevertheless, challenges (Figure 2) show that the models often confuse the label 1 remain. The dataset’s inherent class imbalance, with a with other labels. It is interesting to note that GPT-4o- predominance of “unrelated” sense labels, likely influmini confuses the label 1 with the label 4 508 times. This enced the models’ predictions and underscores the need behaviour is more evident in LLaMA-3.1-8B-instruct for more balanced and semantically diverse training data. (Figure 3) where 913 instances labelled as 1 are confused Future work will focus on three main directions: i) Exwith label 3 and 579 with labels 4. panding the annotated dataset to include more lemmata

The fine-tuned model LLaMA-3.1-8B-instruct-ft and a broader variety of senses; ii) Evaluating model per(Figure 4) is the best at recognising label 1. This be- formance on additional semantic tasks, such as definition haviour is evident since the model tends to overfit on the generation and contextual paraphrasing in Latin; iii) Exmore frequent class. ploring multilingual and cross-lingual transfer learning strategies, leveraging annotations from related Romance 5.2. Binary Classification task languages to further boost Latin model capabilities. Results of the WSD task framed as a binary classification task are in Table 6, as well as the confusion matrix Acknowledgments of each model in Appendix B. Our proposed fine-tuned model LLaMA-3.1-8B-instruct-ft shows a strong per- We acknowledge the support of the PNRR project FAIR formance boost with respect to LLaMA-3.1-8B-instruct Future AI Research (PE00000013), Spoke 6 - Symbiotic AI and LLaMA-3.3-70B-instruct-turbo. On the other (CUP H97G22000210007) under the NRRP MUR program hand, GPT-4o-mini performance is in line with LLaMA- funded by the NextGenerationEU. 3.1-8B-instruct-ft, and even surpasses it in Precision and Accuracy. In general, our LLaMA-3.1-8B-instruct- References ft outperforms the baseline models. Figure 8 shows that LLaMA-3.1-8B-instruct-ft performs the best on class no, while GPT-4o-mini predicts class yes better.

Precision Recall Accuracy

A. Translation

Cicero’s Tuscolanae Disputationes la: [...] Dico enim constanter grauiter sapienter fortiter. Haec etiam in eculeum coiciuntur, quo uita non adspirat beata. - Quid igitur? solane beata uita, quaeso, relinquitur extra ostium limenque carceris, cum constantia grauitas fortitudo sapientia reliquaeque uirtutes rapiantur ad tortorem nullumque recusent nec supplicium nec dolorem? [...] en: For I say constantly, gravely, wisely, and strongly.

These things are also cast into the rack, to which life does not aspire for happiness. - What then? Is a blessed life alone, I pray you, left outside the door and threshold of the prison, when constancy, gravity fortitude, wisdom and the other virtues are snatched away to the torturer and refuse neither punishment nor pain? Robertus Grossetest’s De libero arbitrio la: [...] Ex quo fit, ut de nihilo creauerit omnia.” Eadem itaque ratione solus facit ominia, nulla adiutus natura. Horum autem obiectorum solutio haberi potest ut uidetur ex uerbis beati Bernardi sic dicentis: “Ipsa gratia Liberum arbitrium excitat, cum seminat cogitatum. Sanat, cum mutat afectum; roborat, ut perducat ad actum; seruat, ne sentiat defectum.” [...] en: From which it comes about that He created all things out of nothing.” Therefore, by the same reasoning, He alone creates all things, without any help from nature. But the solution to these objections can be found, as can be seen from the words of Blessed Bernard, who says thus: “Grace itself awakens Free will when it sows thought. It heals when it changes afection; it strengthens, so that it may lead to action; it preserves, so that it may not feel a deficiency.”

B. Confusion Matrices

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

linguistic resources and nlp tools for latin ., in: LDK [10]

Straka ,

Straková , Tokenizing, pos tagging,

(Posters) , 2019 , pp. 6 - 11 . lemmatizing and parsing ud 2.0 with udpipe , in: [2]

Passarotti ,

Mambrini ,

Franzini , F. M.

Cec- Proceedings of the CoNLL 2017 shared task: Multi-

Interlinking through lemmas . the lexical collection dencies , 2017 , pp. 88 - 99 .

of the lila knowledge base of linguistic resources for [11]

Stroh , Latein ist tot, es lebe Latein!: kleine

latin , Studi e Saggi Linguistici 58 ( 2020 ) 177 - 212 . Geschichte einer grossen Sprache, List Taschen[3]

Passarotti ,

Litta ,

F. M.

Cecchini , M. Pellegrini, buch, 2008 .

Moretti ,

Rufolo , G. Pedonese, The lila knowl- [12]

Leonhardt , Latin: Story of a world language , Har-

edge base of interoperable linguistic resources for vard University Press, 2013 .

latin. architecture and current state (

2022 ). [13]

Schlechtweg ,

McGillivray ,

Hengchen , H. Du[4]

McGillivray ,

Cassotti ,

Basile , D. Di Pierro, bossarsky, N. Tahmasebi, Semeval-2020 task 1: Un-

Ferilli , Using graph databases for historical lan- supervised lexical semantic change detection , 2020 .

guage data: Challenges and opportunities (

2023 ). arXiv: 2007 . 11464 . [5]

McGillivray ,

Kilgarrif , Tools for historical [14]

Sprugnoli ,

Passarotti ,

F. M.

Cecchini , M. Pel-

corpus research, and a corpus of latin, New methods legrini, Overview of the EvaLatin 2020 evaluation

in historical corpus linguistics 1 ( 2013 ) 247 - 257 . campaign, in: R. Sprugnoli, M. Passarotti (Eds.), [6]

McGillivray ,

Kondakova ,

Burman , Proceedings of LT4HALA 2020 - 1st Workshop on

framework for latin diachronic lexical semantics, tion (ELRA), Marseille , France, 2020 , pp. 105 - 110 .

Journal of Latin Linguistics 21 ( 2022 ) 47 - 105 . URL: https://aclanthology.org/ 2020 .lt4hala- 1 .16/. [7]

Minozzi , Latin wordnet, una rete di conoscenza [15]

Sprugnoli ,

Passarotti ,

F. M.

Cecchini , M. Fan-

semantica per il latino e alcune ipotesi di utilizzo toli, G. Moretti, Overview of the EvaLatin 2022

( 2017 ) 123 - 134 . on Language Technologies for Historical and An [8]

K. P.

Johnson , P. J. Burns , J. Stewart , T. Cook, cient Languages, European Language Resources

Besnier ,

W. J. B.

Mattingly , The Classical Lan- Association, Marseille, France, 2022 , pp. 183 - 188 .

guage Toolkit : An NLP framework for pre-modern URL: https://aclanthology .org/ 2022 .lt4hala- 1 .29/.

languages, in: Proceedings of the 59th Annual [16]

Sprugnoli ,

Iurescia ,

Passarotti , Overview

Meeting of the Association for Computational Lin- of the evalatin 2024 evaluation campaign , in:

guistics and the 11th International Joint Confer- Proceedings of the Third Workshop on Language

Demonstrations , Association for Computational (LT4HALA)@ LREC-COLING- 2024 , 2024 , pp. 190 -

Linguistics , Online, 2021 , pp. 20 - 29 . URL: https:// 197.

aclanthology.org/ 2021 .acl-demo.3. doi: 10 .18653/ [17]

Bamman ,

P. J.

Burns , Latin bert: A contextual lan-

v1/ 2021 . acl-demo.3. guage model for classical philology , arXiv preprint [9]

Straka ,

Hajic ,

Straková , Udpipe: trainable arXiv: 2009 . 10053 ( 2020 ).

pipeline for processing conll-u files performing tok- [18]

Devlin , M.-

Chang ,

Lee ,

Toutanova ,

Conference on Language Resources and Evaluation arXiv: 1810 . 04805 ( 2018 ).

(LREC'16) , 2016 , pp. 4290 - 4297 . [19]

Weaver , Translation, in: Proceedings of the

conference on mechanical translation , 1952. Science Society , volume 45 , 2023 . [20]

Navigli , Word sense disambiguation: A survey , [33]

Kibria ,

Dipta ,

Adnan , On functional compe-

ACM computing surveys (CSUR) 41 (

2009 ) 1 - 69 . tence of llms for linguistic disambiguation , in: Pro[21]

W. A.

Gale ,

K. W.

Church ,

Yarowsky , A method ceedings of the 28th Conference on Computational

for disambiguating word senses in a large corpus , Natural Language Learning , 2024 , pp. 143 - 160 .

Computers and the Humanities 26 ( 1992 ) 415 - 439 . [34]

J. H.

Yae ,

N. C.

Skelly ,

N. C.

Ranly , P. M. LaCasse, [22]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit , Leveraging large language models for word sense

tention is all you need , Advances in neural infor- tions 37 ( 2025 ) 4093 - 4110 .

mation processing systems 30 ( 2017 ). [35]

Basile ,

Musacchio , L. Siciliani, Ita-sense[23]

C. T.

Lewis ,

Short , A latin dictionary. clarendon, evaluate llms' ability for italian word sense disam-

1879. biguation: A calamita challenge , in: Proceedings [24]

Lendvai ,

Wick , Finetuning latin bert for word of the 10th Italian Conference on Computational

sense disambiguation on the thesaurus linguae lati- Linguistics (CLiC-it

2024 ), Pisa, Italy, 2024 .

nae, in: Proceedings of the Workshop on Cognitive [36]

Basile ,

Siciliani , E. Musacchio, G. Semeraro,

Aspects of the Lexicon , 2022 , pp. 37 - 41 . Exploring the word sense disambiguation capa [25]

Ghinassi ,

Tedeschi ,

Marongiu , R. Navigli, bilities of large language models , arXiv preprint

McGillivray , Language pivoting from parallel arXiv:2503.08662 ( 2025 ).

corpora for word sense disambiguation of historical [37]

Pasini ,

Raganato ,

Navigli , Xl-wsd: An extra-

the 2024 Joint International Conference on Compu- word sense disambiguation , in: Proceedings of

ation (LREC-COLING

2024 ), 2024 , pp. 10073 - 10084 . ume 35, 2021 , pp. 13648 - 13656 . [26]

Wu ,

Wang , Pivot language approach for [38]

Clackson , A companion to the Latin language,

phrase-based statistical machine translation , in: John Wiley & Sons, 2011 .

Zaenen , A. van den Bosch (Eds.), Proceedings of [39]

Clackson , G. Horrocks, The Blackwell history of

the 45th Annual Meeting of the Association of Com- the Latin language , John Wiley & Sons, 2011 .

putational Linguistics , Association for Computa- [40] P.

Glare , Oxford Latin Dictionary, number

tional Linguistics , Prague, Czech Republic, 2007 , pp. Num. 1-4 in Oxford Latin Dictionary , Clarendon

856- 863 . URL: https://aclanthology.org/P07-1108/. Press, 1982 . URL: https://books.google.it/books?id= [27]

Fellbaum , WordNet: An electronic lexical H7HhzAEACAAJ.

database, MIT press, 1998 . [41]

Lewis Charlton , An elementary latin dictionary , [28]

Naveed ,

A. U.

Khan ,

Qiu ,

Saqib , S. An- New York, Cincinnati, and Chicago. American Book

war , M. Usman, N.

Akhtar , N.

Barnes , A.

Mian , A Company ( 1890 ).

comprehensive overview of large language models , [42]

C. d. F.

Du Cange , Glossarium mediae et infimae

ACM

Trans. Intell . Syst. Technol. ( 2025 ). URL: https: latinitatis: AZ , volume 7 ,

Favre , 1886 .

//doi.org/10.1145/3744746. doi: 10 .1145/3744746, [43]

Schlechtweg , S. Schulte im Walde, S. Eckmann,

just Accepted . Diachronic usage relatedness (DURel): A frame[29]

Qin ,

Chen ,

Feng ,

Wu ,

Zhang , Y. Li, work for the annotation of lexical semantic change,

meet nlp: A survey , arXiv preprint arXiv:2405.12819 ings of the 2018 Conference of the North Amer-

( 2024 ). ican Chapter of the Association for Computational [ 30]

Loureiro ,

Rezaee ,

M. T.

Pilehvar , J. Camacho- Linguistics : Human Language Technologies , Vol-

Collados , Analysis and evaluation of language mod- ume 2 (Short Papers) , Association for Computa-

els for word sense disambiguation , Computational tional Linguistics , New Orleans, Louisiana, 2018 , pp.

Linguistics 47 ( 2021 ) 387 - 443 . 169 - 174 . URL: https://aclanthology.org/N18-2027/. [31]

Bevilacqua ,

Pasini ,

Raganato , R. Navigli, doi:10.18653/v1/ N18 -2027.

Recent trends in word sense disambiguation: A sur- [44]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

intelligence , International Joint Conference on Ar- adaptation of large language models., ICLR 1 ( 2022 )

tificial Intelligence , Inc, 2021 , pp. 4330 - 4338 . 3. [32]

Cabiddu ,

Nikolaus ,

Fourtassi , Comparing

ceedings of the Annual Meeting of the Cognitive Figure 4: LLaMA-3.1-8B-instruct-ft confusion matrix (regression task).