-

Towards the Semi-Automated Population of the Ancient Greek WordNet

Beatrice Marchesi

beatrice.marchesi03@universitadipavia.it 0 1 2

Annachiara Clementelli

0 1 2

Andrea Maurizio Mammarella

andreamaurizio.mammarella01@universitadipavia.it 0 1 2

Silvia Zampetta

silvia.zampetta01@universitadipavia.it 0 1 2

Erica Biagetti

erica.biagetti@unipv.it 0 1 2

Luca Brigada Villa

luca.brigadavilla@unipv.it 0 1 2

Virginia Mastellari

virginia.mastellari@unipv.it 0 1 2

Riccardo Ginevra

riccardo.ginevra@unicatt.it 0 1 2

Claudia Roberta Combei

claudia.roberta.combei@uniroma2.it 0 1 2

Chiara Zanchi

chiara.zanchi@unipv.it 0 1 2 0 Lexical semantics , synonym generation, LLMs, Ancient Greek, WordNet 1 respectively. For example, the Ancient Greek nouns 2 such as Latin , Ancient Greek, Sanskrit and Old English

2025

results. This paper explores the employment of LLMs, specifically of Mistral-Nemo, in the semi-automatic population of the Ancient Greek WordNet synsets. Several approaches are investigated: zero-shot, few-shots, and fine-tuning. The results are compared against an English baseline. Zero-shot approach yields the highest accuracy, while fine-tuning leads to the highest number of potential synonyms. Our analysis also reveals that polysemy and PoS play a role in the model's performance, as the highest scores are registered for polysemous words and for verbs and nouns. The results are encouraging for the application of such approaches in a human-in-the-loop scenario, since human validation still proves crucial in ensuring the accuracy of the ∗Corresponding author.

CEUR Workshop ISSN1613-0073

1. Introduction

guage Models (LLMs) for populating the synsets of the In this paper, we explore the application of Large Lan- short definition and an ID-number ([ 1 ]). WordNets are Ancient Greek WordNet (AGWN) and assessing the ex- assignment to the same synset or to multiple synsets,

The building blocks of WordNets are synsets, that is,

groups of cognitive synonyms, each associated with a designed to represent both synonymy and polysemy, via meanings by groups of quasi-synonymous words con- riphéggeia, augasmós, bolē ́, kiéllē 1 all belong to the synset tent to which these models can support such a task.

WordNet is a lexical resource that organizes word nected to each other in a network structure ([ 1 ]). The ifrst WordNet was developed for English at Princeton University by George Miller and Christiane Fellbaum ([ 2 ], [ 3 ], [ 4 ]). Originally developed within a project in psycholinguistics, it gradually evolved into a tool for computational lexical semantics. The development of such semantic networks was subsequently extended to languages beyond English, beginning with modern languages (e.g., [ 5 ]) and later including ancient ones as well,

Drawing from a previous collaboration with the Uniwas developed in 2014 as the result of an international collaboration between the Institute of Computational Linguistics “Antonio Zampolli” (Pisa), the Perseus Project, the Open Philology Project, and the Alpheios Project. It 1Note that in the experiment both the inputs and the outputs of the model were written in the Greek alphabet. In this paper, however, all Ancient Greek lemmas are transliterated and provided with translations supplied by the LSJ lexicon [ 11 ].

2Synsets do not group together only ‘absolute synonyms’, i.e., words

that are interchangeable in all possible contexts, but also words that are similar in meaning limited to certain contexts ([ 2 ]: 241, [ 12 ].) CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- versity of Pavia ([ 13 ]), the first version of the AGWN was initially constructed using digitized Greek-English and degrees of polysemy. These goals are pursued in the lexica from the Perseus Project, linking the Greek word present paper, which extends the experiment to Ancient of each extracted bilingual pair to every synset in the Greek.

Princeton WordNet ([ 3 ]) in which the English member The paper is organized as follows. In Section 2 we of the pair appeared. This method, known as the ex- describe our data and methodology, discussing the crepand method ([ 5 ]), has been commonly adopted in the ation of the dataset (2.1), the zero-shot approach (2.2), the development of several modern WordNets ([ 14 ]), largely few-shot approach (2.3), and the fine-tuning processes due to the extensive richness and detail of the Prince- performed using the LoRA technique (2.4). In Section 3 ton WordNet. However, it presents challenges typical we report the results of the experiment, which are disof using English as a pivot language, as well as dificul- cussed from both a quantitative (3.1) and a qualitative ties specific to mapping concepts across culturally and (3.2) perspective. Section 4 concludes the paper. historically distant traditions. In the case of the AGWN, synsets were also aligned with the Italian section of the MultiWordNet ([ 15 ]), ItalWordNet ([ 16 ]), and with the 2. Data and Methodologies Latin WordNet ([ 6 ]). A subset of synsets was used to evaluate the automatic extraction process and erroneous The experiment 3 followed three distinct methodological alignments were removed by filtering out anachronistic phases, namely zero-shot prompting, few-shot promptdomains. This version of the AGWN included approx- ing, and fine-tuning. This progression was introduced to imately 35,000 lemmas—roughly 28% of the estimated evaluate the efectiveness of diferent approaches for the 120,000 lemmas in the entire Ancient Greek lexicon. Cov- given task and determine the advantages and disadvanerage was significantly higher for the Homeric lexicon tages of each strategy. (69%), owing to the incorporation of Autenrieth’s Home- Furthermore, an English baseline was established to ric Dictionary in the construction of the resource (see [ 7 ] validate the results of this study, in order to explore the for details). model’s responsiveness to this specific task and to exam

The work on the AGWN continues in the framework ine how cross-linguistic diferences might influence its of the PRIN project Linked WordNets for Ancient Indo- performance.

European Languages, whose aim is to harmonize three The pretrained model used in all stages of the experiWordNets for Ancient Greek, Latin, and Sanskrit, and ex- ment is Mistral-NeMo4, a multilingual open source model pand their coverage in terms of the number of annotated selected because of its balance between performance and words and populated synsets ([ 9 ], [ 17 ]). eficiency, which results optimal for fine-tuning.

While various methods have been proposed for the automatic population of synsets, their outputs typically 2.1. Datasets still require substantial manual validation. For instance, word embeddings have been employed to identify lexi- The testing data used in the experiment consists of two cal relations absent from existing WordNets for Ancient datasets, one made up of (chiefly) monosemous lemmas Greek ([ 18 ]), Sanskrit ([ 19 ]), and Latin ([ 20 ]; see [ 21 ] for and the other of polysemous lemmas. This distinction an overview). Given that fully manual synset population follows the work of [ 21 ], in which the distinction of the is highly time-consuming, a further aim was later added two datasets was based on the number of lemmas associated to the synsets: the so-called polysemous dataset was tLoanthgeuapgroesje:ctthLeitnrkaeindinWgoardnNdettessftoinrgAonfciLeLnMtIsndfoor-Etuhreoapueaton- formed by well-populated synsets, each containing 15 matic population of synsets of ancient languages. These mainly polysemous lemmas, while the so-called monosemodels are intended to be integrated into the current mous dataset was made up by less populated synsets annotation platform to suggest potential synonyms to containing at least two monosemous lemmas. However, annotators, who will then manually validate the LLM in this work the datasets were manually crafted, since the generations. annotated data in the AGWN are too scarce to allow for

The first experiment with LLMs, conducted on Latin the same approach: lemmas possessing just one meaning ([ 21 ]), aimed to compare zero-shot, few-shot, and fine- according to the LSJ lexicon ([ 11 ]) were collected in the tuning approaches against an English baseline. Quantita- monosemous dataset, while lemmas associated to multitive analysis showed marked improvements from zero- ple meanings constitute the polysemous dataset. Each of shot to fine-tuning approaches, with the latter outper- the datasets is composed of 40 lemmas, equally divided forming the English baseline. Qualitative evaluation revealed stronger performance with verbs and with lemmas belonging to relatively well-populated synsets. While 3The datasets, code, and data used for this experiment are provided the results were encouraging, they highlighted the need 4ihnttapsr:e/p/mosiisttorrayl.aait/nhetwtpss/:m//gisittrhaulb-n.ceommo/,unipv-larl/llms-ag. for better performance across various parts of speech https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 among the four PoS types included in WordNets (10 verbs, The resulting dataset in JSONL format was made up of 10 nouns, 10 adjectives, and 10 adverbs). 5,458 sets of synonyms with a mean number of 16 syn

To validate the results against a benchmark, an English onyms each (minimum 1, maximum 315 for the lemma baseline (EB) dataset was created. Considering that the peribállō (throw around)), thus divided across PoS: 2946 English baseline serves as a benchmark to highlight difer- nouns (54%), 1372 verbs (25%), 955 adjectives (18%) and ences in performance between a high-resource modern 185 adverbs (3%)5. language such as English and Ancient Greek, a substan- The aim of the experiment with Latin WordNet ([ 21 ]) tial gap between the results for the two target languages was to explore the outcomes and benefits of automating is to be expected. The English baseline dataset maintains WordNet annotation by fine-tuning a model with data the distinction between “monosemous” and “polysemous” extracted from the WordNet itself. The assumption was sets, and its characteristics are the same as those of the that training a model on data of the same type and with test dataset. Thus, the included lemmas have roughly the same structure of the desired output might lead to the same meanings as the Ancient Greek words, since improved results, creating a virtuous feedback loop in they consist of translations and are balanced for PoS. which WordNet data are directly used to generate new During the translation of the Ancient Greek dataset into data for WordNet population. Although AGWN does not English, particular care was taken to preserve the distinc- contain suficient annotated data to provide a suitable tions between the datasets. Lemmas from the monosemy training dataset and to support the exact same approach dataset were translated using roughly monosemous En- as [ 21 ], this work is based on the same assumption, since glish words, while those from the polysemy dataset were the data that was collected for fine-tuning shares the rendered with mainly polysemous equivalents. same structure and properties of the data in the WordNet,

The fine-tuning dataset was created by extracting data as previously discussed. from back-translation dictionaries, based on the assumption that such dictionaries provide, for any given entry 2.2. Zero-Shot Approach in a modern language, a list of Ancient Greek words that can be used in context to translate that entry, that is, contextual synonyms. An example of a back-translation dictionary entry is ofered below: The first approach of the experiment is zero-shot (ZS) learning. This strategy tests the generalization potential and performance of models in tasks for which they were not specifically trained, since “no demonstrations are • Accusation (subs.): P. katēgoría, hē, katēgórēma, allowed, and the model is only given a natural language tó, P. and V. aitía, hē, aitíama, tó, énklēma, tó, V. instruction describing the task” ([25]: 7). Indeed, models epíklēma, tó ([22]). pre-trained on various and general datasets are usually able to generalize across new tasks, thus saving resources needed to create labeled data for additional training or demonstrations ([26]).

Compared to other approaches, zero-shot learning presents several drawbacks, including dificulty with complex tasks and lower accuracy, as outputs may lack precision or contextual relevance. Moreover, it is highly sensitive to prompt framing, which plays a crucial role in this setting ([27]).

As the first stage of the experiment, the zero-shot strategy was applied for both the Ancient Greek dataset and the English baseline. The prompts were tailored to each language and followed the best practices of prompt engineering, such as assigning a persona, specifying the desired output format, and organizing assertions as a bullet list ([28]; [29]). For the complete prompts, see A.1 and A.2.

Through a series of processing and cleaning operations,

a dictionary of Ancient Greek synonym sets was extracted from the English-Greek Dictionary ([22]) and the Deutsch-Griechisches Wörterbuch ([23]), merging the results obtained from each dictionary to avoid overlap.

It is important to note that the digital versions of these back-translation dictionaries were obtained through OCR (Optical Character Recognition), which - while generally accurate for modern languages written in the Latin script - yields sub-optimal results for Ancient Greek, often producing incorrectly digitized data and, consequently, inexact outputs. To address this problem, a series of cleaning operations was performed, from encoding normalization to checking the lemmas against the entries of the Brill Dictionary ([24]) to exclude incorrect or non-existent words.

Such cleaning procedures ensure that the assembled dictionary only contains existing Ancient Greek words in their lemmatized form and that each set of synonyms exclusively features lemmas pertaining to the same PoS. 2.3. Few-Shot Approach An example of the synonym sets resulting from the data collection procedure is presented below: In the few-shot (FS) setting, some examples demonstrating the expected output, its format, and style are given • phrikō ́dēs (awe-inspiring): ouránios (heavenly), 5The data collected for fine-tuning will be imported in the AGWN, theîos (divine), deinós (wondrous). to help with the automatic population of the resource. to the model to enhance performance, helping it under- a task-specific model. This was achieved by fine-tuning stand the reasoning required for the new task ([25]). This the quantized Mistral-NeMo model, which was loaded in approach has been proven to generally outperform zero- 8-bit format to optimize computational eficiency, using and one-shot learning ([25]; [30]), especially in structured the previously described fine-tuning dataset on a GPU and complex tasks, such as synonym generation. Com- node of an HPC cluster. LoRA was used to optimize finepared to fine-tuning, this method proves cost-efective tuning, setting the low-rank matrix dimension to 8 and because the weights of the model are left unchanged, the scale factor lora_alpha to 16, with a dropout of 10%. sparing a computationally intensive process, and only a The dataset was split into training (80%) and validation small set of labeled items is needed, which is convenient (20%), and the training was set for five epochs with a in cases of scarcity of data ([27]: 24). However, this strat- learning rate of 1e-4. An early stopping mechanism with egy is strongly dependent on careful prompt engineering a patience of one epoch was established to avoid overfitand on suitable and verified examples. Therefore, par- ting, and a parameter was set to save the model with the ticular attention is needed when designing the prompts lowest value of validation loss, which corresponded to ([31]: 3). As for prompt engineering best practices, per- the output of the fourth epoch. The metrics calculated formance has been proven to increase the more similar during fine-tuning over the five epochs of training are the examples are to testing data. The choice of examples presented in Table 1. also seems to have a great efect on the output ([ 27]: 16).

To test this approach on the Ancient Greek dataset, Table 1 an ad-hoc prompt was created by maintaining the basic Fine-tuning metrics over the five epochs of training. For each structure of the zero-shot prompt and adding a set of metric, the best value is highlighted in bold type. eight examples featuring the same structure of the desired output. The examples are equally divided into roughly 1 2 3 4 5 monosemous and polysemous word sets and are balanced for PoS, so that for each of the four PoS, two lemmas Training loss 1,2943 1,4099 1,1478 1,2232 1,1855 are provided, that is, one monosemous, the other one polysemous. The examples added to the few-shot prompt are listed in A.3. Validation loss 1,4814 1,4366 1,4137 1,4087 1,4100

2.4. Fine-Tuning with LoRA

Training mean token accuracy 0,6587 0,6262 0,6597 0,6720 0,7206 A recent trend with demonstrated advantages is to adapt large-scale pre-trained language models to specific downstream tasks. Indeed, a first stage of generative pre- The overall loss trend is descending, even if gradually, training leads to gaining a greater world and language both in training and in validation, and the accuracy values knowledge and, consequently, to an improved perfor- are increasing. Overall, the metrics show that the training mance. Then, the following fine-tuning (FT) on domain- was conducted successfully and without overfitting. specific labeled data updates the pre-trained parameters with a new training cycle to adapt the model to the task 3. Results and Discussion at hand. This combination of unsupervised pre-training and supervised fine-tuning results in a semi-supervised The validation of the results took place in two steps. The approach able to construct a universal representation, first step was to automatically lemmatize each word uswhich can be applied to a wide array of tasks ([32]: 2). ing greCy ([34]), so that even inflected forms generated

Although fine-tuning greatly enhances model per- by the model are traced back to the corresponding lemma. formance, it is very resource-intensive. Some strate- Notably, this pre-processing step is pointless in the case gies were explored to mitigate this issue, such as LoRA of hallucinations or incorrect forms (for a more detailed (Low-Rank Adaptation), which is a PEFT (Parameter- discussion, see 3.2.1 and 3.2.2). It is worth pointing out Eficient Fine-Tuning) method that makes fine-tuning that the lemmatization, while correct in most cases, was more parameter- and compute-eficient by freezing the not always impeccable (e.g., theoí (gods, masculine nompre-trained model’s parameters and adapting only a sub- inative plural) > theoí (FS)). set of weight matrices. This method proves to be highly After lemmatization, three human annotators 6 valieficient compared to traditional fine-tuning, especially dated the results, determining for each generated item if with regard to memory and storage ([33]: 5), meeting and it constituted a potential synonym of the input word. In sometimes surpassing the baselines, without increases in inference times ([33]).

The final step of the experiment involved fine-tuning

6The three annotators are all students of the MA program in Lin

guistics at the University of Pavia with a BA Degree in Classics. cases of disagreement between the annotators, the mat- As for the similarity metric, cosine similarity was comter was resolved through discussion until an agreement puted using pre-trained Word2vec embeddings based was reached. The inter-annotator agreement, measured on a skip-Gram model for both English7 and Ancient with Fleiss’ Kappa ([35]), reached a value of 0.71 on the Greek8. In a task such as synonym generation this metAncient Greek data and 0.66 on the English data, both of ric is useful in determining if the output might be a valid which fall under the label of good to substantial agree- synonym to the target word based on semantics and ment. For the purposes of this work, the concept of syn- distribution. However, one limitation is represented by onymy is interpreted in a shallow and contextual sense, out-of-vocabulary (OOV) terms, meaning that in some consistent with the framework upon which the WordNet cases, for both English and Ancient Greek, the metric architecture is based (see footnote 2). Thus, words whose fails to capture the actual similarity between the genermeaning is similar enough that they might be assigned ated output and the input lemma, as one or both of the to the same synset are considered potential synonyms, two words are not contained in the embedding dictionary, as in 1. such as in 2.a and 2.b: 1 anankázō: rule, hold sway.

kratéō: force, compel.

2.a gourmand: epicure. Similarity: 0. 2.b katasparássō (tear in pieces): katagnúō (break in pieces). Similarity: 0. y EB emZS s on FS o

MFT The results are analyzed both from a quantitative and a qualitative perspective, and the analysis is carried out by comparing the diferent approaches employed, which are bench-marked against the English baseline. Regarding the quantitative data discussed in Section 3.1, the performance of each of the approaches is evaluated through the metrics of accuracy, similarity, number of generated outputs, and potential synonyms.

While the issue of OOVs afects both English and An

cient Greek, the latter is more severely impacted by this problem due to the more limited size of the embedding dictionary, thus the similarity values for Ancient Greek tend to be underestimated compared to the English baseline.

As shown in Table 2, the two datasets of the English baseline score the highest values in accuracy, similarity, 3.1. Quantitative Analysis total, and mean of potential synonyms. The results highThe results of the quantitative analysis are shown in Ta- light that the model reaches a high performance in the ble 2, which displays the values of the metrics for each task at hand, even in a zero-shot setting without taskof the approaches, both providing the overall scores and specific demonstrations or training. This result indicates distinguishing between the polysemous and the monose- that the generalization potential of the model is quite mous datasets. high for a high-resource language such as English. As for the zero-shot approach, the first step of the Table 2 experiment shows a much lower performance compared Metrics comparison (acc: accuracy, sim: similarity, n_gen: to the English baseline, across all metrics. Considering number of generated outputs, p_syn: number of potential that pre-trained models have much less data available for synonyms). For each row, the best scores, excluding those of Ancient Greek compared to modern languages such as the EB, are highlighted in bold type to facilitate comparison English, the drop in performance and in the number of across approaches for Ancient Greek synonym generation. generations is to be expected.

acc sim n_gen p_syn Considering now the few-shot approach, the results EB 90% .377 167 151 show an unexpected drop in performance compared to l the zero-shot strategy. Indeed, the instructions given in lra ZS 30% .261 116 34 the prompt apparently do not help the model, but rather ev FS 5% .099 169 9 afect the outputs negatively. However, it is important to OFT 11% .077 403 43 point out that the number of generated outputs increases y EB 98% .407 85 83 compared to the zero-shot approach, reaching the same emZS 40% .296 63 24 value as the English baseline. lsy FS 7% .066 61 4 Finally, the results of the fine-tuned model register an oP FT 13% .113 288 38 overall increase in performance compared to the few83% .347 68 shot approach. Compared to zero-shot learning, this 19% .226 10 approach scores lower accuracy and similarity, but registers a higher number of validated potential synonyms. This is because the number of generated outputs in- of [ 21 ], applies not only to Ancient Greek, but also to creases greatly, surpassing even the English baseline, English. A possible explanation for this phenomenon is which makes accuracy drop since only a portion of the that polysemous words tend to be more frequent than outputs are potential synonyms. While the zero-shot monosemous words ([39]). As the frequency of a word approach is more accurate in output generations, fine- in pre-training data impacts the LLM’s ability to learn its tuning leads to a greater number of generated synonyms representation ([40]), more frequent words can be linked and, in turn, of validated potential synonyms. This trade- to higher performance levels, as they are encountered of might prove advantageous for automating population in a wider variety of contexts during model pre-training. with a human-in-the-loop approach, since on average a Moreover, in a task such as synonym generation, it is higher number of potential synonyms is generated and likely that language models perform better with polysethe human annotator can eficiently discard inappropri- mous compared to monosemous words, as they encode ate generations, as the average number of outputs for richer semantic information, resulting in a higher probaeach input word is moderate (around 5). bility of generating suitable outputs. This is because the

Our findings show that the results of the English base- model is provided with a broader semantic basis from line greatly outperform those of the other approaches which to draw suitable candidates. across all metrics but the number of generations, which is highest for the fine-tuned model. Considering the pro- 3.2. Qualitative Analysis gression of the approaches adopted in the experiment, one can note that the scores of accuracy and similarity Examples of generations across approaches divided for drop along every stage of the experiment, contrary to the monosemous and polysemous datasets are shown in the expectations discussed in Section 2.2-2.4, and to the Table 3. results of [ 21 ]. On the other hand, the number of generated outputs steadily increases with each stage of the experiment. The diferences in performance across the stages of this experiment, when compared to the results with Latin reported by Santoro et al., are likely due to the

Taking a closer look at incorrectly generated outputs,

4 homôs (similarly): hómoios (similar) (FS). several typologies of orthographic errors and inconsisAnother type of task misalignment that was frequently tencies were observed. Across approaches, some outputs observed in Santoro et al. [ 21 ] was the generation were written using multiple alphabets: alongside Greek of multi-word expressions, despite instructions in the characters, characters from other scripts appeared, such prompt explicitly prohibiting it. Notably, such instances as Latin, Cyrillic, and Arabic (e.g dapánawm, blētē ́rioны). are extremely rare in our results, with just a few occur- Interestingly, these types of errors are less frequent in rences (e.g. met’hautoû (afterwards) (ZS)). the zero-shot setting compared to the other approaches.

A second typology of orthographic errors that was observed is closely tied to the internal conventions of 3.2.1. Non-Ancient Greek Generations Ancient Greek. Across all three training settings, lemmas were generated lacking either the accent (7.a) or the initial breathing mark (7.b). In other cases, the lemmas were generated with an incorrect accent (7.c).

Across all three approaches, the generations include cases

of hallucinations, a term that refers to ‘generated content that is nonsensical or unfaithful to the provided source content’ ([41]). It has been observed in previous literature that hallucinations are amplified by the scarcity of data when dealing with low-resource languages ([42], [43]). Hallucinations are far more frequent in the FS and FT approaches than in ZS. In some cases, the hallucinations share features with the input words, such as the root (see 5.a) or the prefix ( 5.b). In other cases, no such formal relationship seems to exist (5.c).

7.a krísis (dispute): kindunos (vs kíndunos) (danger)

(FT). 7.b hellēnikós (Greek): ellēnēios (vs hellēnēios) (Greek) (FS).

7.c kritē ́s (judge): brabeûs (vs brabeús) (arbiter) (FS).

Notably, such incorrect generations are much less fre

5.a plêthos (multitude): poluplēstía (ZS). quent in the zero-shot setting. One may hypothesize that 5.b diakrínō (distinguish): dialúeimi, diēkribállēn these errors are related to the fact that Modern Greek (FS). lacks the initial breathing mark and the iota subscript, and retains a single accent type. A similar type of ortho5.c eupetôs (easily): tlēmatikós (FT). graphic inconsistency, afecting only two generations, is Notably, some of the outputs are generated in languages the use of the iota adscript instead of the iota subscript. other than Ancient Greek, namely English and Modern For the target word kléptēs (thief), the few-shot and fineGreek, even though the prompt specifically instructs to tuning outputs are respectively lēistē ́s (robber) and lēïstē ́s. avoid this behavior (see A.1 and A.2). The inability of While such instances are linguistically and philologically LLMs to consistently generate text in a user’s desired correct, they were not validated as potential synonyms language is widely known in NLP and is referred to as since they are not compatible with the AGWN graphic language confusion ([44]). Examples of language confu- standard regarding the iota subscript. sion in the model’s generations are presented in 6.a and

6.b. 3.2.3. Potential Synonyms

6.a arktikós (northern): psēlóten/flutter/tall (FT). 6.b éris (strife): antagōnismós (competition) (ZS). Notably, Mistral models have been found to exhibit high degrees of language confusion ([44]), so the presence of languages other than Ancient Greek in the model’s output is not surprising. The problem of English generations also impacted the results of Santoro et al., even though such instances are quite rare in our study. On the contrary, the outputs in Modern Greek are much more numerous, which could depend on an interference efect of the target language’s script. This is because the model likely tends to produce outputs in a higherresource modern language with the same script, as for Latin and English on the one hand, and Ancient Greek and Modern Greek on the other.

Considering now the generations that were validated

as potential synonyms, some interesting observations emerged from the results. One interesting phenomenon that was observed is the generation of rare lemmas or lexical items dating to the Postclassical stages of Ancient Greek (e.g., the Roman or Byzantine period, [45]: 3-6).

For example, as a synonym for kritē ́s (judge) the model generates lutē ́r (arbitrator), a rare lemma that occurs only 6 times in the Thesaurus Linguae Graecae (TLG)9. Only three of such instances are found in Classical texts, while the remaining occurrences come from texts belonging to the Imperial and Byzantine period. Furthermore, the meaning ‘arbitrator’ associated with lutē ́r is rare, as it is attested only for one of its occurrences (A.Th.940), while 9Accessed July, 2025 it usually means ‘deliverer’. An example of a generation distribution of the training data used for fine-tuning, in consisting of a Postclassical lemma is boreinós (northern), which nouns and verbs constituted the majority classes, generated as a synonym for arktikós (northern), which making up, respectively, 54% and 25% of the dataset (see is attested 7 times in the TLG, all in Imperial Greek and Section 2.1), possibly resulting in a bias of the fine-tuned later, and eventually gives rise to the Modern Greek term model. Furthermore, another possible explanation is vorinós. While unexpected, these phenomena do not connected to the diference in performance between the impact the potential for the automatic population of the (roughly) polysemous and monosemous datasets already AGWN proposed in this work, since the AGWN collects discussed in Section 3.1: independently of the PoS of lemmas independently of their frequency or the language the input word, the performance of the model is better stage in which they are attested. for polysemous input words across all approaches but

Focusing now on the diference in performance de- FS. Indeed, verbs are generally considered more polysepending on the PoS of the input lemma, Table 4 shows mous than other PoS as their meanings are thought to for each approach the number of generations and the be more flexible, thus encoding richer semantics ([ 46], number of validated synonyms across PoS, both divided [47]). Nouns also exhibit a high degree of polysemy ([48]). for datasets and overall. Since, as already discussed, polysemous words tend also to be more frequent, the increase in performance for Table 4 these PoS may be linked both to a higher frequency in Model performance across PoS (Tot: generations for PoS; Syn: the training data and to their greater polysemy, which potential synonyms for PoS). For each cell, the highest value provides a broader semantic basis for the generation task is presented in bold type to facilitate comparison. at hand.

4. Conclusions

This work has explored the potential of LLMs in the semiautomatic population of the AGWN, evaluating and comparing multiple approaches. The first approach tested was zero-shot, which, despite the lack of examples, generated numerous potential synonyms and achieved considerable accuracy and similarity scores, given the task at hand. Contrary to expectations, the few-shot setting marked a decline in results across all evaluation metrics, except the number of generations. Finally, fine-tuning outperformed the few-shot setting, but scored lower accuracy and similarity values compared to zero-shot prompting. However, this approach scored the highest number of generated outputs and potential synonyms.

The divergence between our results and the outcomes of Santoro et al.’s analysis [ 21 ] is likely due to the more Notably, the PoS for which the model generated the high- recent language model employed, which shows enhanced est number of outputs is nouns (215), followed by verbs zero-shot performance, and to the diferent target lan(201). However, these overall results are highly influ- guage, as the variation in available data and writing sysenced by the FT data, which are very abundant and have tem between Greek and Latin can significantly impact a great impact on the total. If we consider the ZS and FS the results. approaches alone, the PoS with the most numerous out- Our analysis shows that, for the task at hand, the zeroputs is adjectives (ZS: 36; FS: 54). The PoS with the lowest shot approach represents a promising starting point for number of generations is adverbs, a trend that is quite partially automating the population of the AGWN, withstable across approaches, independently of the dataset out needing the resources necessary for fine-tuning a considered. Concerning the number of validated syn- model. Zero-shot generations reach good scores of aconyms across PoS, the highest number of potential syn- curacy and similarity, and in the majority of cases outonyms is generated for nouns (32/215) and verbs (32/201), puts are correctly spelled and lemmatized. On the other even though this general trend does not apply to the ZS hand, while fine-tuning results in lower precision, it leads approach, in which adjectives score the highest number to a greater number of generations and potential synof potential synonyms. Overall, adverbs score the lowest onyms. This approach, while not as accurate as zero-shot, number of potential synonyms (5/116). The reason for might prove suitable in a human-in-the-loop scenario, in this diference in generation trends across PoS may be the

Acknowledgments The fine-tuning of the model presented in this work was

carried out on the High Performance Computing DataCenter at IUSS, co-funded by Regione Lombardia through the funding programme established by Regional Decree No. 3776 of November 3, 2020. The authors wish to express their sincere gratitude to Cristiano Chesi for granting access to the HPC cluster.

Research for this study was funded through the European Union Funding Program – NextGenerationEU S. Wang, L. Wang, W. Chen, Lora: Low-rank adap- Big Data Analytics, Decision-Making, and Pretation of large language models, in: ICLR 2022 - dictive Modelling Systems 8 (2024) 17–24. URL: 10th International Conference on Learning Repre- https://polarpublications.com/index.php/JABADP/ sentations, 2022. article/view/2024-12-10. [34] J. Myerston, J. López, grecy: Ancient greek spacy [44] K. Marchisio, W.-Y. Ko, A. Bérard, T. Dehaze, models for natural language processing in python, S. Ruder, Understanding and mitigating language 2023. confusion in llms, 2024. doi:10.48550/arXiv. [35] J. L. Fleiss, Measuring nominal scale agreement 2406.20052.

among many raters, Psychological Bulletin 76 [45] G. D. Bartolo, D. Kölligan, Postclassical Greek: Prob(1971). doi:10.1037/h0031619. lems and Perspectives, De Gruyter, 2024. [36] S. Stopponi, N. Pedrazzini, S. Peels-Matthey, [46] C. Fellbaum, English Verbs as a Semantic Net, InterB. McGillivray, M. Nissim, Natural language national Journal of Lexicography 3 (1990) 278–301. processing for ancient greek, Diachronica 41 URL: http://dx.doi.org/10.1093/ijl/3.4.278. doi:10. (2024) 414–435. URL: https://www.jbe-platform. 1093/ijl/3.4.278. com/content/journals/10.1075/dia.23013.sto. [47] D. Gentner, I. M. France, The verb mutabildoi:https://doi.org/10.1075/dia. ity efect: Studies of the combinatorial seman23013.sto. tics of nouns and verbs, 2013. doi:10.1016/ [37] H. Nguyen, K. Mahajan, V. Yadav, J. Salazar, P. S. B978-0-08-051013-2.50018-5.

Yu, M. Hashemi, R. Maheshwary, Prompting with [48] A. A. Freihat, F. Giunchiglia, B. Dutta, A taxonomic phonemes: Enhancing llms’ multilinguality for non- classification of wordnet polysemy types, in: Prolatin script languages, in: L. Chiruzzo, A. Ritter, ceedings of the 8th Global WordNet Conference, L. Wang (Eds.), Proceedings of the 2025 Conference GWC 2016, 2016. of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Online Resources Association for Computational Linguistics, 2025, pp. 11975–11994. URL: https://aclanthology.org/ • Thesaurus Linguae Graecae® Digital Library. Ed. 2025.naacl-long.599/. doi:10.18653/v1/2025. Maria C. Pantelia. University of California, Irvine naacl-long.599. (accessed May 31 2025). [38] O. Shliazhko, A. Fenogenova, M. Tikhonova, A. Kozlova, V. Mikhailov, T. Shavrina, mgpt: Few-shot learners go multilingual, Transactions of the As- A. Prompts Used in the sociation for Computational Linguistics 12 (2024). Experiment doi:10.1162/tacl_a_00633. [39] G. K. Zipf, The meaning-frequency relationship of This appendix contains the full prompts used in the exwords, Journal of General Psychology 33 (1945). periment for both Ancient Greek and English. doi:10.1080/00221309.1945.10544509. [40] T. Fu, R. Ferrando, J. Conde, C. Arriaga-Prieto, P. Reviriego, Why do large language models (llms) strug- A.1. Ancient Greek Prompt gle to count letters?, 2024. doi:10.48550/arXiv. zs_prompt = f"""You are a powerful 2412.18626. AI assistant trained in semantics and [41] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Classics.

Y. J. Bang, A. Madotto, P. Fung, Survey of hal- You are an Ancient Greek native lucination in natural language generation, ACM speaker. The only language you speak Comput. Surv. 55 (2023). URL: https://doi.org/10. is Ancient Greek.

1145/3571730. doi:10.1145/3571730. Your task is to provide a bullet list [42] N. M. Guerreiro, D. M. Alves, J. Waldendorf, B. Had- of Ancient Greek synonyms for a userdow, A. Birch, P. Colombo, A. F. Martins, Hal- chosen word. lucinations in large multilingual translation mod- Your response must contain the els, Transactions of the Association for Computa- generated synonyms as comma-separated tional Linguistics 11 (2023). doi:10.1162/tacl_ values.

a_00615. Observe the following instructions [43] M. Abdelrahman, Hallucination in low-resource very closely: [INST] languages: Amplified risks and mitigation strate- - Generate only Ancient Greek gies for multilingual llms, Journal of Applied synonyms. - Provide single-word expressions synonyms for each lemma. ONLY. -- ABSOLUTELY AVOID including any - Do NOT generate long phrases. additional explanations or comments - Make sure to provide numerous in your output. synonyms for each lemma. - VERY IMPORTANT: Make sure the -- ABSOLUTELY AVOID including any outputs are spelled correctly. additional explanations or comments - IMPORTANT: Generate words with the in your output. same part of speech as the input word, - VERY IMPORTANT: DO NOT translate the for example if the input word is a words. verb you must generate only verbs as - VERY IMPORTANT: Use ANCIENT GREEK synonyms. exclusively. -- List each English word separately - VERY IMPORTANT: Generate ANCIENT with proper formatting. GREEK lemmas in the original script """ with accurate diacritics (accents, breathing marks, and vowel quantity A.3. Examples for the Few-Shot Prompt for long vowels indicated by macrons or other notations). word: ’nouthetē ́seis’ - VERY IMPORTANT: Make sure the synonyms: [’paramuthía’, ’protropē ́’, ’parakéleusis’, outputs are spelled correctly. ’parórmēsis’, ’paroksusmós’, ’peithō ́’, ’pístis’, ’kéntron’, - IMPORTANT: Do NOT generate any word ’múōps’, ’paraínesis’] in Modern Greek. - IMPORTANT: Generate words with the word: ’atimázō’ same part of speech as the input synonyms: [’kataiskhúnō’, ’aischúnō’, ’atimóō’, word, ’atimáō’] for example if the input word is a verb you must generate only verbs as word: ’theosebē ́s’ synonyms. synonyms: [’deisidaímōn’, ’eusebēē ́s’, ’eúphēmos’, -- For NOUNS generate only the ’pistós’] NOMINATIVE CASE, as shown in the examples below. word: ’autoû’ -- For VERBS generate only the FIRST- synonyms: [’entaûtha’, ’entháde’, ’autóthi’, ’éntha’, PERSON SINGULAR of the INDICATIVE. ’ekeî’] -- List each Ancient Greek word separately with proper formatting. word: ’trophē ́’ """ synonyms: [’deîpnon’, ’edōdē ́’, ’sîtos’, ’édesma’]

A.2. English Prompt

en_prompt=f"""You are a powerful AI assistant trained in semantics. You are an English native speaker. Your task is to provide a bullet list of English synonyms for a user-chosen word.

Your response must contain the generated synonyms as comma-separated values.

Observe the following instructions very closely: [INST] - Generate only English synonyms. - Provide single-word expressions ONLY. - Do NOT generate long phrases. - Make sure to provide numerous word: ’iskhurós’ synonyms: [’drastē ́rios’, ’karterós’, ’energē ́s’, ’rhōmaléos’, ’krataíos’, ’óbrimos’, ’sthenarós’, ’kraterós’]

[1]

Fellbaum , Wordnet and wordnets , in: Encyclopedia of Language and Linguistics , Second Edition , Elsevier, 2005 .

[2]

G. A.

Miller ,

Beckwith ,

Fellbaum ,

Gross ,

K. J.

Miller , Introduction to wordnet: An on-line lexical database , International journal of lexicography 3 ( 1990 ) 235 - 244 .

[3]

Fellbaum , WordNet: An electronic lexical database , GMA, MIT Press, 1998 .

[4]

G. A.

Miller ,

Fellbaum , Wordnet then and now , Language Resources and Evaluation 41 ( 2007 ) 209 - 214 .

[5]

Vossen , Introduction to eurowordnet, Computers and the Humanities ( 1998 ) 73 - 89 .

[6]

Minozzi , The latin wordnet project , Latin Linguistics Today. Akten des 15 . Internationalem Kol- loquiums zur Lateinischen Linguistik ( 2009 ) 707 - 716 .

[7]

Bizzoni ,

Boschetti ,

Del Gratta ,

Diakof ,

Monachini , G. Crane, The making of ancient greek wordnet , Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14) ( 2014 ) 1140 - 1147 .

[8]

Hellwig , The making of ancient greek wordnet , Proceedings of the 12th Inter- national Conference on Computational Semantics (IWCS) 137 ( 2017 ) 3934 - 3941 .

[9]

Biagetti ,

Zanchi ,

W. M.

Short , Toward the creation of WordNets for ancient Indo-European languages , in: P. Vossen, C. Fellbaum (Eds.), Proceedings of the 11th Global Wordnet Conference, Global Wordnet Association , University of South Africa (UNISA), 2021 , pp. 258 - 266 . URL: https:// aclanthology.org/ 2021 .gwc- 1 .30/.

[10]

HKhan ,

F. J.

Minaya Gómez ,

R. Cruz

González ,

Diakof ,

J. E. Diaz

Vera ,

J. P.

McCrae , C. O'Loughlin , W. M.

Short , S.

Stolk , Towards the construction of a wordnet for old english , Proceedings of the Thirteenth Language Resources and Eval- uation

Facchinetti ,

Ginevra ,

Zanchi , Exploring Conference, Marseille, France 137 ( 2022 ) 3934 - 3941 . latin wordnet synset annotation with llms , Global

[11]

H. G.

Liddell ,

Scott ,

H. S.

Jones ,

McKenzie , A WordNet Conference 2025 54 ( 2025 ). Greek-English Lexicon , 9th ed., revised and aug- [22]

S. C.

Woodhouse , English-Greek

Dictionary

, mented throughout ed., Clarendon Press, Oxford, George Routledge & Sons, Limited, 1910 . 1996 . [23]

V. C. F.

Rost , Deutsch-griechisches Wörterbuch,

[12] M. L. Murphy , Lexical meaning, Cambridge Univer- Vandenhöck und Ruprecht , 1829 . sity Press, 2010 . [24]

Montanari , The Brill Dictionary of Ancient Greek,

[13]

Sausa , Toward an ancient greek wordnet , ???? Brill, 2015 . Paper presented at the Workshop on WordNet and [25] T. B. Brown , B.

Mann , N.

Ryder , M. Subbiah, SketchEngine, Pavia, March 2012 . J. Kaplan , P.

Dhariwal , A.

Neelakantan , P. Shyam,

[14]

Sagot ,

Fišer , Extending wordnets by learning G . Sastry,

Askell ,

Agarwal , A. Herbert-Voss, from multiple resources , in: LTC' 11 : 5th Language G. Krueger , T.

Henighan , R.

Child , A.

Ramesh , and Technology Conference, 2011 .

D. M.

Ziegler ,

Wu ,

Winter ,

Hesse , M. Chen,

[15]

Pianta ,

Bentivogli , C. Girardi, MultiWordNet: E. Sigler,

Litwin ,

Gray ,

Chess , J. Clark, developing an aligned multilingual database , in: C. Berner , S.

McCandlish , A.

Radford , I. Sutskever , First International Conference on Global WordNet ,

Amodei , Language models are few-shot learn2002 . ers, Advances in Neural Information Processing

[16]

Roventini ,

Alonge ,

Bertagna ,

Calzo- Systems ( 2020 ). lari, J. Cancila,

Girardi ,

Magnini ,

Marinelli , [26]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

Speranza ,

Zampolli , Italwordnet: Building a I. Sutskever, Language models are unsupervised large semantic database for the automatic treatment multitask learners | enhanced reader, OpenAI Blog of the italian language , Computational Linguistics 1 ( 2019 ). in Pisa, Special Issue ( 2003 ) 745 - 791 . [27]

Liu ,

Yuan ,

Fu ,

Jiang , H. Hayashi,

[17]

Biagetti ,

Giuliani ,

Zampetta ,

Luraghi , G. Neubig, Pre-train, prompt, and predict: A C. Zanchi, Combining neo-structuralist and cogni- systematic survey of prompting methods in nattive approaches to semantics to build wordnets for ural language processing , ACM Comput. Surv. ancient languages: Challenges and perspectives , in: 55 ( 2023 ). URL: https://doi.org/10.1145/3560815. M. Zock , E. Chersoni, Y.-Y.

Hsu , S. de Deyne (Eds.), doi:10.1145/3560815. Proceedings of the Workshop on Cognitive As- [28] L.

Reynolds , K.

McDonell , Prompt programming pects of the Lexicon @ LREC-COLING 2024, ELRA for large language models: Beyond the few-shot and ICCL, Torino , Italia, 2024 , pp. 151 - 161 . URL: paradigm, 2021 . URL: https://arxiv.org/abs/2102. https://aclanthology.org/ 2024 .cogalex- 1 .18/. 07350. arXiv: 2102 . 07350 .

[18]

Singh ,

Rutten , E. Lefever, Pilot study for [29]

Mishra ,

Khashabi ,

Baral ,

Choi , H. Habert language modelling and morphological analy- jishirzi, Reframing instructional prompts to gptk's sis for ancient and medieval greek , in: Proceedings language , 2022 . URL: https://arxiv.org/abs/2109. of the 5th Joint SIGHUM Workshop on Compu- 07830 . arXiv: 2109 .07830. tational Linguistics for Cultural Heritage , Social [30]

Wei ,

Bosma ,

V. Y.

Zhao ,

Guu ,

A. W.

Yu , Sciences, Humanities and Literature, Association B. Lester , N.

Du , A. M.

Dai , Q. V.

Le , Finetuned for Computational Linguistics, Punta Cana, Do- language models are zero-shot learners , in: ICLR minican Republic (online) , 2021 , pp. 129 - 135 . URL: 2022 - 10th International Conference on Learning https://aclanthology.org/ 2021 .latechclfl- 1 . 15 . Representations , 2022 .

[19]

J. K.

Sandhan ,

Adideva ,

Komal ,

Modani , [31]

Li , A practical survey on zero-shot prompt

Naik ,

S. K.

Muthiah ,

Kulkarni , Evaluating design for in-context learning, in: Proneural word embeddings for sanskrit , https://arxiv. ceedings of the Conference Recent Advances org/pdf/2104 .00270.pdf, 2021 . Accessed: [Insert ac- in Natural Language Processing - Large Lancess date here]. guage Models for Natural Language Process-

[20]

Mehler ,

Jussen ,

Geelhaar , W. Trautmann, ings, RANLP, INCOMA Ltd., Shoumen, BulD. Sacha,

Schwandt ,

Glądalski , D. Lücke, garia, 2023 , pp. 641 - 647 . URL: http://dx.doi.org/ R. Gleim, The frankfurt latin lexicon: From mor- 10 .26615/ 978 -954-452-092-2_ 069 . doi: 10 .26615/ phological expansion and word embeddings to 978-954-452-092-2_ 069 . semiographs, Studi e Saggi Linguistici 58 ( 2020 ) [32]

Radford , Improving language understanding by 121-155 . doi: 10 .4454/ssl.v58i1.265. generative pre-training, Homology, Homotopy and

[21]

Santoro ,

Marchesi ,

Zampetta ,

M. D.

Tredici , Applications 9 ( 2018 ). E. Biagetti , E.

Litta , C. R.

Combei , S. Rocchi, [33] E.

Hu , Y.

Shen , P.

Wallis , Z.

Allen-Zhu , Y.

Li ,