1. Introduction

A. Farina); A key benchmark in this area has been the EvaLatin

Mapping Meaning in Latin with Large Language Models: A Multi-Task Evaluation of Preverbed Motion Verbs and Spatial Relation Detection in LLMs

Andrea Farina

Andrea Ballatore

Barbara McGillivray

0 0 King's College London , Strand Campus, Strand, WC2R 2LS, London , United Kingdom

2025

000 0 0003

This paper evaluates the capabilities of Large Language Models (LLMs) on three interrelated linguistic tasks in Latin: preverbed motion verb identification, spatial relation (SR) classification, and SR type disambiguation. We evaluate GPT-4, Llama, and Mistral under zero-shot and few-shot settings, using a manually annotated dataset of Latin sentences drawn from diferent authors, text types, and historical periods (3rd century BCE - 2nd century CE) as our gold standard. Results show that GPT-4 consistently outperforms open-weight models, particularly in zero-shot scenarios, likely due to its substantial pretraining exposure to Latin. However, even GPT-4 struggles with syntactic disambiguation, especially in linking proper nouns to their governing verbs. SR classification performance is skewed by dataset imbalance, and SR type disambiguation errors often stem from over-reliance on salience over syntax. Qualitative analysis reveals common patterns of overgeneration and uncertainty across tasks. Our findings underscore the potential of LLMs for historical language processing while highlighting persistent challenges related to ambiguity, entity linking, and syntactic reasoning. This study represents the first evaluation of SR recognition in historical languages and lays the groundwork for future domain-adapted fine-tuning approaches in Computational Humanities.

eol>Large Language Models Latin motion verbs spatial relation classification SR type disambiguation

1. Introduction

set of continuous locations crossed by while moving from the Source to the Goal) [3]. This usually happens The central aim of this study is to evaluate the ability of both in literal and non-literal contexts [4]. Large Language Models (LLMs) to analyse spatiality in This paper explores to what extent LLMs can handle Latin texts, with a focus on motion events and their syn- such constructions in Latin, taken as an example of a histactic and semantic environments. In Latin, motion verbs, torical and morphologically complex language. We focus i.e., verbs denoting movement (cf. class 51 in [1]), often on preverbed motion verbs as an area that demands the combine with preverbs — prefixes that attach onto verbal integration of lexical, syntactic, and spatial information. bases to express (among other things) nuanced spatial To evaluate the models’ performance, we design three meanings (cf. Section 4.2). For example, the Latin mo- linguistic tasks targeting diferent layers of interpretation tion verb eo ‘go’ can be prefixed with diferent preverbs, relevant to motion events (Section 3). Preverbs often prowhich deeply modify its semantics (e.g., the preverbs ex- vide crucial cues to argument structure and directionality ‘out of’ and in- ‘into’ generate exeo ‘exit’ and ineo ‘enter’). (e.g., abeo ‘go away’ vs. adeo ‘go toward/to’), which may This preverbal modification is crucial for encoding spatial pose significant challenges for automatic disambiguation relations (SRs) in Latin, as directionality and argument with LLMs. This allows us to assess the extent to which structure are frequently expressed jointly by the verbal LLMs are able to perform linguistic annotation on chalroot and its preverb. Motion events [2] involve an entity lenging verbal constructions such as preverbed motion moving from a Source (the starting point of motion) to verbs, which are structurally more complex than their a Goal (the ending point of motion), and along a Path (the non-preverbed counterparts.

NLP tasks [5]. Among the most influential recent develop- such as Byzantium, Constantinople, and Istanbul, usments in Latin NLP is the introduction of contextualised ing a linked data approach.2 Historical geoparsers must language models. LatinBERT [6], a contextualised model balance precision with historical sensitivity and domaintrained on a substantial corpus comprising 642.7 million specific training [ 19]. For Latin, NER faces more chalwords spanning from Classical Antiquity to the contem- lenges than for English, including orthographic and diporary period, has been shown to perform well in tasks achronic variation, as well as limited and sparse training such as lemmatisation, part-of-speech (POS) tagging, and data [20, 21]. To date, the majority of research and tools syntactic parsing. LatinBERT has also shown promise focus on contemporary languages, and no Latin evaluain word sense disambiguation [7, 8] and named entity tion exists for the extraction of SRs and geoparsing. recognition [9]. While the studies briefly reviewed in this section mark

Generative LLMs have demonstrated impressive per- important progress in both Latin NLP and SR extraction, formance across several NLP tasks [10, 11]. However, systematic evaluation of LLMs on spatial language undertheir success relies on vast amounts of data [12, 13], standing in Latin remains largely unexplored. Building which is not typically achieved by most historical cor- on this foundation, our study investigates whether LLMs pora. The potential of LLMs for Latin is beginning to be can interpret spatial constructions in Latin with a level systematically evaluated. Volk et al. [14] showed that of accuracy that approximates human linguistic analysis. GPT-4-based machine translation substantially outperforms previous approaches when tested on 16th-century correspondence written in Latin and Early New High 3. Research Questions and German. In addition to translation, they also evaluated Evaluation Tasks GPT-4 for paragraph-level summarisation of Latin texts, with its output compared against human-generated sum- We examine whether LLMs can identify and interpret maries. spatial constructions in Latin in ways that approximate

Parallel to these developments, eforts have been made human linguistic judgment. Specifically, we investigate to extract SRs from text, not only in computational lin- three tasks that collectively test the models’ capacity to guistics, but also in information retrieval and geospatial perform SR extraction and identification in Latin. This analytics. Early approaches relied on rule-based methods study is guided by the following research questions: and regular expressions, which have since evolved into more flexible ML methods. SRs labelling can be char- RQ1: To what extent can LLMs accurately identify preacterised as an ML classification task to identify combi- verbed motion verbs in Latin sentences? nations of trajectors (e.g., “ball”), indicators (“on”), and RQ2: To what extent can LLMs detect place expressions landmarks (“the ground”) [15]. More recent work lever- that co-occur with preverbed motion verbs — reages deep learning for this task, including convolutional gardless of their syntactic form — and classify neural networks for relation extraction [16]. them as indicating the Source, Goal, or Path?

A related task consists of detecting toponyms in text, usually as part of Named Entity Recognition (NER). A RQ3: To what extent can LLMs correctly perform SR further step associates toponyms with spatial extensions, type disambiguation in Latin, especially in cases such as georeferenced points or polygons, to facilitate where the distinction between common nouns, data integration and analysis — this process is known as proper nouns (toponyms), and adverbs is ambigugeoparsing, geocoding, toponym resolution, or georefer- ous? encing. The integration of SR detection with NER has also been explored, estimating the spatial extent of expres- These questions target key linguistic phenomena insions such as “North Milan” and “10 km from the French volved in spatial language understanding and test the border” [17]. Recently, LLMs have begun to be evalu- applicability of LLMs to historical languages. Motion ated for their efectiveness in NER for place detection verbs are highly relevant for tasks involving spatial seand geoparsing. Initial research shows how GPT-based mantics and argument structure, particularly in Latin, models can achieve high accuracy in multiple domains, where directional meaning is often distributed across including geography [18]. both the verb and its preverb. Secondly, motion verbs

Toponyms exhibit strong temporal variation and re- frequently occur with locative or directional expressions quire dedicated semantic resources to connect place (e.g., accusative or ablative prepositional phrases), providnames to appropriate spatial scopes. The World His- ing rich ground for testing whether models can correctly torical Gazetteer (WHG)1 gathers records from multiple associate verbs with SRs. Finally, the variability in motion sources to identify place names across temporal contexts, verb semantics (e.g., goal-directed vs. manner-of-motion)

1https://whgazetteer.org. Last accessed: 26 July 2025.

allows us to probe whether models distinguish diferent types of motion events. Preverbs play a central role in encoding directionality and spatial modification in Latin motion constructions. The distinction between proper and common nouns (Roma ‘Rome’ vs. domus ‘house’) is important from a cultural perspective to map how motion verbs relate to the geographical imaginary of the Roman world. Technically, it also provides more detail about the ability of LLMs to detect and interpret spatial references.

To operationalise our research questions questions, we define three corresponding annotation tasks: phases of Latin’s development, across Early, Classical, and Late Latin [27]. Genre was a key consideration in corpus design. To avoid the so-called “God’s truth fallacy” [23] — the mistaken assumption that a single text type or genre can represent the full linguistic reality of a historical period — we included a range of genres that relfect diferent stylistic and communicative registers. The corpus contains texts from a wide range of genres: historiography, poetry, theatre, philosophy, novel, oratory. 4 This selection allows us to investigate genre-conditioned variation while also providing a broader basis for generalisations about Latin syntax. Texts were sourced primarily from the Perseus Digital Library5 [29], except for Ennius’ 1. Motion Verb Identification (RQ1): Determine Annales, accessed via PHI Latin Texts 6 [30]. whether a given Latin sentence contains a pre- Prose is more represented (61.7%) than poetry (38.3%), verbed motion verb. reflecting both textual availability and our aim to balance 2. SR Detection and Classification (RQ2): Iden- stylistic registers. Comedy and satire, often considered tify the presence of place expressions that co- closer to spoken Latin, were included despite their unoccur with preverbed motion verbs and classify derrepresentation in standard corpora. Inscriptions and their semantic role in the motion event as Source, epistolography were excluded due to limited data on prePath, or Goal, regardless of syntactic realization. verbs. Text selection also accounted for varying author 3. SR Type Disambiguation (RQ3): Perform SR productivity, with prolific authors like Cicero and Seneca type disambiguation with particular attention represented by more than one text, while preserving balto expressions relevant to motion contexts, in- ance across genres. cluding disambiguation between common nouns, proper nouns (toponyms), and adverbs.

4.2. Selecting Motion Verbs and Preverbs The study requires a representative sample of motion

4. Corpus, Annotation, Dataset verbs exhibiting diverse syntactic behaviour and frequently co-occurring with place expressions in Latin 4.1. The Usual Dilemma: Choosing a texts. We select eight verbal bases denoting diferent

Representative Corpus for Latin motion domains, and 16 preverbs. This results in a combiGiven the fragmentary nature of the surviving material natorial space of 128 verb–preverb combinations (though and the uneven transmission of texts across time, genre, not all are attested). The selection is based on the PREand register, a fully representative corpus of Latin, as for MOVE dataset (cf. Section 4.3), which provides goldhistorical languages in general, is ultimately unattainable standard annotations for these verbs and preverbs, en[22]. Nevertheless, the Latin corpus used in this study suring both linguistic coverage and empirical grounding. is constructed specifically to address the limitations of The verbal bases are: eo ‘go’, venio ‘go, come’ (all referexisting resources and to meet the needs of historical ring to generic motion), fugio ‘flee’, gradior ‘walk’, curro corpus linguistics [23, 24]. Standard annotated corpora, ‘run’, volo ‘fly’, no ‘swim’ (manner-of-motion verbs denotsuch as the Latin Dependency Treebank (LDT) [25, 26], ing specific types of movements along diferent media: ofer valuable syntactically annotated material but are ground, sky, water), and navigo ‘sail’ (motion by water limited in scope and uneven in their coverage. Many im- via vehicle). These bases are selected to ensure coverportant authors — such as Plautus, Seneca, and Petronius age of diferent spatial event types and to test model — are entirely absent, and key texts like Caesar’s De bello performance across varying lexical, morphological, and Gallico and Virgil’s Aeneid are only partially included. syntactic profiles. Apart from the comitative preverb To support quantitative and diachronic analysis, we con- cum- ‘together’, denoting accompaniment, all preverbs structed a custom corpus that is sensitive to linguistic possess an inherent spatial meaning. They can be catediversity across time and genres. The corpus includes 16 gorised into four classes, based on the SR they inherently Latin texts by 13 authors, and 265,707 tokens in total.3 focus on: The corpus texts span from the 3rd century BCE to the • Source-preverbs: ab- ‘away, away from’, de2nd century CE. This temporal range captures the major

4Labels from [28].

3Since punctuation is not present in the original Latin texts, punctu- 5https://www.perseus.tufts.edu. Last accessed: 26 July 2025. ation marks are excluded from the token count. 6https://latin.packhum.org. Last accessed: 26 July 2025.

4.3. Gold Standard To create the gold standard for evaluation, we manually

annotate occurrences of motion verb constructions in the Latin corpus described above. The annotation is carried out using the INCEpTION platform [31, 32, 33, 34]. INCEpTION’s user-friendly interface and extensible architecture proves essential for this study. All annotations are carried out by a single expert annotator (the first author). To verify task clarity, we conducted an InterAnnotator Agreement (IAA) test on a random sample of 10 sentences, independently annotated by two additional historical linguists. The test yielded perfect agreement (IAA = 1.0), confirming that the task is suficiently clear and unambiguous to justify relying on a single expert annotator for the full dataset. The annotation follows the guidelines described in [35]. Each sentence containing a preverbed motion verb is analysed to determine the presence of SRs, following a multi-layered annotation scheme (cf. Section 3): 1. Motion Verb Identification (Task 1) : Identify whether the sentence contains a target motion verb. 2. SR Detection and Classification (Task 2) : If a motion verb is present, determine whether it cooccurs with a SR. When a SR is present, classify its type as Source, Goal, or Path. Prepositions, case morphology, and preverb semantics are used to guide this decision, making the task unambiguous (e.g., ex urbe ‘from the city’ = Source; in urbem ‘to the city’ = Goal; per urbem ‘through the city’ = Path). 3. SR Type Disambiguation (Task 3): Annotate the SR type of spatial expressions, i.e. distinguish between proper nouns (e.g., Roma ‘Rome’), common nouns (e.g., domus ‘house’), and adverbs (e.g., hinc ‘from here’).

These annotations form part of the PREMOVE dataset [36], which also contains additional annotation layers as it is developed within the context of a broader research project [37]. 5. Experimental Setup 5.1. Dataset and Models Dataset. The experiments are conducted on the dataset

described in Section 4.1, which consists of 1,483 Latin sentences. Since our focus is on spatial semantics, we iflter out sentences that lacked SR annotations. The resulting dataset used for experimentation comprises 649 sentences (cf. Section 4.1).

SRs are unevenly distributed across the data: Goal relations appears in 68.4% of the occurrences, while Source and Path occur in only 19.6% and 12.0%, respectively. This is in line with the Goal-over-Source principle, according to which languages express the Goal more frequently experiments are implemented in Python 3.9.13, usbecause it plays a more central role in the conceptualisa- ing the PyTorch and Hugging Face Transformers tion of motion events, making the event appear complete libraries. To run the Mistral and Llama models, we use and cognitively salient [38]. Moreover, Goal-oriented an A100 GPU (purchased) and a T4 GPU via Google motion is often perceived as more intentional and pur- Colab. Our code is freely available on GitHub7. poseful, while Source expressions suggest less human agency [39, 40]. To mitigate this imbalance and ensure 5.2. Prompt Engineering a fairer evaluation of model behaviour across relation types, we also construct three distinct, balanced subsets Task 1. Task 1 consists in identifying all inflected forms of the dataset (cf. Sections 5.2, 6.1). Each subset isolates of a given Latin verb in one or more input sentences. a single SR and balances positive and negative examples The core prompt includes the verb lemma, a linguistic for that relation. The resulting subset sizes are as follows: framing, and clear task constraints. Importantly, the input to the models consists of individual sentences rather • Goal subset: 394 sentences than full passages. These are extracted directly from • Source subset: 256 sentences PREMOVE (cf. 4.3), in order to isolate sentence-level syn• Path subset: 150 sentences tactic and semantic behaviour and reduce computational cost during inference. The prompt is given below:

The total number of sentences across the subsets exceeds the total number of sentences in the dataset (649), as individual sentences can encode more than one type of SR.

This is a task of Latin linguistics. Given the following Latin sentences, identify all the forms of the verb ‘{verb}’ across all sentences.

Note that verbs may occur more than once and in more than one sentence, so PROVIDE ALL THE Models. We choose two open-weight LLMs (Mis- FORMS YOU DETECT. tral and Llama) and one proprietary model (OpenAI’s GPT) to compare performance across diferent archi- This task is designed to evaluate models’ ability to tectures and accessibility levels. Open-weight mod- identify all inflected forms of a given Latin verb, not to els are LLMs whose trained parameters (weights) are test their recognition of motion semantics per se. While publicly released, allowing researchers and develop- the target lemmas are motion verbs, they are explicitly ers to run, fine-tune, and deploy them independently. provided in the prompt to ensure clarity and task focus. In contrast, proprietary models like GPT are closed- This approach also avoids ambiguity in cases where mulsource and accessible only via API or controlled plat- tiple motion verbs may occur in the same sentence, some forms. We use Mistral-7B-Instruct-v0.1, Meta’s of which fall outside the scope of annotation. Testing the Llama-3.2-3B-Instruct, and OpenAI’s GPT-4. We models’ ability to detect motion verbs without guidance did not perform any fine-tuning on the open-weight mod- would indeed be a valuable direction for future work, but els. We used the pre-trained versions of the models as lies beyond the controlled objectives of this task. provided on Hugging Face, without further adaptation or training. The prompts are described in section 5.2. In few- Task 2. The base prompt includes a task explanation shot settings, manually annotated examples from our cor- and binary labels for each SR. A representative zero-shot pus (section 5.1) are randomly added to the prompts. We version is shown below: evaluate model performance under zero-shot, one-shot, and five-shot conditions. In the zero-shot setting, the

This is a task of Latin linguistics. Given the following Latin sentence, identify all the model is given only the task instruction without any ex- forms of the verb ‘{verb}’. Then, additionally amples. In the one-shot and five-shot settings, we include answer: Does the sentence contain a source respectively one or five manually annotated examples expression? True or False; Does the sentence from our corpus (Section 5.1) to the prompt. These ex- contain a goal expression? True or False; amples are selected at random and aim to reflect typical Does the sentence contain a path expression? structures found in the corpus. This design allows us to True or False test how much model performance improves with limited supervision. We intentionally selected models that Task 3. This task consists of classifying a spatial token were not specifically fine-tuned for Latin to ensure a fair linked to a motion verb as either an adverb, a common comparison across general-purpose architectures. Our noun, or a proper noun. Initial prompts list classification aim is to evaluate how LLMs trained primarily on large labels and provide a target token. As early outputs show multilingual or general corpora perform out of the box on Latin. All experiments are performed locally, with a 7https://github.com/farina-andrea/latin-spatial-relations-llms. Last machine comprising 8 CPU cores and 8 GB of RAM. The accessed: 26 July 2025.

This is a task of Latin linguistics.

Given the Latin sentence below, and focusing specifically on the verb ‘{verb}’, identify the noun or adverb in the sentence governed by ‘{verb}’ and expressing the spatial relation ‘{relation type}’ (Source, Goal, or Path).

Classify this token as one of the following: - An adverb (e.g., ‘hinc’) - A common noun referring to a place (e.g., ‘domus’, ‘forum’) - A proper noun referring to a place name (e.g., ‘Roma’, ‘Carthago’).

Sentence: ‘{sentence}’ Answer with exactly two lines, no extra text: Token: <token> adverb | common noun | proper noun

Literal motion. We evaluate Task 2 on a subset an

6. Results notated exclusively for literal motion verbs, focusing on physical movement and excluding figurative uses. This 6.1. Quantitative Evaluation dataset includes Source, Goal, and Path, but is unbalanced across SRs. Mistral, Llama, and GPT are tested under zeroTask 1. The results of Task 1 are given in Table 2. , one-, and six-shot settings, with the latter including one positive and one negative example per relation.

Model Setting Precision Recall F1-score As shown in Table 4, Llama’s and Mistral’s perfor

Zero-shot 0.09 0.23 0.13 mances remain identical and unreliable, marked by low Mistral-7B One-shot 0.08 0.19 0.11 precision and F1-scores, particularly for Path, which is Five-shots 0.04 0.10 0.06 never correctly identified. While slight improvements Zero-shot 0.33 0.12 0.05 can be seen for Source under six-shot prompting (F1 = 0.67 Llama-3.2B One-shot 0.03 0.10 0.05 for Mistral), overall performance remains inconsistent Five-shots 0.01 0.06 0.02 and largely unchanged compared to the mixed dataset Zero-shot 0.95 0.98 0.96 (cf. Table 3). For this reason, both models were excluded GPT-4 One-shot 0.91 0.98 0.94 from further experiments on Task 2 and the entirety of Five-shots 0.85 0.97 0.91 Task 3, as it builds upon SR classification performed in Task 2.

Table 2 GPT-4 performs considerably better. The Goal relation Task 1. Model performances across diferent shot settings on continues to be the most robust, reaching an F1-score of all 649 sentences. Highest scores per shot setting are high- 0.83 in the six-shot setting. Performance for Source and lighted in bold. Path, however, remains more variable and consistently lower, with best F1-scores of 0.61 and 0.54 respectively.

GPT-4 strongly outperforms both Llama-3.2-3B- This suggests that even in literal motion contexts, Source Instruct and Mistral-7B-Instruct on all 649 sentences. Its and Path relations are harder to detect reliably — possibly precision, recall, and F1-scores remain consistently high because Goal is more commonly and overtly expressed across all prompt settings, indicating robust zero- and in motion events, giving the model stronger and more few-shot generalisation. The open-weight models per- consistent lexical or structural cues to rely on. form poorly and also degrade in performance as shots increase, suggesting that additional examples may introduce noise rather than aid in disambiguation.

Task 2. Results for Task 2 on all 649 sentences are shown in Table 3. Performance varies significantly between GPT on the one hand, and Mistral and Llama on Controlled SRs. To check whether the imbalance be

tween Goal, Source, and Path is contributing to GPT-4’s lower performance on the Goal class, we test the model on a three separate subsets of the data. The Task was split into three separate sub-tasks, each focused on a sinModel Mistral-7B Llama-3.2B GPT-4 Model Mistral-7B Llama-3.2B GPT-4

Metric Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score Metric Precision Recall F1-score Precision Recall F1-score Precision Recall F1-score 1-shot

The results on the split dataset show more stable per

formance across relations (Table 5). For Source, the best F1 is 0.77 with one-shot prompting; for Goal, recall remains high (0.95) with moderate precision (0.57); and for Path, the best F1 (0.79) is achieved with two-shot prompting.

Task 3. Table 6 summarises the performance of GPT-4

in classifying parts of speech in sentences related to motion. We exclude the other two models because of their poor performance on the previous two tasks, on which Task 3 relies on (cf. 6.1). Zero- and one-shot prompting achieve the highest F1 score for common nouns, followed by adverbs. For proper nouns, recall is high, while precision is low. This discrepancy between high recall and low precision for proper nouns suggests that while GPT-4 reliably detects their presence, it often overpredicts and misattributes them within the sentence structure (cf. 6.2). gle SR, with corresponding dataset subsets (cf. 5.1). We restrict this analysis to GPT-4, as it seems to be the only model to produce SR predictions that are not efectively random (cf. 6.1 above).

Relation

Setting

Precision

Recall F1-score

SR Type

Precision Recall F1-score

Task 3. The SR type disambiguation task (GPT-4 only)

displays diferent levels of the models’ accuracy across Table 6 parts of speech. While common nouns are identified with Task 3 (GPT-4). SR type disambiguation: adverbs, common high confidence and accuracy, proper nouns pose some nouns, proper nouns, under zero-shot and one-shot prompting. challenges, as reflected in lower precision and F1 scores. The one-shot (*) is given on a proper noun instance. Highest This finding reinforces the need to treat them separately. F1-score per shot setting is highlighted in bold. Even after prompt engineering (which yielded a slight performance improvement), a consistent pattern of error persists: whenever a proper noun appears in the sentence 6.2. Qualitative Evaluation but is not governed by the target motion verb, the model still annotates it as the relevant argument. Although this is technically a correct identification of a proper noun, it is incorrect in the context of the task. For instance, in the sentence: Task 1. Mistral and Llama show high confusion for verb identification, with an overgeneration of predictions that do not include the correct value. They often include forms that are morphologically or semantically related to the correct one (e.g., conveniens instead of conveniunt, Nam, ut scis optime, secundum quaestum subeo instead of subit), though in some cases the forms Macedoniam profectus, [...] per transiare entirely unrelated (e.g., advena, adgredior, excolui in- tum spectaculum obiturus, in quadam stead of aggressus). A qualitative inspection of the (few) avia et lacunosa convalli a vastissimis mismatches for GPT-4 reveals that the model occasion- latronibus obsessus atque omnibus privatus ally produces multiple verb forms within its output for tandem evado a single sentence. Examples include cases such as transierat, traduxisse and evolo, evigila, where multiple words ‘So, as you well know, I had set out are listed. In these cases, the words are not diferent in- for Macedonia to earn a living. On the lfected forms of the same lemma, but rather distinct verbs way, planning to take in some sights, I or nouns. Nonetheless, the correct verb form is always was ambushed in a remote and marshy present among these outputs (evolo, transierat), indicat- valley by a band of enormous robbers. ing that these are instances of overgeneration or model Stripped of everything, I finally managed uncertainty. This behaviour persists despite prompt engi- to escape.’ (Apul.Met.1.7) neering eforts to constrain the output format, suggesting a tendency of the model to hedge its predictions in am- the model correctly identifies Macedoniam as a proper biguous cases. Interestingly, increasing the number of noun but incorrectly links it to the motion verb obeo (in shots does not improve performance, suggesting that ad- the form obiturus), instead of recognising that it belongs ditional examples for verb identification may introduce to a diferent motion verb ( profectus, from proficiscor ), noise or ambiguity rather than reinforcing the model’s which is not among the verbs considered for annotation. task-specific behaviour [41]. This may suggest that in the context of proper nouns, the model relies heavily on their salience and tends to overlook verb-governance constraints. In other words, the model appears to prioritise SR type recognition and semantic prominence over syntactic dependencies when proper nouns are involved. In other cases, the model occasionally misclassifies common nouns as proper nouns.

Examples include words like fines ‘borders’ or urbs ‘city’, which are common nouns, but are mistakenly labeled as proper nouns.

Task 2. Mistral’s and Llama’s predictions show that

the models randomly assign a positive or negative value to a specific SR. For Goal, F1 is high as Goal is mostly present in the examples, due to the Goal-over-Source principle [38]. GPT-4 has a diferent performance depending on the relation type and prompt format. For the Goal, performance drastically drops under the one-shot and three-shot settings with an unbalanced dataset. In these cases, the prompt examples possibly do not include a representative positive instance of Goal, causing a steep drop in its recognition. Balancing the dataset improves consistency across SRs, but qualitative errors remain. For instance, the model often confuses Source and Path when This study evaluates LLMs across three interconnected tasks in Latin linguistic analysis: motion verb identificathe contextual cues are subtle or ambiguous. On the subset limited to literal motion verbs, the model demonstrates relatively strong recognition of Goal, but struggles more with Source and Path.

7. Discussion and Conclusion

tion, SR classification, and SR type disambiguation. Our performance. results are encouraging, but they also highlight the sig- Our study — the first on LLMs’ SR recognition in hisnificant diferences in performance between models — torical languages — clarifies their performance and limits particularly the stark contrast between GPT-4 and open- in this area. It lays the groundwork for more specialised weight models such as Llama and Mistral. computational methods in Computational Humanities

GPT-4 achieves high performance across all tasks, al- and Historical Linguistics, with potential applications ready in zero-shot settings. This is likely due to the to other historical languages where preverbs are vastly substantial presence of Latin data in its pretraining cor- employed, such as Ancient Greek [43]. pus. While the precise contents of GPT-4’s training data remain undisclosed, estimates based on GPT-3 suggest at least 339 million Latin tokens were included [42], and Author contributions GPT-4 was trained on significantly more data. This makes it plausible that GPT-4 has substantial exposure to Latin, AF was responsible for conceptualisation, methodology, unlike models such as Llama and Mistral, which likely formal analysis, software implementation (including all lack such training data and perform accordingly worse — code used for analysis), and manual annotation of the often failing completely in zero-shot settings. dataset; he wrote the original draft for Sections 1, 3-7, and

For preverbed motion verb identification, GPT-4 edited the final manuscript. AB and BMcG contributed achieves strong performance, particularly under zero- to the conceptualisation and methodology of the project, shot settings [41]. SR classification exposes challenges drafted Section 2, and participated in review, editing, and due to data imbalance, with Goal relations dominating supervision of the research. the dataset. Creating balanced subsets helps obtain more reliable and interpretable results. SR type disambiguation References proves the most dificult task, with the model frequently misclassifying proper nouns and failing to correctly link [1] B. Levin, English verb classes and alternation, A them to relevant motion verbs. This highlights a gap preliminary investigation, Chicago: The University in the way the models can use contextual reasoning to of Chicago Press, 1993. disambiguate entities. This may be mitigated by expand- [2] L. Talmy, Toward a Cognitive Semantics. Vol. 1: ing the length of the input text so to ofer more context Concept Structuring Systems, Cambridge (MA): Mit to the models. Error analysis suggests that the model’s Press, 2000. dependence on lexical familiarity and world knowledge, [3] G. Lakof, Women, Fire and Dangerous Things. which may not perfectly align with classical contexts, What Categories Reveal about the Mind., Chicago: limits its accuracy. The University of Chicago Press, 1987.

These findings demonstrate that while LLMs show [4] G. Lakof, M. Johnson, Metaphors we live by, promising semantic understanding in Latin, syntactic Chicago: The University of Chicago Press, 1980. and contextual challenges persist. Balancing datasets and [5] R. Sprugnoli, F. Iurescia, M. Passarotti, Overview employing few-shot prompting improve performance, of the evalatin 2024 evaluation campaign, in: but do not fully resolve issues related to ambiguity and Proceedings of the Third Workshop on Language entity linking. Technologies for Historical and Ancient Languages

Future work should focus on domain-specific fine- (LT4HALA), Language Resources and Evaluation tuning with classical corpora, possibly integrating ex- Conference (LREC 2024), 2024, pp. 190–197. ternal knowledge sources to enhance disambiguation [6] D. Bamman, P. J. Burns, Latin bert: A contextual lanand semantic grounding. This combined approach can guage model for classical philology, arXiv preprint better support the complex linguistic features of Latin arXiv:2009.10053 (2020). URL: https://arxiv.org/abs/ and ultimately advance computational tools for classi- 2009.10053. cal language research. In parallel, similar experiments [7] P. Lendvai, C. Wick, Finetuning latin bert for word should be conducted on other languages to assess how sense disambiguation on the thesaurus linguae latiespecially open-weight models handle spatial relations nae, in: Proceedings of the Workshop on Cognitive in languages for which they have broader coverage. Such Aspects of the Lexicon, Association for Computacomparisons can clarify whether the poor performance tional Linguistics, Taipei, Taiwan, 2022, pp. 37–41. observed in Latin stems from language-specific limita- [8] I. Ghinassi, S. Tedeschi, P. Marongiu, R. Navigli, tions or from more general architectural and training dif- B. McGillivray, Language pivoting from parallel ferences. Additionally, future studies could isolate prose corpora for word sense disambiguation of histortexts to control for syntactic regularity, as poetic lan- ical languages: A case study on latin, in: Proguage often introduces greater structural variability and ceedings of the 2024 Joint International Conference long-distance dependencies that may challenge model on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and [17] M. A. Syed, E. Arsevska, M. Roche, M. Teisseire, ICCL, Torino, Italia, 2024, pp. 10073–10084. Geospatre: extraction and geocoding of spatial rela[9] M. Beersmans, E. de Graaf, T. V. de Cruys, M. Fan- tion entities in textual documents, Cartography and toli, Training and evaluation of named entity recog- Geographic Information Science 52 ( 2025 ) 221–236. nition models for classical latin, in: A. Ander- [18] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, son, S. Gordin, S. Klein, B. Li, Y. Liu, M. C. Pas- J. Li, G. Wang, GPT-NER: Named Entity Recognisarotti (Eds.), Proceedings of the Ancient Language tion via Large Language Models, arXiv preprint Processing Workshop (ALP 2023) associated with arXiv:2304.10428 ( 2023 ).

The 14th International Conference on Recent Ad- [19] J. Kenyon, J. W. Karl, B. Godfrey, Evaluation of plavances in Natural Language Processing (RANLP cename geoparsers, Journal of Map & Geography 2023), 2023. Libraries 19 ( 2023 ) 185–197. [10] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, [20] A. Erdmann, C. Brown, B. Joseph, M. Janse, P. Ajaka, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, M. Elsner, M.-C. de Marnefe, Challenges and soluY. Chang, P. S. Yu, Q. Yang, X. Xie, A survey on tions for Latin named entity recognition, in: Proevaluation of large language models, ACM Trans. ceedings of the Workshop on Language TechnolIntell. Syst. Technol. 15 (2024). URL: https://doi.org/ ogy Resources and Tools for Digital Humanities 10.1145/3641289. doi:10.1145/3641289. (LT4DH), 2016, pp. 85–93. [11] Q. Xue, Unlocking the potential: A comprehen- [21] M. Beersmans, E. de Graaf, T. Van de Cruys, M. Fansive exploration of large language models in nat- toli, Training and evaluation of named entity recogural language processing, Applied and Computa- nition models for classical Latin, in: Proceedings of tional Engineering 57 (2024) 247–252. URL: https: the Ancient Language Processing Workshop, 2023, //doi.org/10.54254/2755-2721/57/20241341. doi:10. pp. 1–12.

54254/2755-2721/57/20241341. [22] T. McEnery, A. Wilson, Corpus Linguistics. An In[12] Z. Wang, W. Zhong, Y. Wang, Q. Zhu, F. Mi, troduction. Second edition, Edinburgh: Edinburgh B. Wang, L. Shang, X. Jiang, Q. Liu, Data man- University Press, 2001. agement for training large language models: A [23] M. Rissanen, Three problems connected with the survey, 2024. URL: https://arxiv.org/abs/2312.01700. use of diachronic corpora, ICAME Journal 13 (1989) arXiv:2312.01700. 16–19. [13] I. Vieira, W. Allred, S. Lankford, S. Castilho, A. Way, [24] G. B. Jenset, B. McGillivray, Quantitative Historical How much data is enough data? fine-tuning large Linguistcs. A Corpus framework, Oxford University language models for in-house translation: Perfor- Press, Oxford, 2017. mance evaluation across multiple dataset sizes, in: [25] D. Bamman, G. Crane, The Latin Dependency TreeR. Knowles, A. Eriguchi, S. Goel (Eds.), Proceed- bank in a cultural heritage digital library, Proceedings of the 16th Conference of the Association for ings of the Workshop on Language Technology Machine Translation in the Americas (Volume 1: Re- for Cultural Heritage Data (LaTeCH 2007). Prague search Track), Association for Machine Translation (Czech Republic) (2007) 33–40. in the Americas, Chicago, USA, 2024, pp. 236–249. [26] D. Bamman, G. Crane, The Ancient Greek and Latin URL: https://aclanthology.org/2024.amta-research. Dependency Treebanks, in: Language Technology 20/. for Cultural Heritage, Springer, Berlin/Heidelberg, [14] M. Volk, D. P. Fischer, L. Fischer, P. Scheurer, P. B. 2011, pp. 79–98.

Ströbel, Llm-based machine translation and sum- [27] P. Cuzzolin, G. V. M. Haverling, Syntax, sociolinmarization for latin, in: Proceedings of the Third guistics, and literary genres, in: P. Baldi, P. Cuzzolin Workshop on Language Technologies for Histori- (Eds.), New perspectives on historical Latin syntax, cal and Ancient Languages (LT4HALA) @ LREC- 2009, pp. 16–63.

COLING-2024, ELRA and ICCL, Torino, Italia, 2024, [28] E. Biagetti, C. Zanchi, W. M. Short, Toward the pp. 122–128. creation of WordNets for ancient Indo- European [15] P. Kordjamshidi, M. Van Otterlo, M.-F. Moens, Spa- languages, in: Proceedings of the 11th Global Wordtial role labeling: Towards extraction of spatial rela- net Conference, University of South Africa (UNISA), tions from natural language, ACM Transactions on volume 13, 2021, pp. 258–266.

Speech and Language Processing (TSLP) 8 (2011) [29] G. Crane, Building a Digital Library: The Perseus 1–36. Project as a Case Study in the Humanities, in: DL [16] Q. Qiu, Z. Xie, K. Ma, Z. Chen, L. Tao, Spatially ori- ’96: Proceedings of the First ACM International ented convolutional neural network for spatial rela- Conference on Digital Libraries, 1996, pp. 3–10. tion extraction from natural language texts, Trans- [30] P. H. Institute, Classical latin texts. a resource preactions in GIS 26 ( 2022 ) 839–866. pared by the packard humanities institute (phi),

2015 . [31] J.-C. Klie , INCEpTION: Interactive Machine-

Bertinoro , Italy, 2018 . [32]

Boullosa , R. E. de Castilho,

Kumar , J.-C. Klie,

ceedings of the 2018 Conference on Empirical Meth-

ods in Natural Language Processing (EMNLP) ( 2018 )

127- 132 . [33] R. E. de Castilho , J.-C. Klie, N.

Kumar , B.

Boullosa ,

(DI4R) 2018 , 9 - 11 October 2018 , Lisbon, Portugal

(2018a) 1 . URL: https://inception-project. github.io/

publications/DI4R-2018.pdf . [34] R. E. de Castilho , J.-C. Klie, N. Kumar , B. Boul-

Proceedings of the 14th eScience IEEE Inter-

(2018b) 1 . URL: https://inception-project. github.io/

publications/ESCIENCE-2018.pdf . [35]

Farina , Guidelines for a linguistic annotation of

preverbed verbs of motion , Figshare ( 2024 ). URL:

https://doi.org/10.18742/25055573. [36]

Farina , PREMOVE - a diachronic dataset of

tion Verbs , Oxford Text Archive ( 2025 ). URL:

http://hdl.handle. net/20.500 .14106/2579. [37]

Farina , The diferences in Ancient Greek and

Research and Innovation (ref . number: 2749398 ),

2022- 2026 . [38]

Ikegami , ' source' vs. 'goal': A case of linguistic

cepts of Case, Narr., Tübingen , 1987 , pp. 122 - 146 . [39]

Ungerer , H. - J. Schmidt , An Introduction to Cog-

nitive Linguistics , London: Longman, 1996 . [40]

Dirven ,

Verspoor , Cognitive Exploration of

phia: John Benjamins , 2004 . [41]

McGillivray ,

Farina , Are large language mod-

tics

2025 . 9-13 June, Udine (Italy) ( 2025 ). [42]

P. J.

Burns , Research recap: How much latin

( 2023 ). URL: https://isaw.nyu.edu/library/blog/

research-recap-how-much-latin-does-chatgpt-know . [43]

Farina , Aquamotion Verbs in Ancient Greek: A

Pavia: MA Thesis , 2021 .