<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Farina); A key benchmark in this area has been the EvaLatin</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Mapping Meaning in Latin with Large Language Models: A Multi-Task Evaluation of Preverbed Motion Verbs and Spatial Relation Detection in LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Farina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Ballatore</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Barbara McGillivray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>King's College London</institution>
          ,
          <addr-line>Strand Campus, Strand, WC2R 2LS, London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This paper evaluates the capabilities of Large Language Models (LLMs) on three interrelated linguistic tasks in Latin: preverbed motion verb identification, spatial relation (SR) classification, and SR type disambiguation. We evaluate GPT-4, Llama, and Mistral under zero-shot and few-shot settings, using a manually annotated dataset of Latin sentences drawn from diferent authors, text types, and historical periods (3rd century BCE - 2nd century CE) as our gold standard. Results show that GPT-4 consistently outperforms open-weight models, particularly in zero-shot scenarios, likely due to its substantial pretraining exposure to Latin. However, even GPT-4 struggles with syntactic disambiguation, especially in linking proper nouns to their governing verbs. SR classification performance is skewed by dataset imbalance, and SR type disambiguation errors often stem from over-reliance on salience over syntax. Qualitative analysis reveals common patterns of overgeneration and uncertainty across tasks. Our findings underscore the potential of LLMs for historical language processing while highlighting persistent challenges related to ambiguity, entity linking, and syntactic reasoning. This study represents the first evaluation of SR recognition in historical languages and lays the groundwork for future domain-adapted fine-tuning approaches in Computational Humanities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Latin</kwd>
        <kwd>motion verbs</kwd>
        <kwd>spatial relation classification</kwd>
        <kwd>SR type disambiguation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>set of continuous locations crossed by  while moving
from the Source to the Goal) [3]. This usually happens
The central aim of this study is to evaluate the ability of both in literal and non-literal contexts [4].
Large Language Models (LLMs) to analyse spatiality in This paper explores to what extent LLMs can handle
Latin texts, with a focus on motion events and their syn- such constructions in Latin, taken as an example of a
histactic and semantic environments. In Latin, motion verbs, torical and morphologically complex language. We focus
i.e., verbs denoting movement (cf. class 51 in [1]), often on preverbed motion verbs as an area that demands the
combine with preverbs — prefixes that attach onto verbal integration of lexical, syntactic, and spatial information.
bases to express (among other things) nuanced spatial To evaluate the models’ performance, we design three
meanings (cf. Section 4.2). For example, the Latin mo- linguistic tasks targeting diferent layers of interpretation
tion verb eo ‘go’ can be prefixed with diferent preverbs, relevant to motion events (Section 3). Preverbs often
prowhich deeply modify its semantics (e.g., the preverbs ex- vide crucial cues to argument structure and directionality
‘out of’ and in- ‘into’ generate exeo ‘exit’ and ineo ‘enter’). (e.g., abeo ‘go away’ vs. adeo ‘go toward/to’), which may
This preverbal modification is crucial for encoding spatial pose significant challenges for automatic disambiguation
relations (SRs) in Latin, as directionality and argument with LLMs. This allows us to assess the extent to which
structure are frequently expressed jointly by the verbal LLMs are able to perform linguistic annotation on
chalroot and its preverb. Motion events [2] involve an entity lenging verbal constructions such as preverbed motion
 moving from a Source (the starting point of motion) to verbs, which are structurally more complex than their
a Goal (the ending point of motion), and along a Path (the non-preverbed counterparts.</p>
      <p>NLP tasks [5]. Among the most influential recent develop- such as Byzantium, Constantinople, and Istanbul,
usments in Latin NLP is the introduction of contextualised ing a linked data approach.2 Historical geoparsers must
language models. LatinBERT [6], a contextualised model balance precision with historical sensitivity and
domaintrained on a substantial corpus comprising 642.7 million specific training [ 19]. For Latin, NER faces more
chalwords spanning from Classical Antiquity to the contem- lenges than for English, including orthographic and
diporary period, has been shown to perform well in tasks achronic variation, as well as limited and sparse training
such as lemmatisation, part-of-speech (POS) tagging, and data [20, 21]. To date, the majority of research and tools
syntactic parsing. LatinBERT has also shown promise focus on contemporary languages, and no Latin
evaluain word sense disambiguation [7, 8] and named entity tion exists for the extraction of SRs and geoparsing.
recognition [9]. While the studies briefly reviewed in this section mark</p>
      <p>Generative LLMs have demonstrated impressive per- important progress in both Latin NLP and SR extraction,
formance across several NLP tasks [10, 11]. However, systematic evaluation of LLMs on spatial language
undertheir success relies on vast amounts of data [12, 13], standing in Latin remains largely unexplored. Building
which is not typically achieved by most historical cor- on this foundation, our study investigates whether LLMs
pora. The potential of LLMs for Latin is beginning to be can interpret spatial constructions in Latin with a level
systematically evaluated. Volk et al. [14] showed that of accuracy that approximates human linguistic analysis.
GPT-4-based machine translation substantially
outperforms previous approaches when tested on 16th-century
correspondence written in Latin and Early New High 3. Research Questions and
German. In addition to translation, they also evaluated Evaluation Tasks
GPT-4 for paragraph-level summarisation of Latin texts,
with its output compared against human-generated sum- We examine whether LLMs can identify and interpret
maries. spatial constructions in Latin in ways that approximate</p>
      <p>Parallel to these developments, eforts have been made human linguistic judgment. Specifically, we investigate
to extract SRs from text, not only in computational lin- three tasks that collectively test the models’ capacity to
guistics, but also in information retrieval and geospatial perform SR extraction and identification in Latin. This
analytics. Early approaches relied on rule-based methods study is guided by the following research questions:
and regular expressions, which have since evolved into
more flexible ML methods. SRs labelling can be char- RQ1: To what extent can LLMs accurately identify
preacterised as an ML classification task to identify combi- verbed motion verbs in Latin sentences?
nations of trajectors (e.g., “ball”), indicators (“on”), and RQ2: To what extent can LLMs detect place expressions
landmarks (“the ground”) [15]. More recent work lever- that co-occur with preverbed motion verbs —
reages deep learning for this task, including convolutional gardless of their syntactic form — and classify
neural networks for relation extraction [16]. them as indicating the Source, Goal, or Path?</p>
      <p>A related task consists of detecting toponyms in text,
usually as part of Named Entity Recognition (NER). A RQ3: To what extent can LLMs correctly perform SR
further step associates toponyms with spatial extensions, type disambiguation in Latin, especially in cases
such as georeferenced points or polygons, to facilitate where the distinction between common nouns,
data integration and analysis — this process is known as proper nouns (toponyms), and adverbs is
ambigugeoparsing, geocoding, toponym resolution, or georefer- ous?
encing. The integration of SR detection with NER has also
been explored, estimating the spatial extent of expres- These questions target key linguistic phenomena
insions such as “North Milan” and “10 km from the French volved in spatial language understanding and test the
border” [17]. Recently, LLMs have begun to be evalu- applicability of LLMs to historical languages. Motion
ated for their efectiveness in NER for place detection verbs are highly relevant for tasks involving spatial
seand geoparsing. Initial research shows how GPT-based mantics and argument structure, particularly in Latin,
models can achieve high accuracy in multiple domains, where directional meaning is often distributed across
including geography [18]. both the verb and its preverb. Secondly, motion verbs</p>
      <p>Toponyms exhibit strong temporal variation and re- frequently occur with locative or directional expressions
quire dedicated semantic resources to connect place (e.g., accusative or ablative prepositional phrases),
providnames to appropriate spatial scopes. The World His- ing rich ground for testing whether models can correctly
torical Gazetteer (WHG)1 gathers records from multiple associate verbs with SRs. Finally, the variability in motion
sources to identify place names across temporal contexts, verb semantics (e.g., goal-directed vs. manner-of-motion)</p>
      <sec id="sec-1-1">
        <title>1https://whgazetteer.org. Last accessed: 26 July 2025.</title>
        <p>allows us to probe whether models distinguish diferent
types of motion events. Preverbs play a central role in
encoding directionality and spatial modification in Latin
motion constructions. The distinction between proper
and common nouns (Roma ‘Rome’ vs. domus ‘house’) is
important from a cultural perspective to map how
motion verbs relate to the geographical imaginary of the
Roman world. Technically, it also provides more detail
about the ability of LLMs to detect and interpret spatial
references.</p>
        <p>To operationalise our research questions questions, we
define three corresponding annotation tasks:
phases of Latin’s development, across Early, Classical,
and Late Latin [27]. Genre was a key consideration in
corpus design. To avoid the so-called “God’s truth
fallacy” [23] — the mistaken assumption that a single text
type or genre can represent the full linguistic reality of a
historical period — we included a range of genres that
relfect diferent stylistic and communicative registers. The
corpus contains texts from a wide range of genres:
historiography, poetry, theatre, philosophy, novel, oratory. 4
This selection allows us to investigate genre-conditioned
variation while also providing a broader basis for
generalisations about Latin syntax. Texts were sourced primarily
from the Perseus Digital Library5 [29], except for Ennius’
1. Motion Verb Identification (RQ1): Determine Annales, accessed via PHI Latin Texts 6 [30].
whether a given Latin sentence contains a pre- Prose is more represented (61.7%) than poetry (38.3%),
verbed motion verb. reflecting both textual availability and our aim to balance
2. SR Detection and Classification (RQ2): Iden- stylistic registers. Comedy and satire, often considered
tify the presence of place expressions that co- closer to spoken Latin, were included despite their
unoccur with preverbed motion verbs and classify derrepresentation in standard corpora. Inscriptions and
their semantic role in the motion event as Source, epistolography were excluded due to limited data on
prePath, or Goal, regardless of syntactic realization. verbs. Text selection also accounted for varying author
3. SR Type Disambiguation (RQ3): Perform SR productivity, with prolific authors like Cicero and Seneca
type disambiguation with particular attention represented by more than one text, while preserving
balto expressions relevant to motion contexts, in- ance across genres.
cluding disambiguation between common nouns,
proper nouns (toponyms), and adverbs.</p>
        <sec id="sec-1-1-1">
          <title>4.2. Selecting Motion Verbs and Preverbs</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>The study requires a representative sample of motion</title>
        <p>4. Corpus, Annotation, Dataset verbs exhibiting diverse syntactic behaviour and
frequently co-occurring with place expressions in Latin
4.1. The Usual Dilemma: Choosing a texts. We select eight verbal bases denoting diferent</p>
        <p>Representative Corpus for Latin motion domains, and 16 preverbs. This results in a
combiGiven the fragmentary nature of the surviving material natorial space of 128 verb–preverb combinations (though
and the uneven transmission of texts across time, genre, not all are attested). The selection is based on the
PREand register, a fully representative corpus of Latin, as for MOVE dataset (cf. Section 4.3), which provides
goldhistorical languages in general, is ultimately unattainable standard annotations for these verbs and preverbs,
en[22]. Nevertheless, the Latin corpus used in this study suring both linguistic coverage and empirical grounding.
is constructed specifically to address the limitations of The verbal bases are: eo ‘go’, venio ‘go, come’ (all
referexisting resources and to meet the needs of historical ring to generic motion), fugio ‘flee’, gradior ‘walk’, curro
corpus linguistics [23, 24]. Standard annotated corpora, ‘run’, volo ‘fly’, no ‘swim’ (manner-of-motion verbs
denotsuch as the Latin Dependency Treebank (LDT) [25, 26], ing specific types of movements along diferent media:
ofer valuable syntactically annotated material but are ground, sky, water), and navigo ‘sail’ (motion by water
limited in scope and uneven in their coverage. Many im- via vehicle). These bases are selected to ensure
coverportant authors — such as Plautus, Seneca, and Petronius age of diferent spatial event types and to test model
— are entirely absent, and key texts like Caesar’s De bello performance across varying lexical, morphological, and
Gallico and Virgil’s Aeneid are only partially included. syntactic profiles. Apart from the comitative preverb
To support quantitative and diachronic analysis, we con- cum- ‘together’, denoting accompaniment, all preverbs
structed a custom corpus that is sensitive to linguistic possess an inherent spatial meaning. They can be
catediversity across time and genres. The corpus includes 16 gorised into four classes, based on the SR they inherently
Latin texts by 13 authors, and 265,707 tokens in total.3 focus on:
The corpus texts span from the 3rd century BCE to the • Source-preverbs: ab- ‘away, away from’,
de2nd century CE. This temporal range captures the major</p>
      </sec>
      <sec id="sec-1-3">
        <title>4Labels from [28].</title>
        <p>3Since punctuation is not present in the original Latin texts, punctu- 5https://www.perseus.tufts.edu. Last accessed: 26 July 2025.
ation marks are excluded from the token count. 6https://latin.packhum.org. Last accessed: 26 July 2025.</p>
        <sec id="sec-1-3-1">
          <title>4.3. Gold Standard</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>To create the gold standard for evaluation, we manually</title>
        <p>annotate occurrences of motion verb constructions in
the Latin corpus described above. The annotation is
carried out using the INCEpTION platform [31, 32, 33, 34].
INCEpTION’s user-friendly interface and extensible
architecture proves essential for this study. All annotations
are carried out by a single expert annotator (the first
author). To verify task clarity, we conducted an
InterAnnotator Agreement (IAA) test on a random sample of
10 sentences, independently annotated by two additional
historical linguists. The test yielded perfect agreement
(IAA = 1.0), confirming that the task is suficiently clear
and unambiguous to justify relying on a single expert
annotator for the full dataset. The annotation follows the
guidelines described in [35]. Each sentence containing
a preverbed motion verb is analysed to determine the
presence of SRs, following a multi-layered annotation
scheme (cf. Section 3):
1. Motion Verb Identification (Task 1) : Identify
whether the sentence contains a target motion
verb.
2. SR Detection and Classification (Task 2) : If a
motion verb is present, determine whether it
cooccurs with a SR. When a SR is present, classify
its type as Source, Goal, or Path. Prepositions, case
morphology, and preverb semantics are used to
guide this decision, making the task unambiguous
(e.g., ex urbe ‘from the city’ = Source; in urbem ‘to
the city’ = Goal; per urbem ‘through the city’ =
Path).
3. SR Type Disambiguation (Task 3): Annotate
the SR type of spatial expressions, i.e. distinguish
between proper nouns (e.g., Roma ‘Rome’),
common nouns (e.g., domus ‘house’), and adverbs (e.g.,
hinc ‘from here’).</p>
      </sec>
      <sec id="sec-1-5">
        <title>These annotations form part of the PREMOVE dataset [36], which also contains additional annotation layers as it is developed within the context of a broader research project [37].</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Experimental Setup</title>
      <sec id="sec-2-1">
        <title>5.1. Dataset and Models</title>
        <sec id="sec-2-1-1">
          <title>Dataset. The experiments are conducted on the dataset</title>
          <p>described in Section 4.1, which consists of 1,483 Latin
sentences. Since our focus is on spatial semantics, we
iflter out sentences that lacked SR annotations. The
resulting dataset used for experimentation comprises 649
sentences (cf. Section 4.1).</p>
          <p>SRs are unevenly distributed across the data: Goal
relations appears in 68.4% of the occurrences, while Source
and Path occur in only 19.6% and 12.0%, respectively. This
is in line with the Goal-over-Source principle, according
to which languages express the Goal more frequently experiments are implemented in Python 3.9.13,
usbecause it plays a more central role in the conceptualisa- ing the PyTorch and Hugging Face Transformers
tion of motion events, making the event appear complete libraries. To run the Mistral and Llama models, we use
and cognitively salient [38]. Moreover, Goal-oriented an A100 GPU (purchased) and a T4 GPU via Google
motion is often perceived as more intentional and pur- Colab. Our code is freely available on GitHub7.
poseful, while Source expressions suggest less human
agency [39, 40]. To mitigate this imbalance and ensure 5.2. Prompt Engineering
a fairer evaluation of model behaviour across relation
types, we also construct three distinct, balanced subsets Task 1. Task 1 consists in identifying all inflected forms
of the dataset (cf. Sections 5.2, 6.1). Each subset isolates of a given Latin verb in one or more input sentences.
a single SR and balances positive and negative examples The core prompt includes the verb lemma, a linguistic
for that relation. The resulting subset sizes are as follows: framing, and clear task constraints. Importantly, the
input to the models consists of individual sentences rather
• Goal subset: 394 sentences than full passages. These are extracted directly from
• Source subset: 256 sentences PREMOVE (cf. 4.3), in order to isolate sentence-level
syn• Path subset: 150 sentences tactic and semantic behaviour and reduce computational
cost during inference. The prompt is given below:</p>
          <p>The total number of sentences across the subsets
exceeds the total number of sentences in the dataset (649),
as individual sentences can encode more than one type
of SR.</p>
          <p>This is a task of Latin linguistics. Given
the following Latin sentences, identify all the
forms of the verb ‘{verb}’ across all sentences.</p>
          <p>Note that verbs may occur more than once and
in more than one sentence, so PROVIDE ALL THE
Models. We choose two open-weight LLMs (Mis- FORMS YOU DETECT.
tral and Llama) and one proprietary model (OpenAI’s
GPT) to compare performance across diferent archi- This task is designed to evaluate models’ ability to
tectures and accessibility levels. Open-weight mod- identify all inflected forms of a given Latin verb, not to
els are LLMs whose trained parameters (weights) are test their recognition of motion semantics per se. While
publicly released, allowing researchers and develop- the target lemmas are motion verbs, they are explicitly
ers to run, fine-tune, and deploy them independently. provided in the prompt to ensure clarity and task focus.
In contrast, proprietary models like GPT are closed- This approach also avoids ambiguity in cases where
mulsource and accessible only via API or controlled plat- tiple motion verbs may occur in the same sentence, some
forms. We use Mistral-7B-Instruct-v0.1, Meta’s of which fall outside the scope of annotation. Testing the
Llama-3.2-3B-Instruct, and OpenAI’s GPT-4. We models’ ability to detect motion verbs without guidance
did not perform any fine-tuning on the open-weight mod- would indeed be a valuable direction for future work, but
els. We used the pre-trained versions of the models as lies beyond the controlled objectives of this task.
provided on Hugging Face, without further adaptation or
training. The prompts are described in section 5.2. In few- Task 2. The base prompt includes a task explanation
shot settings, manually annotated examples from our cor- and binary labels for each SR. A representative zero-shot
pus (section 5.1) are randomly added to the prompts. We version is shown below:
evaluate model performance under zero-shot, one-shot,
and five-shot conditions. In the zero-shot setting, the</p>
          <p>This is a task of Latin linguistics. Given
the following Latin sentence, identify all the
model is given only the task instruction without any ex- forms of the verb ‘{verb}’. Then, additionally
amples. In the one-shot and five-shot settings, we include answer: Does the sentence contain a source
respectively one or five manually annotated examples expression? True or False; Does the sentence
from our corpus (Section 5.1) to the prompt. These ex- contain a goal expression? True or False;
amples are selected at random and aim to reflect typical Does the sentence contain a path expression?
structures found in the corpus. This design allows us to True or False
test how much model performance improves with
limited supervision. We intentionally selected models that Task 3. This task consists of classifying a spatial token
were not specifically fine-tuned for Latin to ensure a fair linked to a motion verb as either an adverb, a common
comparison across general-purpose architectures. Our noun, or a proper noun. Initial prompts list classification
aim is to evaluate how LLMs trained primarily on large labels and provide a target token. As early outputs show
multilingual or general corpora perform out of the box
on Latin. All experiments are performed locally, with a 7https://github.com/farina-andrea/latin-spatial-relations-llms. Last
machine comprising 8 CPU cores and 8 GB of RAM. The accessed: 26 July 2025.</p>
          <p>This is a task of Latin linguistics.</p>
          <p>Given the Latin sentence below, and focusing
specifically on the verb ‘{verb}’, identify
the noun or adverb in the sentence governed by
‘{verb}’ and expressing the spatial relation
‘{relation type}’ (Source, Goal, or Path).</p>
          <p>Classify this token as one of the following:
- An adverb (e.g., ‘hinc’)
- A common noun referring to a place (e.g.,
‘domus’, ‘forum’)
- A proper noun referring to a place name (e.g.,
‘Roma’, ‘Carthago’).</p>
          <p>Sentence: ‘{sentence}’
Answer with exactly two lines, no extra text:
Token: &lt;token&gt;
adverb | common noun | proper noun</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Literal motion. We evaluate Task 2 on a subset an</title>
          <p>6. Results notated exclusively for literal motion verbs, focusing on
physical movement and excluding figurative uses. This
6.1. Quantitative Evaluation dataset includes Source, Goal, and Path, but is unbalanced
across SRs. Mistral, Llama, and GPT are tested under
zeroTask 1. The results of Task 1 are given in Table 2.
, one-, and six-shot settings, with the latter including one
positive and one negative example per relation.</p>
          <p>Model Setting Precision Recall F1-score As shown in Table 4, Llama’s and Mistral’s
perfor</p>
          <p>Zero-shot 0.09 0.23 0.13 mances remain identical and unreliable, marked by low
Mistral-7B One-shot 0.08 0.19 0.11 precision and F1-scores, particularly for Path, which is
Five-shots 0.04 0.10 0.06 never correctly identified. While slight improvements
Zero-shot 0.33 0.12 0.05 can be seen for Source under six-shot prompting (F1 = 0.67
Llama-3.2B One-shot 0.03 0.10 0.05 for Mistral), overall performance remains inconsistent
Five-shots 0.01 0.06 0.02 and largely unchanged compared to the mixed dataset
Zero-shot 0.95 0.98 0.96 (cf. Table 3). For this reason, both models were excluded
GPT-4 One-shot 0.91 0.98 0.94 from further experiments on Task 2 and the entirety of
Five-shots 0.85 0.97 0.91 Task 3, as it builds upon SR classification performed in
Task 2.</p>
          <p>Table 2 GPT-4 performs considerably better. The Goal relation
Task 1. Model performances across diferent shot settings on continues to be the most robust, reaching an F1-score of
all 649 sentences. Highest scores per shot setting are high- 0.83 in the six-shot setting. Performance for Source and
lighted in bold. Path, however, remains more variable and consistently
lower, with best F1-scores of 0.61 and 0.54 respectively.</p>
          <p>GPT-4 strongly outperforms both Llama-3.2-3B- This suggests that even in literal motion contexts, Source
Instruct and Mistral-7B-Instruct on all 649 sentences. Its and Path relations are harder to detect reliably — possibly
precision, recall, and F1-scores remain consistently high because Goal is more commonly and overtly expressed
across all prompt settings, indicating robust zero- and in motion events, giving the model stronger and more
few-shot generalisation. The open-weight models per- consistent lexical or structural cues to rely on.
form poorly and also degrade in performance as shots
increase, suggesting that additional examples may
introduce noise rather than aid in disambiguation.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Task 2. Results for Task 2 on all 649 sentences are shown in Table 3. Performance varies significantly between GPT on the one hand, and Mistral and Llama on</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Controlled SRs. To check whether the imbalance be</title>
          <p>tween Goal, Source, and Path is contributing to GPT-4’s
lower performance on the Goal class, we test the model
on a three separate subsets of the data. The Task was
split into three separate sub-tasks, each focused on a
sinModel
Mistral-7B
Llama-3.2B
GPT-4
Model
Mistral-7B
Llama-3.2B
GPT-4</p>
          <p>Metric
Precision
Recall
F1-score
Precision
Recall
F1-score
Precision
Recall
F1-score
Metric
Precision
Recall
F1-score
Precision
Recall
F1-score
Precision
Recall
F1-score
1-shot</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>The results on the split dataset show more stable per</title>
          <p>formance across relations (Table 5). For Source, the best
F1 is 0.77 with one-shot prompting; for Goal, recall
remains high (0.95) with moderate precision (0.57); and
for Path, the best F1 (0.79) is achieved with two-shot
prompting.</p>
        </sec>
        <sec id="sec-2-1-6">
          <title>Task 3. Table 6 summarises the performance of GPT-4</title>
          <p>in classifying parts of speech in sentences related to
motion. We exclude the other two models because of their
poor performance on the previous two tasks, on which
Task 3 relies on (cf. 6.1). Zero- and one-shot prompting
achieve the highest F1 score for common nouns, followed
by adverbs. For proper nouns, recall is high, while
precision is low. This discrepancy between high recall and
low precision for proper nouns suggests that while GPT-4
reliably detects their presence, it often overpredicts and
misattributes them within the sentence structure (cf. 6.2).
gle SR, with corresponding dataset subsets (cf. 5.1). We
restrict this analysis to GPT-4, as it seems to be the only
model to produce SR predictions that are not efectively
random (cf. 6.1 above).</p>
          <p>Relation</p>
          <p>Setting</p>
          <p>Precision</p>
          <p>Recall F1-score</p>
          <p>SR Type</p>
          <p>Precision Recall F1-score</p>
        </sec>
        <sec id="sec-2-1-7">
          <title>Task 3. The SR type disambiguation task (GPT-4 only)</title>
          <p>displays diferent levels of the models’ accuracy across
Table 6 parts of speech. While common nouns are identified with
Task 3 (GPT-4). SR type disambiguation: adverbs, common high confidence and accuracy, proper nouns pose some
nouns, proper nouns, under zero-shot and one-shot prompting. challenges, as reflected in lower precision and F1 scores.
The one-shot (*) is given on a proper noun instance. Highest This finding reinforces the need to treat them separately.
F1-score per shot setting is highlighted in bold. Even after prompt engineering (which yielded a slight
performance improvement), a consistent pattern of error
persists: whenever a proper noun appears in the sentence
6.2. Qualitative Evaluation but is not governed by the target motion verb, the model
still annotates it as the relevant argument. Although this
is technically a correct identification of a proper noun,
it is incorrect in the context of the task. For instance, in
the sentence:
Task 1. Mistral and Llama show high confusion for
verb identification, with an overgeneration of predictions
that do not include the correct value. They often include
forms that are morphologically or semantically related
to the correct one (e.g., conveniens instead of conveniunt, Nam, ut scis optime, secundum quaestum
subeo instead of subit), though in some cases the forms Macedoniam profectus, [...] per
transiare entirely unrelated (e.g., advena, adgredior, excolui in- tum spectaculum obiturus, in quadam
stead of aggressus). A qualitative inspection of the (few) avia et lacunosa convalli a vastissimis
mismatches for GPT-4 reveals that the model occasion- latronibus obsessus atque omnibus privatus
ally produces multiple verb forms within its output for tandem evado
a single sentence. Examples include cases such as
transierat, traduxisse and evolo, evigila, where multiple words ‘So, as you well know, I had set out
are listed. In these cases, the words are not diferent in- for Macedonia to earn a living. On the
lfected forms of the same lemma, but rather distinct verbs way, planning to take in some sights, I
or nouns. Nonetheless, the correct verb form is always was ambushed in a remote and marshy
present among these outputs (evolo, transierat), indicat- valley by a band of enormous robbers.
ing that these are instances of overgeneration or model Stripped of everything, I finally managed
uncertainty. This behaviour persists despite prompt engi- to escape.’ (Apul.Met.1.7)
neering eforts to constrain the output format, suggesting
a tendency of the model to hedge its predictions in am- the model correctly identifies Macedoniam as a proper
biguous cases. Interestingly, increasing the number of noun but incorrectly links it to the motion verb obeo (in
shots does not improve performance, suggesting that ad- the form obiturus), instead of recognising that it belongs
ditional examples for verb identification may introduce to a diferent motion verb ( profectus, from proficiscor ),
noise or ambiguity rather than reinforcing the model’s which is not among the verbs considered for annotation.
task-specific behaviour [41]. This may suggest that in the context of proper nouns,
the model relies heavily on their salience and tends to
overlook verb-governance constraints. In other words,
the model appears to prioritise SR type recognition and
semantic prominence over syntactic dependencies when
proper nouns are involved. In other cases, the model
occasionally misclassifies common nouns as proper nouns.</p>
          <p>Examples include words like fines ‘borders’ or urbs ‘city’,
which are common nouns, but are mistakenly labeled as
proper nouns.</p>
        </sec>
        <sec id="sec-2-1-8">
          <title>Task 2. Mistral’s and Llama’s predictions show that</title>
          <p>the models randomly assign a positive or negative value
to a specific SR. For Goal, F1 is high as Goal is mostly
present in the examples, due to the Goal-over-Source
principle [38]. GPT-4 has a diferent performance
depending on the relation type and prompt format. For the
Goal, performance drastically drops under the one-shot
and three-shot settings with an unbalanced dataset. In
these cases, the prompt examples possibly do not include
a representative positive instance of Goal, causing a steep
drop in its recognition. Balancing the dataset improves
consistency across SRs, but qualitative errors remain. For
instance, the model often confuses Source and Path when
This study evaluates LLMs across three interconnected
tasks in Latin linguistic analysis: motion verb
identificathe contextual cues are subtle or ambiguous. On the
subset limited to literal motion verbs, the model
demonstrates relatively strong recognition of Goal, but struggles
more with Source and Path.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7. Discussion and Conclusion</title>
      <p>tion, SR classification, and SR type disambiguation. Our performance.
results are encouraging, but they also highlight the sig- Our study — the first on LLMs’ SR recognition in
hisnificant diferences in performance between models — torical languages — clarifies their performance and limits
particularly the stark contrast between GPT-4 and open- in this area. It lays the groundwork for more specialised
weight models such as Llama and Mistral. computational methods in Computational Humanities</p>
      <p>GPT-4 achieves high performance across all tasks, al- and Historical Linguistics, with potential applications
ready in zero-shot settings. This is likely due to the to other historical languages where preverbs are vastly
substantial presence of Latin data in its pretraining cor- employed, such as Ancient Greek [43].
pus. While the precise contents of GPT-4’s training data
remain undisclosed, estimates based on GPT-3 suggest
at least 339 million Latin tokens were included [42], and Author contributions
GPT-4 was trained on significantly more data. This makes
it plausible that GPT-4 has substantial exposure to Latin, AF was responsible for conceptualisation, methodology,
unlike models such as Llama and Mistral, which likely formal analysis, software implementation (including all
lack such training data and perform accordingly worse — code used for analysis), and manual annotation of the
often failing completely in zero-shot settings. dataset; he wrote the original draft for Sections 1, 3-7, and</p>
      <p>For preverbed motion verb identification, GPT-4 edited the final manuscript. AB and BMcG contributed
achieves strong performance, particularly under zero- to the conceptualisation and methodology of the project,
shot settings [41]. SR classification exposes challenges drafted Section 2, and participated in review, editing, and
due to data imbalance, with Goal relations dominating supervision of the research.
the dataset. Creating balanced subsets helps obtain more
reliable and interpretable results. SR type disambiguation References
proves the most dificult task, with the model frequently
misclassifying proper nouns and failing to correctly link [1] B. Levin, English verb classes and alternation, A
them to relevant motion verbs. This highlights a gap preliminary investigation, Chicago: The University
in the way the models can use contextual reasoning to of Chicago Press, 1993.
disambiguate entities. This may be mitigated by expand- [2] L. Talmy, Toward a Cognitive Semantics. Vol. 1:
ing the length of the input text so to ofer more context Concept Structuring Systems, Cambridge (MA): Mit
to the models. Error analysis suggests that the model’s Press, 2000.
dependence on lexical familiarity and world knowledge, [3] G. Lakof, Women, Fire and Dangerous Things.
which may not perfectly align with classical contexts, What Categories Reveal about the Mind., Chicago:
limits its accuracy. The University of Chicago Press, 1987.</p>
      <p>These findings demonstrate that while LLMs show [4] G. Lakof, M. Johnson, Metaphors we live by,
promising semantic understanding in Latin, syntactic Chicago: The University of Chicago Press, 1980.
and contextual challenges persist. Balancing datasets and [5] R. Sprugnoli, F. Iurescia, M. Passarotti, Overview
employing few-shot prompting improve performance, of the evalatin 2024 evaluation campaign, in:
but do not fully resolve issues related to ambiguity and Proceedings of the Third Workshop on Language
entity linking. Technologies for Historical and Ancient Languages</p>
      <p>
        Future work should focus on domain-specific fine- (LT4HALA), Language Resources and Evaluation
tuning with classical corpora, possibly integrating ex- Conference (LREC 2024), 2024, pp. 190–197.
ternal knowledge sources to enhance disambiguation [6] D. Bamman, P. J. Burns, Latin bert: A contextual
lanand semantic grounding. This combined approach can guage model for classical philology, arXiv preprint
better support the complex linguistic features of Latin arXiv:2009.10053 (2020). URL: https://arxiv.org/abs/
and ultimately advance computational tools for classi- 2009.10053.
cal language research. In parallel, similar experiments [7] P. Lendvai, C. Wick, Finetuning latin bert for word
should be conducted on other languages to assess how sense disambiguation on the thesaurus linguae
latiespecially open-weight models handle spatial relations nae, in: Proceedings of the Workshop on Cognitive
in languages for which they have broader coverage. Such Aspects of the Lexicon, Association for
Computacomparisons can clarify whether the poor performance tional Linguistics, Taipei, Taiwan, 2022, pp. 37–41.
observed in Latin stems from language-specific limita- [8] I. Ghinassi, S. Tedeschi, P. Marongiu, R. Navigli,
tions or from more general architectural and training dif- B. McGillivray, Language pivoting from parallel
ferences. Additionally, future studies could isolate prose corpora for word sense disambiguation of
histortexts to control for syntactic regularity, as poetic lan- ical languages: A case study on latin, in:
Proguage often introduces greater structural variability and ceedings of the 2024 Joint International Conference
long-distance dependencies that may challenge model on Computational Linguistics, Language Resources
and Evaluation (LREC-COLING 2024), ELRA and [17] M. A. Syed, E. Arsevska, M. Roche, M. Teisseire,
ICCL, Torino, Italia, 2024, pp. 10073–10084. Geospatre: extraction and geocoding of spatial
rela[9] M. Beersmans, E. de Graaf, T. V. de Cruys, M. Fan- tion entities in textual documents, Cartography and
toli, Training and evaluation of named entity recog- Geographic Information Science 52 (
        <xref ref-type="bibr" rid="ref21">2025</xref>
        ) 221–236.
nition models for classical latin, in: A. Ander- [18] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang,
son, S. Gordin, S. Klein, B. Li, Y. Liu, M. C. Pas- J. Li, G. Wang, GPT-NER: Named Entity
Recognisarotti (Eds.), Proceedings of the Ancient Language tion via Large Language Models, arXiv preprint
Processing Workshop (ALP 2023) associated with arXiv:2304.10428 (
        <xref ref-type="bibr" rid="ref22">2023</xref>
        ).
      </p>
      <p>
        The 14th International Conference on Recent Ad- [19] J. Kenyon, J. W. Karl, B. Godfrey, Evaluation of
plavances in Natural Language Processing (RANLP cename geoparsers, Journal of Map &amp; Geography
2023), 2023. Libraries 19 (
        <xref ref-type="bibr" rid="ref22">2023</xref>
        ) 185–197.
[10] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, [20] A. Erdmann, C. Brown, B. Joseph, M. Janse, P. Ajaka,
H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, M. Elsner, M.-C. de Marnefe, Challenges and
soluY. Chang, P. S. Yu, Q. Yang, X. Xie, A survey on tions for Latin named entity recognition, in:
Proevaluation of large language models, ACM Trans. ceedings of the Workshop on Language
TechnolIntell. Syst. Technol. 15 (2024). URL: https://doi.org/ ogy Resources and Tools for Digital Humanities
10.1145/3641289. doi:10.1145/3641289. (LT4DH), 2016, pp. 85–93.
[11] Q. Xue, Unlocking the potential: A comprehen- [21] M. Beersmans, E. de Graaf, T. Van de Cruys, M.
Fansive exploration of large language models in nat- toli, Training and evaluation of named entity
recogural language processing, Applied and Computa- nition models for classical Latin, in: Proceedings of
tional Engineering 57 (2024) 247–252. URL: https: the Ancient Language Processing Workshop, 2023,
//doi.org/10.54254/2755-2721/57/20241341. doi:10. pp. 1–12.
      </p>
      <p>54254/2755-2721/57/20241341. [22] T. McEnery, A. Wilson, Corpus Linguistics. An
In[12] Z. Wang, W. Zhong, Y. Wang, Q. Zhu, F. Mi, troduction. Second edition, Edinburgh: Edinburgh
B. Wang, L. Shang, X. Jiang, Q. Liu, Data man- University Press, 2001.
agement for training large language models: A [23] M. Rissanen, Three problems connected with the
survey, 2024. URL: https://arxiv.org/abs/2312.01700. use of diachronic corpora, ICAME Journal 13 (1989)
arXiv:2312.01700. 16–19.
[13] I. Vieira, W. Allred, S. Lankford, S. Castilho, A. Way, [24] G. B. Jenset, B. McGillivray, Quantitative Historical
How much data is enough data? fine-tuning large Linguistcs. A Corpus framework, Oxford University
language models for in-house translation: Perfor- Press, Oxford, 2017.
mance evaluation across multiple dataset sizes, in: [25] D. Bamman, G. Crane, The Latin Dependency
TreeR. Knowles, A. Eriguchi, S. Goel (Eds.), Proceed- bank in a cultural heritage digital library,
Proceedings of the 16th Conference of the Association for ings of the Workshop on Language Technology
Machine Translation in the Americas (Volume 1: Re- for Cultural Heritage Data (LaTeCH 2007). Prague
search Track), Association for Machine Translation (Czech Republic) (2007) 33–40.
in the Americas, Chicago, USA, 2024, pp. 236–249. [26] D. Bamman, G. Crane, The Ancient Greek and Latin
URL: https://aclanthology.org/2024.amta-research. Dependency Treebanks, in: Language Technology
20/. for Cultural Heritage, Springer, Berlin/Heidelberg,
[14] M. Volk, D. P. Fischer, L. Fischer, P. Scheurer, P. B. 2011, pp. 79–98.</p>
      <p>Ströbel, Llm-based machine translation and sum- [27] P. Cuzzolin, G. V. M. Haverling, Syntax,
sociolinmarization for latin, in: Proceedings of the Third guistics, and literary genres, in: P. Baldi, P. Cuzzolin
Workshop on Language Technologies for Histori- (Eds.), New perspectives on historical Latin syntax,
cal and Ancient Languages (LT4HALA) @ LREC- 2009, pp. 16–63.</p>
      <p>COLING-2024, ELRA and ICCL, Torino, Italia, 2024, [28] E. Biagetti, C. Zanchi, W. M. Short, Toward the
pp. 122–128. creation of WordNets for ancient Indo- European
[15] P. Kordjamshidi, M. Van Otterlo, M.-F. Moens, Spa- languages, in: Proceedings of the 11th Global
Wordtial role labeling: Towards extraction of spatial rela- net Conference, University of South Africa (UNISA),
tions from natural language, ACM Transactions on volume 13, 2021, pp. 258–266.</p>
      <p>
        Speech and Language Processing (TSLP) 8 (2011) [29] G. Crane, Building a Digital Library: The Perseus
1–36. Project as a Case Study in the Humanities, in: DL
[16] Q. Qiu, Z. Xie, K. Ma, Z. Chen, L. Tao, Spatially ori- ’96: Proceedings of the First ACM International
ented convolutional neural network for spatial rela- Conference on Digital Libraries, 1996, pp. 3–10.
tion extraction from natural language texts, Trans- [30] P. H. Institute, Classical latin texts. a resource
preactions in GIS 26 (
        <xref ref-type="bibr" rid="ref17">2022</xref>
        ) 839–866. pared by the packard humanities institute (phi),
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>2015</year>
          . [31]
          <string-name>
            <surname>J.-C. Klie</surname>
          </string-name>
          , INCEpTION: Interactive Machine-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bertinoro</surname>
          </string-name>
          , Italy,
          <year>2018</year>
          . [32]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boullosa</surname>
          </string-name>
          , R. E. de Castilho,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , J.-C. Klie,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>ceedings of the 2018 Conference on Empirical Meth-</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>ods in Natural Language Processing (EMNLP)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          127-
          <fpage>132</fpage>
          . [33]
          <string-name>
            <surname>R. E. de Castilho</surname>
            , J.-C. Klie,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Boullosa</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>(DI4R)</source>
          <year>2018</year>
          ,
          <fpage>9</fpage>
          -
          <issue>11</issue>
          <year>October 2018</year>
          , Lisbon, Portugal
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>(2018a) 1</article-title>
          . URL: https://inception-project.
          <source>github.io/</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          publications/DI4R-2018.pdf . [34]
          <string-name>
            <surname>R. E. de Castilho</surname>
            , J.-C. Klie,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , B. Boul-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Proceedings of the 14th eScience IEEE</source>
          Inter-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>(2018b) 1</article-title>
          . URL: https://inception-project.
          <source>github.io/</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          publications/ESCIENCE-2018.pdf . [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          ,
          <article-title>Guidelines for a linguistic annotation of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>preverbed verbs of motion</article-title>
          ,
          <source>Figshare</source>
          (
          <year>2024</year>
          ). URL:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          https://doi.org/10.18742/25055573. [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          , PREMOVE
          <article-title>- a diachronic dataset of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>tion Verbs</source>
          , Oxford Text Archive (
          <year>2025</year>
          ). URL:
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          http://hdl.handle.
          <source>net/20.500</source>
          .14106/2579. [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          ,
          <article-title>The diferences in Ancient Greek and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>Research and Innovation (ref</article-title>
          . number:
          <volume>2749398</volume>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          2022-
          <fpage>2026</fpage>
          . [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ikegami</surname>
          </string-name>
          , '
          <article-title>source' vs. 'goal': A case of linguistic</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>cepts of Case, Narr., Tübingen</source>
          ,
          <year>1987</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>146</lpage>
          . [39]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ungerer</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Schmidt</surname>
          </string-name>
          , An Introduction to Cog-
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>nitive Linguistics</source>
          , London: Longman,
          <year>1996</year>
          . [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dirven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verspoor</surname>
          </string-name>
          , Cognitive Exploration of
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>phia: John Benjamins</source>
          ,
          <year>2004</year>
          . [41]
          <string-name>
            <given-names>B.</given-names>
            <surname>McGillivray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          ,
          <article-title>Are large language mod-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          <year>2025</year>
          . 9-13 June, Udine (Italy) (
          <year>2025</year>
          ). [42]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Burns</surname>
          </string-name>
          , Research recap: How much latin
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          (
          <year>2023</year>
          ). URL: https://isaw.nyu.edu/library/blog/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>research-recap-how-much-latin-does-chatgpt-know</article-title>
          . [43]
          <string-name>
            <given-names>A.</given-names>
            <surname>Farina</surname>
          </string-name>
          , Aquamotion Verbs in Ancient Greek: A
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>Pavia: MA Thesis</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>