<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Semantic Reuse in Ancient Greek Literature: A Computational Approach.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Caterina D'Angelo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Taddei</string-name>
          <email>andrea.taddei@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <email>alessandro.lenci@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Pisa</institution>
          ,
          <addr-line>Lungarno Pacinotti 43, 56126 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper introduces the first step towards a computational method for detecting semantic textual reuse in Ancient Greek literature. While existing tools focus primarily on exact or nearlexical matching, our approach leverages the semantic capabilities of contextual LLMs, aiming to finetune a pretrained encoder via contrastive learning to recognize textual reuse even when expressions are paraphrased and/or morphologically altered. To build a suitable dataset, we developed an automatic pipeline that generates positive samples by extracting paraphrases for each sentence using the Ancient Greek Wordnet and a customtrained morphological re-inflection model. Negative samples, or “confounders”, are selected through topic modeling to ensure thematic relevance while preserving semantic dissimilarity. The model is evaluated through a curated case study on Homeric formulae. We retrieve the top ten most similar sentences in a corpus of Ancient Greek authors from the classical age, assessing model outputs using both standard metrics and comparison with established philological studies. The outcomes demonstrate that contrastive fine-tuning, paired with linguistically informed data augmentation, offers promising directions for identifying non-literal textual reuse in historical corpora. This work contributes a framework for philological discovery, combining deep learning with interpretive scholarship in classical studies. Ancient Greek, intertextuality, contrastive learning, paraphrase generation, topic modeling, morphological inflection, synonyms extraction, computational philology.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reuse and, more generally, intertextuality have
always been peculiar lens through which literary
works can be analyzed. It has been the focus of literary
critics and philologists such as Gerard Genette
        <xref ref-type="bibr" rid="ref11">(Genette, 1982)</xref>
        , Julia Kristeva
        <xref ref-type="bibr" rid="ref13">(Kristeva, 1986)</xref>
        ,
Roland Barthes
        <xref ref-type="bibr" rid="ref1">(Barthes, 1975)</xref>
        and Michael
Riffaterre
        <xref ref-type="bibr" rid="ref18">(Riffaterre, 1978)</xref>
        to establish the
importance of intertextual allusions as well as “word
by word” quotations, with structuralist thinking going
as far as to say that «Intertextuality is…. The
mechanism specific to literary reading. It alone, in fact,
produces significance, while linear reading, common to
literary and nonliterary texts, produces only meaning.
        <xref ref-type="bibr" rid="ref11">(Genette, 1982, p. 18)</xref>
        ». With the present work, our
aim is to build a computational tool that can aid in the
complex task of identifying instances of re-use in
Ancient Greek texts. We start from the definition that
Gerard Genette gives us of intertextuality, focusing on
its less literal guise: «it is the traditional practice of
quoting […] in a still less explicit and less literal guise, it
is the practice of allusion
        <xref ref-type="bibr" rid="ref11">(Genette, 1982, p. 18)</xref>
        .». In
the following paper we focus specifically on semantic
0009-0008-4500-225X (C. D’Angelo) 0000-0002-8977-5528
(A. Taddei); 0000-0001-5790-4308 (A. Lenci);
© 2025 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
reuse by developing methods to detect semantic
connections that may indicate shared themes, motifs,
or conceptual relationships between texts. Our
approach represents a foundational step toward the
broader goal of computational intertextuality
detection, providing scholars with a tool to identify
semantically related passages that merit further
philological investigation.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Related Works</title>
        <p>
          Existing computational tools for reuse detection in
classical languages are primarily based on lexical
similarity. Among them, the most prominent is the
Tesserae project
          <xref ref-type="bibr" rid="ref8">(Coffee, et al., 2013)</xref>
          , which
identifies parallels in Latin and Ancient Greek texts by
combining lexical overlap with phonetic and thematic
similarity, the latter through topic modeling
algorithms. Nonetheless, such thematic similarity
does not imply intentional intertextuality, which
involves the conscious use of another author’s
language or ideas.
        </p>
        <p>Another widely used tool is Diogenes3, a desktop
application that enables exact lexical searches across
a large corpus of classical texts.</p>
        <p>Another significant tool in this domain is TRACER
(Büchler et al., 2014), a flexible framework for
automatic detection of text reuse that supports
multiple similarity measures.</p>
        <p>Despite their usefulness, these systems are
focused on surface-level matches and fail to capture
semantic paraphrases or allusive reuse.</p>
        <p>
          To detect such deeper forms of intertextuality,
recent approaches have turned to distributional
semantics. A key challenge, however, is the scarcity of
annotated and homogeneous corpora in ancient
languages, which makes training large language
models (LLMs) difficult
          <xref ref-type="bibr" rid="ref14">(Moritz, Wiederhold, Pavlek,
Bizzoni, &amp; Buchler, 2016)</xref>
          .
        </p>
        <p>
          A seminal contribution in this direction is
          <xref ref-type="bibr" rid="ref6">(Burns,
Brofos, Li, Chaudhuri, &amp; Dexter, 2021)</xref>
          who uses
Word2Vec embeddings to measure the semantic
similarity between Latin bigrams. Their method
computes pairwise cosine similarities between words
and averages the results. Although effective, they
acknowledge the limitations of static embeddings and
propose that contextual embeddings (e.g.,
BERTbased models) may offer better nuance and
generalization.
        </p>
        <p>The paper by Burns et al. also frames
intertextuality as a form of anomaly detection, using
the embeddings created with the corpus of a specific
author (in this case Livy) as input for a SVM: with this
model, the goal is to predict the “Livianess” of each
work, so as to find instances in which the authors have
alluded to Livy’s works.</p>
        <p>
          Following the parallel between intertextuality and
anomaly detection, similar methods have been
explored in the context of authorship attribution. In
          <xref ref-type="bibr" rid="ref21">(Yamschikov, Tikhonov, Pantis, Schubert, &amp; Jurgen,
2022)</xref>
          the authors aim to obtain contextual
embeddings for Ancient Greek by leveraging transfer
learning. Starting from pre-trained models, they
finetune both a multilingual transformer and one trained
on Modern Greek, adapting them to downstream tasks
in Ancient Greek.
        </p>
        <p>While this approach demonstrates the feasibility
of adapting general-purpose models to low-resource
historical languages, it suffers from the limitations of
using a tokenizer and vocabulary not optimized for
Ancient Greek.</p>
        <p>A common obstacle encountered in our research
pertains the shortage of digitized Ancient Greek texts.
The main source would be the Thesaurus Linguae
Grecae4, but its policy is against using the data for
machine learning purposes.</p>
        <p>
          Nonetheless, the work by
          <xref ref-type="bibr" rid="ref21">(Yamschikov, Tikhonov,
Pantis, Schubert, &amp; Jurgen, 2022)</xref>
          inspired our own
application of transfer learning, allowing us to make
efficient use of limited annotated data while focusing
on semantic reuse detection.
        </p>
        <p>
          A similar strategy is adopted by
          <xref ref-type="bibr" rid="ref16 ref17">(Riemenschneider
&amp; Frank, 2023)</xref>
          , who leverage pre-trained language
models to detect intertextual allusions in a
multilingual setting, analyzing sentence-level
correspondences across Ancient Greek, Latin, and
English. Although their focus lies primarily on
crosslingual reuse, their work further confirms the
potential of contextual models in identifying
nonliteral textual relationships.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Contributions</title>
        <p>This paper makes the following contributions:
•
•</p>
        <p>We propose an automated pipeline for
generating paraphrases of Ancient
Greek sentences, combining resources
such as the Ancient Greek WordNet
with a custom-trained morphological
re-inflection model based on annotated
Ancient Greek data.</p>
        <p>We conduct a qualitative assessment of
different contextual encoders for
3 https://d.iogen.es/web/?ver=1.003&amp;user=stud
4 https://stephanus.tlg.uci.edu/</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>To fine-tune a model for semantic reuse detection in
Ancient Greek, we first selected a suitable encoder.</p>
      <p>We then constructed a contrastive dataset
consisting of 11,305 triplets, each composed of a
query sentence, a positive sample (paraphrase), and a
negative sample (confounder). The query sentences
were randomly extracted from a subcorpus of works
by Homer, Thucydides, and Herodotus, taken from the
Opera Graeca Adnotata. Positive and negative samples
were generated automatically through the paraphrase
and confounder generation pipeline described in
Sections 2.1 and 2.2.</p>
      <sec id="sec-2-1">
        <title>2.1. Model Selection</title>
        <p>Although Ancient Greek remains a low-resource
language, recent years have seen the development of
several contextual language models tailored to its
linguistic properties. For our task, the encoder must
be able to encode semantic contextual information,
particularly the similarity between lexically and
morphologically varied expressions.</p>
        <p>To evaluate model performance in capturing
semantic relationships, we designed a synonym
retrieval task, which will be described in detail in
Section 2.2.</p>
        <p>The models considered include:
•</p>
        <sec id="sec-2-1-1">
          <title>Logion (Cowen-Breen, Brooks, Haubold,</title>
          <p>&amp; Graziosi, 2023): A BERT-based
architecture pre-trained on modern
Greek and fine-tuned on Ancient Greek
texts from First1KGreek5, Perseus Digital
•
•</p>
          <p>Library6 and data obtained from fellow
scholars. The training corpus comprises
approximately 70 million words. In its
50K version, a WordPiece tokenizer was
trained on the same corpus, resulting in
a vocabulary of 50,000 subword units
tailored to Ancient Greek.</p>
          <p>
            GreBERTA
            <xref ref-type="bibr" rid="ref16 ref17">(Riemenschneider &amp; Frank,
Exploring Large Language Models for
Classical Philology, 2023)</xref>
            : A
RoBERTastyle encoder with dynamic masking,
trained on a composite corpus including
the Open Greek and Latin Project7 (30M
tokens), the CLARIN Greek Medieval
corpus8 (3.3M), the Patrologia Graeca9
(28.5M), and the Ancient Greek texts
contained in the Internet Archive10
(123.3M). Despite its size, the latter
source contains substantial noise and
inconsistencies.
          </p>
          <p>Word2Vec: A non-contextual baseline
model, included for comparison.</p>
          <p>As will be further explained in section 2.2,
lemmatization was necessary for synonym extraction.
We therefore compared the two main lemmatization
libraries available for Ancient Greek: CLTK11 and
greCy12.
5 https://opengreekandlatin.github.io/First1KGreek/
6 https://www.perseus.tufts.edu/hopper/
7 https://opengreekandlatin.org/.
8 https://inventory.clarin.gr/corpus/890.
9 https://patristica.net/graeca/.
10 https://archive.org/.
11 http://cltk.org/
12 https://github.com/jmyerston/greCy
Meaning
“To go up”
“To go up”
Predicted Model and
synonym lemmatization
διαβαίνω Logion 50K</p>
          <p>with CLTK
διαβαίνω Logion 50k</p>
          <p>with greCy
στείχω Logion BASE “To go”</p>
          <p>with CLTK
στείχω Logion BASE “To go”</p>
          <p>with greCy
στείχω GreBerta with “To go”</p>
          <p>CLTK
στείχω GreBerta with “To go”</p>
          <p>greCy
διαβαίνω Word2Vec</p>
          <p>with CLTK
διαβαίνω Word2Vec
with greCy
“To go up”
“To go up”</p>
          <p>
            Similarity
Score
0.34
13 https://github.com/gcelano/LemmatizedAncientGreekXML
14 https://github.com/sigmorphon/2023InflectionST
Since the objective of our model is to detect semantic
reuse, positive samples must exemplify cases of
nonliteral reuse. For this purpose, we developed an
automated pipeline for paraphrase generation
through targeted lexical substitution, following data
augmentation techniques such as those described in
            <xref ref-type="bibr" rid="ref2">(Bayer, Kaufhold, &amp; Reuter, 2022)</xref>
            .
          </p>
          <p>
            Specifically, we focused on substituting
semantically salient tokens—nouns, verbs, and
adjectives—with suitable synonyms. To identify
these, we combined lexical information from the
Ancient Greek WordNet
            <xref ref-type="bibr" rid="ref3">(Bizzoni, et al., 2014)</xref>
            with
semantic similarity estimates derived from contextual
embeddings.
          </p>
          <p>For each semantically relevant word in a sentence,
we queried the WordNet to retrieve its synsets (i.e.,
sets of synonyms grouped by sense). For each offset
(individual sense), we collected a candidate list of
synonyms. We then computed the cosine similarity
between the contextual embedding of the original
word and four contextual embeddings of each
synonym, obtained by extracting four different
sentence contexts in which that synonym appears.
The sentences were extracted from the corpus
Lemmatized Ancient Greek Texts13 by Giuseppe
Antonio Celano.</p>
          <p>This method allowed us to select the most
semantically coherent synonym among candidates,
accounting for the high degree of polysemy in Ancient
Greek vocabulary.
2.1.1. Re-inflection Model
As mentioned above, the synonym selection pipeline
outputs the lemma of the best synonym. However, to
generate a valid paraphrase within the Ancient Greek
sentence, it is necessary to re-inflect the selected
lemma according to the morphological features of the
word it replaces.</p>
          <p>To this end, we developed a morphological
reinflection model, which takes as input the lemma and
a set of morphological features (e.g., case, number,
tense) and returns the inflected form.</p>
          <p>The model was trained on a corpus constructed by
merging and normalizing data from multiple
resources:
• SIGMORPHON 2023 – UniMorph Shared
Task14: 5,572 inflected forms annotated
with morphosyntactic features.
•
•</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Perseus Project: A dataset of 1,290,544</title>
          <p>
            linguistically annotated forms originally
produced by the Morpheus parser and
generator of Ancient Greek inflected forms
            <xref ref-type="bibr" rid="ref10">(Crane, 1991)</xref>
            .
          </p>
          <p>
            Opera Graeca Adnotata15
            <xref ref-type="bibr" rid="ref7">(Celano, 2024)</xref>
            :
A morphologically annotated corpus
curated by G. A. Celano, from which we
extracted 589,105 forms.
          </p>
          <p>After removing defective entries and applying
standard normalization procedures (e.g., Unicode
harmonization, feature unification), we trained a
sequence-to-sequence model composed of an LSTM
layer, a dropout layer, and a Bidirectional LSTM
decoder. This architecture was chosen for its balance
between simplicity and effectiveness in
characterlevel morphological generation tasks.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Negative Samples</title>
        <p>To create negative samples for the contrastive
learning task we introduced the notion of lexical
confounders: these are sentences that share
semantically relevant words with the target sentence
but express a different meaning. This technique
allows us to create “hard negatives”, capable of aiding
the model in identifying sentences with no lexical
overlap but semantically similar, teaching it to
disentangle lexical similarity from semantic
equivalence.</p>
        <p>To automatically select these confounders, we
applied topic modeling with the goal of identifying
sentences that differ in thematic content. The
underlying assumption is that sentences on distinct
topics are unlikely to convey the same meaning, even
if they share lexically similar elements.</p>
        <p>The topic modeling process was carried out on the
Opera Graeca Adnotata corpus, leveraging
lemmatized tokens to improve generalization. We
first applied the Hierarchical Dirichlet Process
(HDP) to estimate the optimal number of latent topics
(resulting in k = 10), and then trained a Latent
Dirichlet Allocation (LDA) model accordingly. The
resulting LDA model achieved an average UMass topic
coherence score of −0.68, indicating a moderate level
of interpretability suitable for the identification of
semantically distinct negative samples.
15 https://github.com/OperaGraecaAdnotata/OGA</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>In this section, we present the results obtained from
the evaluation of the two main components of our
pipeline: the re-inflection model and the contrastive
sentence encoder.</p>
      <sec id="sec-3-1">
        <title>3.1. Re-inflection Model Evaluation</title>
        <p>To generate grammatically coherent paraphrases, we
trained a sequence-to-sequence model to perform
morphological inflection from lemma + features to
surface form. The architecture consists of a
singlelayer LSTM followed by dropout and a bidirectional
LSTM.</p>
        <p>The model was trained for a maximum of 120
epochs with early stopping (patience = 10), halting at
epoch 77. We used the Adam optimizer with a
learning rate of 0.001.</p>
        <p>The learning curves of accuracy and loss for the
training and validation set can be seen in Figure 1 and
Figure 2.</p>
        <p>The model reached 0.90 accuracy on both the
validation and test set. While performance on
frequent forms is consistent, rare accented forms
remain problematic. For instance, characters such as
“ΐ” and “ΰ”, which appear only 398 and 48 times
respectively in the validation set, obtained F1-scores
as low as 0.39 and 0.21. This imbalance affects the
macro average, which is significantly lower than the
weighted average, as shown in Table 3 (test set
results).</p>
        <p>Nonetheless, performance on frequent cases is
sufficient to support the generation of realistic
paraphrastic samples.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Contrastive Model Evaluation</title>
        <p>To fine-tune the Logion 50k model, we used the
HuggingFace SentenceTransformers library,
representing each sentence with its [CLS] embedding.
The model was trained for 7 epochs, reaching its
optimal performance at epoch 6.18. We used the
AdamW optimizer with a learning rate of 5e-6 and a
weight decay of 0.01.</p>
        <p>The contrastive dataset was split into 80%
training, 10% validation, and 10% testing, with
sentence triplets shuffled prior to the split to ensure
distributional uniformity across subsets.</p>
        <p>Figures 3 and 4 illustrate the training and
validation curves for loss and accuracy, showing a
stable convergence pattern.</p>
        <p>The final accuracy on the test set is 0.81, marking
a notable improvement over earlier experiments. In a
preliminary run using only 5,000 triplets, the model
reached an accuracy of 0.71, highlighting its
sensitivity to the amount of training data.</p>
        <p>Due to the computational complexity of the
pipeline used to generate positive and negative
samples, we limited the dataset to ~11,000 triplets.
However, we hypothesize that a larger dataset—
enabled by scaling the paraphrasis and confounder
generation—would likely lead to further performance
improvements. The model shows strong
generalization capabilities despite the relatively
limited dataset size.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Case Study: Homeric Formulae</title>
        <p>To evaluate the model’s ability to detect semantic
reuse, we selected Homeric formulas from the Odyssey
and retrieved their most similar counterparts from
the prose corpora of Herodotus and Thucydides using
cosine similarity.</p>
        <p>We first performed a general comparison by
encoding all sentences from Homer, Herodotus, and
Thucydides. For each Homeric sentence, we computed
the most similar sentence from both historians. Figure
5 reports how often the most similar match came from
each author. Herodotus consistently emerged as the
“most Homeric” in style.</p>
        <p>We then zoom in on the top matches for a handful
of Homeric formulas. Table 4 reports the top-3 most
similar matches (with cosine similarity) from
Herodotus and Thucydides.</p>
        <p>In Herodotus, the top match for “ἄσμενοι ἐκ
θανάτοιο, φίλους ὀλέσαντες ἑταίρους” (“Glad to have
escaped death, having lost dear companions”) is:
κομισθεὶς ἄρα ἐς τὰς Ἀθήνας ἀπήγγελλε τὸ πάθος
(V.87) “Back in Athens, he reported the terrible news.”
(CosSim: 0.73)</p>
        <p>Though the sentences are lexically unrelated, the
narrative context aligns: both recount survival from
disaster followed by the emotional burden of
reporting it. In the Herodotean passage, the warrior
coming home is the only survivor: he, too, has “lost
dear companions”. The model appears to capture
these semantic and narrative parallels, ignoring
surface forms.</p>
        <p>
          On the other hand, the matching Thucydidean
phrase “καὶ τροπαῖον στήσαντες ἀνεχώρησαν ἐς τὸ
Ῥήγιον” (IV.25) refers to a commemorated but
marginal victory: as noted by
          <xref ref-type="bibr" rid="ref12">(Graves, 1884)</xref>
          , the use
of fixed epic-like expressions for minimal
accomplishments may reflect a form of ironic
intertextuality.
νῦν τε ὅδε ἐστί. πολέμιος οὖν ἦν.
Transl: And here it is now. Transl He was therefore
Sim: 0.72 an enemy.
        </p>
        <p>Sim: 0.73</p>
        <p>Across both historians, the model demonstrates
sensitivity to semantic and narrative similarities even
in absence of direct verbal overlap. This reinforces the
notion that the contrastive objective, paired with
linguistically-informed data, enables detection of
nonliteral textual reuse. Herodotus tends to reuse
Homeric motifs to elevate the narrative or align with
epic tradition, while Thucydides may repurpose
similar forms to subvert or problematize epic
conventions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>The results of our evaluation show that the proposed
model is capable of identifying semantic similarity in
Ancient Greek texts with a significant degree of
accuracy. The performance of the contrastive model—
reaching 0.81 accuracy on the test set—suggests that
even with a relatively limited dataset, it is possible to
fine-tune contextual embeddings for a low-resource
language such as Ancient Greek.</p>
      <p>Importantly, our qualitative case study
demonstrates that the model does not rely solely on
lexical overlap, but is able to capture semantic
connections grounded in context. This capability is
particularly relevant for supporting scholarly analysis
of textual relationships, where surface variation and
thematic connections require careful interpretation.</p>
      <p>Our analysis of Herodotus' proximity to Homer in
the similarity distributions aligns with established
literary hypotheses about thematic continuity and
shared motifs between these authors. However, it is
important to note that the semantic similarities
detected by our model represent connections that
merit further philological investigation rather than
definitive instances of literary allusion. The
distinction between shared themes, common literary
topoi, and intentional intertextual references requires
expert scholarly judgment that goes beyond
computational analysis.</p>
      <p>The matches found in Thucydides, while
semantically related to Homeric passages, illustrate
this distinction clearly: while our model identifies
thematic connections, determining whether these
represent ironic reuse, coincidental similarity, or
genuine allusion requires deeper interpretive
knowledge of the historical and literary context. The
contrastive learning objective appears well-suited to
identifying such semantic connections as potential
candidates for scholarly investigation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>Our approach faces several important limitations that
should be acknowledged:
• Methodological limitations: The generation
of paraphrastic and confounding samples,
while linguistically motivated, is
computationally expensive and depends on
the quality of available lexical resources. The
method relies heavily on the accuracy of
synonym lists from Ancient Greek WordNet
and morphological re-inflection models.
• Evaluation constraints: Our evaluation
remains primarily qualitative and
impressionistic. A more rigorous
assessment would require comparison with
known allusions identified in scholarly
literature, which represents a significant
challenge for future work.
• Scope of detection: Our model identifies
semantic similarities and thematic
connections, but cannot distinguish between
coincidental similarity, shared literary
tradition, and intentional allusion. This
distinction requires expert philological
knowledge and cultural context that
computational methods cannot currently
provide.
• Dataset limitations: The relatively small
dataset limits the model's generalizability,
and further work is needed to expand
coverage across different genres, time
periods, and authors to explore cross-genre
or diachronic reuse phenomena.</p>
      <p>These limitations do not invalidate our
approach but rather define its appropriate scope:
as a tool for identifying semantically related
passages that warrant scholarly attention, rather
than as an autonomous detector of literary
allusions.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper presented a novel approach to the
detection of semantic reuse in Ancient Greek
literature through the use of contrastive learning and
contextual language models. We developed a pipeline
for generating paraphrastic sentence pairs and
lexically confounding negatives, enabling the
finetuning of an encoder model specifically trained for
Ancient Greek.</p>
      <p>Our method demonstrates the feasibility of
identifying thematic connections and semantic
relationships in ancient texts, providing a foundation
for future work in computational intertextuality
detection.</p>
      <p>While promising, this system is not meant to
replace human judgment. In many cases,
interpretation requires close reading and contextual
insight that go beyond the scope of automated
retrieval. Rather, our model should be seen as an
exploratory aid, offering novel perspectives and
candidate matches for scholarly validation.</p>
      <p>Looking ahead, our goal is to scale the dataset by
including larger portions of Herodotean,
Thucydidean, and Homeric corpora, and to refine the
model further through application to other authors
and genres. In particular, we aim to focus on specific
thematic domains such as the lexicon of the sacred.
Future work should also include more rigorous
evaluation against annotated corpora of known
literary allusions identified in scholarly literature as
well as an evaluation of the paraphrases and
confounders by scholarly experts,</p>
      <p>Ultimately, this study shows that the intersection
of artificial intelligence and philology is not only
feasible, but capable of generating innovative and
promising contributions to the study of ancient
textual reuse.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>We thank Federico Boschetti for sharing
annotated corpora and the Ancient Greek WordNet,
Sebastian Padò for his contribution to the initial
phases of this work, and Barbara Graziosi for her kind
willingness to discuss the use of large language
models for Ancient Greek.</p>
      <p>The implementation code is available from the
corresponding author.</p>
      <p>In the following image, in green the correct
synonyms, in yellow those semantically similar to the
original word and in red the results stemming from an
incorrect lemmatization, while the synonyms
considered wrong are not underlined.
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text translation and
Improve writing style. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Barthes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1975</year>
          ).
          <source>Il Piacere del Testo</source>
          . Torino: Einaudi.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bayer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaufhold</surname>
            , M.-
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Reuter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>A Survey on Data Augmentation for Text Classification</article-title>
          .
          <source>ACM Computing Surveys</source>
          , vol.
          <volume>55</volume>
          ,
          <issue>Issue 7</issue>
          ,
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bizzoni</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boschetti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Gratta</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diakoff</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monachini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Crane</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The making of Ancient Greek WordNet</article-title>
          .
          <source>LREC</source>
          <year>2014</year>
          .
          <article-title>European Language Resources Association ELRA</article-title>
          (p.
          <fpage>1140</fpage>
          -
          <lpage>1147</lpage>
          ). Paris, France:
          <article-title>European language resources association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Boschetti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Semantic Analysis and Thematic Annotation</article-title>
          . In M. Berti (Ed.), Digital Classical Philology:
          <article-title>Ancient Greek and Latin in the Digital Revolution</article-title>
          (pp.
          <fpage>321</fpage>
          -
          <lpage>340</lpage>
          ). Berlin, Boston: De Gruyter Saur.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Büchler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burns</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franzini</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franzini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Towards a Historical Text Re-use Detection</article-title>
          . In: Biemann,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mehler</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (
          <article-title>eds) Text Mining</article-title>
          .
          <source>Theory and Applications of Natural Language Processing</source>
          . Springer, Cham
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Burns</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brofos</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dexter</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Profiling of Intertextuality in Latin Literature Using Word Embeddings</article-title>
          .
          <source>Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          (p.
          <fpage>4900</fpage>
          -
          <lpage>4907</lpage>
          ). Online:
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Celano</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>Opera Graeca Adnotata: Building a 34M+ Token Multilayer Corpus for Ancient Greek</article-title>
          .
          <source>ArXiv abs/2404</source>
          .00739.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Coffee</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koenig</surname>
            ,
            <given-names>J.-P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poornima</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forstall</surname>
            ,
            <given-names>C. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ossewaarde</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jacobson</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>The Tesserae Project: intertextual analysis of Latin poetry</article-title>
          .
          <source>Literary and Linguistics Computing</source>
          ,
          <fpage>221</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Cowen-Breen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haubold</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Graziosi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Logion: Machine Learning for Greek Philology</article-title>
          .
          <source>Proceedings of the Ancient Language Processing Workshop</source>
          (p.
          <fpage>170</fpage>
          -
          <lpage>178</lpage>
          ). Varna, Bulgaria: INCOMA Ltd.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Crane</surname>
            ,
            <given-names>G. R.</given-names>
          </string-name>
          (
          <year>1991</year>
          ).
          <article-title>Generating and Parsing Classical Greek</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          , vol.
          <volume>6</volume>
          ,
          <fpage>243</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Genette</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>1982</year>
          ). Palimpsests. Lincoln and London: University of Nebraska Press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          (
          <year>1884</year>
          ). Commentary on Thucydides. London: MacMillan &amp; Company.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Kristeva</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1986</year>
          ).
          <article-title>Word, Dialogue and Novel. The Kristeva Reader</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Moritz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiederhold</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavlek</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizzoni</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Buchler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse</article-title>
          .
          <source>Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          (p.
          <fpage>1849</fpage>
          -
          <lpage>1859</lpage>
          ). Austin, Texas: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Pranaydeep</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rutten</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lefever</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek</article-title>
          .
          <source>Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage</source>
          ,
          <source>Social Sciences, Humanities and Literature</source>
          , (p.
          <fpage>128</fpage>
          -
          <lpage>137</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Riemenschneider</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Exploring Large Language Models for Classical Philology</article-title>
          .
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          (p.
          <fpage>15181</fpage>
          -
          <lpage>15199</lpage>
          ). Toronto, Canada: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Riemenschneider</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature</article-title>
          .
          <source>Proceedings of the Ancient Language Processing Workshop</source>
          (p.
          <fpage>30</fpage>
          -
          <lpage>38</lpage>
          ). Varna, Bulgaria: INCOMA Ltd.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Riffaterre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1978</year>
          ).
          <article-title>Semiotics of poetry</article-title>
          . Bloomington and London: Indiana University Press.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Rodda</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Probert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>McGillivray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <source>Vector Space Models of Ancient Greek Word Meaning, and A Case Study on Homer. Traitement Automatique des Langues (TAL)</source>
          . (p.
          <fpage>63</fpage>
          -
          <lpage>87</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Stopponi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peels-Matthey</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2024</year>
          ).
          <article-title>AGREE: a new benchmark for the evaluation of distributional semantic models of ancient Greek. Digital Scholarship in the Humanities</article-title>
          . (p.
          <fpage>373</fpage>
          -
          <lpage>392</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Yamschikov</surname>
            ,
            <given-names>I. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tikhonov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schubert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jurgen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>BERT in Plutarch's Shadows</article-title>
          .
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          (p.
          <fpage>6071</fpage>
          -
          <lpage>6080</lpage>
          ). Abu Dhabi, United Arab Emirates:
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>