<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>These authors contributed equally.
£ carlosnusch@prebi.unlp.edu.a(rC. J. Nusch)
ç https://prebi-sedici.unlp.edu.ar/personal/carlos-nus(cCh./J. Nusch)
ȉ</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature ⋆ Importance Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlos JavierNusch</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gimena del RioRiande</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leticia CeciliaCagnina</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo LuisErrecald</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LeandroAntonell</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAETI, Facultad de Tecnología Informática, Universidad Abierta Interamericana</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CESGI, Comisión de Investigaciones Científicas de la Provincia de Buenos Aires</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IIBICRIT, Consejo Nacional de Investigaciones Científicas y Técnicas</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>LIFIA, Facultad de Informática, Universidad Nacional de La Plata</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>PREBI, SEDICI, Universidad Nacional de La Plata</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the ifndings from previous exploratory tasks performed using other techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Augustan love poets</kwd>
        <kwd>Document Clustering</kwd>
        <kwd>K Means</kwd>
        <kwd>Silhouette CoefÏcient</kwd>
        <kwd>Decision Trees</kwd>
        <kwd>Feature Importance</kwd>
        <kwd>Information Gain Ratio</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This study1 builds on a master’s thesis 2[
        <xref ref-type="bibr" rid="ref1">7</xref>
        ] examining C. S. Lewis’ observations 1[
        <xref ref-type="bibr" rid="ref2">9</xref>
        ] on the
influence of Courtly love and Occitan literature on 20th-century love imagery. Similarities in
love themes, treatment of the beloved, and political and military terms were found between
Occitan and 1st-century BC Latin poetry. This thesis aims to identify textual patterns linking
ancient love themes to theReligion of Love in medieval Occitan poetry, using a comparative
approach that combines close reading with computational method3s2,[
        <xref ref-type="bibr" rid="ref14">23</xref>
        ]. This article evaluates
clustering techniques for diferentiating love poems from other Latin poetry, identifying key
lexical features. Previous work29[] explained the techniques, while here we focus on feature
extraction and optimal Silhouette Score values.
      </p>
      <sec id="sec-1-1">
        <title>1.1. State of the Art</title>
        <p>
          Several authors have applied clustering to ancient texts. Bracco et a1l6.][used K-means to
detect literary genres in cuneiform texts, and Martins et a2l1.][ used k-Nearest Neighbors for
author classification. Cantaluppi and Passarotti9[] studied Seneca’s complete works, Cicero’s
orations, Jerome’s Latin New Testament, and Aquinas’ major works. Nagy25[] used
multivariate analysis and clustering to examine rhyme in twelve classical Latin poets, identifying
stylistic diferences between genres and authors. In recent work, he applied UMAP and t-SNE
to show stylistic distinctions between Ovid’Hseroides and other works, and the authenticity
of the Epistula Sapphus [
          <xref ref-type="bibr" rid="ref17">26</xref>
          ]. Forstall et al.1[
          <xref ref-type="bibr" rid="ref6">4, 15</xref>
          ] compared lexical and rhythmic features at
character and word n-gram levels with other 1st-century BC poets.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Problem Definition and Contributions</title>
        <p>
          The previous work aimed to explore clustering techniques to distinguish love poems from other
types of poetry and identify useful lexical characteristics for classification. The K-means
algorithm [
          <xref ref-type="bibr" rid="ref11">20</xref>
          ] was used, and the optimal number of clusters was determined with the Silhouette
Index [34], which measures group cohesion and separation. Since K-means, based on Euclidean
distance, does not provide detailed feature extraction, decision tre3e1s][were used to
complement this approach. This combination allowed for the indirect extraction of features, with
metrics such as Importance, Information Gain, and Information Gain Rati3o5[] identifying the
most relevant features.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Research Methodology and Approach</title>
      <sec id="sec-2-1">
        <title>2.1. Analysis Corpus and Used Editions</title>
        <p>
          The corpus includes the complete works of Gaius Valerius Catull2u2s],[Albius Tibullus2[4],
and Sextus Propertius 3[0], representing love poetry, as well as all books from Atheeneid by
1An appendix with key tables is included after the references. A larger dataset is available: Nusch, C. (2024).
Clustering Tasks and Decision Trees with Augustan love poets [Data set]. CHR2024, Aarhus, Denmark. Zenodo.
https://doi.org/10.5281/zenodo.12682694.
Publius Vergilius Maro17[] and Pharsalia by Marcus Annaeus Lucanus [
          <xref ref-type="bibr" rid="ref29">39</xref>
          ] as control samples,
focused on political, historical, and martial themes. The analysis reveals diferences in the
number of words and verses per poem among diferent authors and genres (Tab1l)e. To address
concerns about unbalanced datasets, we used relative frequency and separated the authors
to reduce noise and bias from the larger epic texts. Two datasets were used: one with the
Augustan love poets and Vergil, and another with the Augustan love poets and Lucan.
        </p>
        <p>To construct the analysis corpus, resources from the Perseus Project digital libr1a, r1y0][
at Tufts University were used. The library contains 2,412 works in 3,192 editions and
translations (1,639 in Greek, 636 in Latin) and a total of 69.7 million words. The texts, curated by
specialists and shared under a CC BY SA 3.0 (US) license, are available in XML format.
Additional resources include models for grammatical tagging and stopwords for Latin. The poems
were harvested through web scraping using R, while Python libraries were employed for text
analysis and mining.</p>
        <p>The analysis explored character n-grams (2 to 7) and word n-grams (1 to 5), using the Bag
of Words (BOW) method 3[6]. Three types of matrices were generated: the first based on
raw frequency (using Scikit-learn’s CountVectorizer), the second on relative frequency (with a
custom function), and the third using the TF-IDF technique3[6], which highlights important
words by weighing their frequency relative to their rarity across the dataset. While
CountVectorizer simply counts word occurrences, TF-IDF reduces the impact of common words, giving
more weight to unique terms.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Text Preprocessing Tasks</title>
        <p>Before analysis, the text was cleaned by removing empty lines, sequences of spaces (“\n \n\n\n”),
and editorial symbols for illegible gaps (“†”). Spanish quotation marks were replaced with
English ones for tool compatibility, and punctuation was removed from character n-grams, as
it was added by editors.</p>
        <p>
          For stopword2s, we used the Stopwords ISO 2[] package for Latin, which we preferred over
the Perseus Project version because it retains important words in elegiac poetry, suchegaos,
enabling the analysis of personal pronouns—a significant feature noted in previous work27s,[
          <xref ref-type="bibr" rid="ref19">28</xref>
          ].
2For a more detailed discussion of the complexity and variety of stopwords in Latin and other ancient languages,
see A. Berra [4] and P.J. Burns [
          <xref ref-type="bibr" rid="ref1">7</xref>
          ].
        </p>
        <p>
          To enhance tokenization, we added two procedures from The Classical Language Toolkit
(CLTK) [
          <xref ref-type="bibr" rid="ref9">18</xref>
          ]: JVReplacer, to standardize spellings (e. gIu.,lius/Julius and uir/vir), and
LatinWordTokenizer, which helped identify enclitics (e.g-.q,ue, -ve) and prevent incorrect tokenization.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation: The Clustering and Decision Trees as combined techniques</title>
      <p>
        As explained previously2[
        <xref ref-type="bibr" rid="ref2">9</xref>
        ], document clustering was performed using the K-means method
and Silhouette scores to evaluate the best cluster configuration. The optimal number of clusters
(k) was determined by testing k values from 2 to 20. Tests were conducted using fixed ranges of
character n-grams (2 to 7) and word n-grams (1 to 5), with the Silhouette coefÏcient calculated
for each k. The aim was to find both the best k and the most efective n-gram ranges for
clustering.
      </p>
      <p>Once the data was labeled, decision trees were trained using the entropy criterion to assess
feature importance, with Information Gain (IG) and Information Gain Ratio (IGR) calculated.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary or Intermediate Results</title>
      <p>Better Silhouette Scores were achieved using the raw frequency matrix with simple stopwords
ifltering (CountVectorizer with Stopwords), while the Relative Frequency and TF-IDF Matrices
showed lower scores. TF-IDF scores were close to zero, indicating poor cluster separation and
Relative Frequency values were around 0.5 for both datasets (Tab2leasnd 3). The use of relative
frequency significantly impacted the optimal number of clusters recommended by the K-means
algorithm.</p>
      <p>The new tokenization and normalization process using CLTK modules had a noticeable
impact. Regarding the most critical features for classifying clusters, results suggest that the
N-gram Type
Char 2-grams
Char 3-grams
Char 4-grams
Char 5-grams
Char 6-grams
Char 7-grams
Word 1-grams
Word 2-grams
Word 3-grams
Word 4-grams
Word 5-grams
methodology and resources should be reevaluated. In previous work, high Silhouette scores
were observed with the frequency table, but feature importance metrics showed an uneven
distribution, with one or two attributes dominating. While TF-IDF identified more features, the
low Silhouette scores indicated poor classification. Despite better Silhouette scores, the same
issue occurred with the relative frequency matrix.</p>
      <sec id="sec-4-1">
        <title>4.1. Feature Extraction at Character N-Grams Level</title>
        <p>As shown in Figure1 the clustering task succeeded in separating Augustan love poets from
epic poets. In the next page, Tables4, and 5 show the feature extraction with this n-gram level.</p>
        <p>In the case of relative frequency, after excluding Lucan, the best Silhouette Score at the
character n-gram level was achieved with 4-grams. However, despite obtaining a relatively good
score (0.53), the resulting classification did not meet expectations (Figur2e). The algorithm
constructed two clusters, one of which contained onlCyarmen 94. A similar phenomenon
occurred when excluding Vergil, but with the distinction that the isolaCteadrmen was 112.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature Extraction at Word N-Grams Level</title>
        <p>At the word n-gram level, similar to the character n-gram level, the best classification method
was the raw frequency matrix. Although the relative frequency matrix also yielded good
Silhouette scores, it consistently produced poor classifications, isolating onClayrmen 94 from
the rest. At other n-gram levels, thcearmina that were separated includedCarmina 14, 82, 85,
and 106. All of these are relatively short, suggesting that the diference in length among the
poems introduces internal variability in the corpus that hinders classification based on relative
frequencies. The same task, when performed using raw frequencies, yielded excellent results,
whether Lucan or Vergil was excluded from the analysis (Figu3raend Tables6, 7).</p>
        <p>In the following table and figure, it can be observed that the use of relative frequency brings
forth personal pronouns, terms previously associated with love poetry in earlier studies.
However, these should be disregarded when obtained through this methodology, as the
classifications achieved with them were quite poor, as can be seen in Tabl8eand Figure4.</p>
        <p>Both datasets, whether excluding Lucan or Vergil, showed identical performance in author
classification. Character n-grams (2 to 6) and single-word n-grams efectively separated epic
authors from Augustan love poets using the raw frequency matrix, with Silhouette Scores above
0.83. Lower scores led to suboptimal classifications, where one cluster contained only a single
book (e.g., Book X or XII of theAeneid or Book IX ofPharsalia).</p>
        <p>Assessing the relevance of specific character n-grams for classification remains challenging,
requiring a more detailed stylistic investigation. In summary, document grouping was
efective, though feature-level techniques did not always highlight typical elegiac terms. The terms
extracted via decision trees for Augustan love poets and Vergil predominantly reflected epic,
mythical, and martial languag4e.
3For more details on the extracted terms, see Appendix A and B.
4Please note that with the English quotation marks we have attempted to indicate the spaces before or after the
words, in cases where it corresponds to the character n-gram.
4.3. Data from the Corpora of Catullus, Tibullus, Propertius, and Vergil:
• 5-character n-grams: ‘ sub ’ (low),eucri and eucru (part ofTeucri), fatus (spoke), fatur
(speaks), auras (breezes), eneas (Aeneas)
• 6-character n-grams: teucru (Trojan), aeneas (Aeneas)‘,fatur ’ (speaks), ‘fatus ’ (spoke),
‘fatis ’ (fates), ipoten and mnipot (from omnipotens, presumably attributed to Jupiter),
clamor (shout)
• 1 word n-grams: urbem (city), aeneas (Aeneas), teucrum (Trojan),ingentem (huge),
omnipotens (almighty),aether (ether/sky),pius (pious),iamque (and now), socius
(ally),clamore (shout), finis (end), fatis (fates), ignem (fire), auris (from auris, ear oraurum, gold),
caelum (sky), genitor (father), hostis (enemy), terram (land), bellum (war), dux (leader),
uisus (vision).</p>
        <p>A similar phenomenon occurs with the terms obtained from the grouping of the
Augustan love poets with Lucan, where words referring to the political causes of the Civil War
predominate, emphasizing the crimes committed by the diferent factions and the physical
consequences on the bodies of the Roman soldiers and citizens38[]:</p>
        <p>IG
4.4. Data from the Corpora of Catullus, Tibullus, Propertius, and Lucan:
• 5-character n-grams: aesar (from Caesar),aussa (cause), scera (part ofviscera, viscera),
pulos (from populos, peoples), elero (part of scelero, referring to crimes)c,oeli (of ’ coel’,
from heaven), libye (Libya), gladi (part of gladium, sword),adhu and adhuc (from until
now)
• 6-character n-grams: ‘osque ’ (composed most likely by the accusative plural ending of
the second declension combined with the enclitic -quep),ulos and opulos (peoples),libye
(Libya), scera and iscera (fromviscera, entrails),bellor and elloru (as part ofbellorum, of
the wars), elerum, ‘lerum’ (fromscelerum, crime), caussa and ‘ causs’ (cause), ‘ gladi’ (as
part ofgladium on its diferent forms, the sword),‘ coeli’ (sky).
• 1 word n-grams: pectora (chests), populos (peoples),scelerum (crimes),bellorum (wars),
senatus (senate), ciuilia (civil),nocentes (from nocens, guilty or harmful)c,aussa (cause),
fatis (fates), coelo (sky), mundus (world),caeli (sky), diui (gods), libye (Libya).</p>
        <p>In previous work, we found that the Silhouette method consistently recommended two
clusters for 2-character n-grams and three clusters for 1-word n-grams, regardless of the technique
used. Scatter plots aligned with the stylistic distribution reported by Forstall e1t5a]l,.w[ho
used SVM to analyze Catullus’ influence on Paul the Deacon’s poetry (Figur5e).</p>
        <p>However, in this instance, whether due to the new preprocessing corrections or the novel
comparison methods employed in separating Vergil and Lucan, the clustering tasks were not</p>
        <p>IG
always accurate. While some correct distributions of the authors can be detected in the scatter
plot space, the algorithm’s non-human interpretation results in an unclear cluster classification,
grouping Tibullus, Propertius, and Vergil against Catullus (Fig6)u.re</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Learned Lessons</title>
      <p>This article highlights the need to reevaluate methodologies and resources. Positive clustering
results were obtained, especially with raw frequencies and n-grams, with Silhouette Scores
above 0.8. Preprocessing steps and CLTK modules for tokenization and normalization
significantly impacted the results, emphasizing the importance of tailored tools for ancient texts.</p>
      <p>Relative frequency and two datasets reduced noise and bias from the epic authors, aiming
to balance the text sets. However, even with more balanced data, raw frequencies provided
better clustering results than relative frequency and TF-IDF matrices. As in the previous study,
uneven feature importance and variable performance across n-gram levels and matrices suggest
further refinement is needed for consistent results.</p>
      <p>N-grams from Vergil’s and Lucan’s works show a dominance of political, historical, and
war-related terms. This suggests that, despite eforts to balance the datasets, the lexical
characteristics of epic poetry still influence classification. The identified n-grams reflect the epic
and mythical focus of these authors, contrasting with the love and personal themes of Catullus,
Tibullus, and Propertius. The variability in document length—regulaPrhainrsalia and Aeneid,
but variable in the Augustan love poets—afects results. Additionally, the internal variability
among the Augustan love poets’ corpora also afects classification. We could experiment by
partitioning Catullu’s work into polymetric poemcasr,mina maiora, and epigrams or elegiac
couplets, and run separate analyses, or intervene in thcoerpus catullianum by removing
nonamorous themed poems. However, this is complex, as thematic boundaries are not clear-cut.</p>
      <p>This exploratory analysis requires further refinement of other techniques such as variable
ranges of character and word n-grams (only fixed ranges were used in this study), other
similarity measures such as Jaccard, Cosine, or Soft Cosine, or clustering methods like Gaussian
Mixture Models, DBSCAN, or hierarchical clustering. Future research could apply
normalization techniques such as L1 or Z-scaler, and phenomena like collocations and co-occurrences,
which were not applied in this study. A close reading of clusters based on relative frequency
also ofers promise.</p>
      <p>As for the representation of the documents, there is a need to explore techniques with
embeddings like those developed by Burns et al., Bamman et al., and Johnson et a1l8., [8, 3, 6].</p>
      <p>It should also be noted that the terms obtained by the Decision Tree technique are words
with classification power for that dataset, not necessarily the most typical of one type of
poetry or another, as there may be important words for both genres penalized by the metrics of
Importance, Information Gain, or TF-IDF. The unequal size of the poems also contributed to
the clarity of classification in raw counts, indirectly transferring poem length as a classification
criterion. Similarly, in decision trees, the feature split points reflected the same pattern, with
epic poem features having much higher frequencies, clearly impacting the results.</p>
      <p>
        Finally, it is important to briefly consider the implications of applying computational and
Distant Reading techniques alongside hypotheses or educated guesses from Close Reading.
Frequency counting, for instance, is used here to model documents, but humans do not speak to be
counted. Otherwise, Catullus would simply have repeated the naLmeesbia, and his love would
have been understood without the efort of creating poetry. Fortunately, language is far more
abstract and complex, and computational methods are only beginning to reveal its intricacies.
This issue has resurfaced with criticisms, such as those by Noam Chomsky, against generative
models [11]. It is true that the human mind performs language tasks in a highly elegant manner
and acquires a language exposed to a much smaller number of data than those handled by Large
Language Models (LLMs). LLMs are tools developed for other tasks that did not originally seek
to emulate the human mind 1[
        <xref ref-type="bibr" rid="ref27">3, 37</xref>
        ]. But it is also true that one must yield to the evidence
of the successful results obtained with the use of these techniques and their undeniable
capacity to facilitate all kinds of tasks. It’s essential to acknowledge both the limits and strengths
of computational tools, recognizing that Distant Reading ofers a diferent scale of analysis—
rooted not just in methodology but in changes in how information is produced, accessed, and
analyzed in the digital age3[3]. Despite criticisms [5], Digital Humanities methodologies hold
great promise for studying language-rich subjects that balance aesthetic and rhythmic elements
like refrains, alliterations, and anaphoras, presenting a unique challenge for modern analytical
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>I sincerely thank Dr. Kyle P. Johnson, Director of AI at Morgan, Lewis and Bockius LLP, and
Dr. Patrick J. Burns from the Institute for the Study of the Ancient World, NYU, for their kind
and insightful responses to my inquiries on tokenization and the use of CLTK and LatinCy
libraries. I also extend my gratitude to Professor Benjamin Nagy from the Institute of the Polish
Language, Polish Academy of Sciences (IJP PAN), Krakow, for his expert advice on correcting
verse counts based on authorized editions, which greatly enhanced the accuracy of this text
analysis.
[1] [No author].Perseus Digital Library Homepage. [No date]. url: https://www.perseus.tuf
ts.edu/hopper/.
[2] [No author].Stopwords ISO. [No date]. url: https://github.com/stopwords-iso/stopwor
ds-iso/blob/master/README.m.d
[3]
[4]</p>
      <p>D. Bamman and P. J. Burns.Latin BERT: A Contextual Language Model for Classical
Philology. 2020. doi: 10.48550/arXiv.2009.10053. url: http://arxiv.org/abs/2009.1005 3.
A. Berra.Ancient Greek and Latin Stopwords. 2024. url: https://github.com/aurelberra/s
topwords.
[5] T. Brennan. The Digital-Humanities Bust. 2017. url: https://www.chronicle.com/article
/the-digital-humanities-bust./
[6] P. J. Burns. “Building a Text Analysis Pipeline for Classical Languages”. BInu:ilding a
Text Analysis Pipeline for Classical Languages. De Gruyter Saur, 2019, pp. 159–176. doi:
10.1515/9783110599572-010. url: https://www.degruyter.com/document/doi/10.1515/97
83110599572-010/html.</p>
      <p>[8] P. J. Burns.LatinCy: Synthetic Trained Pipelines for Latin NLP. 2023. doi: 10.48550/arXiv
.2305.04365. url: http://arxiv.org/abs/2305.0436 5.</p>
      <p>of Digital Humanities”. In:Journal of Digital Humanities (2013), pp. 478–479. url: https:
//journalofdigitalhumanities.org/3-1/modelling-the-interpretation-of-literary-allusionwith-machine-learning-techniques./
2021, pp. 20–29. doi: 10.18653/v1/2021.acl-demo.3. url: https://aclanthology.org/2021.a
cl-demo.3.
[34] P. J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis”. InJ:ournal of Computational and Applied Mathematics 20 (1987), pp. 53–
65. doi: 10.1016/0377-0427(87)90125-7. url: https://www.sciencedirect.com/science/arti
cle/pii/0377042787901257.</p>
      <p>Appendix: Extra Data from the Corpora of Catullus, Tibullus, Propertius, Vergilius and
Lucanus</p>
      <p>Raw Frequency char (2, 2) n-grams
Feature Importance Feature IG Feature
un
1</p>
      <p>IGR
iuom
ucru
temq
ubib
boru
nimb
m ef
efa
fat
ast
cios
lumq
ipot
mman
mnip
nipo
bibu
sumq
adfa
anid
eucru
borum
ntemq
fatur
iuom
imman
ucrum
eucri
e teu
temqu
fatus
cios
clamo
anch
efa
lamor
ocios
efat
undam
m ef</p>
      <p>IGR</p>
      <p>Raw Frequency char (2, 2) n-grams
Feature Importance Feature IG Feature
g
1</p>
      <p>Raw Frequency char (3, 3) n-grams
Feature Importance Feature IG Feature
te
1
bye
oer
dhu
ye
rct
b p
giq
gme
emt
bmo
eer
xir
ax
obo
nny
al
mto
bif
efo
gad</p>
      <p>Raw Frequency char (4, 4) n-grams
Feature Importance Feature IG Feature
ra n
1</p>
      <p>IGR</p>
      <p>Raw Frequency char (5, 5) n-grams
Feature Importance Feature IG Feature
aesar
1</p>
      <p>IGR</p>
      <p>Raw Frequency char (6, 6) n-grams
Feature Importance Feature IG Feature
osque
1</p>
      <p>IGR
pectora</p>
      <p>IGR</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Burns</surname>
          </string-name>
          . “
          <article-title>Constructing Stoplists for Historical Languages”</article-title>
          .
          <source>DIni:gital Classics Online</source>
          (
          <year>2018</year>
          ), pp.
          <fpage>4</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .11588/dco.
          <year>2018</year>
          .
          <volume>2</volume>
          .52124. url: https://journals.ub.uni-heidelberg .de/index.php/dco/article/view/5212.4
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cantaluppi</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          . “
          <article-title>Clustering the Corpus of Seneca: A Lexical-Based Approach”</article-title>
          .
          <source>In: Advances in Latent Variables: Methods, Models and Applications</source>
          . Ed. by
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpita</surname>
          </string-name>
          , E. Brentari, and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Qannari</surname>
          </string-name>
          . Cham: Springer International Publishing,
          <year>2015</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1007/10104\_
          <year>2014</year>
          \_6. url: https://doi.org/10.1007/10104%5C%
          <fpage>5F2014</fpage>
          %
          <fpage>5C</fpage>
          %
          <fpage>5F6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Cerrato</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Chavez</surname>
          </string-name>
          .
          <article-title>Perseus Classics Collection: An Overview</article-title>
          . [No date]. url: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:
          <year>1999</year>
          .
          <volume>04</volume>
          .
          <volume>00 5.3</volume>
          [11] [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Roberts</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Watumull</surname>
          </string-name>
          . “Noam Chomsky:
          <article-title>The False Promise of ChatGPT”</article-title>
          . In: The New York Times (
          <year>2023</year>
          ). url: https://www.nytimes.com/
          <year>2023</year>
          /03/08/opinio n/
          <article-title>noam-chomsky-chatgpt-ai</article-title>
          .
          <source>html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>ToutanovaB</article-title>
          .ERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          . url: ht tp://arxiv.org/abs/
          <year>1810</year>
          .0480 5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Forstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Jacobson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          . “
          <article-title>Evidence of intertextuality: investigating Paul the Deacon's Angustae Vitae”</article-title>
          .
          <source>In:Literary and Linguistic Computing 26.3</source>
          (
          <issue>2011</issue>
          ), pp.
          <fpage>285</fpage>
          -
          <lpage>296</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqr02.9url: https://doi.org/10.1093/llc/fqr0 2.
          <fpage>9</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Forstall</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          . “
          <string-name>
            <given-names>A Statistical</given-names>
            <surname>Stylistic</surname>
          </string-name>
          <article-title>Study of Latin Elegiac Couplets”</article-title>
          .
          <source>In: 2010 Chicago Colloquium on Digital Humanities and Computer Science</source>
          .
          <year>2010</year>
          , [No pages]. url: https://www.semanticscholar.org/paper/A-
          <string-name>
            <surname>Statistical-</surname>
          </string-name>
          Stylistic-
          <article-title>Study-of-Latin-Ele giac-Forstall-Scheirer/e3caac9ec4ee16baac70ed94808dca57dff48a</article-title>
          .2d
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Giovanni</surname>
            <given-names>Bracco</given-names>
          </string-name>
          , Silvio Migliori, Giorgio Mencuccini, Daniela Alderuccio, and Giovanni Ponti.
          <article-title>“Data mining tools and GRID infrastructure for Assyriology text analysis (an Old-Babylonian situation studied through text analysis and data mining tools)”R</article-title>
          .IAnI:- Rencontre
          <string-name>
            <surname>Assyriologique</surname>
          </string-name>
          Internationale-
          <article-title>Private and State in the Ancient Near East</article-title>
          . Belgium,
          <year>2013</year>
          , [No pages
          <fpage>]</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Greenough</surname>
          </string-name>
          .The Bucolics, AEneid, and Georgics of Virgil. Boston: Ginn,
          <year>1900</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , P. J.
          <string-name>
            <surname>Burns</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Besnier</surname>
            , and
            <given-names>W. J. B.</given-names>
          </string-name>
          <string-name>
            <surname>Mattingly</surname>
          </string-name>
          . “
          <article-title>The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”.PIrno:ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing: System Demonstrations</source>
          . Ed. by
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Park</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Xia</surname>
          </string-name>
          . Online: Association for Computational Linguistics,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Lewis</surname>
          </string-name>
          .
          <article-title>La alegorıá del amor: un estudio sobre tradición medieval</article-title>
          . 2015th ed. Madrid: Encuentro,
          <year>1936</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          . “
          <article-title>Least squares quantization in PCM”</article-title>
          .
          <source>InI:EEE Transactions on Information Theory 28.2</source>
          (
          <issue>1982</issue>
          ), pp.
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          . doi:
          <volume>10</volume>
          .1109/tit.
          <year>1982</year>
          .
          <volume>1056489</volume>
          . url: https://ieeexplore.ieee.o rg/document/1056489.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grácio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Teixeira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Pimenta</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. G.</given-names>
            <surname>Zapata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          . “
          <article-title>Historia Augusta authorship: an approach based on Measurements of Complex Networks”</article-title>
          .
          <source>In:Applied Network Science 6.1</source>
          (
          <issue>2021</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          .1007/s41109-021-00390- 7. url: https://appliednetsci.springeropen.com/articles/10.1007/s41109-021-00390.-
          <fpage>7</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [22]
          <string-name>
            <surname>E. T.</surname>
          </string-name>
          <article-title>MerrillC.atullus; edited by Elmer Truesdell Merrill</article-title>
          . Boston Ginn,
          <year>1893</year>
          . url: http://a rchive.org/details/catulluseditedby00catuu.oft
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moretti</surname>
          </string-name>
          .Distant Reading. Verso,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L. MüllerS.ex. Propertii</given-names>
            <surname>Elegiae</surname>
          </string-name>
          . Leipzig: Teubner,
          <year>1898</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nagy</surname>
          </string-name>
          . “
          <article-title>Rhyme in classical Latin poetry: Stylistic or stochastic?” IDni:gital Scholarship in the Humanities 37</article-title>
          .4 (
          <issue>2022</issue>
          ), pp.
          <fpage>1097</fpage>
          -
          <lpage>1118</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqab105. url: https://doi.o rg/10.1093/llc/fqab10.5
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [26]
          <string-name>
            <surname>B. Nagy. “</surname>
          </string-name>
          <article-title>Some stylometric remarks on Ovid's Heroides and the Epistula Sapphus”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 38.3</source>
          (
          <issue>2023</issue>
          ), pp.
          <fpage>1183</fpage>
          -
          <lpage>1199</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqac098. url: https://doi.org/10.1093/llc/fqac09.8
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Nusch</surname>
          </string-name>
          . “
          <article-title>Las Edades del Amor: una propuesta para el proyecto Aetates Amoris destinado a la poesáı amorosa”</article-title>
          .
          <source>Tesis. Universidad Nacional de Educación a Distancia</source>
          , España,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .35537/10915/125629. url: http://sedici.unlp.edu.ar/handle/10915/1256 2.
          <fpage>9</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Nusch</surname>
          </string-name>
          . “
          <article-title>Una breve exploración de la terminol oágaımorosa en los corpora catullianum, tibullianum y propertianum con métodos y herramientas computacionales: etiquetado gramatical, lemas, bigramas y co-apariciones”</article-title>
          .
          <source>IRne:vista de Humanidades Digitales</source>
          <volume>9</volume>
          (
          <year>2024</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          . doi:
          <volume>10</volume>
          .5944/rhd.vol.
          <volume>9</volume>
          .
          <year>2024</year>
          .
          <volume>38680</volume>
          .url: https://revistas.uned.es /index.php/RHD/article/view/3868.0
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Nusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. del Rio</given-names>
            <surname>Riande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C. C.</given-names>
            <surname>Cagnina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Errecalde</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Antonelli</surname>
          </string-name>
          . “
          <article-title>Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets”</article-title>
          .
          <source>DIenc:isioning. Pereira</source>
          , Colombia,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Postgate</surname>
          </string-name>
          .
          <article-title>Tibulli aliorumque carminum libri tres</article-title>
          .
          <source>Oxford: Scriptorum classicorum bibliotheca Oxoniensis</source>
          ,
          <year>1915</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          . “
          <article-title>Induction of decision trees”</article-title>
          .
          <source>InM: achine Learning 1.1</source>
          (
          <issue>1986</issue>
          ), pp.
          <fpage>81</fpage>
          -
          <lpage>106</lpage>
          . doi:
          <volume>10</volume>
          .1007/bf00116251. url: https://doi.org/10.1007/BF0011625 1.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramsay</surname>
          </string-name>
          . Reading Machines: Toward and
          <string-name>
            <given-names>Algorithmic</given-names>
            <surname>Criticism</surname>
          </string-name>
          . Urbana,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Pimenta</surname>
          </string-name>
          . “
          <article-title>De Narciso ao mundo-imagem: por uma urgência de uma perspectiva crıt́ica sobre a cena informacional contemporânea”. ICn:iência da Informação : sociedade, crıt́ica e inovação</article-title>
          . Rio de Janeiro,
          <year>2022</year>
          , [No pages
          <fpage>]</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [35]
          <string-name>
            <surname>C. E. Shannon. “</surname>
          </string-name>
          <article-title>A mathematical theory of communication”</article-title>
          .
          <source>InT:he Bell System Technical Journal 27.3</source>
          (
          <issue>1948</issue>
          ), pp.
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          . doi:
          <volume>10</volume>
          .1002/j.1538-
          <fpage>7305</fpage>
          .
          <year>1948</year>
          .tb01338.x. url: https://ie eexplore.
          <source>ieee.org/document/677302</source>
          .4
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>K. Spärck</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>“A statistical interpretation of term specificity and its application in retrieval”</article-title>
          .
          <source>In:Journal of Documentation 28.1</source>
          (
          <issue>1972</issue>
          ), pp.
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          . doi:
          <volume>10</volume>
          . 1108 / eb026526. url: https://doi.org/10.1108/eb02652 6.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>and I. Polosukhin.</surname>
          </string-name>
          “
          <article-title>Attention is All you Need”</article-title>
          .
          <source>InA:dvances in Neural Information Processing Systems</source>
          . Vol.
          <volume>30</volume>
          . Curran Associates, Inc.,
          <year>2017</year>
          , [No pages]. urlh:ttps://papers.nips.cc/p aper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htm.l
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [38]
          <string-name>
            <surname>M. M. Vizzotti</surname>
          </string-name>
          . “De la tragedia de Séneca a la épica de Lucano: estrategias de representación de los paradigmas filosóficos y literarios”. Tesis. Universidad
          <string-name>
            <surname>Nacional de La Plata</surname>
          </string-name>
          ,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .35537/10915/34410. url: http://sedici.unlp.edu.ar/handle/10915/344 10.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Weise. Pharsaliae Libri X. M. Annaeus Lucanus</surname>
          </string-name>
          . Leipzig: G. Bassus,
          <year>1935</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>