1. Introduction

These authors contributed equally. £ carlosnusch@prebi.unlp.edu.a(rC. J. Nusch) ç https://prebi-sedici.unlp.edu.ar/personal/carlos-nus(cCh./J. Nusch) ȉ

Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature ⋆ Importance Extraction

Carlos JavierNusch

1 5

Gimena del RioRiande

Leticia CeciliaCagnina

Marcelo LuisErrecald

LeandroAntonell

4 0 CAETI, Facultad de Tecnología Informática, Universidad Abierta Interamericana , Argentina 1 CESGI, Comisión de Investigaciones Científicas de la Provincia de Buenos Aires , Argentina 2 IIBICRIT, Consejo Nacional de Investigaciones Científicas y Técnicas , Argentina 3 LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis , Argentina 4 LIFIA, Facultad de Informática, Universidad Nacional de La Plata , Argentina 5 PREBI, SEDICI, Universidad Nacional de La Plata , Argentina

2024

000 0 0003

This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the ifndings from previous exploratory tasks performed using other techniques.

eol>Augustan love poets Document Clustering K Means Silhouette CoefÏcient Decision Trees Feature Importance Information Gain Ratio

1. Introduction

This study1 builds on a master’s thesis 2[ 7 ] examining C. S. Lewis’ observations 1[ 9 ] on the influence of Courtly love and Occitan literature on 20th-century love imagery. Similarities in love themes, treatment of the beloved, and political and military terms were found between Occitan and 1st-century BC Latin poetry. This thesis aims to identify textual patterns linking ancient love themes to theReligion of Love in medieval Occitan poetry, using a comparative approach that combines close reading with computational method3s2,[ 23 ]. This article evaluates clustering techniques for diferentiating love poems from other Latin poetry, identifying key lexical features. Previous work29[] explained the techniques, while here we focus on feature extraction and optimal Silhouette Score values.

1.1. State of the Art

Several authors have applied clustering to ancient texts. Bracco et a1l6.][used K-means to detect literary genres in cuneiform texts, and Martins et a2l1.][ used k-Nearest Neighbors for author classification. Cantaluppi and Passarotti9[] studied Seneca’s complete works, Cicero’s orations, Jerome’s Latin New Testament, and Aquinas’ major works. Nagy25[] used multivariate analysis and clustering to examine rhyme in twelve classical Latin poets, identifying stylistic diferences between genres and authors. In recent work, he applied UMAP and t-SNE to show stylistic distinctions between Ovid’Hseroides and other works, and the authenticity of the Epistula Sapphus [ 26 ]. Forstall et al.1[ 4, 15 ] compared lexical and rhythmic features at character and word n-gram levels with other 1st-century BC poets.

1.2. Problem Definition and Contributions

The previous work aimed to explore clustering techniques to distinguish love poems from other types of poetry and identify useful lexical characteristics for classification. The K-means algorithm [ 20 ] was used, and the optimal number of clusters was determined with the Silhouette Index [34], which measures group cohesion and separation. Since K-means, based on Euclidean distance, does not provide detailed feature extraction, decision tre3e1s][were used to complement this approach. This combination allowed for the indirect extraction of features, with metrics such as Importance, Information Gain, and Information Gain Rati3o5[] identifying the most relevant features.

2. Research Methodology and Approach 2.1. Analysis Corpus and Used Editions

The corpus includes the complete works of Gaius Valerius Catull2u2s],[Albius Tibullus2[4], and Sextus Propertius 3[0], representing love poetry, as well as all books from Atheeneid by 1An appendix with key tables is included after the references. A larger dataset is available: Nusch, C. (2024). Clustering Tasks and Decision Trees with Augustan love poets [Data set]. CHR2024, Aarhus, Denmark. Zenodo. https://doi.org/10.5281/zenodo.12682694. Publius Vergilius Maro17[] and Pharsalia by Marcus Annaeus Lucanus [ 39 ] as control samples, focused on political, historical, and martial themes. The analysis reveals diferences in the number of words and verses per poem among diferent authors and genres (Tab1l)e. To address concerns about unbalanced datasets, we used relative frequency and separated the authors to reduce noise and bias from the larger epic texts. Two datasets were used: one with the Augustan love poets and Vergil, and another with the Augustan love poets and Lucan.

To construct the analysis corpus, resources from the Perseus Project digital libr1a, r1y0][ at Tufts University were used. The library contains 2,412 works in 3,192 editions and translations (1,639 in Greek, 636 in Latin) and a total of 69.7 million words. The texts, curated by specialists and shared under a CC BY SA 3.0 (US) license, are available in XML format. Additional resources include models for grammatical tagging and stopwords for Latin. The poems were harvested through web scraping using R, while Python libraries were employed for text analysis and mining.

The analysis explored character n-grams (2 to 7) and word n-grams (1 to 5), using the Bag of Words (BOW) method 3[6]. Three types of matrices were generated: the first based on raw frequency (using Scikit-learn’s CountVectorizer), the second on relative frequency (with a custom function), and the third using the TF-IDF technique3[6], which highlights important words by weighing their frequency relative to their rarity across the dataset. While CountVectorizer simply counts word occurrences, TF-IDF reduces the impact of common words, giving more weight to unique terms.

2.2. Text Preprocessing Tasks

Before analysis, the text was cleaned by removing empty lines, sequences of spaces (“\n \n\n\n”), and editorial symbols for illegible gaps (“†”). Spanish quotation marks were replaced with English ones for tool compatibility, and punctuation was removed from character n-grams, as it was added by editors.

For stopword2s, we used the Stopwords ISO 2[] package for Latin, which we preferred over the Perseus Project version because it retains important words in elegiac poetry, suchegaos, enabling the analysis of personal pronouns—a significant feature noted in previous work27s,[ 28 ]. 2For a more detailed discussion of the complexity and variety of stopwords in Latin and other ancient languages, see A. Berra [4] and P.J. Burns [ 7 ].

To enhance tokenization, we added two procedures from The Classical Language Toolkit (CLTK) [ 18 ]: JVReplacer, to standardize spellings (e. gIu.,lius/Julius and uir/vir), and LatinWordTokenizer, which helped identify enclitics (e.g-.q,ue, -ve) and prevent incorrect tokenization.

3. Evaluation: The Clustering and Decision Trees as combined techniques

As explained previously2[ 9 ], document clustering was performed using the K-means method and Silhouette scores to evaluate the best cluster configuration. The optimal number of clusters (k) was determined by testing k values from 2 to 20. Tests were conducted using fixed ranges of character n-grams (2 to 7) and word n-grams (1 to 5), with the Silhouette coefÏcient calculated for each k. The aim was to find both the best k and the most efective n-gram ranges for clustering.

Once the data was labeled, decision trees were trained using the entropy criterion to assess feature importance, with Information Gain (IG) and Information Gain Ratio (IGR) calculated.

4. Preliminary or Intermediate Results

Better Silhouette Scores were achieved using the raw frequency matrix with simple stopwords ifltering (CountVectorizer with Stopwords), while the Relative Frequency and TF-IDF Matrices showed lower scores. TF-IDF scores were close to zero, indicating poor cluster separation and Relative Frequency values were around 0.5 for both datasets (Tab2leasnd 3). The use of relative frequency significantly impacted the optimal number of clusters recommended by the K-means algorithm.

The new tokenization and normalization process using CLTK modules had a noticeable impact. Regarding the most critical features for classifying clusters, results suggest that the N-gram Type Char 2-grams Char 3-grams Char 4-grams Char 5-grams Char 6-grams Char 7-grams Word 1-grams Word 2-grams Word 3-grams Word 4-grams Word 5-grams methodology and resources should be reevaluated. In previous work, high Silhouette scores were observed with the frequency table, but feature importance metrics showed an uneven distribution, with one or two attributes dominating. While TF-IDF identified more features, the low Silhouette scores indicated poor classification. Despite better Silhouette scores, the same issue occurred with the relative frequency matrix.

4.1. Feature Extraction at Character N-Grams Level

As shown in Figure1 the clustering task succeeded in separating Augustan love poets from epic poets. In the next page, Tables4, and 5 show the feature extraction with this n-gram level.

In the case of relative frequency, after excluding Lucan, the best Silhouette Score at the character n-gram level was achieved with 4-grams. However, despite obtaining a relatively good score (0.53), the resulting classification did not meet expectations (Figur2e). The algorithm constructed two clusters, one of which contained onlCyarmen 94. A similar phenomenon occurred when excluding Vergil, but with the distinction that the isolaCteadrmen was 112.

4.2. Feature Extraction at Word N-Grams Level

At the word n-gram level, similar to the character n-gram level, the best classification method was the raw frequency matrix. Although the relative frequency matrix also yielded good Silhouette scores, it consistently produced poor classifications, isolating onClayrmen 94 from the rest. At other n-gram levels, thcearmina that were separated includedCarmina 14, 82, 85, and 106. All of these are relatively short, suggesting that the diference in length among the poems introduces internal variability in the corpus that hinders classification based on relative frequencies. The same task, when performed using raw frequencies, yielded excellent results, whether Lucan or Vergil was excluded from the analysis (Figu3raend Tables6, 7).

In the following table and figure, it can be observed that the use of relative frequency brings forth personal pronouns, terms previously associated with love poetry in earlier studies. However, these should be disregarded when obtained through this methodology, as the classifications achieved with them were quite poor, as can be seen in Tabl8eand Figure4.

Both datasets, whether excluding Lucan or Vergil, showed identical performance in author classification. Character n-grams (2 to 6) and single-word n-grams efectively separated epic authors from Augustan love poets using the raw frequency matrix, with Silhouette Scores above 0.83. Lower scores led to suboptimal classifications, where one cluster contained only a single book (e.g., Book X or XII of theAeneid or Book IX ofPharsalia).

Assessing the relevance of specific character n-grams for classification remains challenging, requiring a more detailed stylistic investigation. In summary, document grouping was efective, though feature-level techniques did not always highlight typical elegiac terms. The terms extracted via decision trees for Augustan love poets and Vergil predominantly reflected epic, mythical, and martial languag4e. 3For more details on the extracted terms, see Appendix A and B. 4Please note that with the English quotation marks we have attempted to indicate the spaces before or after the words, in cases where it corresponds to the character n-gram. 4.3. Data from the Corpora of Catullus, Tibullus, Propertius, and Vergil: • 5-character n-grams: ‘ sub ’ (low),eucri and eucru (part ofTeucri), fatus (spoke), fatur (speaks), auras (breezes), eneas (Aeneas) • 6-character n-grams: teucru (Trojan), aeneas (Aeneas)‘,fatur ’ (speaks), ‘fatus ’ (spoke), ‘fatis ’ (fates), ipoten and mnipot (from omnipotens, presumably attributed to Jupiter), clamor (shout) • 1 word n-grams: urbem (city), aeneas (Aeneas), teucrum (Trojan),ingentem (huge), omnipotens (almighty),aether (ether/sky),pius (pious),iamque (and now), socius (ally),clamore (shout), finis (end), fatis (fates), ignem (fire), auris (from auris, ear oraurum, gold), caelum (sky), genitor (father), hostis (enemy), terram (land), bellum (war), dux (leader), uisus (vision).

A similar phenomenon occurs with the terms obtained from the grouping of the Augustan love poets with Lucan, where words referring to the political causes of the Civil War predominate, emphasizing the crimes committed by the diferent factions and the physical consequences on the bodies of the Roman soldiers and citizens38[]:

IG 4.4. Data from the Corpora of Catullus, Tibullus, Propertius, and Lucan: • 5-character n-grams: aesar (from Caesar),aussa (cause), scera (part ofviscera, viscera), pulos (from populos, peoples), elero (part of scelero, referring to crimes)c,oeli (of ’ coel’, from heaven), libye (Libya), gladi (part of gladium, sword),adhu and adhuc (from until now) • 6-character n-grams: ‘osque ’ (composed most likely by the accusative plural ending of the second declension combined with the enclitic -quep),ulos and opulos (peoples),libye (Libya), scera and iscera (fromviscera, entrails),bellor and elloru (as part ofbellorum, of the wars), elerum, ‘lerum’ (fromscelerum, crime), caussa and ‘ causs’ (cause), ‘ gladi’ (as part ofgladium on its diferent forms, the sword),‘ coeli’ (sky). • 1 word n-grams: pectora (chests), populos (peoples),scelerum (crimes),bellorum (wars), senatus (senate), ciuilia (civil),nocentes (from nocens, guilty or harmful)c,aussa (cause), fatis (fates), coelo (sky), mundus (world),caeli (sky), diui (gods), libye (Libya).

In previous work, we found that the Silhouette method consistently recommended two clusters for 2-character n-grams and three clusters for 1-word n-grams, regardless of the technique used. Scatter plots aligned with the stylistic distribution reported by Forstall e1t5a]l,.w[ho used SVM to analyze Catullus’ influence on Paul the Deacon’s poetry (Figur5e).

However, in this instance, whether due to the new preprocessing corrections or the novel comparison methods employed in separating Vergil and Lucan, the clustering tasks were not

IG always accurate. While some correct distributions of the authors can be detected in the scatter plot space, the algorithm’s non-human interpretation results in an unclear cluster classification, grouping Tibullus, Propertius, and Vergil against Catullus (Fig6)u.re

5. Conclusions and Learned Lessons

This article highlights the need to reevaluate methodologies and resources. Positive clustering results were obtained, especially with raw frequencies and n-grams, with Silhouette Scores above 0.8. Preprocessing steps and CLTK modules for tokenization and normalization significantly impacted the results, emphasizing the importance of tailored tools for ancient texts.

Relative frequency and two datasets reduced noise and bias from the epic authors, aiming to balance the text sets. However, even with more balanced data, raw frequencies provided better clustering results than relative frequency and TF-IDF matrices. As in the previous study, uneven feature importance and variable performance across n-gram levels and matrices suggest further refinement is needed for consistent results.

N-grams from Vergil’s and Lucan’s works show a dominance of political, historical, and war-related terms. This suggests that, despite eforts to balance the datasets, the lexical characteristics of epic poetry still influence classification. The identified n-grams reflect the epic and mythical focus of these authors, contrasting with the love and personal themes of Catullus, Tibullus, and Propertius. The variability in document length—regulaPrhainrsalia and Aeneid, but variable in the Augustan love poets—afects results. Additionally, the internal variability among the Augustan love poets’ corpora also afects classification. We could experiment by partitioning Catullu’s work into polymetric poemcasr,mina maiora, and epigrams or elegiac couplets, and run separate analyses, or intervene in thcoerpus catullianum by removing nonamorous themed poems. However, this is complex, as thematic boundaries are not clear-cut.

This exploratory analysis requires further refinement of other techniques such as variable ranges of character and word n-grams (only fixed ranges were used in this study), other similarity measures such as Jaccard, Cosine, or Soft Cosine, or clustering methods like Gaussian Mixture Models, DBSCAN, or hierarchical clustering. Future research could apply normalization techniques such as L1 or Z-scaler, and phenomena like collocations and co-occurrences, which were not applied in this study. A close reading of clusters based on relative frequency also ofers promise.

As for the representation of the documents, there is a need to explore techniques with embeddings like those developed by Burns et al., Bamman et al., and Johnson et a1l8., [8, 3, 6].

It should also be noted that the terms obtained by the Decision Tree technique are words with classification power for that dataset, not necessarily the most typical of one type of poetry or another, as there may be important words for both genres penalized by the metrics of Importance, Information Gain, or TF-IDF. The unequal size of the poems also contributed to the clarity of classification in raw counts, indirectly transferring poem length as a classification criterion. Similarly, in decision trees, the feature split points reflected the same pattern, with epic poem features having much higher frequencies, clearly impacting the results.

Finally, it is important to briefly consider the implications of applying computational and Distant Reading techniques alongside hypotheses or educated guesses from Close Reading. Frequency counting, for instance, is used here to model documents, but humans do not speak to be counted. Otherwise, Catullus would simply have repeated the naLmeesbia, and his love would have been understood without the efort of creating poetry. Fortunately, language is far more abstract and complex, and computational methods are only beginning to reveal its intricacies. This issue has resurfaced with criticisms, such as those by Noam Chomsky, against generative models [11]. It is true that the human mind performs language tasks in a highly elegant manner and acquires a language exposed to a much smaller number of data than those handled by Large Language Models (LLMs). LLMs are tools developed for other tasks that did not originally seek to emulate the human mind 1[ 3, 37 ]. But it is also true that one must yield to the evidence of the successful results obtained with the use of these techniques and their undeniable capacity to facilitate all kinds of tasks. It’s essential to acknowledge both the limits and strengths of computational tools, recognizing that Distant Reading ofers a diferent scale of analysis— rooted not just in methodology but in changes in how information is produced, accessed, and analyzed in the digital age3[3]. Despite criticisms [5], Digital Humanities methodologies hold great promise for studying language-rich subjects that balance aesthetic and rhythmic elements like refrains, alliterations, and anaphoras, presenting a unique challenge for modern analytical

Acknowledgments

I sincerely thank Dr. Kyle P. Johnson, Director of AI at Morgan, Lewis and Bockius LLP, and Dr. Patrick J. Burns from the Institute for the Study of the Ancient World, NYU, for their kind and insightful responses to my inquiries on tokenization and the use of CLTK and LatinCy libraries. I also extend my gratitude to Professor Benjamin Nagy from the Institute of the Polish Language, Polish Academy of Sciences (IJP PAN), Krakow, for his expert advice on correcting verse counts based on authorized editions, which greatly enhanced the accuracy of this text analysis. [1] [No author].Perseus Digital Library Homepage. [No date]. url: https://www.perseus.tuf ts.edu/hopper/. [2] [No author].Stopwords ISO. [No date]. url: https://github.com/stopwords-iso/stopwor ds-iso/blob/master/README.m.d [3] [4]

D. Bamman and P. J. Burns.Latin BERT: A Contextual Language Model for Classical Philology. 2020. doi: 10.48550/arXiv.2009.10053. url: http://arxiv.org/abs/2009.1005 3. A. Berra.Ancient Greek and Latin Stopwords. 2024. url: https://github.com/aurelberra/s topwords. [5] T. Brennan. The Digital-Humanities Bust. 2017. url: https://www.chronicle.com/article /the-digital-humanities-bust./ [6] P. J. Burns. “Building a Text Analysis Pipeline for Classical Languages”. BInu:ilding a Text Analysis Pipeline for Classical Languages. De Gruyter Saur, 2019, pp. 159–176. doi: 10.1515/9783110599572-010. url: https://www.degruyter.com/document/doi/10.1515/97 83110599572-010/html.

[8] P. J. Burns.LatinCy: Synthetic Trained Pipelines for Latin NLP. 2023. doi: 10.48550/arXiv .2305.04365. url: http://arxiv.org/abs/2305.0436 5.

of Digital Humanities”. In:Journal of Digital Humanities (2013), pp. 478–479. url: https: //journalofdigitalhumanities.org/3-1/modelling-the-interpretation-of-literary-allusionwith-machine-learning-techniques./ 2021, pp. 20–29. doi: 10.18653/v1/2021.acl-demo.3. url: https://aclanthology.org/2021.a cl-demo.3. [34] P. J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis”. InJ:ournal of Computational and Applied Mathematics 20 (1987), pp. 53– 65. doi: 10.1016/0377-0427(87)90125-7. url: https://www.sciencedirect.com/science/arti cle/pii/0377042787901257.

Appendix: Extra Data from the Corpora of Catullus, Tibullus, Propertius, Vergilius and Lucanus

Raw Frequency char (2, 2) n-grams Feature Importance Feature IG Feature un 1

IGR iuom ucru temq ubib boru nimb m ef efa fat ast cios lumq ipot mman mnip nipo bibu sumq adfa anid eucru borum ntemq fatur iuom imman ucrum eucri e teu temqu fatus cios clamo anch efa lamor ocios efat undam m ef

IGR

Raw Frequency char (2, 2) n-grams Feature Importance Feature IG Feature g 1

Raw Frequency char (3, 3) n-grams Feature Importance Feature IG Feature te 1 bye oer dhu ye rct b p giq gme emt bmo eer xir ax obo nny al mto bif efo gad

Raw Frequency char (4, 4) n-grams Feature Importance Feature IG Feature ra n 1

IGR

Raw Frequency char (5, 5) n-grams Feature Importance Feature IG Feature aesar 1

IGR

Raw Frequency char (6, 6) n-grams Feature Importance Feature IG Feature osque 1

IGR pectora

IGR

[7]

P. J.

Burns . “ Constructing Stoplists for Historical Languages” . DIni:gital Classics Online ( 2018 ), pp. 4 - 20 . doi: 10 .11588/dco. 2018 . 2 .52124. url: https://journals.ub.uni-heidelberg .de/index.php/dco/article/view/5212.4

[9]

Cantaluppi and

Passarotti . “ Clustering the Corpus of Seneca: A Lexical-Based Approach” . In: Advances in Latent Variables: Methods, Models and Applications . Ed. by

Carpita , E. Brentari, and

E. M.

Qannari . Cham: Springer International Publishing, 2015 , pp. 13 - 25 . doi: 10 .1007/10104\_ 2014 \_6. url: https://doi.org/10.1007/10104%5C% 5F2014 % 5C % 5F6 .

[10]

L. M.

Cerrato and

R. F.

Chavez . Perseus Classics Collection: An Overview . [No date]. url: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text: 1999 . 04 . 00 5.3 [11] [12]

Chomsky , I. Roberts , and

Watumull . “Noam Chomsky: The False Promise of ChatGPT” . In: The New York Times ( 2023 ). url: https://www.nytimes.com/ 2023 /03/08/opinio n/ noam-chomsky-chatgpt-ai . html.

[13]

Devlin , M.-

Chang ,

Lee , and K. ToutanovaB .ERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . 2019 . doi: 10 .48550/arXiv. 1810 . 04805 . url: ht tp://arxiv.org/abs/ 1810 .0480 5.

[14]

C. W.

Forstall ,

S. L.

Jacobson , and

W. J.

Scheirer . “ Evidence of intertextuality: investigating Paul the Deacon's Angustae Vitae” . In:Literary and Linguistic Computing 26.3 ( 2011 ), pp. 285 - 296 . doi: 10 .1093/llc/fqr02.9url: https://doi.org/10.1093/llc/fqr0 2. 9

[15]

C. W.

Forstall and

Scheirer . “

A Statistical

Stylistic Study of Latin Elegiac Couplets” . In: 2010 Chicago Colloquium on Digital Humanities and Computer Science . 2010 , [No pages]. url: https://www.semanticscholar.org/paper/A- Statistical- Stylistic- Study-of-Latin-Ele giac-Forstall-Scheirer/e3caac9ec4ee16baac70ed94808dca57dff48a .2d

[16] Giovanni

Bracco

, Silvio Migliori, Giorgio Mencuccini, Daniela Alderuccio, and Giovanni Ponti. “Data mining tools and GRID infrastructure for Assyriology text analysis (an Old-Babylonian situation studied through text analysis and data mining tools)”R .IAnI:- Rencontre Assyriologique Internationale- Private and State in the Ancient Near East . Belgium, 2013 , [No pages ] .

[17]

J. B.

Greenough .The Bucolics, AEneid, and Georgics of Virgil. Boston: Ginn, 1900 .

[18]

K. P.

Johnson , P. J. Burns , J.

Stewart , T.

Cook , C.

Besnier , and W. J. B.

Mattingly . “ The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages”.PIrno:ceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations . Ed. by

Ji ,

J. C.

Park , and

Xia . Online: Association for Computational Linguistics,

[19]

C. S.

Lewis . La alegorıá del amor: un estudio sobre tradición medieval . 2015th ed. Madrid: Encuentro, 1936 .

[20]

Lloyd . “ Least squares quantization in PCM” . InI:EEE Transactions on Information Theory 28.2 ( 1982 ), pp. 129 - 137 . doi: 10 .1109/tit. 1982 . 1056489 . url: https://ieeexplore.ieee.o rg/document/1056489.

[21]

Martins ,

Grácio ,

Teixeira ,

I. Pimenta

Rodrigues ,

J. L. G.

Zapata , and

Ferreira . “ Historia Augusta authorship: an approach based on Measurements of Complex Networks” . In:Applied Network Science 6.1 ( 2021 ), pp. 1 - 23 . doi: 10 .1007/s41109-021-00390- 7. url: https://appliednetsci.springeropen.com/articles/10.1007/s41109-021-00390.- 7

[22] E. T. MerrillC.atullus; edited by Elmer Truesdell Merrill . Boston Ginn, 1893 . url: http://a rchive.org/details/catulluseditedby00catuu.oft

[23]

Moretti .Distant Reading. Verso, 2013 .

[24]

L. MüllerS.ex. Propertii

Elegiae . Leipzig: Teubner, 1898 .

[25]

Nagy . “ Rhyme in classical Latin poetry: Stylistic or stochastic?” IDni:gital Scholarship in the Humanities 37 .4 ( 2022 ), pp. 1097 - 1118 . doi: 10 .1093/llc/fqab105. url: https://doi.o rg/10.1093/llc/fqab10.5

[26] B. Nagy. “ Some stylometric remarks on Ovid's Heroides and the Epistula Sapphus” . In: Digital Scholarship in the Humanities 38.3 ( 2023 ), pp. 1183 - 1199 . doi: 10 .1093/llc/fqac098. url: https://doi.org/10.1093/llc/fqac09.8

[27]

C. J.

Nusch . “ Las Edades del Amor: una propuesta para el proyecto Aetates Amoris destinado a la poesáı amorosa” . Tesis. Universidad Nacional de Educación a Distancia , España, 2021 . doi: 10 .35537/10915/125629. url: http://sedici.unlp.edu.ar/handle/10915/1256 2. 9

[28]

C. J.

Nusch . “ Una breve exploración de la terminol oágaımorosa en los corpora catullianum, tibullianum y propertianum con métodos y herramientas computacionales: etiquetado gramatical, lemas, bigramas y co-apariciones” . IRne:vista de Humanidades Digitales 9 ( 2024 ), pp. 1 - 40 . doi: 10 .5944/rhd.vol. 9 . 2024 . 38680 .url: https://revistas.uned.es /index.php/RHD/article/view/3868.0

[29]

C. J.

Nusch ,

G. del Rio

Riande ,

L. C. C.

Cagnina ,

M. L.

Errecalde , and

Antonelli . “ Initial Explorations for Document Clustering Tasks in Latin Elegiac Poets” . DIenc:isioning. Pereira , Colombia, 2024 .

[30]

J. P.

Postgate . Tibulli aliorumque carminum libri tres . Oxford: Scriptorum classicorum bibliotheca Oxoniensis , 1915 .

[31]

J. R.

Quinlan . “ Induction of decision trees” . InM: achine Learning 1.1 ( 1986 ), pp. 81 - 106 . doi: 10 .1007/bf00116251. url: https://doi.org/10.1007/BF0011625 1.

[32]

Ramsay . Reading Machines: Toward and

Algorithmic

Criticism . Urbana, 2011 .

[33]

Ricardo

Pimenta . “ De Narciso ao mundo-imagem: por uma urgência de uma perspectiva crıt́ica sobre a cena informacional contemporânea”. ICn:iência da Informação : sociedade, crıt́ica e inovação . Rio de Janeiro, 2022 , [No pages ] .

[35] C. E. Shannon. “ A mathematical theory of communication” . InT:he Bell System Technical Journal 27.3 ( 1948 ), pp. 379 - 423 . doi: 10 .1002/j.1538- 7305 . 1948 .tb01338.x. url: https://ie eexplore. ieee.org/document/677302 .4

[36]

K. Spärck

Jones . “A statistical interpretation of term specificity and its application in retrieval” . In:Journal of Documentation 28.1 ( 1972 ), pp. 11 - 21 . doi: 10 . 1108 / eb026526. url: https://doi.org/10.1108/eb02652 6.

[37]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, and I. Polosukhin. “ Attention is All you Need” . InA:dvances in Neural Information Processing Systems . Vol. 30 . Curran Associates, Inc., 2017 , [No pages]. urlh:ttps://papers.nips.cc/p aper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htm.l

[38] M. M. Vizzotti . “De la tragedia de Séneca a la épica de Lucano: estrategias de representación de los paradigmas filosóficos y literarios”. Tesis. Universidad Nacional de La Plata , 2014 . doi: 10 .35537/10915/34410. url: http://sedici.unlp.edu.ar/handle/10915/344 10.

[39]

C. H.

Weise. Pharsaliae Libri X. M. Annaeus Lucanus . Leipzig: G. Bassus, 1935 .