Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction⋆ Carlos Javier Nusch1,2,∗,† , Gimena del Rio Riande3,† , Leticia Cecilia Cagnina4,† , Marcelo Luis Errecalde4,† and Leandro Antonelli5,6,† 1 PREBI, SEDICI, Universidad Nacional de La Plata, Argentina 2 CESGI, Comisión de Investigaciones Científicas de la Provincia de Buenos Aires, Argentina 3 IIBICRIT, Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina 4 LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina 5 LIFIA, Facultad de Informática, Universidad Nacional de La Plata, Argentina 6 CAETI, Facultad de Tecnología Informática, Universidad Abierta Interamericana, Argentina Abstract This article extends various automatic text analysis tasks from previous works by applying natural language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD. The motivation behind this work is to delve into and understand a historical literary trend revolving around the themes of love, spanning from antiquity through to the medieval period. The analyzed authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with distinct styles, serving as control samples. Unlike previous works, various corrections were added to the preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each range of n-grams, aiming to identify features with the highest Information Gain and Information Gain Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance were performed. In this study, we focused on detailing the classification results and features extracted by the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined whether the words or parts of words with classificatory potential identified in the process matched the findings from previous exploratory tasks performed using other techniques. Keywords Augustan love poets, Document Clustering, K Means, Silhouette CoefÏcient, Decision Trees, Feature Importance, Information Gain Ratio CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark ∗ Corresponding author. † These authors contributed equally. £ carlosnusch@prebi.unlp.edu.ar (C. J. Nusch) ç https://prebi-sedici.unlp.edu.ar/personal/carlos-nusch/ (C. J. Nusch) ȉ 0000-0003-1715-4228 (C. J. Nusch); 0000-0002-8997-5415 (G. d. R. Riande); 0000-0001-7825-2927 (L. C. Cagnina); 0000-0001-5605-8963 (M. L. Errecalde); 0000-0003-1388-0337 (L. Antonelli) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 620 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Introduction This study1 builds on a master’s thesis [27] examining C. S. Lewis’ observations [19] on the influence of Courtly love and Occitan literature on 20th-century love imagery. Similarities in love themes, treatment of the beloved, and political and military terms were found between Occitan and 1st-century BC Latin poetry. This thesis aims to identify textual patterns linking ancient love themes to the Religion of Love in medieval Occitan poetry, using a comparative ap- proach that combines close reading with computational methods [32, 23]. This article evaluates clustering techniques for differentiating love poems from other Latin poetry, identifying key lexical features. Previous work [29] explained the techniques, while here we focus on feature extraction and optimal Silhouette Score values. 1.1. State of the Art Several authors have applied clustering to ancient texts. Bracco et al. [16] used K-means to detect literary genres in cuneiform texts, and Martins et al. [21] used k-Nearest Neighbors for author classification. Cantaluppi and Passarotti [9] studied Seneca’s complete works, Cicero’s orations, Jerome’s Latin New Testament, and Aquinas’ major works. Nagy [25] used multi- variate analysis and clustering to examine rhyme in twelve classical Latin poets, identifying stylistic differences between genres and authors. In recent work, he applied UMAP and t-SNE to show stylistic distinctions between Ovid’s Heroides and other works, and the authenticity of the Epistula Sapphus [26]. Forstall et al. [14, 15] compared lexical and rhythmic features at character and word n-gram levels with other 1st-century BC poets. 1.2. Problem Definition and Contributions The previous work aimed to explore clustering techniques to distinguish love poems from other types of poetry and identify useful lexical characteristics for classification. The K-means algo- rithm [20] was used, and the optimal number of clusters was determined with the Silhouette Index [34], which measures group cohesion and separation. Since K-means, based on Euclidean distance, does not provide detailed feature extraction, decision trees [31] were used to comple- ment this approach. This combination allowed for the indirect extraction of features, with metrics such as Importance, Information Gain, and Information Gain Ratio [35] identifying the most relevant features. 2. Research Methodology and Approach 2.1. Analysis Corpus and Used Editions The corpus includes the complete works of Gaius Valerius Catullus [22], Albius Tibullus [24], and Sextus Propertius [30], representing love poetry, as well as all books from the Aeneid by 1 An appendix with key tables is included after the references. A larger dataset is available: Nusch, C. (2024). Clus- tering Tasks and Decision Trees with Augustan love poets [Data set]. CHR2024, Aarhus, Denmark. Zenodo. https://doi.org/10.5281/zenodo.12682694. 621 Publius Vergilius Maro [17] and Pharsalia by Marcus Annaeus Lucanus [39] as control samples, focused on political, historical, and martial themes. The analysis reveals differences in the number of words and verses per poem among different authors and genres (Table 1). To address concerns about unbalanced datasets, we used relative frequency and separated the authors to reduce noise and bias from the larger epic texts. Two datasets were used: one with the Augustan love poets and Vergil, and another with the Augustan love poets and Lucan. Table 1 Summary of works and word statistics from various Latin authors. Author Verses Total Unique Avg. Words (Work) Words Words Poem/Canto Catullus (Merrill, 1893) 2289 12912 5802 110.35 Tibullus (Müller, 1898) 1930 12368 5201 334.27 Propertius (Postgate, 1915) 4008 25450 9809 242.38 Lucanus Pharsalia (Weise, 1935) 8061 51215 14750 5121.5 Virgilius Aeneid (Greenough, 1900) 9896 63896 16616 5324.66 To construct the analysis corpus, resources from the Perseus Project digital library [1, 10] at Tufts University were used. The library contains 2,412 works in 3,192 editions and trans- lations (1,639 in Greek, 636 in Latin) and a total of 69.7 million words. The texts, curated by specialists and shared under a CC BY SA 3.0 (US) license, are available in XML format. Addi- tional resources include models for grammatical tagging and stopwords for Latin. The poems were harvested through web scraping using R, while Python libraries were employed for text analysis and mining. The analysis explored character n-grams (2 to 7) and word n-grams (1 to 5), using the Bag of Words (BOW) method [36]. Three types of matrices were generated: the first based on raw frequency (using Scikit-learn’s CountVectorizer), the second on relative frequency (with a custom function), and the third using the TF-IDF technique [36], which highlights important words by weighing their frequency relative to their rarity across the dataset. While CountVec- torizer simply counts word occurrences, TF-IDF reduces the impact of common words, giving more weight to unique terms. 2.2. Text Preprocessing Tasks Before analysis, the text was cleaned by removing empty lines, sequences of spaces (“\n \n\n\n”), and editorial symbols for illegible gaps (“†”). Spanish quotation marks were replaced with English ones for tool compatibility, and punctuation was removed from character n-grams, as it was added by editors. For stopwords2 , we used the Stopwords ISO [2] package for Latin, which we preferred over the Perseus Project version because it retains important words in elegiac poetry, such as ego, enabling the analysis of personal pronouns—a significant feature noted in previous works [27, 28]. 2 For a more detailed discussion of the complexity and variety of stopwords in Latin and other ancient languages, see A. Berra [4] and P.J. Burns [7]. 622 To enhance tokenization, we added two procedures from The Classical Language Toolkit (CLTK) [18]: JVReplacer, to standardize spellings (e.g., Iulius/Julius and uir/vir), and LatinWord- Tokenizer, which helped identify enclitics (e.g., -que, -ve) and prevent incorrect tokenization. 3. Evaluation: The Clustering and Decision Trees as combined techniques As explained previously [29], document clustering was performed using the K-means method and Silhouette scores to evaluate the best cluster configuration. The optimal number of clusters (k) was determined by testing k values from 2 to 20. Tests were conducted using fixed ranges of character n-grams (2 to 7) and word n-grams (1 to 5), with the Silhouette coefÏcient calculated for each k. The aim was to find both the best k and the most effective n-gram ranges for clustering. Once the data was labeled, decision trees were trained using the entropy criterion to assess feature importance, with Information Gain (IG) and Information Gain Ratio (IGR) calculated. 4. Preliminary or Intermediate Results Better Silhouette Scores were achieved using the raw frequency matrix with simple stopwords filtering (CountVectorizer with Stopwords), while the Relative Frequency and TF-IDF Matrices showed lower scores. TF-IDF scores were close to zero, indicating poor cluster separation and Relative Frequency values were around 0.5 for both datasets (Tables 2 and 3). The use of relative frequency significantly impacted the optimal number of clusters recommended by the K-means algorithm. Table 2 Optimal clusters and corresponding Silhouette values for different ranges of n-grams (Corpora of Cat- ullus, Tibullus, Propertius, and Vergilius). Raw Frequency Relative Frequency TF-IDF N-gram Type Clusters Score Clusters Score Clusters Score Char 2-grams 2 0.94 7 0.19 7 0.16 Char 3-grams 2 0.93 2 0.534 3 0.07 Char 4-grams 2 0.904 2 0.536 3 0.031 Char 5-grams 2 0.85 5 0.446 2 0.008 Char 6-grams 2 0.801 5 0.443 15 0.008 Char 7-grams 2 0.76 5 0.44 15 0.004 Word 1-grams 2 0.802 2 0.55 2 0.013 Word 2-grams 2 0.718 2 0.52 18 0.001 Word 3-grams 2 0.719 2 0.54 19 0.007 Word 4-grams 2 0.72 2 0.56 10 0.007 Word 5-grams 2 0.721 2 0.58 19 0.007 The new tokenization and normalization process using CLTK modules had a noticeable impact. Regarding the most critical features for classifying clusters, results suggest that the 623 Table 3 Optimal clusters and corresponding Silhouette values for different ranges of n-grams (Corpora of Cat- ullus, Tibullus, Propertius, and Lucanus). Raw Frequency Relative Frequency TF-IDF N-gram Type Clusters Score Clusters Score Clusters Score Char 2-grams 2 0.94 2 0.19 2 0.15 Char 3-grams 2 0.92 3 0.53 3 0.074 Char 4-grams 2 0.89 2 0.53 4 0.031 Char 5-grams 2 0.85 5 0.44 2 0.009 Char 6-grams 2 0.801 2 0.53 4 0.004 Char 7-grams 2 0.79 2 0.55 18 0.004 Word 1-grams 2 0.802 2 0.52 2 0.011 Word 2-grams 2 0.74 2 0.38 18 0.001 Word 3-grams 3 0.74 3 0.54 17 0.003 Word 4-grams 3 0.75 2 0.56 15 0.008 Word 5-grams 3 0.75 2 0.58 19 0.007 methodology and resources should be reevaluated. In previous work, high Silhouette scores were observed with the frequency table, but feature importance metrics showed an uneven dis- tribution, with one or two attributes dominating. While TF-IDF identified more features, the low Silhouette scores indicated poor classification. Despite better Silhouette scores, the same issue occurred with the relative frequency matrix. 4.1. Feature Extraction at Character N-Grams Level As shown in Figure 1 the clustering task succeeded in separating Augustan love poets from epic poets. In the next page, Tables 4, and 5 show the feature extraction with this n-gram level. In the case of relative frequency, after excluding Lucan, the best Silhouette Score at the char- acter n-gram level was achieved with 4-grams. However, despite obtaining a relatively good score (0.53), the resulting classification did not meet expectations (Figure 2). The algorithm constructed two clusters, one of which contained only Carmen 94. A similar phenomenon occurred when excluding Vergil, but with the distinction that the isolated Carmen was 112. 4.2. Feature Extraction at Word N-Grams Level At the word n-gram level, similar to the character n-gram level, the best classification method was the raw frequency matrix. Although the relative frequency matrix also yielded good Sil- houette scores, it consistently produced poor classifications, isolating only Carmen 94 from the rest. At other n-gram levels, the carmina that were separated included Carmina 14, 82, 85, and 106. All of these are relatively short, suggesting that the difference in length among the poems introduces internal variability in the corpus that hinders classification based on relative frequencies. The same task, when performed using raw frequencies, yielded excellent results, whether Lucan or Vergil was excluded from the analysis (Figure 3 and Tables 6, 7). In the following table and figure, it can be observed that the use of relative frequency brings 624 Figure 1: Scatter plot of clustering by K Means using a raw frequency matrix of 2 character n-grams indicating the clusters (left) and authors (right) with different colors. forth personal pronouns, terms previously associated with love poetry in earlier studies. How- ever, these should be disregarded when obtained through this methodology, as the classifica- tions achieved with them were quite poor, as can be seen in Table 8 and Figure 4. Both datasets, whether excluding Lucan or Vergil, showed identical performance in author classification. Character n-grams (2 to 6) and single-word n-grams effectively separated epic authors from Augustan love poets using the raw frequency matrix, with Silhouette Scores above 0.83 . Lower scores led to suboptimal classifications, where one cluster contained only a single book (e.g., Book X or XII of the Aeneid or Book IX of Pharsalia). Assessing the relevance of specific character n-grams for classification remains challenging, requiring a more detailed stylistic investigation. In summary, document grouping was effec- tive, though feature-level techniques did not always highlight typical elegiac terms. The terms extracted via decision trees for Augustan love poets and Vergil predominantly reflected epic, mythical, and martial language4 . 3 For more details on the extracted terms, see Appendix A and B. 4 Please note that with the English quotation marks we have attempted to indicate the spaces before or after the words, in cases where it corresponds to the character n-gram. 625 Table 4 Importance features (2 of character n-grams) using the raw frequency matrix method and Stopwords filtering (Corpora of Catullus, Tibullus, Propertius, and Vergilius). Feature Importance Feature IG Feature IGR un 1 nl 0.1254 dg 0.5747 uq 0.1226 aë 0.5463 dg 0.1208 oï 0.5193 gg 0.1205 ën 0.4929 ms 0.1205 ïa 0.4929 gm 0.1205 ez 0.4664 dh 0.1181 aï 0.4664 bt 0.1174 dh 0.4516 yd 0.1129 ï 0.439 dl 0.1128 eï 0.439 nh 0.1128 ër 0.439 ze 0.1128 mf 0.4105 bn 0.1072 x 0.376 my 0.1068 oö 0.3758 ln 0.105 ë 0.3758 rh 0.105 ïc 0.3758 aë 0.1049 ön 0.3758 df 0.1036 ïu 0.3758 yt 0.0999 oë 0.3758 yc 0.0999 gg 0.3724 4.3. Data from the Corpora of Catullus, Tibullus, Propertius, and Vergil: • 5-character n-grams: ‘ sub ’ (low), eucri and eucru (part of Teucri), fatus (spoke), fatur (speaks), auras (breezes), eneas (Aeneas) • 6-character n-grams: teucru (Trojan), aeneas (Aeneas), ‘fatur ’ (speaks), ‘fatus ’ (spoke), ‘fatis ’ (fates), ipoten and mnipot (from omnipotens, presumably attributed to Jupiter), clamor (shout) • 1 word n-grams: urbem (city), aeneas (Aeneas), teucrum (Trojan), ingentem (huge), om- nipotens (almighty), aether (ether/sky), pius (pious), iamque (and now), socius (ally), clam- ore (shout), finis (end), fatis (fates), ignem (fire), auris (from auris, ear or aurum, gold), caelum (sky), genitor (father), hostis (enemy), terram (land), bellum (war), dux (leader), uisus (vision). A similar phenomenon occurs with the terms obtained from the grouping of the Augus- tan love poets with Lucan, where words referring to the political causes of the Civil War pre- dominate, emphasizing the crimes committed by the different factions and the physical conse- quences on the bodies of the Roman soldiers and citizens [38]: 626 Table 5 Importance features (2 of character n-grams) using the raw frequency matrix method (Corpora of Cat- ullus, Tibullus, Propertius, and Lucanus). Feature Importance Feature IG Feature IGR g 1 ye 0.1464 ye 0.5942 gm 0.1234 dh 0.4673 dh 0.1151 mt 0.4094 ya 0.1048 bf 0.3991 by 0.1025 gm 0.3975 xq 0.097 sf 0.3797 ze 0.0938 dt 0.3708 dt 0.0914 fc 0.3514 oh 0.0909 fs 0.3514 rh 0.0894 pc 0.3514 oa 0.0879 nb 0.3514 sn 0.087 dg 0.3514 dq 0.0864 cp 0.3514 yp 0.0864 sn 0.3306 df 0.0858 bm 0.3287 gy 0.0858 y 0.323 ee 0.0858 mf 0.323 yc 0.085 xq 0.3126 sy 0.085 ms 0.2979 lm 0.0836 ze 0.2883 4.4. Data from the Corpora of Catullus, Tibullus, Propertius, and Lucan: • 5-character n-grams: aesar (from Caesar), aussa (cause), scera (part of viscera, viscera), pulos (from populos, peoples), elero (part of scelero, referring to crimes), coeli (of ’ coel’, from heaven), libye (Libya), gladi (part of gladium, sword), adhu and adhuc (from until now) • 6-character n-grams: ‘osque ’ (composed most likely by the accusative plural ending of the second declension combined with the enclitic -que), pulos and opulos (peoples), libye (Libya), scera and iscera (from viscera, entrails), bellor and elloru (as part of bellorum, of the wars), elerum, ‘lerum’ (from scelerum, crime), caussa and ‘ causs’ (cause), ‘ gladi’ (as part of gladium on its different forms, the sword), ‘ coeli’ (sky). • 1 word n-grams: pectora (chests), populos (peoples), scelerum (crimes), bellorum (wars), senatus (senate), ciuilia (civil), nocentes (from nocens, guilty or harmful), caussa (cause), fatis (fates), coelo (sky), mundus (world), caeli (sky), diui (gods), libye (Libya). In previous work, we found that the Silhouette method consistently recommended two clus- ters for 2-character n-grams and three clusters for 1-word n-grams, regardless of the technique used. Scatter plots aligned with the stylistic distribution reported by Forstall et al. [15], who used SVM to analyze Catullus’ influence on Paul the Deacon’s poetry (Figure 5). However, in this instance, whether due to the new preprocessing corrections or the novel comparison methods employed in separating Vergil and Lucan, the clustering tasks were not 627 Figure 2: Scatter plot of clustering by K Means using a relative frequency matrix of 4 character n-grams indicating clusters (left) and authors (right) with different colors. always accurate. While some correct distributions of the authors can be detected in the scatter plot space, the algorithm’s non-human interpretation results in an unclear cluster classification, grouping Tibullus, Propertius, and Vergil against Catullus (Figure 6). 5. Conclusions and Learned Lessons This article highlights the need to reevaluate methodologies and resources. Positive clustering results were obtained, especially with raw frequencies and n-grams, with Silhouette Scores above 0.8. Preprocessing steps and CLTK modules for tokenization and normalization signifi- cantly impacted the results, emphasizing the importance of tailored tools for ancient texts. Relative frequency and two datasets reduced noise and bias from the epic authors, aiming to balance the text sets. However, even with more balanced data, raw frequencies provided better clustering results than relative frequency and TF-IDF matrices. As in the previous study, uneven feature importance and variable performance across n-gram levels and matrices suggest further refinement is needed for consistent results. N-grams from Vergil’s and Lucan’s works show a dominance of political, historical, and war-related terms. This suggests that, despite efforts to balance the datasets, the lexical char- acteristics of epic poetry still influence classification. The identified n-grams reflect the epic and mythical focus of these authors, contrasting with the love and personal themes of Catullus, Tibullus, and Propertius. The variability in document length—regular in Pharsalia and Aeneid, but variable in the Augustan love poets—affects results. Additionally, the internal variability among the Augustan love poets’ corpora also affects classification. We could experiment by partitioning Catullu’s work into polymetric poems, carmina maiora, and epigrams or elegiac couplets, and run separate analyses, or intervene in the corpus catullianum by removing non- amorous themed poems. However, this is complex, as thematic boundaries are not clear-cut. This exploratory analysis requires further refinement of other techniques such as variable 628 Figure 3: Scatter plot of clustering by K Means using a raw frequency Matrix of 1-word n-grams clusters (left) and authors (right) with different colors. ranges of character and word n-grams (only fixed ranges were used in this study), other sim- ilarity measures such as Jaccard, Cosine, or Soft Cosine, or clustering methods like Gaussian Mixture Models, DBSCAN, or hierarchical clustering. Future research could apply normaliza- tion techniques such as L1 or Z-scaler, and phenomena like collocations and co-occurrences, which were not applied in this study. A close reading of clusters based on relative frequency also offers promise. As for the representation of the documents, there is a need to explore techniques with em- beddings like those developed by Burns et al., Bamman et al., and Johnson et al. [18, 8, 3, 6]. It should also be noted that the terms obtained by the Decision Tree technique are words with classification power for that dataset, not necessarily the most typical of one type of po- etry or another, as there may be important words for both genres penalized by the metrics of Importance, Information Gain, or TF-IDF. The unequal size of the poems also contributed to the clarity of classification in raw counts, indirectly transferring poem length as a classification criterion. Similarly, in decision trees, the feature split points reflected the same pattern, with epic poem features having much higher frequencies, clearly impacting the results. Finally, it is important to briefly consider the implications of applying computational and 629 Table 6 Most important features at the level of 1-word n-grams ranked by Importance, Information Gain, Infor- mation Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Propertius, and Vergilius). Raw Frequency Word (1, 1) n-grams Feature Importance Feature IG Feature IGR urbem 1 aeneas 0.1813 teucrum 0.6931 ingentem 0.1813 ingens 0.6931 ingens 0.1813 ingentem 0.6931 teucrum 0.1813 aeneas 0.6931 fatis 0.1683 pius 0.6413 omnipotens 0.1683 aethere 0.6413 late 0.1683 ast 0.6413 ignem 0.1683 fatur 0.6413 auras 0.1683 diuom 0.6413 iamque 0.1601 socios 0.6413 ea 0.1601 clamore 0.6413 genitor 0.1601 teucros 0.6413 terram 0.1601 visu 0.6413 talibus 0.1601 ignem 0.6061 equidem 0.1601 omnipotens 0.6061 diuom 0.1571 fatis 0.6061 pius 0.1571 late 0.6061 visu 0.1571 auras 0.6061 teucros 0.1571 teucri 0.6056 ast 0.1571 regem 0.6056 aeneas 0.1571 ast 0.6056 Distant Reading techniques alongside hypotheses or educated guesses from Close Reading. Fre- quency counting, for instance, is used here to model documents, but humans do not speak to be counted. Otherwise, Catullus would simply have repeated the name Lesbia, and his love would have been understood without the effort of creating poetry. Fortunately, language is far more abstract and complex, and computational methods are only beginning to reveal its intricacies. This issue has resurfaced with criticisms, such as those by Noam Chomsky, against generative models [11]. It is true that the human mind performs language tasks in a highly elegant manner and acquires a language exposed to a much smaller number of data than those handled by Large Language Models (LLMs). LLMs are tools developed for other tasks that did not originally seek to emulate the human mind [13, 37]. But it is also true that one must yield to the evidence of the successful results obtained with the use of these techniques and their undeniable capac- ity to facilitate all kinds of tasks. It’s essential to acknowledge both the limits and strengths of computational tools, recognizing that Distant Reading offers a different scale of analysis— rooted not just in methodology but in changes in how information is produced, accessed, and analyzed in the digital age [33]. Despite criticisms [5], Digital Humanities methodologies hold great promise for studying language-rich subjects that balance aesthetic and rhythmic elements like refrains, alliterations, and anaphoras, presenting a unique challenge for modern analytical 630 Table 7 Most important features at the level of 1 word n-grams according to Importance, Information Gain, Information Gain Ratio using the raw matrix method (Corpora of Catullus, Tibullus, Propertius, and Lucanus). Raw Frequency word (1, 1) n-grams Feature Importance Feature IG Feature IGR pectora 1 populos 0.1589 populis 0.6931 scelerum 0.1589 scelerum 0.6931 populis 0.1589 bellorum 0.6931 mundo 0.1589 uiscera 0.6931 bellorum 0.1589 exit 0.6931 exit 0.1589 populos 0.6931 uiscera 0.1589 mundo 0.6931 senatus 0.1464 caussa 0.636 ciuilia 0.1464 nocentes 0.636 nefas 0.1464 ciuilibus 0.636 fatis 0.1464 coelo 0.636 bellum 0.1388 libye 0.636 milite 0.1388 robore 0.636 ducis 0.1388 superi 0.636 ciuilibus 0.1345 adhuc 0.636 fauces 0.1345 ciuile 0.636 robore 0.1345 fauces 0.636 adhuc 0.1345 coeli 0.636 caussa 0.1345 malorum 0.636 libye 0.1345 potuere 0.5968 techniques. Acknowledgments I sincerely thank Dr. Kyle P. Johnson, Director of AI at Morgan, Lewis and Bockius LLP, and Dr. Patrick J. Burns from the Institute for the Study of the Ancient World, NYU, for their kind and insightful responses to my inquiries on tokenization and the use of CLTK and LatinCy libraries. I also extend my gratitude to Professor Benjamin Nagy from the Institute of the Polish Language, Polish Academy of Sciences (IJP PAN), Krakow, for his expert advice on correcting verse counts based on authorized editions, which greatly enhanced the accuracy of this text analysis. References [1] [No author]. Perseus Digital Library Homepage. [No date]. url: https://www.perseus.tuf ts.edu/hopper/.. 631 Table 8 Most important features at the level of 1 word n-grams according to Importance, Information Gain, Information Gain Ratio using the relative frequency matrix method (Corpora of Catullus, Tibullus, Propertius, and Vergilius). Relative Frequency word (1, 1) n-grams Feature Importance Feature IG Feature IGR legit 1 moechatur 0.0244 moechatur 0.6931 olera 0.0244 olera 0.6931 olla 0.0244 olla 0.6931 mentula 0.0151 mentula 0.114 dicunt 0.0132 dicunt 0.0689 legit 0.0124 legit 0.0542 certe 0.009 certe 0.0209 ipsa 0.0061 ipsa 0.0087 mihi 0.003 mihi 0.0031 tibi 0.003 tibi 0.003 tu 0.0024 tu 0.0024 nunc 0.0021 nunc 0.0022 quid 0.0019 quid 0.002 ne 0.0019 ne 0.002 ego 0.0018 ego 0.0019 mea 0.0018 mea 0.0019 esse 0.0018 esse 0.0019 iam 0.0018 iam 0.0018 illa 0.0016 illa 0.0017 nam 0.0015 nam 0.0017 [2] [No author]. Stopwords ISO. [No date]. url: https://github.com/stopwords-iso/stopwor ds-iso/blob/master/README.md. [3] D. Bamman and P. J. Burns. Latin BERT: A Contextual Language Model for Classical Philol- ogy. 2020. doi: 10.48550/arXiv.2009.10053. url: http://arxiv.org/abs/2009.10053. [4] A. Berra. Ancient Greek and Latin Stopwords. 2024. url: https://github.com/aurelberra/s topwords. [5] T. Brennan. The Digital-Humanities Bust. 2017. url: https://www.chronicle.com/article /the-digital-humanities-bust/. [6] P. J. Burns. “Building a Text Analysis Pipeline for Classical Languages”. In: Building a Text Analysis Pipeline for Classical Languages. De Gruyter Saur, 2019, pp. 159–176. doi: 10.1515/9783110599572-010. url: https://www.degruyter.com/document/doi/10.1515/97 83110599572-010/html. [7] P. J. Burns. “Constructing Stoplists for Historical Languages”. In: Digital Classics Online (2018), pp. 4–20. doi: 10.11588/dco.2018.2.52124. url: https://journals.ub.uni-heidelberg .de/index.php/dco/article/view/52124. 632 Figure 4: Scatter plot of clustering by K Means using a relative frequency matrix of 1-word n-grams indicating clusters (left) and authors (right) with different colors Figure 5: Graph obtained by Forstall et. al. using the One-class SVM method. Cited by Coffe et al. [12]. [8] P. J. Burns. LatinCy: Synthetic Trained Pipelines for Latin NLP. 2023. doi: 10.48550/arXiv .2305.04365. url: http://arxiv.org/abs/2305.04365. [9] G. Cantaluppi and M. Passarotti. “Clustering the Corpus of Seneca: A Lexical-Based Ap- proach”. In: Advances in Latent Variables: Methods, Models and Applications. Ed. by M. Carpita, E. Brentari, and E. M. Qannari. Cham: Springer International Publishing, 2015, pp. 13–25. doi: 10.1007/10104\_2014\_6. url: https://doi.org/10.1007/10104%5C%5F2014 %5C%5F6. [10] L. M. Cerrato and R. F. Chavez. Perseus Classics Collection: An Overview. [No date]. url: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0053. [11] N. Chomsky, I. Roberts, and J. Watumull. “Noam Chomsky: The False Promise of Chat- GPT”. In: The New York Times (2023). url: https://www.nytimes.com/2023/03/08/opinio n/noam-chomsky-chatgpt-ai.html. [12] N. Coffee, J. Gawley, C. Forstall, W. Scheirer, D. Scheirer, J. Corso, and B. Parks. “Mod- elling the Interpretation of Literary Allusion with Machine Learning Techniques Journal 633 Figure 6: Scatter plot of clustering by K Means using a TF IDF matrix of 1-word n-grams indicating clusters (left) and authors (right) with different colors. of Digital Humanities”. In: Journal of Digital Humanities (2013), pp. 478–479. url: https: //journalofdigitalhumanities.org/3-1/modelling-the-interpretation-of-literary-allusion- with-machine-learning-techniques/. [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. doi: 10.48550/arXiv.1810.04805. url: ht tp://arxiv.org/abs/1810.04805. [14] C. W. Forstall, S. L. Jacobson, and W. J. Scheirer. “Evidence of intertextuality: investigat- ing Paul the Deacon’s Angustae Vitae”. In: Literary and Linguistic Computing 26.3 (2011), pp. 285–296. doi: 10.1093/llc/fqr029. url: https://doi.org/10.1093/llc/fqr029. [15] C. W. Forstall and W. Scheirer. “A Statistical Stylistic Study of Latin Elegiac Couplets”. In: 2010 Chicago Colloquium on Digital Humanities and Computer Science. 2010, [No pages]. url: https://www.semanticscholar.org/paper/A-Statistical-Stylistic-Study-of-Latin-Ele giac-Forstall-Scheirer/e3caac9ec4ee16baac70ed94808dca57dff48a2d. [16] Giovanni Bracco, Silvio Migliori, Giorgio Mencuccini, Daniela Alderuccio, and Giovanni Ponti. “Data mining tools and GRID infrastructure for Assyriology text analysis (an Old-Babylonian situation studied through text analysis and data mining tools)”. In: RAI- Rencontre Assyriologique Internationale- Private and State in the Ancient Near East. Bel- gium, 2013, [No pages]. [17] J. B. Greenough. The Bucolics, AEneid, and Georgics of Virgil. Boston: Ginn, 1900. [18] K. P. Johnson, P. J. Burns, J. Stewart, T. Cook, C. Besnier, and W. J. B. Mattingly. “The Clas- sical Language Toolkit: An NLP Framework for Pre-Modern Languages”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Ed. by H. Ji, J. C. Park, and R. Xia. Online: Association for Computational Linguistics, 634 2021, pp. 20–29. doi: 10.18653/v1/2021.acl-demo.3. url: https://aclanthology.org/2021.a cl-demo.3. [19] C. S. Lewis. La alegorı́a del amor: un estudio sobre tradición medieval. 2015th ed. Madrid: Encuentro, 1936. [20] S. Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions on Information The- ory 28.2 (1982), pp. 129–137. doi: 10.1109/tit.1982.1056489. url: https://ieeexplore.ieee.o rg/document/1056489. [21] A. Martins, C. Grácio, C. Teixeira, I. Pimenta Rodrigues, J. L. G. Zapata, and L. Ferreira. “Historia Augusta authorship: an approach based on Measurements of Complex Net- works”. In: Applied Network Science 6.1 (2021), pp. 1–23. doi: 10.1007/s41109-021-00390- 7. url: https://appliednetsci.springeropen.com/articles/10.1007/s41109-021-00390-7. [22] E. T. Merrill. Catullus; edited by Elmer Truesdell Merrill. Boston Ginn, 1893. url: http://a rchive.org/details/catulluseditedby00catuuoft. [23] F. Moretti. Distant Reading. Verso, 2013. [24] L. Müller. Sex. Propertii Elegiae. Leipzig: Teubner, 1898. [25] B. Nagy. “Rhyme in classical Latin poetry: Stylistic or stochastic?” In: Digital Scholarship in the Humanities 37.4 (2022), pp. 1097–1118. doi: 10.1093/llc/fqab105. url: https://doi.o rg/10.1093/llc/fqab105. [26] B. Nagy. “Some stylometric remarks on Ovid’s Heroides and the Epistula Sapphus”. In: Digital Scholarship in the Humanities 38.3 (2023), pp. 1183–1199. doi: 10.1093/llc/fqac098. url: https://doi.org/10.1093/llc/fqac098. [27] C. J. Nusch. “Las Edades del Amor: una propuesta para el proyecto Aetates Amoris desti- nado a la poesı́a amorosa”. Tesis. Universidad Nacional de Educación a Distancia, España, 2021. doi: 10.35537/10915/125629. url: http://sedici.unlp.edu.ar/handle/10915/125629. [28] C. J. Nusch. “Una breve exploración de la terminologı́a amorosa en los corpora catul- lianum, tibullianum y propertianum con métodos y herramientas computacionales: eti- quetado gramatical, lemas, bigramas y co-apariciones”. In: Revista de Humanidades Dig- itales 9 (2024), pp. 1–40. doi: 10.5944/rhd.vol.9.2024.38680. url: https://revistas.uned.es /index.php/RHD/article/view/38680. [29] C. J. Nusch, G. del Rio Riande, L. C. C. Cagnina, M. L. Errecalde, and L. Antonelli. “Ini- tial Explorations for Document Clustering Tasks in Latin Elegiac Poets”. In: Decisioning. Pereira, Colombia, 2024. [30] J. P. Postgate. Tibulli aliorumque carminum libri tres. Oxford: Scriptorum classicorum bibliotheca Oxoniensis, 1915. [31] J. R. Quinlan. “Induction of decision trees”. In: Machine Learning 1.1 (1986), pp. 81–106. doi: 10.1007/bf00116251. url: https://doi.org/10.1007/BF00116251. [32] S. Ramsay. Reading Machines: Toward and Algorithmic Criticism. Urbana, 2011. [33] Ricardo Pimenta. “De Narciso ao mundo-imagem: por uma urgência de uma perspectiva crı́tica sobre a cena informacional contemporânea”. In: Ciência da Informação : sociedade, crı́tica e inovação. Rio de Janeiro, 2022, [No pages]. 635 [34] P. J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis”. In: Journal of Computational and Applied Mathematics 20 (1987), pp. 53– 65. doi: 10.1016/0377-0427(87)90125-7. url: https://www.sciencedirect.com/science/arti cle/pii/0377042787901257. [35] C. E. Shannon. “A mathematical theory of communication”. In: The Bell System Technical Journal 27.3 (1948), pp. 379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. url: https://ie eexplore.ieee.org/document/6773024. [36] K. Spärck Jones. “A statistical interpretation of term specificity and its application in retrieval”. In: Journal of Documentation 28.1 (1972), pp. 11–21. doi: 10 . 1108 / eb026526. url: https://doi.org/10.1108/eb026526. [37] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017, [No pages]. url: https://papers.nips.cc/p aper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. [38] M. M. Vizzotti. “De la tragedia de Séneca a la épica de Lucano: estrategias de repre- sentación de los paradigmas filosóficos y literarios”. Tesis. Universidad Nacional de La Plata, 2014. doi: 10.35537/10915/34410. url: http://sedici.unlp.edu.ar/handle/10915/344 10. [39] C. H. Weise. Pharsaliae Libri X. M. Annaeus Lucanus. Leipzig: G. Bassus, 1935. Appendix: Extra Data from the Corpora of Catullus, Tibullus, Propertius, Vergilius and Lu- canus 636 Table 9 Most important features at the level of character 2 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method and Stopwords filtering (Corpora of Catullus, Tibullus, Propertius, and Vergilius). Raw Frequency char (2, 2) n-grams Feature Importance Feature IG Feature IGR un 1 nl 0.1254 dg 0.5747 uq 0.1226 aë 0.5463 dg 0.1208 oï 0.5193 gg 0.1205 ën 0.4929 ms 0.1205 ïa 0.4929 gm 0.1205 ez 0.4664 dh 0.1181 aï 0.4664 bt 0.1174 dh 0.4516 yd 0.1129 ï 0.439 dl 0.1128 eï 0.439 nh 0.1128 ër 0.439 ze 0.1128 mf 0.4105 bn 0.1072 x 0.376 my 0.1068 oö 0.3758 ln 0.105 ë 0.3758 rh 0.105 ïc 0.3758 aë 0.1049 ön 0.3758 df 0.1036 ïu 0.3758 yt 0.0999 oë 0.3758 yc 0.0999 gg 0.3724 637 Table 10 Most important features at the level of character 3 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (3, 3) n-grams Feature Importance Feature IG Feature IGR aba 1 ipo 0.1601 ffa 0.6056 om 0.1601 dfa 0.6056 ffa 0.1571 adg 0.5747 dsu 0.1536 dhu 0.5747 oln 0.1536 dgn 0.5747 gii 0.1481 xsc 0.5747 ols 0.1444 ciq 0.5747 teb 0.1433 ybr 0.5747 tto 0.1433 axu 0.5747 uom 0.139 hyb 0.5747 toq 0.139 ols 0.5521 xce 0.139 yde 0.5463 dfa 0.138 aë 0.5463 rex 0.1365 bp 0.5463 bii 0.1352 amd 0.5463 teu 0.1352 moq 0.5463 nip 0.1352 mdu 0.5463 roq 0.1316 ipo 0.5458 euc 0.1316 om 0.5458 bum 0.1316 giq 0.5193 638 Table 11 Most important features at the level of character 4 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (4, 4) n-grams Feature Importance Feature IG Feature IGR pri 1 temq 0.1813 iuom 0.6931 iuom 0.1813 ucru 0.6931 ucru 0.1813 temq 0.6931 boru 0.1813 ubib 0.6931 ubib 0.1813 boru 0.6931 sumq 0.1683 nimb 0.6413 mman 0.1683 m ef 0.6413 lumq 0.1683 effa 0.6413 mnip 0.1683 ffat 0.6413 ipot 0.1683 ast 0.6413 nipo 0.1683 cios 0.6413 bibu 0.1683 lumq 0.6061 ea 0.1601 ipot 0.6061 iisq 0.1601 mman 0.6061 cri 0.1601 mnip 0.6061 dani 0.1601 nipo 0.6061 teuc 0.1601 bibu 0.6061 scun 0.1601 sumq 0.6061 ttol 0.1601 adfa 0.6056 eucr 0.1601 anid 0.6056 639 Table 12 Most important features at the level of character 5 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (5, 5) n-grams Feature Importance Feature IG Feature IGR sub 1 ntemq 0.1813 eucru 0.6931 eucri 0.1813 borum 0.6931 eucru 0.1813 ntemq 0.6931 fatus 0.1813 fatur 0.6931 imman 0.1813 iuom 0.6931 fatur 0.1813 imman 0.6931 borum 0.1813 ucrum 0.6931 iuom 0.1813 eucri 0.6931 temqu 0.1813 e teu 0.6931 e teu 0.1813 temqu 0.6931 ucrum 0.1813 fatus 0.6931 mnipo 0.1683 cios 0.6413 auras 0.1683 clamo 0.6413 sumqu 0.1683 anch 0.6413 fatis 0.1683 effa 0.6413 nipot 0.1683 lamor 0.6413 nt ac 0.1683 ocios 0.6413 omnip 0.1683 effat 0.6413 eneas 0.1683 undam 0.6413 ipote 0.1683 m eff 0.6413 640 Table 13 Most important features at the level of character 6 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (6, 6) n-grams Feature Importance Feature IG Feature IGR tisque 1 teucru 0.1813 ngente 0.6931 aeneas 0.1813 teucri 0.6931 fatur 0.1813 aeneas 0.6931 ntemqu 0.1813 ucrum 0.6931 e teuc 0.1813 e teuc 0.6931 eucrum 0.1813 teucru 0.6931 ngente 0.1813 borum 0.6931 imman 0.1813 imman 0.6931 ucrum 0.1813 fatus 0.6931 borum 0.1813 ntemqu 0.6931 fatus 0.1813 eucrum 0.6931 teucri 0.1813 fatur 0.6931 temque 0.1813 temque 0.6931 fatis 0.1683 m regi 0.6413 auras 0.1683 a fatu 0.6413 eneas 0.1683 pius a 0.6413 auras 0.1683 a teuc 0.6413 ipoten 0.1683 uisu 0.6413 omnip 0.1683 clamor 0.6413 mnipot 0.1683 e ora 0.6413 641 Table 14 Most important features at the level of word 1 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Vergilius). Raw Frequency word (1, 1) n-grams Feature Importance Feature IG Feature IGR urbem 1 aeneas 0.1813 teucrum 0.6931 ingentem 0.1813 ingens 0.6931 ingens 0.1813 ingentem 0.6931 teucrum 0.1813 aeneas 0.6931 fatis 0.1683 pius 0.6413 omnipotens 0.1683 aethere 0.6413 late 0.1683 ast 0.6413 ignem 0.1683 fatur 0.6413 auras 0.1683 diuom 0.6413 iamque 0.1601 socios 0.6413 ea 0.1601 clamore 0.6413 genitor 0.1601 teucros 0.6413 terram 0.1601 uisu 0.6413 talibus 0.1601 ignem 0.6061 equidem 0.1601 omnipotens 0.6061 diuom 0.1571 fatis 0.6061 pius 0.1571 late 0.6061 uisu 0.1571 auras 0.6061 teucros 0.1571 teucri 0.6056 ast 0.1571 regem 0.6056 642 Table 15 Most important features at the level of char 2 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (2, 2) n-grams Feature Importance Feature IG Feature IGR g 1 ye 0.1464 ye 0.5942 gm 0.1234 dh 0.4673 dh 0.1151 mt 0.4094 ya 0.1048 bf 0.3991 by 0.1025 gm 0.3975 xq 0.097 sf 0.3797 ze 0.0938 dt 0.3708 dt 0.0914 fc 0.3514 oh 0.0909 fs 0.3514 rh 0.0894 pc 0.3514 oa 0.0879 nb 0.3514 sn 0.087 dg 0.3514 dq 0.0864 cp 0.3514 yp 0.0864 sn 0.3306 df 0.0858 bm 0.3287 gy 0.0858 y 0.323 ee 0.0858 mf 0.323 yc 0.085 xq 0.3126 sy 0.085 ms 0.2979 lm 0.0836 ze 0.2883 643 Table 16 Most important features at the level of char 3 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (3, 3) n-grams Feature Importance Feature IG Feature IGR te 1 oer 0.1589 bye 0.6931 bye 0.1589 oer 0.6931 rct 0.1464 dhu 0.636 obo 0.1388 ye 0.636 ax 0.1388 rct 0.5942 ye 0.1345 bp 0.5627 dhu 0.1345 giq 0.5627 nfu 0.1328 gme 0.5341 agm 0.1277 emt 0.5309 teb 0.1234 bmo 0.5309 rut 0.1234 eer 0.5309 gme 0.1224 xir 0.5309 toq 0.1195 ax 0.5275 xce 0.1195 obo 0.5275 xn 0.1195 nny 0.5 lsu 0.1195 al 0.5 pei 0.116 mto 0.5 axe 0.116 bif 0.5 aux 0.116 efo 0.5 saq 0.116 gad 0.5 644 Table 17 Most important features at the level of char 4 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (4, 4) n-grams Feature Importance Feature IG Feature IGR ra n 1 ibye 0.1589 auss 0.6931 auss 0.1589 coel 0.6931 coel 0.1589 ibye 0.6931 glad 0.1589 glad 0.6931 sena 0.1464 oelo 0.636 s rh 0.1464 bye 0.636 iuil 0.1464 dhuc 0.636 leru 0.1464 rtib 0.636 efas 0.1464 adhu 0.636 moto 0.1464 auce 0.636 susq 0.1464 cesp 0.5968 fauc 0.1464 moes 0.5968 tebr 0.1464 xcus 0.5968 rcto 0.1464 suor 0.5968 lumq 0.1464 tors 0.5968 arct 0.1464 otue 0.5968 mpul 0.1388 adau 0.5968 robo 0.1388 gulo 0.5968 uile 0.1388 nfan 0.5968 obor 0.1388 mpag 0.5968 645 Table 18 Most important features at the level of char 5 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (5, 5) n-grams Feature Importance Feature IG Feature IGR aesar 1 aussa 0.1589 causs 0.6931 lerum 0.1589 scera 0.6931 causs 0.1589 eleru 0.6931 scera 0.1589 pulos 0.6931 coeli 0.1589 coeli 0.6931 libye 0.1589 coel 0.6931 pulos 0.1589 lerum 0.6931 gladi 0.1589 libye 0.6931 eleru 0.1589 aussa 0.6931 coel 0.1589 gladi 0.6931 glad 0.1589 glad 0.6931 ellor 0.1589 ellor 0.6931 arcto 0.1464 oeli 0.636 peri 0.1464 oelo 0.636 fatis 0.1464 i dam 0.636 ciuil 0.1464 oties 0.636 tent 0.1464 ic fa 0.636 nefas 0.1464 obore 0.636 fauc 0.1464 adhuc 0.636 susqu 0.1464 iscri 0.636 646 Table 19 Most important features at the level of char 6 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency char (6, 6) n-grams Feature Importance Feature IG Feature IGR osque 1 pulos 0.1589 elerum 0.6931 scera 0.1589 ssere 0.6931 lerum 0.1589 iscera 0.6931 opulos 0.1589 celeru 0.6931 bellor 0.1589 s phar 0.6931 causs 0.1589 gladi 0.6931 ssere 0.1589 m popu 0.6931 s phar 0.1589 pulos 0.6931 elerum 0.1589 lerum 0.6931 exit 0.1589 elloru 0.6931 lia be 0.1589 bellor 0.6931 caussa 0.1589 caussa 0.6931 iscera 0.1589 opulos 0.6931 libye 0.1589 libye 0.6931 gladi 0.1589 coeli 0.6931 m popu 0.1589 scera 0.6931 coeli 0.1589 causs 0.6931 celeru 0.1589 lia be 0.6931 elloru 0.1589 exit 0.6931 us for 0.1464 unctas 0.636 647 Table 20 Most important features at the level of word 1 n-grams according to Importance, Information Gain, Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper- tius, and Lucanus). Raw Frequency word (1, 1) n-grams Feature Importance Feature IG Feature IGR pectora 1 populos 0.1589 populis 0.6931 scelerum 0.1589 scelerum 0.6931 populis 0.1589 bellorum 0.6931 mundo 0.1589 uiscera 0.6931 bellorum 0.1589 exit 0.6931 exit 0.1589 populos 0.6931 uiscera 0.1589 mundo 0.6931 senatus 0.1464 caussa 0.636 ciuilia 0.1464 nocentes 0.636 nefas 0.1464 ciuilibus 0.636 fatis 0.1464 coelo 0.636 bellum 0.1388 libye 0.636 milite 0.1388 robore 0.636 ducis 0.1388 superi 0.636 ciuilibus 0.1345 adhuc 0.636 fauces 0.1345 ciuile 0.636 robore 0.1345 fauces 0.636 adhuc 0.1345 coeli 0.636 caussa 0.1345 malorum 0.636 libye 0.1345 potuere 0.5968 648