=Paper= {{Paper |id=Vol-3834/paper43 |storemode=property |title=Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction |pdfUrl=https://ceur-ws.org/Vol-3834/paper43.pdf |volume=Vol-3834 |authors=Carlos Javier Nusch,Gimena Del Río Riande,Leticia Cagnina,Marcelo Luis Errecalde,Leandro Antonelli |dblpUrl=https://dblp.org/rec/conf/chr/NuschRCEA24 }} ==Clustering Tasks and Decision Trees with Augustan Love Poets: Cohesion and Separation in Feature Importance Extraction== https://ceur-ws.org/Vol-3834/paper43.pdf
                                Clustering Tasks and Decision Trees with Augustan
                                Love Poets: Cohesion and Separation in Feature
                                Importance Extraction⋆
                                Carlos Javier Nusch1,2,∗,† , Gimena del Rio Riande3,† , Leticia Cecilia Cagnina4,† ,
                                Marcelo Luis Errecalde4,† and Leandro Antonelli5,6,†
                                1
                                  PREBI, SEDICI, Universidad Nacional de La Plata, Argentina
                                2
                                  CESGI, Comisión de Investigaciones Científicas de la Provincia de Buenos Aires, Argentina
                                3
                                  IIBICRIT, Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina
                                4
                                  LIDIC, Facultad de Ciencias Físico Matemáticas y Naturales, Universidad Nacional de San Luis, Argentina
                                5
                                  LIFIA, Facultad de Informática, Universidad Nacional de La Plata, Argentina
                                6
                                  CAETI, Facultad de Tecnología Informática, Universidad Abierta Interamericana, Argentina


                                            Abstract
                                            This article extends various automatic text analysis tasks from previous works by applying natural
                                            language processing techniques to a corpus of Latin texts from the 1st century BC and 1st century AD.
                                            The motivation behind this work is to delve into and understand a historical literary trend revolving
                                            around the themes of love, spanning from antiquity through to the medieval period. The analyzed
                                            authors include Gaius Valerius Catullus, Albius Tibullus, and Sextus Propertius, representing the literary
                                            movement of the neoterics, and Publius Vergilius Maro and Marcus Annaeus Lucanus, epic poets with
                                            distinct styles, serving as control samples. Unlike previous works, various corrections were added to the
                                            preprocessing tasks, including improved word tokenization with enclitics and handling of orthographic
                                            variances. For the clustering tasks, the K-Means method and the Silhouette Score were used to determine
                                            the optimal cluster sizes. Using these optimal clusters as labels, decision trees were trained for each
                                            range of n-grams, aiming to identify features with the highest Information Gain and Information Gain
                                            Ratio. The trees were trained based on the criterion of Entropy, and calculations of Feature Importance
                                            were performed. In this study, we focused on detailing the classification results and features extracted by
                                            the decision trees, based on the best Silhouette scores obtained and the Information Gain. We examined
                                            whether the words or parts of words with classificatory potential identified in the process matched the
                                            findings from previous exploratory tasks performed using other techniques.

                                            Keywords
                                            Augustan love poets, Document Clustering, K Means, Silhouette CoefÏcient, Decision Trees, Feature
                                            Importance, Information Gain Ratio




                                CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
                                ∗
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                £ carlosnusch@prebi.unlp.edu.ar (C. J. Nusch)
                                ç https://prebi-sedici.unlp.edu.ar/personal/carlos-nusch/ (C. J. Nusch)
                                ȉ 0000-0003-1715-4228 (C. J. Nusch); 0000-0002-8997-5415 (G. d. R. Riande); 0000-0001-7825-2927 (L. C. Cagnina);
                                0000-0001-5605-8963 (M. L. Errecalde); 0000-0003-1388-0337 (L. Antonelli)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




                                                                                                           620
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
This study1 builds on a master’s thesis [27] examining C. S. Lewis’ observations [19] on the
influence of Courtly love and Occitan literature on 20th-century love imagery. Similarities in
love themes, treatment of the beloved, and political and military terms were found between
Occitan and 1st-century BC Latin poetry. This thesis aims to identify textual patterns linking
ancient love themes to the Religion of Love in medieval Occitan poetry, using a comparative ap-
proach that combines close reading with computational methods [32, 23]. This article evaluates
clustering techniques for differentiating love poems from other Latin poetry, identifying key
lexical features. Previous work [29] explained the techniques, while here we focus on feature
extraction and optimal Silhouette Score values.

1.1. State of the Art
Several authors have applied clustering to ancient texts. Bracco et al. [16] used K-means to
detect literary genres in cuneiform texts, and Martins et al. [21] used k-Nearest Neighbors for
author classification. Cantaluppi and Passarotti [9] studied Seneca’s complete works, Cicero’s
orations, Jerome’s Latin New Testament, and Aquinas’ major works. Nagy [25] used multi-
variate analysis and clustering to examine rhyme in twelve classical Latin poets, identifying
stylistic differences between genres and authors. In recent work, he applied UMAP and t-SNE
to show stylistic distinctions between Ovid’s Heroides and other works, and the authenticity
of the Epistula Sapphus [26]. Forstall et al. [14, 15] compared lexical and rhythmic features at
character and word n-gram levels with other 1st-century BC poets.

1.2. Problem Definition and Contributions
The previous work aimed to explore clustering techniques to distinguish love poems from other
types of poetry and identify useful lexical characteristics for classification. The K-means algo-
rithm [20] was used, and the optimal number of clusters was determined with the Silhouette
Index [34], which measures group cohesion and separation. Since K-means, based on Euclidean
distance, does not provide detailed feature extraction, decision trees [31] were used to comple-
ment this approach. This combination allowed for the indirect extraction of features, with
metrics such as Importance, Information Gain, and Information Gain Ratio [35] identifying the
most relevant features.


2. Research Methodology and Approach
2.1. Analysis Corpus and Used Editions
The corpus includes the complete works of Gaius Valerius Catullus [22], Albius Tibullus [24],
and Sextus Propertius [30], representing love poetry, as well as all books from the Aeneid by

1
    An appendix with key tables is included after the references. A larger dataset is available: Nusch, C. (2024). Clus-
    tering Tasks and Decision Trees with Augustan love poets [Data set]. CHR2024, Aarhus, Denmark. Zenodo.
    https://doi.org/10.5281/zenodo.12682694.




                                                           621
Publius Vergilius Maro [17] and Pharsalia by Marcus Annaeus Lucanus [39] as control samples,
focused on political, historical, and martial themes. The analysis reveals differences in the
number of words and verses per poem among different authors and genres (Table 1). To address
concerns about unbalanced datasets, we used relative frequency and separated the authors
to reduce noise and bias from the larger epic texts. Two datasets were used: one with the
Augustan love poets and Vergil, and another with the Augustan love poets and Lucan.

Table 1
Summary of works and word statistics from various Latin authors.
        Author                                Verses          Total             Unique           Avg. Words
        (Work)                                                Words             Words            Poem/Canto
    Catullus (Merrill, 1893)                   2289           12912               5802               110.35
    Tibullus (Müller, 1898)                    1930           12368               5201               334.27
    Propertius (Postgate, 1915)                4008           25450               9809               242.38
    Lucanus Pharsalia (Weise, 1935)            8061           51215               14750              5121.5
    Virgilius Aeneid (Greenough, 1900)         9896           63896               16616              5324.66

   To construct the analysis corpus, resources from the Perseus Project digital library [1, 10]
at Tufts University were used. The library contains 2,412 works in 3,192 editions and trans-
lations (1,639 in Greek, 636 in Latin) and a total of 69.7 million words. The texts, curated by
specialists and shared under a CC BY SA 3.0 (US) license, are available in XML format. Addi-
tional resources include models for grammatical tagging and stopwords for Latin. The poems
were harvested through web scraping using R, while Python libraries were employed for text
analysis and mining.
   The analysis explored character n-grams (2 to 7) and word n-grams (1 to 5), using the Bag
of Words (BOW) method [36]. Three types of matrices were generated: the first based on
raw frequency (using Scikit-learn’s CountVectorizer), the second on relative frequency (with a
custom function), and the third using the TF-IDF technique [36], which highlights important
words by weighing their frequency relative to their rarity across the dataset. While CountVec-
torizer simply counts word occurrences, TF-IDF reduces the impact of common words, giving
more weight to unique terms.

2.2. Text Preprocessing Tasks
Before analysis, the text was cleaned by removing empty lines, sequences of spaces (“\n \n\n\n”),
and editorial symbols for illegible gaps (“†”). Spanish quotation marks were replaced with
English ones for tool compatibility, and punctuation was removed from character n-grams, as
it was added by editors.
   For stopwords2 , we used the Stopwords ISO [2] package for Latin, which we preferred over
the Perseus Project version because it retains important words in elegiac poetry, such as ego,
enabling the analysis of personal pronouns—a significant feature noted in previous works [27,
28].
2
    For a more detailed discussion of the complexity and variety of stopwords in Latin and other ancient languages,
    see A. Berra [4] and P.J. Burns [7].




                                                        622
  To enhance tokenization, we added two procedures from The Classical Language Toolkit
(CLTK) [18]: JVReplacer, to standardize spellings (e.g., Iulius/Julius and uir/vir), and LatinWord-
Tokenizer, which helped identify enclitics (e.g., -que, -ve) and prevent incorrect tokenization.


3. Evaluation: The Clustering and Decision Trees as combined
   techniques
As explained previously [29], document clustering was performed using the K-means method
and Silhouette scores to evaluate the best cluster configuration. The optimal number of clusters
(k) was determined by testing k values from 2 to 20. Tests were conducted using fixed ranges of
character n-grams (2 to 7) and word n-grams (1 to 5), with the Silhouette coefÏcient calculated
for each k. The aim was to find both the best k and the most effective n-gram ranges for
clustering.
   Once the data was labeled, decision trees were trained using the entropy criterion to assess
feature importance, with Information Gain (IG) and Information Gain Ratio (IGR) calculated.


4. Preliminary or Intermediate Results
Better Silhouette Scores were achieved using the raw frequency matrix with simple stopwords
filtering (CountVectorizer with Stopwords), while the Relative Frequency and TF-IDF Matrices
showed lower scores. TF-IDF scores were close to zero, indicating poor cluster separation and
Relative Frequency values were around 0.5 for both datasets (Tables 2 and 3). The use of relative
frequency significantly impacted the optimal number of clusters recommended by the K-means
algorithm.

Table 2
Optimal clusters and corresponding Silhouette values for different ranges of n-grams (Corpora of Cat-
ullus, Tibullus, Propertius, and Vergilius).
                             Raw Frequency       Relative Frequency         TF-IDF
           N-gram Type      Clusters Score       Clusters Score        Clusters Score
           Char 2-grams         2        0.94          7      0.19         7        0.16
           Char 3-grams         2        0.93          2      0.534        3        0.07
           Char 4-grams         2        0.904         2      0.536        3       0.031
           Char 5-grams         2        0.85          5      0.446        2       0.008
           Char 6-grams         2        0.801         5      0.443       15       0.008
           Char 7-grams         2         0.76         5      0.44        15       0.004
           Word 1-grams         2        0.802         2      0.55         2       0.013
           Word 2-grams         2        0.718         2      0.52        18       0.001
           Word 3-grams         2        0.719         2      0.54        19       0.007
           Word 4-grams         2        0.72          2      0.56        10       0.007
           Word 5-grams         2        0.721         2      0.58        19       0.007

  The new tokenization and normalization process using CLTK modules had a noticeable
impact. Regarding the most critical features for classifying clusters, results suggest that the




                                                 623
Table 3
Optimal clusters and corresponding Silhouette values for different ranges of n-grams (Corpora of Cat-
ullus, Tibullus, Propertius, and Lucanus).
                             Raw Frequency       Relative Frequency         TF-IDF
           N-gram Type      Clusters Score       Clusters Score        Clusters Score
           Char 2-grams         2        0.94          2      0.19         2        0.15
           Char 3-grams         2        0.92          3      0.53         3       0.074
           Char 4-grams         2         0.89         2      0.53         4       0.031
           Char 5-grams         2         0.85         5      0.44         2       0.009
           Char 6-grams         2        0.801         2      0.53         4       0.004
           Char 7-grams         2         0.79         2      0.55        18       0.004
           Word 1-grams         2        0.802         2      0.52         2       0.011
           Word 2-grams         2         0.74         2      0.38        18       0.001
           Word 3-grams         3         0.74         3      0.54        17       0.003
           Word 4-grams         3         0.75         2      0.56        15       0.008
           Word 5-grams         3         0.75         2      0.58        19       0.007


methodology and resources should be reevaluated. In previous work, high Silhouette scores
were observed with the frequency table, but feature importance metrics showed an uneven dis-
tribution, with one or two attributes dominating. While TF-IDF identified more features, the
low Silhouette scores indicated poor classification. Despite better Silhouette scores, the same
issue occurred with the relative frequency matrix.

4.1. Feature Extraction at Character N-Grams Level
As shown in Figure 1 the clustering task succeeded in separating Augustan love poets from
epic poets. In the next page, Tables 4, and 5 show the feature extraction with this n-gram level.
  In the case of relative frequency, after excluding Lucan, the best Silhouette Score at the char-
acter n-gram level was achieved with 4-grams. However, despite obtaining a relatively good
score (0.53), the resulting classification did not meet expectations (Figure 2). The algorithm
constructed two clusters, one of which contained only Carmen 94. A similar phenomenon
occurred when excluding Vergil, but with the distinction that the isolated Carmen was 112.

4.2. Feature Extraction at Word N-Grams Level
At the word n-gram level, similar to the character n-gram level, the best classification method
was the raw frequency matrix. Although the relative frequency matrix also yielded good Sil-
houette scores, it consistently produced poor classifications, isolating only Carmen 94 from
the rest. At other n-gram levels, the carmina that were separated included Carmina 14, 82, 85,
and 106. All of these are relatively short, suggesting that the difference in length among the
poems introduces internal variability in the corpus that hinders classification based on relative
frequencies. The same task, when performed using raw frequencies, yielded excellent results,
whether Lucan or Vergil was excluded from the analysis (Figure 3 and Tables 6, 7).
   In the following table and figure, it can be observed that the use of relative frequency brings




                                                 624
Figure 1: Scatter plot of clustering by K Means using a raw frequency matrix of 2 character n-grams
indicating the clusters (left) and authors (right) with different colors.


forth personal pronouns, terms previously associated with love poetry in earlier studies. How-
ever, these should be disregarded when obtained through this methodology, as the classifica-
tions achieved with them were quite poor, as can be seen in Table 8 and Figure 4.
   Both datasets, whether excluding Lucan or Vergil, showed identical performance in author
classification. Character n-grams (2 to 6) and single-word n-grams effectively separated epic
authors from Augustan love poets using the raw frequency matrix, with Silhouette Scores above
0.83 . Lower scores led to suboptimal classifications, where one cluster contained only a single
book (e.g., Book X or XII of the Aeneid or Book IX of Pharsalia).
   Assessing the relevance of specific character n-grams for classification remains challenging,
requiring a more detailed stylistic investigation. In summary, document grouping was effec-
tive, though feature-level techniques did not always highlight typical elegiac terms. The terms
extracted via decision trees for Augustan love poets and Vergil predominantly reflected epic,
mythical, and martial language4 .

3
    For more details on the extracted terms, see Appendix A and B.
4
    Please note that with the English quotation marks we have attempted to indicate the spaces before or after the
    words, in cases where it corresponds to the character n-gram.




                                                        625
Table 4
Importance features (2 of character n-grams) using the raw frequency matrix method and Stopwords
filtering (Corpora of Catullus, Tibullus, Propertius, and Vergilius).
                 Feature    Importance      Feature      IG     Feature     IGR
                    un            1            nl      0.1254      dg      0.5747
                                              uq       0.1226      aë      0.5463
                                              dg       0.1208      oï      0.5193
                                              gg       0.1205      ën      0.4929
                                              ms       0.1205      ïa      0.4929
                                              gm       0.1205      ez      0.4664
                                              dh       0.1181      aï      0.4664
                                              bt       0.1174      dh      0.4516
                                              yd       0.1129       ï       0.439
                                               dl      0.1128      eï       0.439
                                              nh       0.1128      ër       0.439
                                              ze       0.1128      mf      0.4105
                                              bn       0.1072       x       0.376
                                              my       0.1068      oö      0.3758
                                               ln       0.105       ë      0.3758
                                              rh        0.105      ïc      0.3758
                                              aë       0.1049      ön      0.3758
                                              df       0.1036      ïu      0.3758
                                              yt       0.0999      oë      0.3758
                                              yc       0.0999      gg      0.3724


4.3. Data from the Corpora of Catullus, Tibullus, Propertius, and Vergil:
    • 5-character n-grams: ‘ sub ’ (low), eucri and eucru (part of Teucri), fatus (spoke), fatur
      (speaks), auras (breezes), eneas (Aeneas)
    • 6-character n-grams: teucru (Trojan), aeneas (Aeneas), ‘fatur ’ (speaks), ‘fatus ’ (spoke),
      ‘fatis ’ (fates), ipoten and mnipot (from omnipotens, presumably attributed to Jupiter),
      clamor (shout)
    • 1 word n-grams: urbem (city), aeneas (Aeneas), teucrum (Trojan), ingentem (huge), om-
      nipotens (almighty), aether (ether/sky), pius (pious), iamque (and now), socius (ally), clam-
      ore (shout), finis (end), fatis (fates), ignem (fire), auris (from auris, ear or aurum, gold),
      caelum (sky), genitor (father), hostis (enemy), terram (land), bellum (war), dux (leader),
      uisus (vision).

  A similar phenomenon occurs with the terms obtained from the grouping of the Augus-
tan love poets with Lucan, where words referring to the political causes of the Civil War pre-
dominate, emphasizing the crimes committed by the different factions and the physical conse-
quences on the bodies of the Roman soldiers and citizens [38]:




                                               626
Table 5
Importance features (2 of character n-grams) using the raw frequency matrix method (Corpora of Cat-
ullus, Tibullus, Propertius, and Lucanus).
                 Feature     Importance     Feature      IG      Feature     IGR
                     g            1            ye      0.1464      ye       0.5942
                                               gm      0.1234      dh       0.4673
                                               dh      0.1151      mt       0.4094
                                               ya      0.1048      bf       0.3991
                                               by      0.1025      gm       0.3975
                                               xq       0.097       sf      0.3797
                                               ze      0.0938      dt       0.3708
                                               dt      0.0914       fc      0.3514
                                               oh      0.0909       fs      0.3514
                                               rh      0.0894      pc       0.3514
                                               oa      0.0879      nb       0.3514
                                               sn      0.087       dg       0.3514
                                               dq      0.0864      cp       0.3514
                                               yp      0.0864      sn       0.3306
                                               df      0.0858      bm       0.3287
                                               gy      0.0858       y        0.323
                                               ee      0.0858      mf        0.323
                                               yc       0.085      xq       0.3126
                                               sy       0.085      ms       0.2979
                                               lm      0.0836      ze       0.2883


4.4. Data from the Corpora of Catullus, Tibullus, Propertius, and Lucan:
    • 5-character n-grams: aesar (from Caesar), aussa (cause), scera (part of viscera, viscera),
      pulos (from populos, peoples), elero (part of scelero, referring to crimes), coeli (of ’ coel’,
      from heaven), libye (Libya), gladi (part of gladium, sword), adhu and adhuc (from until
      now)
    • 6-character n-grams: ‘osque ’ (composed most likely by the accusative plural ending of
      the second declension combined with the enclitic -que), pulos and opulos (peoples), libye
      (Libya), scera and iscera (from viscera, entrails), bellor and elloru (as part of bellorum, of
      the wars), elerum, ‘lerum’ (from scelerum, crime), caussa and ‘ causs’ (cause), ‘ gladi’ (as
      part of gladium on its different forms, the sword), ‘ coeli’ (sky).
    • 1 word n-grams: pectora (chests), populos (peoples), scelerum (crimes), bellorum (wars),
      senatus (senate), ciuilia (civil), nocentes (from nocens, guilty or harmful), caussa (cause),
      fatis (fates), coelo (sky), mundus (world), caeli (sky), diui (gods), libye (Libya).
   In previous work, we found that the Silhouette method consistently recommended two clus-
ters for 2-character n-grams and three clusters for 1-word n-grams, regardless of the technique
used. Scatter plots aligned with the stylistic distribution reported by Forstall et al. [15], who
used SVM to analyze Catullus’ influence on Paul the Deacon’s poetry (Figure 5).
   However, in this instance, whether due to the new preprocessing corrections or the novel
comparison methods employed in separating Vergil and Lucan, the clustering tasks were not




                                                627
Figure 2: Scatter plot of clustering by K Means using a relative frequency matrix of 4 character n-grams
indicating clusters (left) and authors (right) with different colors.


always accurate. While some correct distributions of the authors can be detected in the scatter
plot space, the algorithm’s non-human interpretation results in an unclear cluster classification,
grouping Tibullus, Propertius, and Vergil against Catullus (Figure 6).


5. Conclusions and Learned Lessons
This article highlights the need to reevaluate methodologies and resources. Positive clustering
results were obtained, especially with raw frequencies and n-grams, with Silhouette Scores
above 0.8. Preprocessing steps and CLTK modules for tokenization and normalization signifi-
cantly impacted the results, emphasizing the importance of tailored tools for ancient texts.
   Relative frequency and two datasets reduced noise and bias from the epic authors, aiming
to balance the text sets. However, even with more balanced data, raw frequencies provided
better clustering results than relative frequency and TF-IDF matrices. As in the previous study,
uneven feature importance and variable performance across n-gram levels and matrices suggest
further refinement is needed for consistent results.
   N-grams from Vergil’s and Lucan’s works show a dominance of political, historical, and
war-related terms. This suggests that, despite efforts to balance the datasets, the lexical char-
acteristics of epic poetry still influence classification. The identified n-grams reflect the epic
and mythical focus of these authors, contrasting with the love and personal themes of Catullus,
Tibullus, and Propertius. The variability in document length—regular in Pharsalia and Aeneid,
but variable in the Augustan love poets—affects results. Additionally, the internal variability
among the Augustan love poets’ corpora also affects classification. We could experiment by
partitioning Catullu’s work into polymetric poems, carmina maiora, and epigrams or elegiac
couplets, and run separate analyses, or intervene in the corpus catullianum by removing non-
amorous themed poems. However, this is complex, as thematic boundaries are not clear-cut.
   This exploratory analysis requires further refinement of other techniques such as variable




                                                 628
Figure 3: Scatter plot of clustering by K Means using a raw frequency Matrix of 1-word n-grams
clusters (left) and authors (right) with different colors.


ranges of character and word n-grams (only fixed ranges were used in this study), other sim-
ilarity measures such as Jaccard, Cosine, or Soft Cosine, or clustering methods like Gaussian
Mixture Models, DBSCAN, or hierarchical clustering. Future research could apply normaliza-
tion techniques such as L1 or Z-scaler, and phenomena like collocations and co-occurrences,
which were not applied in this study. A close reading of clusters based on relative frequency
also offers promise.
   As for the representation of the documents, there is a need to explore techniques with em-
beddings like those developed by Burns et al., Bamman et al., and Johnson et al. [18, 8, 3, 6].
   It should also be noted that the terms obtained by the Decision Tree technique are words
with classification power for that dataset, not necessarily the most typical of one type of po-
etry or another, as there may be important words for both genres penalized by the metrics of
Importance, Information Gain, or TF-IDF. The unequal size of the poems also contributed to
the clarity of classification in raw counts, indirectly transferring poem length as a classification
criterion. Similarly, in decision trees, the feature split points reflected the same pattern, with
epic poem features having much higher frequencies, clearly impacting the results.
   Finally, it is important to briefly consider the implications of applying computational and




                                               629
Table 6
Most important features at the level of 1-word n-grams ranked by Importance, Information Gain, Infor-
mation Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Propertius,
and Vergilius).
                              Raw Frequency Word (1, 1) n-grams
              Feature    Importance    Feature       IG       Feature           IGR
               urbem           1            aeneas      0.1813     teucrum     0.6931
                                          ingentem      0.1813      ingens     0.6931
                                             ingens     0.1813    ingentem     0.6931
                                           teucrum      0.1813      aeneas     0.6931
                                              fatis     0.1683         pius    0.6413
                                         omnipotens     0.1683      aethere    0.6413
                                               late     0.1683          ast    0.6413
                                             ignem      0.1683        fatur    0.6413
                                              auras     0.1683       diuom     0.6413
                                            iamque      0.1601       socios    0.6413
                                                ea      0.1601     clamore     0.6413
                                            genitor     0.1601      teucros    0.6413
                                             terram     0.1601         visu    0.6413
                                             talibus    0.1601       ignem     0.6061
                                           equidem      0.1601   omnipotens    0.6061
                                             diuom      0.1571        fatis    0.6061
                                               pius     0.1571         late    0.6061
                                               visu     0.1571        auras    0.6061
                                            teucros     0.1571       teucri    0.6056
                                                ast     0.1571       regem     0.6056
                                            aeneas      0.1571          ast    0.6056


Distant Reading techniques alongside hypotheses or educated guesses from Close Reading. Fre-
quency counting, for instance, is used here to model documents, but humans do not speak to be
counted. Otherwise, Catullus would simply have repeated the name Lesbia, and his love would
have been understood without the effort of creating poetry. Fortunately, language is far more
abstract and complex, and computational methods are only beginning to reveal its intricacies.
This issue has resurfaced with criticisms, such as those by Noam Chomsky, against generative
models [11]. It is true that the human mind performs language tasks in a highly elegant manner
and acquires a language exposed to a much smaller number of data than those handled by Large
Language Models (LLMs). LLMs are tools developed for other tasks that did not originally seek
to emulate the human mind [13, 37]. But it is also true that one must yield to the evidence
of the successful results obtained with the use of these techniques and their undeniable capac-
ity to facilitate all kinds of tasks. It’s essential to acknowledge both the limits and strengths
of computational tools, recognizing that Distant Reading offers a different scale of analysis—
rooted not just in methodology but in changes in how information is produced, accessed, and
analyzed in the digital age [33]. Despite criticisms [5], Digital Humanities methodologies hold
great promise for studying language-rich subjects that balance aesthetic and rhythmic elements
like refrains, alliterations, and anaphoras, presenting a unique challenge for modern analytical




                                                630
Table 7
Most important features at the level of 1 word n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw matrix method (Corpora of Catullus, Tibullus, Propertius, and
Lucanus).
                              Raw Frequency word (1, 1) n-grams
                 Feature   Importance Feature        IG      Feature        IGR
                 pectora         1          populos    0.1589    populis    0.6931
                                           scelerum    0.1589   scelerum    0.6931
                                            populis    0.1589   bellorum    0.6931
                                            mundo      0.1589    uiscera    0.6931
                                           bellorum    0.1589       exit    0.6931
                                               exit    0.1589    populos    0.6931
                                            uiscera    0.1589    mundo      0.6931
                                            senatus    0.1464    caussa      0.636
                                             ciuilia   0.1464   nocentes     0.636
                                              nefas    0.1464   ciuilibus    0.636
                                              fatis    0.1464      coelo     0.636
                                            bellum     0.1388      libye     0.636
                                             milite    0.1388     robore     0.636
                                              ducis    0.1388     superi     0.636
                                           ciuilibus   0.1345     adhuc      0.636
                                             fauces    0.1345      ciuile    0.636
                                             robore    0.1345     fauces     0.636
                                             adhuc     0.1345      coeli     0.636
                                             caussa    0.1345   malorum      0.636
                                              libye    0.1345    potuere    0.5968


techniques.


Acknowledgments
I sincerely thank Dr. Kyle P. Johnson, Director of AI at Morgan, Lewis and Bockius LLP, and
Dr. Patrick J. Burns from the Institute for the Study of the Ancient World, NYU, for their kind
and insightful responses to my inquiries on tokenization and the use of CLTK and LatinCy
libraries. I also extend my gratitude to Professor Benjamin Nagy from the Institute of the Polish
Language, Polish Academy of Sciences (IJP PAN), Krakow, for his expert advice on correcting
verse counts based on authorized editions, which greatly enhanced the accuracy of this text
analysis.


References
 [1] [No author]. Perseus Digital Library Homepage. [No date]. url: https://www.perseus.tuf
     ts.edu/hopper/..




                                               631
Table 8
Most important features at the level of 1 word n-grams according to Importance, Information Gain,
Information Gain Ratio using the relative frequency matrix method (Corpora of Catullus, Tibullus,
Propertius, and Vergilius).
                            Relative Frequency word (1, 1) n-grams
               Feature   Importance      Feature      IG        Feature     IGR
                 legit         1        moechatur    0.0244   moechatur    0.6931
                                          olera      0.0244     olera      0.6931
                                           olla      0.0244      olla      0.6931
                                         mentula     0.0151    mentula      0.114
                                         dicunt      0.0132    dicunt      0.0689
                                          legit      0.0124     legit      0.0542
                                          certe       0.009     certe      0.0209
                                          ipsa       0.0061     ipsa       0.0087
                                          mihi        0.003     mihi       0.0031
                                           tibi       0.003      tibi       0.003
                                             tu      0.0024        tu      0.0024
                                          nunc       0.0021     nunc       0.0022
                                          quid       0.0019     quid        0.002
                                            ne       0.0019       ne        0.002
                                           ego       0.0018      ego       0.0019
                                          mea        0.0018     mea        0.0019
                                          esse       0.0018     esse       0.0019
                                           iam       0.0018      iam       0.0018
                                            illa     0.0016       illa     0.0017
                                          nam        0.0015     nam        0.0017


 [2] [No author]. Stopwords ISO. [No date]. url: https://github.com/stopwords-iso/stopwor
     ds-iso/blob/master/README.md.
 [3] D. Bamman and P. J. Burns. Latin BERT: A Contextual Language Model for Classical Philol-
     ogy. 2020. doi: 10.48550/arXiv.2009.10053. url: http://arxiv.org/abs/2009.10053.
 [4] A. Berra. Ancient Greek and Latin Stopwords. 2024. url: https://github.com/aurelberra/s
     topwords.
 [5] T. Brennan. The Digital-Humanities Bust. 2017. url: https://www.chronicle.com/article
     /the-digital-humanities-bust/.
 [6] P. J. Burns. “Building a Text Analysis Pipeline for Classical Languages”. In: Building a
     Text Analysis Pipeline for Classical Languages. De Gruyter Saur, 2019, pp. 159–176. doi:
     10.1515/9783110599572-010. url: https://www.degruyter.com/document/doi/10.1515/97
     83110599572-010/html.
 [7] P. J. Burns. “Constructing Stoplists for Historical Languages”. In: Digital Classics Online
     (2018), pp. 4–20. doi: 10.11588/dco.2018.2.52124. url: https://journals.ub.uni-heidelberg
     .de/index.php/dco/article/view/52124.




                                              632
Figure 4: Scatter plot of clustering by K Means using a relative frequency matrix of 1-word n-grams
indicating clusters (left) and authors (right) with different colors




Figure 5: Graph obtained by Forstall et. al. using the One-class SVM method. Cited by Coffe et al.
[12].


 [8] P. J. Burns. LatinCy: Synthetic Trained Pipelines for Latin NLP. 2023. doi: 10.48550/arXiv
     .2305.04365. url: http://arxiv.org/abs/2305.04365.
 [9] G. Cantaluppi and M. Passarotti. “Clustering the Corpus of Seneca: A Lexical-Based Ap-
     proach”. In: Advances in Latent Variables: Methods, Models and Applications. Ed. by M.
     Carpita, E. Brentari, and E. M. Qannari. Cham: Springer International Publishing, 2015,
     pp. 13–25. doi: 10.1007/10104\_2014\_6. url: https://doi.org/10.1007/10104%5C%5F2014
     %5C%5F6.
[10]   L. M. Cerrato and R. F. Chavez. Perseus Classics Collection: An Overview. [No date]. url:
       https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.04.0053.
[11]   N. Chomsky, I. Roberts, and J. Watumull. “Noam Chomsky: The False Promise of Chat-
       GPT”. In: The New York Times (2023). url: https://www.nytimes.com/2023/03/08/opinio
       n/noam-chomsky-chatgpt-ai.html.
[12]   N. Coffee, J. Gawley, C. Forstall, W. Scheirer, D. Scheirer, J. Corso, and B. Parks. “Mod-
       elling the Interpretation of Literary Allusion with Machine Learning Techniques Journal




                                               633
Figure 6: Scatter plot of clustering by K Means using a TF IDF matrix of 1-word n-grams indicating
clusters (left) and authors (right) with different colors.


       of Digital Humanities”. In: Journal of Digital Humanities (2013), pp. 478–479. url: https:
       //journalofdigitalhumanities.org/3-1/modelling-the-interpretation-of-literary-allusion-
       with-machine-learning-techniques/.
[13]   J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional
       Transformers for Language Understanding. 2019. doi: 10.48550/arXiv.1810.04805. url: ht
       tp://arxiv.org/abs/1810.04805.
[14]   C. W. Forstall, S. L. Jacobson, and W. J. Scheirer. “Evidence of intertextuality: investigat-
       ing Paul the Deacon’s Angustae Vitae”. In: Literary and Linguistic Computing 26.3 (2011),
       pp. 285–296. doi: 10.1093/llc/fqr029. url: https://doi.org/10.1093/llc/fqr029.
[15]   C. W. Forstall and W. Scheirer. “A Statistical Stylistic Study of Latin Elegiac Couplets”. In:
       2010 Chicago Colloquium on Digital Humanities and Computer Science. 2010, [No pages].
       url: https://www.semanticscholar.org/paper/A-Statistical-Stylistic-Study-of-Latin-Ele
       giac-Forstall-Scheirer/e3caac9ec4ee16baac70ed94808dca57dff48a2d.
[16]   Giovanni Bracco, Silvio Migliori, Giorgio Mencuccini, Daniela Alderuccio, and Giovanni
       Ponti. “Data mining tools and GRID infrastructure for Assyriology text analysis (an
       Old-Babylonian situation studied through text analysis and data mining tools)”. In: RAI-
       Rencontre Assyriologique Internationale- Private and State in the Ancient Near East. Bel-
       gium, 2013, [No pages].
[17]   J. B. Greenough. The Bucolics, AEneid, and Georgics of Virgil. Boston: Ginn, 1900.
[18]   K. P. Johnson, P. J. Burns, J. Stewart, T. Cook, C. Besnier, and W. J. B. Mattingly. “The Clas-
       sical Language Toolkit: An NLP Framework for Pre-Modern Languages”. In: Proceedings
       of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
       International Joint Conference on Natural Language Processing: System Demonstrations.
       Ed. by H. Ji, J. C. Park, and R. Xia. Online: Association for Computational Linguistics,




                                                634
       2021, pp. 20–29. doi: 10.18653/v1/2021.acl-demo.3. url: https://aclanthology.org/2021.a
       cl-demo.3.
[19]   C. S. Lewis. La alegorı́a del amor: un estudio sobre tradición medieval. 2015th ed. Madrid:
       Encuentro, 1936.
[20]   S. Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions on Information The-
       ory 28.2 (1982), pp. 129–137. doi: 10.1109/tit.1982.1056489. url: https://ieeexplore.ieee.o
       rg/document/1056489.
[21]   A. Martins, C. Grácio, C. Teixeira, I. Pimenta Rodrigues, J. L. G. Zapata, and L. Ferreira.
       “Historia Augusta authorship: an approach based on Measurements of Complex Net-
       works”. In: Applied Network Science 6.1 (2021), pp. 1–23. doi: 10.1007/s41109-021-00390-
       7. url: https://appliednetsci.springeropen.com/articles/10.1007/s41109-021-00390-7.
[22]   E. T. Merrill. Catullus; edited by Elmer Truesdell Merrill. Boston Ginn, 1893. url: http://a
       rchive.org/details/catulluseditedby00catuuoft.
[23]   F. Moretti. Distant Reading. Verso, 2013.
[24]   L. Müller. Sex. Propertii Elegiae. Leipzig: Teubner, 1898.
[25]   B. Nagy. “Rhyme in classical Latin poetry: Stylistic or stochastic?” In: Digital Scholarship
       in the Humanities 37.4 (2022), pp. 1097–1118. doi: 10.1093/llc/fqab105. url: https://doi.o
       rg/10.1093/llc/fqab105.
[26]   B. Nagy. “Some stylometric remarks on Ovid’s Heroides and the Epistula Sapphus”. In:
       Digital Scholarship in the Humanities 38.3 (2023), pp. 1183–1199. doi: 10.1093/llc/fqac098.
       url: https://doi.org/10.1093/llc/fqac098.
[27]   C. J. Nusch. “Las Edades del Amor: una propuesta para el proyecto Aetates Amoris desti-
       nado a la poesı́a amorosa”. Tesis. Universidad Nacional de Educación a Distancia, España,
       2021. doi: 10.35537/10915/125629. url: http://sedici.unlp.edu.ar/handle/10915/125629.
[28]   C. J. Nusch. “Una breve exploración de la terminologı́a amorosa en los corpora catul-
       lianum, tibullianum y propertianum con métodos y herramientas computacionales: eti-
       quetado gramatical, lemas, bigramas y co-apariciones”. In: Revista de Humanidades Dig-
       itales 9 (2024), pp. 1–40. doi: 10.5944/rhd.vol.9.2024.38680. url: https://revistas.uned.es
       /index.php/RHD/article/view/38680.
[29]   C. J. Nusch, G. del Rio Riande, L. C. C. Cagnina, M. L. Errecalde, and L. Antonelli. “Ini-
       tial Explorations for Document Clustering Tasks in Latin Elegiac Poets”. In: Decisioning.
       Pereira, Colombia, 2024.
[30]   J. P. Postgate. Tibulli aliorumque carminum libri tres. Oxford: Scriptorum classicorum
       bibliotheca Oxoniensis, 1915.
[31]   J. R. Quinlan. “Induction of decision trees”. In: Machine Learning 1.1 (1986), pp. 81–106.
       doi: 10.1007/bf00116251. url: https://doi.org/10.1007/BF00116251.
[32]   S. Ramsay. Reading Machines: Toward and Algorithmic Criticism. Urbana, 2011.
[33]   Ricardo Pimenta. “De Narciso ao mundo-imagem: por uma urgência de uma perspectiva
       crı́tica sobre a cena informacional contemporânea”. In: Ciência da Informação : sociedade,
       crı́tica e inovação. Rio de Janeiro, 2022, [No pages].




                                                635
[34]   P. J. Rousseeuw. “Silhouettes: A graphical aid to the interpretation and validation of
       cluster analysis”. In: Journal of Computational and Applied Mathematics 20 (1987), pp. 53–
       65. doi: 10.1016/0377-0427(87)90125-7. url: https://www.sciencedirect.com/science/arti
       cle/pii/0377042787901257.
[35]   C. E. Shannon. “A mathematical theory of communication”. In: The Bell System Technical
       Journal 27.3 (1948), pp. 379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. url: https://ie
       eexplore.ieee.org/document/6773024.
[36]   K. Spärck Jones. “A statistical interpretation of term specificity and its application in
       retrieval”. In: Journal of Documentation 28.1 (1972), pp. 11–21. doi: 10 . 1108 / eb026526.
       url: https://doi.org/10.1108/eb026526.
[37]   A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I.
       Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing
       Systems. Vol. 30. Curran Associates, Inc., 2017, [No pages]. url: https://papers.nips.cc/p
       aper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[38]   M. M. Vizzotti. “De la tragedia de Séneca a la épica de Lucano: estrategias de repre-
       sentación de los paradigmas filosóficos y literarios”. Tesis. Universidad Nacional de La
       Plata, 2014. doi: 10.35537/10915/34410. url: http://sedici.unlp.edu.ar/handle/10915/344
       10.
[39]   C. H. Weise. Pharsaliae Libri X. M. Annaeus Lucanus. Leipzig: G. Bassus, 1935.

  Appendix: Extra Data from the Corpora of Catullus, Tibullus, Propertius, Vergilius and Lu-
canus




                                               636
Table 9
Most important features at the level of character 2 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method and Stopwords filtering (Corpora of
Catullus, Tibullus, Propertius, and Vergilius).
                                Raw Frequency char (2, 2) n-grams
                  Feature    Importance Feature         IG     Feature        IGR
                     un            1             nl     0.1254      dg       0.5747
                                                uq      0.1226      aë       0.5463
                                                dg      0.1208      oï       0.5193
                                                gg      0.1205      ën       0.4929
                                                ms      0.1205      ïa       0.4929
                                                gm      0.1205      ez       0.4664
                                                dh      0.1181      aï       0.4664
                                                bt      0.1174      dh       0.4516
                                                yd      0.1129       ï        0.439
                                                 dl     0.1128      eï        0.439
                                                nh      0.1128      ër        0.439
                                                ze      0.1128      mf       0.4105
                                                bn      0.1072       x        0.376
                                                my      0.1068      oö       0.3758
                                                 ln      0.105       ë       0.3758
                                                rh       0.105      ïc       0.3758
                                                aë      0.1049      ön       0.3758
                                                df      0.1036      ïu       0.3758
                                                yt      0.0999      oë       0.3758
                                                yc      0.0999      gg       0.3724




                                                637
Table 10
Most important features at the level of character 3 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (3, 3) n-grams
                  Feature    Importance Feature         IG     Feature       IGR
                    aba            1            ipo     0.1601      ffa      0.6056
                                                om      0.1601      dfa      0.6056
                                                ffa     0.1571     adg       0.5747
                                               dsu      0.1536     dhu       0.5747
                                                oln     0.1536     dgn       0.5747
                                                 gii    0.1481      xsc      0.5747
                                                ols     0.1444      ciq      0.5747
                                                teb     0.1433     ybr       0.5747
                                                tto     0.1433     axu       0.5747
                                               uom       0.139     hyb       0.5747
                                               toq       0.139      ols      0.5521
                                                xce      0.139     yde       0.5463
                                                dfa      0.138       aë      0.5463
                                                rex     0.1365      bp       0.5463
                                                 bii    0.1352     amd       0.5463
                                                teu     0.1352     moq       0.5463
                                                nip     0.1352     mdu       0.5463
                                               roq      0.1316      ipo      0.5458
                                               euc      0.1316      om       0.5458
                                               bum      0.1316      giq      0.5193




                                                638
Table 11
Most important features at the level of character 4 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (4, 4) n-grams
                  Feature    Importance Feature         IG     Feature       IGR
                     pri           1          temq      0.1813    iuom       0.6931
                                              iuom      0.1813     ucru      0.6931
                                               ucru     0.1813    temq       0.6931
                                              boru      0.1813     ubib      0.6931
                                               ubib     0.1813    boru       0.6931
                                              sumq      0.1683    nimb       0.6413
                                              mman      0.1683     m ef      0.6413
                                              lumq      0.1683     effa      0.6413
                                              mnip      0.1683     ffat      0.6413
                                               ipot     0.1683      ast      0.6413
                                               nipo     0.1683     cios      0.6413
                                               bibu     0.1683    lumq       0.6061
                                                ea      0.1601     ipot      0.6061
                                               iisq     0.1601    mman       0.6061
                                                cri     0.1601    mnip       0.6061
                                               dani     0.1601     nipo      0.6061
                                               teuc     0.1601     bibu      0.6061
                                              scun      0.1601    sumq       0.6061
                                               ttol     0.1601     adfa      0.6056
                                               eucr     0.1601     anid      0.6056




                                                639
Table 12
Most important features at the level of character 5 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (5, 5) n-grams
                  Feature    Importance Feature         IG     Feature       IGR
                    sub            1          ntemq      0.1813    eucru     0.6931
                                                eucri    0.1813   borum      0.6931
                                               eucru     0.1813   ntemq      0.6931
                                                fatus    0.1813     fatur    0.6931
                                              imman      0.1813     iuom     0.6931
                                                fatur    0.1813   imman      0.6931
                                              borum      0.1813   ucrum      0.6931
                                                iuom     0.1813     eucri    0.6931
                                              temqu      0.1813     e teu    0.6931
                                                e teu    0.1813   temqu      0.6931
                                              ucrum      0.1813     fatus    0.6931
                                              mnipo      0.1683      cios    0.6413
                                               auras     0.1683    clamo     0.6413
                                              sumqu      0.1683     anch     0.6413
                                                 fatis   0.1683      effa    0.6413
                                               nipot     0.1683    lamor     0.6413
                                                nt ac    0.1683     ocios    0.6413
                                               omnip     0.1683     effat    0.6413
                                               eneas     0.1683   undam      0.6413
                                                ipote    0.1683     m eff    0.6413




                                                 640
Table 13
Most important features at the level of character 6 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (6, 6) n-grams
                  Feature    Importance Feature         IG     Feature       IGR
                   tisque          1          teucru    0.1813    ngente     0.6931
                                             aeneas     0.1813     teucri    0.6931
                                               fatur    0.1813    aeneas     0.6931
                                             ntemqu     0.1813     ucrum     0.6931
                                              e teuc    0.1813     e teuc    0.6931
                                             eucrum     0.1813     teucru    0.6931
                                             ngente     0.1813     borum     0.6931
                                             imman      0.1813    imman      0.6931
                                              ucrum     0.1813      fatus    0.6931
                                              borum     0.1813    ntemqu     0.6931
                                               fatus    0.1813    eucrum     0.6931
                                              teucri    0.1813      fatur    0.6931
                                             temque     0.1813    temque     0.6931
                                                fatis   0.1683     m regi    0.6413
                                               auras    0.1683     a fatu    0.6413
                                               eneas    0.1683     pius a    0.6413
                                               auras    0.1683     a teuc    0.6413
                                              ipoten    0.1683       uisu    0.6413
                                              omnip     0.1683    clamor     0.6413
                                             mnipot     0.1683      e ora    0.6413




                                                641
Table 14
Most important features at the level of word 1 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Vergilius).
                              Raw Frequency word (1, 1) n-grams
              Feature    Importance    Feature        IG      Feature          IGR
               urbem           1            aeneas     0.1813     teucrum     0.6931
                                          ingentem     0.1813      ingens     0.6931
                                             ingens    0.1813    ingentem     0.6931
                                           teucrum     0.1813      aeneas     0.6931
                                              fatis    0.1683         pius    0.6413
                                         omnipotens    0.1683      aethere    0.6413
                                               late    0.1683          ast    0.6413
                                             ignem     0.1683        fatur    0.6413
                                              auras    0.1683       diuom     0.6413
                                            iamque     0.1601       socios    0.6413
                                                ea     0.1601     clamore     0.6413
                                            genitor    0.1601      teucros    0.6413
                                             terram    0.1601         uisu    0.6413
                                             talibus   0.1601       ignem     0.6061
                                           equidem     0.1601   omnipotens    0.6061
                                             diuom     0.1571        fatis    0.6061
                                               pius    0.1571         late    0.6061
                                               uisu    0.1571        auras    0.6061
                                            teucros    0.1571       teucri    0.6056
                                                ast    0.1571       regem     0.6056




                                               642
Table 15
Most important features at the level of char 2 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (2, 2) n-grams
                  Feature    Importance Feature         IG     Feature      IGR
                     g            1            ye      0.1464     ye       0.5942
                                               gm      0.1234     dh       0.4673
                                               dh      0.1151     mt       0.4094
                                               ya      0.1048     bf       0.3991
                                               by      0.1025     gm       0.3975
                                               xq       0.097      sf      0.3797
                                               ze      0.0938     dt       0.3708
                                               dt      0.0914      fc      0.3514
                                               oh      0.0909      fs      0.3514
                                               rh      0.0894     pc       0.3514
                                               oa      0.0879     nb       0.3514
                                               sn      0.087      dg       0.3514
                                               dq      0.0864     cp       0.3514
                                               yp      0.0864     sn       0.3306
                                               df      0.0858     bm       0.3287
                                               gy      0.0858      y        0.323
                                               ee      0.0858     mf        0.323
                                               yc       0.085     xq       0.3126
                                               sy       0.085     ms       0.2979
                                               lm      0.0836     ze       0.2883




                                               643
Table 16
Most important features at the level of char 3 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (3, 3) n-grams
                  Feature    Importance Feature         IG     Feature      IGR
                     te           1           oer      0.1589     bye      0.6931
                                              bye      0.1589     oer      0.6931
                                               rct     0.1464     dhu       0.636
                                              obo      0.1388      ye       0.636
                                                ax     0.1388      rct     0.5942
                                                ye     0.1345      bp      0.5627
                                              dhu      0.1345      giq     0.5627
                                              nfu      0.1328     gme      0.5341
                                              agm      0.1277     emt      0.5309
                                               teb     0.1234     bmo      0.5309
                                               rut     0.1234      eer     0.5309
                                              gme      0.1224      xir     0.5309
                                              toq      0.1195      ax      0.5275
                                              xce      0.1195     obo      0.5275
                                               xn      0.1195     nny        0.5
                                               lsu     0.1195       al       0.5
                                               pei      0.116     mto        0.5
                                              axe       0.116      bif       0.5
                                              aux       0.116      efo       0.5
                                              saq       0.116     gad        0.5




                                               644
Table 17
Most important features at the level of char 4 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (4, 4) n-grams
                  Feature    Importance Feature         IG     Feature      IGR
                    ra n          1            ibye    0.1589    auss      0.6931
                                               auss    0.1589    coel      0.6931
                                               coel    0.1589    ibye      0.6931
                                               glad    0.1589    glad      0.6931
                                              sena     0.1464    oelo       0.636
                                               s rh    0.1464     bye       0.636
                                                iuil   0.1464    dhuc       0.636
                                               leru    0.1464     rtib      0.636
                                               efas    0.1464    adhu       0.636
                                              moto     0.1464    auce       0.636
                                               susq    0.1464    cesp      0.5968
                                               fauc    0.1464    moes      0.5968
                                               tebr    0.1464    xcus      0.5968
                                               rcto    0.1464    suor      0.5968
                                              lumq     0.1464     tors     0.5968
                                               arct    0.1464    otue      0.5968
                                              mpul     0.1388    adau      0.5968
                                               robo    0.1388    gulo      0.5968
                                                uile   0.1388    nfan      0.5968
                                              obor     0.1388    mpag      0.5968




                                                645
Table 18
Most important features at the level of char 5 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (5, 5) n-grams
                  Feature    Importance Feature         IG     Feature      IGR
                   aesar          1          aussa     0.1589    causs     0.6931
                                             lerum     0.1589     scera    0.6931
                                             causs     0.1589     eleru    0.6931
                                              scera    0.1589    pulos     0.6931
                                              coeli    0.1589     coeli    0.6931
                                              libye    0.1589      coel    0.6931
                                             pulos     0.1589    lerum     0.6931
                                              gladi    0.1589     libye    0.6931
                                              eleru    0.1589    aussa     0.6931
                                               coel    0.1589     gladi    0.6931
                                              glad     0.1589     glad     0.6931
                                              ellor    0.1589     ellor    0.6931
                                              arcto    0.1464      oeli     0.636
                                               peri    0.1464      oelo     0.636
                                              fatis    0.1464    i dam      0.636
                                              ciuil    0.1464     oties     0.636
                                               tent    0.1464     ic fa     0.636
                                             nefas     0.1464    obore      0.636
                                              fauc     0.1464    adhuc      0.636
                                             susqu     0.1464      iscri    0.636




                                               646
Table 19
Most important features at the level of char 6 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                                Raw Frequency char (6, 6) n-grams
                  Feature    Importance Feature         IG     Feature      IGR
                   osque          1          pulos     0.1589   elerum     0.6931
                                              scera    0.1589     ssere    0.6931
                                            lerum      0.1589    iscera    0.6931
                                            opulos     0.1589   celeru     0.6931
                                             bellor    0.1589   s phar     0.6931
                                             causs     0.1589     gladi    0.6931
                                              ssere    0.1589   m popu     0.6931
                                            s phar     0.1589    pulos     0.6931
                                            elerum     0.1589    lerum     0.6931
                                               exit    0.1589    elloru    0.6931
                                             lia be    0.1589    bellor    0.6931
                                            caussa     0.1589   caussa     0.6931
                                             iscera    0.1589   opulos     0.6931
                                              libye    0.1589     libye    0.6931
                                              gladi    0.1589     coeli    0.6931
                                            m popu     0.1589     scera    0.6931
                                              coeli    0.1589    causs     0.6931
                                            celeru     0.1589    lia be    0.6931
                                             elloru    0.1589      exit    0.6931
                                             us for    0.1464   unctas      0.636




                                               647
Table 20
Most important features at the level of word 1 n-grams according to Importance, Information Gain,
Information Gain Ratio using the raw frequency matrix method (Corpora of Catullus, Tibullus, Proper-
tius, and Lucanus).
                               Raw Frequency word (1, 1) n-grams
                 Feature    Importance Feature        IG      Feature        IGR
                 pectora          1         populos    0.1589    populis    0.6931
                                           scelerum    0.1589   scelerum    0.6931
                                            populis    0.1589   bellorum    0.6931
                                            mundo      0.1589    uiscera    0.6931
                                           bellorum    0.1589       exit    0.6931
                                               exit    0.1589    populos    0.6931
                                            uiscera    0.1589    mundo      0.6931
                                            senatus    0.1464    caussa      0.636
                                             ciuilia   0.1464   nocentes     0.636
                                              nefas    0.1464   ciuilibus    0.636
                                              fatis    0.1464      coelo     0.636
                                            bellum     0.1388      libye     0.636
                                             milite    0.1388     robore     0.636
                                              ducis    0.1388     superi     0.636
                                           ciuilibus   0.1345     adhuc      0.636
                                             fauces    0.1345      ciuile    0.636
                                             robore    0.1345     fauces     0.636
                                             adhuc     0.1345      coeli     0.636
                                             caussa    0.1345   malorum      0.636
                                              libye    0.1345    potuere    0.5968




                                               648