=Paper=
{{Paper
|id=Vol-3834/paper121
|storemode=property
|title=Promises from an Inferential Approach in Classical Latin Authorship Attribution
|pdfUrl=https://ceur-ws.org/Vol-3834/paper121.pdf
|volume=Vol-3834
|authors=Giulio Tani Raffaelli
|dblpUrl=https://dblp.org/rec/conf/chr/Raffaelli24
}}
==Promises from an Inferential Approach in Classical Latin Authorship Attribution==
Promises from an Inferential Approach in Classical
Latin Authorship Attribution
Giulio Tani Raffaelli1
1
Institute of Computer Science, Czech Academy of Sciences, Czech Republic
Abstract
Applying stylometry to Authorship Attribution requires distilling the elements of an author’s style suf-
ficient to recognise their mark in anonymous documents. Often, this is accomplished by contrasting
the frequency of selected features in the authors’ works. A recent approach, CP2D, uses innovation pro-
cesses to infer the author’s identity, accounting for their propensity to introduce new elements. In this
paper, we apply CP2D to a corpus of Classical Latin texts to test its effectiveness in a new context and
explore the additional insight it can offer the scholar. We show its effectiveness on a corpus of classical
Latin texts and how—moving beyond maximum likelihood—we can visualise the stylistic relationships
and gather additional information on the relationships among documents.
Keywords
authorship attribution, inference, classical Latin, visualisation
1. Introduction
Despite the development of AI tools [2], the current state of their interpretability [16] and the
need for transparent, versatile approaches to stylometry sustain the continued development
and use of tool based on supervised feature selection [5, 1]. These can be general purpose [10,
7] or specific for the Latin language [5]. In philology, where the ground truth on the authorship
is out of reach, knowing why the style of a document is close to an author’s may be more
interesting than the author’s name.
A common practice [1, 12, 11] to gain an understanding of the relationships among docu-
ments is to project them on a 2-dimensional space. This allows to visualise the relative positions
of documents or corpora. For example, one such common approach is Correspondence Anal-
ysis [3], which is now included in multiple R packages and tools as Hyperbase.1 This allows
to visualise at the same time the relationships among documents and how different elements
(e.g., words or lemmas) contribute to the positioning.
The recently proposed CP2D approach [18] applies information theory and innovation pro-
cesses to propose authorship. Each author is modelled as an information source emitting tokens
and characterised by the token frequency in their samples and their tendency to innovate. The
representation as a Poisson-Dirichlet process allows to estimate the likelihood that a given au-
thor produced the anonymous document. The attribution then follows a Maximum Likelihood
CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
£ tani@cs.cas.cz (G. Tani Raffaelli)
ȉ 0000-0003-0866-5210 (G. Tani Raffaelli)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1
https://hyperbase.unice.fr/
610
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
approach. If the anonymous text is split into fragments, the researchers either compute the
text likelihood by assuming the fragments are independent or let each fragment cast a vote
(Majority Rule). The authors test the approach on literary prose in three languages and infor-
mal English texts. This approach is interesting as it is transparent and can be applied without
relying on language tools—e.g., lemmatisers or large pre-trained models—whose quality can
vary dramatically from language to language.
While this approach is proven effective, its basic formulation does not fully exploit its capa-
bilities. Although the model has few hyperparameters, in the case of small corpora, optimising
the hyperparameters based on the best performance on known texts risks overfitting. In the
same paper, the performance on the test set for the smallest corpus is considerably lower than
on the training set, while on large corpora, it tends to be stable or increase [18, Table 1]. Also,
while using lemmatisers is not necessary, this could still help overcome data sparsity when the
corpus is small. On a different note, even in the case of dubious attribution, the likelihoods
produced by CP2D can offer further insight. The actual distribution of the likelihood values
can help assess the relative position of the document of disputed attribution. This paper aims
threefold: testing the application of the CP2D to Classical Latin poetry, dealing with the risks
of overfitting, and propose a projection to examine the model output directly.
2. Results
The first promising result is the correct attribution of at least 34 out of the 36 documents in
the corpus when following the method used in [18]. We say “at least” because—in the method
described by its proponents—no procedure is suggested to choose among different sets of hy-
perparameters, all sharing the same micro-averaged recall on the training data. For one fourth
of the documents, the same maximum is obtained with at least 15 different hyperparameter
sets. Lacking a way to select a single one based on the training corpus increases the risk of
overfitting (selecting a parameter that is effective on the training set but not on the test).
Considering all sets of hyperparameters that offer performances comparable to the maxi-
mum (see Methods section) and choosing the most common author, the correctly attributed
documents are 35. This requires accounting for a relevant fraction of all hyperparameters
tested. The only document not assigned to its canonical author is the Halieutica from Ovid,
whose authorship is indeed is debated [8, Chap. 12]. In most cases, the first author is selected
with more than 70% of the hyperparameter sets, replacing the potential instability of the simple
attribution with a clear rule (see Fig. 1, panel C). Moreover, these results are comparable to the
baseline imposters method [14]. The size and relative simplicity of the corpus do not allow to
claim a significant difference.
A second observation is that, while CP2D does not require the use of lemmatisers, we find
that—in this corpus—the use of sequences of lemmas instead of words increases the number
of sets of hyperparameters that offer performances comparable to the maximum up to one-
fifth of all parameters tried (see Fig. 1, panel A). At the same time, simply relying on any set
of parameters that gives the best attribution gives 33 correct attributions. In this case, one
fourth of the documents has at least 22 best-performing hyperparameter sets. These changes,
possibly due to reduced sparity, require additional care in identifying which parameters to
611
20 Tokens 1.0
450 Words
Lemmas
Fraction of attributions to the preferred author
400 0.9
15
#comnparable parameter sets
Number of documents
350 0.8
10
300
0.7
250
A 5
B 0.6 C
200
0.5
0
Words Lemmas 1 2 3 4 Words Lemmas
Tokens Number of proposed authors Tokens
Figure 1: Statistics of the attribution accounting for multiple equivalent sets of hyperparameters in
CP2D. A: Distribution of the number of equivalent sets of hyperparameters changing the definition of
the tokens. B: Distribution of the number of authors selected by at least one set of hyperparameters for
each document. C: Stability of the attribution of each document to its preferred author while changing
sets of hyperparameters. The boxes span from the first to the third quartile of the distribution, and
the whiskers extend to the last point that is no more than 1.5 IQR from the nearest box edge. The bar
across the boxes marks the median.
trust. However, for most documents with both kinds of tokens, a single author is identified as
the most likely with all sets of hyperparameters, Fig. 1, panels B. Also, even when more than
one author is proposed, most hyperparameter sets usually select the same one, Fig. 1, panel C.
Considering lemmas, the baseline method has two misclassified documents, the Heroides and
the Consolatio ad Liviam for which Seneca is preferred.
The method so far has two ways to offer better insight into the position of each document
relative to the candidate authors: how often each author is selected with different sets of hy-
perparameters or—for each set of hyperparameters in the model—the relative likelihood of the
authors or the number of fragments assigned to each one. However, these methods do not al-
low for the easy accounting of more documents at once. Here, we try to overcome this issue by
projecting the documents on a hyper-sphere where the relationship between texts and authors
and among texts are encoded as angles.
In Fig. 2, we show the positioning of the documents and texts of uncertain attribution in our
corpus and a sample document. In every plot, we show the position of all documents of the
three closest authors.
We notice how we can have different scenarios with a shared message. The attribution of
Halieutica is either barely to Ovid or to Horace (panels A and B), but other documents from all
authors are always distant from it. The attribution can remain correct over the full range of
variation of the Macro Recall (panels C and D). The difference in recall is driven by documents
of other authors crossing attribution boundaries but does not affect stable attributions. A docu-
ment can change attribution even with the same Macro Recall on the training set (panels G and
H). However, even in panels F and G, when the anonymous document crosses the boundary
612
Table 1
Correctly attributed documents varying the definition of the tokens and the attribution procedure.
For the Naïve CP2D we report the minimum number of correct attributions (see text). The baseline
imposters method is applied to the 2000 most common features using the Wurzburg Delta.
Nearest Author Author of nearest doc. Naïve CP2D Imposters
Words 35 35 34 35
Lemmas 35 35 33 34
towards a different author, it remains close to the books of the actual author more than to any
other. This suggests that assigning the documents to the author of the nearest document could
give better results if cases like panels F and G were common. However, on this corpus, there
is no noticeable difference.
Figure 2 shows that the Halieutica seems far from all the authors in our corpus, while the
Heroidum Epistulae and the Consolatio ad Liviam are well integrated into the Ovidian produc-
tion. We also observe how different books from the same collection tend to be grouped. On
a similar note, we can read that while the fourth book of Propertius seems close in style to
its reported author, it may be the least typical in its author’s production. Perhaps unsurpris-
ingly [6], even removing the subdivision in verses for poetical documents, the works in prose
from Seneca form a well-isolated group, and none of the other documents is ever attributed
to Seneca with any choice of parameters. Less obvious is that using lemmas and ignoring the
Heroides, no author is proposed outside Ovid, Propertius and Tibullus for all documents in
elegiac distich.
These observations suggest that closeness between documents in this space is a good proxy
for stylistic similarity. At the same time, being closer to the bottom or one of the top corners
of the graphs indicates similarity to that author.
3. Discussion
We showed that the CP2D approach is also effective on Classical Latin texts, even on a small
and imbalanced corpus. We showed that it is possible to increase the stability of the results
by accounting for the many equivalent sets of hyperparameters and that using lemmas instead
of words expands the subspace of hyperparameters where CP2D has high accuracy. We also
showed how a suitable projection of the documents gives a meaningful representation of the
relationships among documents. This representation can offer insight into the stylistic proper-
ties of the documents. Lastly, we proposed different approaches to attribution leveraging the
distances among documents.
However, this corpus proved to be simple, and it is not possible to judge if some of the pro-
posed alternatives (use of lemmas, authorship of the nearest document) would have a positive
effect on corpora where the simple majority vote over the equivalent set of hyperparameters
is not satisfying. A future challenge will be attributing not entire books but individual poems.
This tougher challenge—some poems are only a few tens of words long—is of greater interest
as often—e.g., is the case of the Heroidum Epistulae—the uncertainty in attribution is mainly
613
Table 2
Documents included in the corpus. All but one lemmatised documents were downloaded from Hyper-
base as part of the LASLA collection [15]. The lemmatisation of the Consolatio ad Liviam was provided
by Noemi Daria Zaccagnino as personal communication.
Author Title Short Title
Catullus Carmina Catul1-3
Carmen Saeculare HorSaecu
Horace
Carmina 1-4 HorCarm1-4
Amores 1-3 OviAmor1-3
Ars Amatoria 1-3 OviArsA1-3
Fasti 1-6 OvFasti1-6
Halieutica OviHalie
Ovid Heroidum Epistulae OviEpist
In Ibin OviIbin
Medicamina Faciei Feminae OviMedic
Remedia Amoris OviRemed
Consolatio ad Liviam OviConsLiv
Propertius Elegiae 1-4 Propert1-4
Ad Helviam Matrem De Consolatione SenHelvi
Seneca Ad Marciam De Consolatione SenMarci
Ad Polybium De Consolatione SenPolyb
Tibullus Elegiae 1-3 TibEleg1-3
on selected poems [17].
4. Methods
We selected a corpus of 34 documents from six different authors writing in Classical Latin for
this work. Poems in elegiac distich form the main part of the corpus (works by Ovid, Proper-
tius and Tibullus), followed by other works in various metres (works by Horace and Catullus)
and three examples of prose from Seneca (Consolationes: Ad Marciam, Ad Helviam matrem, Ad
Polybium). We designed the corpus to be imbalanced (the works of Ovid comprise half of the
documents) and divided into literary genres that we expect to challenge the attribution to differ-
ent extents. The Consolationes in prose from Seneca might show similarities with the Ovidian
text of similar topic. Moreover, the corpus contains four documents considered entirely or par-
tially from a different author. These are the third book of the Elegiae from Tibullus [8, Chapters
8-11] and Ovid’s Halieutica [ibid., Chapter 12-13], the Consolatio ad Liviam [ibid., Chapter 14]
and Heroidum Epistulae [ibid., Chapter 15]. The lemmatised sequences are publicly available
in the LASLA collection from the University of Liège [15]; with the exception of the Consolatio
as Liviam; see Table 2 for the complete list of the documents included. Despite the known rel-
evance of morphosyntactic annotations [9], for this work, we took into consideration only the
lemmas. We executed the entire analysis in Python, using standard packages (numpy, scipy)
and the cp2d module from [19]. The code is available at https://github.com/GiulioTani/CHR24.
We prepared the texts, removing the separation in verses. The distinction between ‘u’ and
614
‘v’ is already removed in the documents, and we removed the distinction between upper- and
lower-case letters and all the non-alphabetic characters (i.e., punctuation). We considered the
sequences of words and lemmas and of 𝑁 -grams with 𝑁 ∈ [3, 6] for both. Note that the built-in
definition of 𝑁 -grams in CP2D is derived from [13] and allows a space to appear only at the
beginning or at the end of the 𝑁 -gram. This definition excludes words or lemmas shorter than
𝑁 − 2 (accounting for spaces at both ends).
As a baseline method, we used the imposters [14] approach built-in in Stylo [7], using the
top 2000 character 4-grams and the Wurzburg delta.
We followed a nested leave-one-out paradigm to evaluate the CP2D’s performance. This is
because it works better maximising the size of the training corpus and most authors in the
corpus have only 3-5 documents. All the results are obtained by excluding one document at a
time and treating it as anonymous. Then, we optimise the model hyperparameters, maximising
the attribution in a new leave-one-out experiment. Finally, we evaluate the attribution of the
left-out document. This procedure requires each author to have at least three documents. To
this end, we split the book of Catullus into three parts containing 39 carmina each, in order of
appearance. The final corpus contains 36 documents.
The simplest approach to attribution requires searching—for every document—the set of
hyperparameters that maximises the attribution on the remaining corpus. We followed the
authors in [18] and used a grid search considering two normalisations of 𝑃0 (constant and
author dependent), five lengths of fragments (full documents, 50, 100, 150 and 300 tokens as
the shortest document contains 339 words), five token definitions (full words and four lengths
of 𝑁 -grams), two options for the attribution (Maximum Likelihood and Majority Rule) and 21
values of delta logarithmically spaced between 0.01 and 100. The left-out document is attributed
using (one of) the sets of hyperparameters that give the best accuracy out of the 2100 taken
into consideration. The search over the entire space of parameters for the attribution of one
document (including the use of lemma and word sequences) takes about two hours on a regular
laptop computer (8 × 2.4 GHz CPU, 16 GiB RAM).
The first step forward is not to limit the analysis to the set of parameters that offers the best
attribution on the training set but to consider all other sets that provide comparable results.
To determine which results are comparable, we will assume that for every set of parameters,
a “true” probability of correct attribution exists. We sample this probability in a leave-one-out
experiment, but the number of correctly attributed texts can be higher or lower than expected
due to chance. To limit the effect of the class imbalance, we will consider—instead of the simple
fraction of correctly attributed books—the macro averaged recall. Taking the best-performing
set of parameters as a reference, we consider all the sets for which the fraction of correctly
attributed texts is at least at the 2.5th percentile in the confidence interval of the best result,
assuming a Bernoulli distribution. This choice will allow us to distinguish cases where the attri-
bution is unanimous and where different authors compete. In this case, every set of parameters
will vote for the final attribution.
While this procedure allows attribution, it does not allow comparisons between docu-
ments. For every document 𝑡𝑗 , the software returns the average log-likelihood per token
𝑗
1 ℒ
𝑁
log ℒ (𝐴𝑖 ∣ 𝑡 𝑗 ) = 𝑁𝑖 of every author 𝐴𝑖 , with 𝑁 number of tokens. These likelihoods are not
directly comparable across documents. Indeed, in the leave-one-out approach, each known
615
document of an author and the anonymous are compared against slightly different versions
of the author’s corpus. For each of the 𝑀 documents of 𝐴𝑖 , the likelihood ℒ (𝐴𝑖 ∣ 𝑡𝑗 ) will be
computed using a corpus of 𝑀 − 1 documents. The reference corpus of 𝐴𝑖 contains all 𝑀 for
the anonymous document.
To compare documents, we will ignore this aspect for two reasons: First, the reference corpus
is meant to reflect the best available description of the author as a proxy for the author’s style.
Each of the different versions represents the author with varying approximations. Second, from
a more technical point of view, the effect on the likelihood of the changing reference corpus
decreases with the size of the corpus itself.
𝑗 𝑗
We will now consider minus the inverse of the output of CP2D, i.e., 𝑥𝑖 = −𝑁 / log(ℒ𝑖 ),
and treat these as Cartesian coordinates. With this transformation, the most likely author
is still associated with the maximum coordinate, and each author identifies with one of the
𝑗
axes in space. The smallest angle 𝜙𝑖 between the document and the axes in the 𝑛-dimensional
𝑗
space, with 𝑛 the number of authors, identifies the most likely author. In the limit of ℒ𝑖 → 1
𝑗
(increasing likelihood of the author), the associated coordinate 𝑥𝑖 → ∞ and the document
𝑗
moves towards the axis 𝜙𝑖 → 0.
The same attribution results would be achieved by projecting all the points on the surface
𝑗
of an 𝑛-ball, i.e., an (𝑛 − 1)-sphere. Since the variability of the values ℒ𝑖 is limited in practice,
most documents are scattered around the 𝑛-dimensional bisector. Thus, the distance from the
origin encodes general information on the typicality of the documents. In the following, we
𝑗
will disregard this information and work only with the 𝜙𝑖 computed as:
𝑛 𝑗 2
𝑗 √∑𝑚=𝑖+1 𝑥𝑚
𝜙𝑖 = arctan 𝑗
(1)
𝑥𝑖
𝑗 𝑗
wit 𝑁 the number of candidate authors and 𝜙𝑛 = 𝜋/2 − 𝜙𝑛−1 . We apply this transformation
only to the likelihood values computed with the sets of hyperparameters that include Maxi-
mum Likelihood attribution, setting aside the attribution with Majority Rule. When comput-
𝑗
ing attribution based on the angle between documents, we use the cosine distance of the 𝑥𝑖 to
determine the nearest document.
This measure misses some characteristics of a proper metric. Most notably, the angle be-
tween two documents can be zero without being the same text. If the two texts differ only in
the order of the words and in words that appear only in the individual documents (and with
the same distribution of the frequencies), every author will have the same likelihood for both
texts, which will have zero distance. The distance between texts should not be interpreted as
a measure of their textual difference, as the position in space depends on the relationship with
the authors. However, it can be viewed as a measure of the stylistic difference.
This projection allows us to visualise on 2D paper the relationship with up to three authors
without dimensionality reduction (a 3D sphere has 2D surface). This natural representation
allows visualising decision boundaries, defining regions associated with each author and cor-
responding to the ML attribution. Moreover, when interested in stylistic relationships and not
in attribution, we can use just a single level of the leave-one-out procedure. This means look-
ing at the documents of a group of authors when none of them is treated as anonymous. Here,
616
each document is compared against all others. In practice, in Fig. 2, this is the case of the works
of Horace, Propertius and Tibullus in panels A–F.
Acknowledgments
The author thanks Noemi Daria Zaccagnino for providing the lemmatisation of the Consolation
ad Liviam and other advice in the assembly of the corpus. Her contribution was essential to
the completion of this work.
References
[1] P. Agapitos and A. van Cranenburgh. A Stylometric Analysis of Seneca?s Disputed Plays.
Authorship Verification of Octavia and Hercules Oetaeus. Tech. rep. 1. Darmstadt: TU
Darmstadt, 2024, 31 Seiten. doi: https://doi.org/10.26083/tuprints-00027394.
[2] D. Bamman and P. J. Burns. Latin BERT: A Contextual Language Model for Classical Philol-
ogy. 2020. arXiv: 2009.10053 [cs.CL].
[3] J.-P. Benzécri. L’Analyse des Correspondances. Vol. 2. 2 vols. Paris, Bruxelles, Montreal:
Dunod, 1973. 625 pp.
[4] J.-P. Benzécri. L’Analyse des Données. 2 vols. Paris, Bruxelles, Montreal: Dunod, 1973.
[5] T. J. Bolt, J. H. Flynt, P. Chaudhuri, and J. P. Dexter. “A Stylometry Toolkit for Latin
Literature”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Process-
ing (EMNLP-IJCNLP): System Demonstrations. Ed. by S. Padó and R. Huang. Hong Kong,
China: Association for Computational Linguistics, 2019, pp. 205–210. doi: 10.18653/v1
/D19-3035.
[6] P. Chaudhuri, T. Dasgupta, J. P. Dexter, and K. Iyer. “A small set of stylometric features
differentiates Latin prose and verse”. In: Digital Scholarship in the Humanities 34.4 (2018),
pp. 716–729. doi: 10.1093/llc/fqy070.
[7] M. Eder, J. Rybicki, and M. Kestemont. “Stylometry with R: A Package for Computational
Text Analysis”. In: The R Journal 8.1 (2016), pp. 107–121. doi: 10.32614/rj-2016-007.
[8] T. E. Franklinos and L. Fulkerson. Constructing Authors and Readers in the Appendices
Vergiliana, Tibulliana, and Ouidiana. Oxford University Press, 2020. doi: 10.1093/oso/97
80198864417.001.0001.
[9] R. Gorman. “Morphosyntactic Annotation in Literary Stylometry”. In: Information 15.4
(2024). doi: 10.3390/info15040211.
[10] P. Juola. “JGAAP: A system for comparative evaluation of authorship attribution”. In:
Journal of the Chicago Colloquium on Digital Humanities and Computer Science. 2009.
doi: 10.6082/m1n29v4z.
[11] F. Karsdorp, M. Kestemont, and A. Riddell. Humanities Data Analysis: Case Studies with
Python. Princeton University Press, 2021.
617
[12] Kestemont, Mike and Moens, Sara and Deploige, Jeroen. “Collaborative authorship in the
twelfth century: a stylometric study of Hildegard of Bingen and Guibert of Gembloux”.
In: Digital Scholarship In The Humanities 30.2 (2015), 199–224.
[13] M. Koppel, J. Schler, and S. Argamon. “Authorship attribution in the wild”. In: Language
Resources and Evaluation 45.1 (2011), pp. 83–94.
[14] M. Koppel and Y. Winter. “Determining if two documents are written by the same author”.
In: Journal of the Association for Information Science and Technology 65 (1 2014), pp. 178–
187. doi: 10.1002/asi.22954.
[15] D. Longree and M. Fantoli. LASLAfiles_Latin_APNformat. Version V1. 2023. doi: 10.581
19/ulg/qjj0sa.
[16] B. Nagy. (Not) Understanding Latin Poetic Style with Deep Learning. 2024. doi: 10.48550
/arXiv.2404.06150.
[17] B. Nagy. “Some stylometric remarks on Ovid’s Heroides and the Epistula Sapphus”. In:
Digital Scholarship in the Humanities 38 (3 2023), pp. 1183–1199. doi: 10.1093/llc/fqac098.
[18] G. T. Raffaelli, M. Lalli, and F. Tria. “Inference through innovation processes tested in
the authorship attribution task”. In: Communications Physics 2024 7:1 7 (1 2024), pp. 1–8.
doi: 10.1038/s42005-024-01714-6.
[19] G. Tani Raffaelli, M. Lalli, and F. Tria. GiulioTani/InnovationProcessesInference: Accepted.
Version v1.0.0. 2024. doi: 10.5281/zenodo.12163218.
618
P0: auth.dep., F: 300, f.l., : 0.799 (M.R.: 1.0) P0: auth.dep., F: 150, N: 5, : 0.602 (M.R.: 1.0)
0.965 HorCarm2
HorCarm1 HorCarm1
0.964 HorCarm2
HorCarm3
HorCarm3
HorCarm4
0.960 Catul3Catul2
Catul1 HorCarm4 HorSaecu OvFasti5
0.960 OvFasti2
OvFasti3
OvFasti6
Document: OviHalie
OviHalie HorSaecu OviIbin OvFasti1
0.955 OvFasti4
OviConsLi
OviEpist
OviArsA1
0.956 OviHalie OviArsA3
OviArsA2
OviAmor2
OviRemed
OviMedicOviAmor3
0.950 OviAmor1
2
0.945 0.952
0.940 TibEleg3 0.948 TibEleg3
0.935 TibEleg1
TibEleg2 A 0.944
TibEleg1
TibEleg2 B
0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.768 0.776 0.784 0.792 0.800 0.808
P0: auth.dep., F: 300, N: 5, : 0.602 (M.R.: 1.0) P0: fixed, f.d., N: 4, : 0.799 (M.R.: 0.9314)
0.963 OvFasti5
OvFasti6 OviEpist
OvFasti2 0.956 OvFasti2
OvFasti6
OvFasti5
OvFasti3
OvFasti1
OviArsA1 OvFasti1
OviArsA1
OvFasti3 OviArsA3OviEpist Propert3
OviArsA3
OviRemed OvFasti4
OviArsA2 OviAmor2
OviRemed
OviAmor2
OviArsA2 OviIbin Propert3 OviAmor3 Propert4 Propert2
0.960 Propert2Propert1 OviAmor1
OvFasti4
OviAmor3
OviConsLi
OviAmor1 Propert4 0.954 OviConsLi
OviIbin
Document: OviEpist
OviMedic Propert1
0.957 OviMedic
0.952 OviHalie
2
OviHalie
0.954
0.950 TibEleg3
TibEleg3
0.951 TibEleg1 TibEleg1 Author
TibEleg2 C 0.948 TibEleg2 D Anonymous
Catullus
0.948
0.772 0.776 0.780 0.784 0.788 0.792 0.780 0.783 0.786 0.789 0.792 Horatius
P0: auth.dep., F: 50, N: 5, : 0.602 (M.R.: 1.0) P0: fixed, f.d., N: 6, : 0.398 (M.R.: 0.9804) Ovidius
0.963 OvFasti5
OvFasti2
OvFasti6
OviArsA1 OviEpist OviArsA1
OvFasti2 OviEpist
OviRemed
OvFasti6 OviAmor2
OviIbin Propert3 Propertius
OvFasti3 OviRemed
OviArsA3OviAmor2 OviArsA3
OviArsA2
OvFasti3
OvFasti5
OviAmor3 Propert2
OviArsA2
OvFasti1 OvFasti1 OviAmor1
OvFasti4 Propert4 Seneca
OviIbin Propert2Propert3
0.954
Propert1 OviConsLi
0.960 OviAmor3
OviConsLi
OviAmor1 Tibullus
OvFasti4 Propert4
Document: OviConsLi
Propert1
OviMedic 0.951 OviHalie
0.957 OviMedic
OviHalie
2
0.954 0.948
TibEleg3 TibEleg3
0.951 TibEleg1 0.945 TibEleg1
0.948 TibEleg2 E 0.942 TibEleg2 F
0.772 0.776 0.780 0.784 0.788 0.792 0.780 0.784 0.788 0.792 0.796
P0: auth.dep., F: 150, N: 6, : 0.204 (M.R.: 1.0) P : auth.dep., F: 150, N: 5, : 0.602 (M.R.: 1.0)
0.966 OviEpist
OviArsA1OviRemed 0.963 0 OviEpist
OviAmor2 OviArsA1OviRemed
OvFasti5
OvFasti2
OviArsA3
OviArsA2
OvFasti6
OvFasti3 Propert2 Propert1 OvFasti5
OvFasti2OviArsA3OviAmor2
OvFasti6
OviArsA2
OvFasti3
Propert2Propert1
0.963 OvFasti1 OviIbin
OviAmor3 OviIbin
0.960 OvFasti1 OviAmor3
OviAmor1
OviConsLi Propert4
Propert3 OviConsLi
OviAmor1 Propert4
Propert3
Document: Propert4
OvFasti4 OvFasti4
0.960 OviMedic 0.957 OviMedic
0.957 OviHalie
2
OviHalie 0.954
0.954 TibEleg3 TibEleg3
TibEleg1 0.951 TibEleg1
0.951
0.948 TibEleg2 G 0.948 TibEleg2 H
0.762 0.768 0.774 0.780 0.786 0.792 0.770 0.775 0.780 0.785 0.790 0.795
1 1
Figure 2: Positions in the space of three documents of debated authorship (Halieutica, Consolatio ad
Liviam and Heroidum Epistulae from Ovid) and one of unstable attribution (the fourth book of the
Elegiae from Propertius), using lemmas. The dotted lines correspond to equal likelihood of the authors
on the two sides and mark the decision boundary when using direct attribution based on likelihood.
In all panels, documents towards the bottom of the plots would be assigned to Tibullus. Documents
in the top-left corner would go to Catullus (A), Horace (B) or Ovid (C—H). Documents in the top-right
corner would go to Horace (A), Ovid (B) or Propertius (B—H). The title of every panel reports the set
of parameters used for plotting and the macro-averaged recall (M.R.). The normalisation of 𝑃0 is either
fixed or author-dependent (auth.dep.), ‘F’ is the number of tokens per fragment (f.t. for full documents),
‘N’ is the size of 𝑁 -grams (f.l. for full lemmas), and ‘𝛿’ is the correction to 𝑃0 . We chose these sets of
parameters for illustrative purposes among those offering results comparable to the maximum. The
scale bars in the bottom left corner of each panel have a fixed size of 0.001 rad.
619