       Using and evaluating TRACER for an Index fontium computatus
              of the Summa contra Gentiles of Thomas Aquinas
                          Greta Franzini, Marco Passarotti
                          Università Cattolica del Sacro Cuore
                                  Maria Moritz, Marco Büchler
                                 Georg-August-Universität Göttingen

                     Abstract                            thout quotation), in the texts of Thomas Aquinas
                                                         (Busa, 1980). Four decades later, Portalupi noted:
    English. This article describes a computa-
    tional text reuse study on Latin texts desi-               Ancora più difficile sarà [. . .] il ten-
    gned to evaluate the performance of TRA-                   tativo di confrontare automaticamente
    CER, a language-agnostic text reuse de-                    tutto Tommaso con tutti i testi di uno
    tection engine. As a case study, we use                    o più autori, per rintracciare in modo
    the Index Thomisticus as a gold standard                   globale la presenza implicita di una
    to measure the performance of the tool                     fonte. Per fare questo occorrerebbe che
    in identifying text reuse between Thomas                   si verificassero due condizioni: in primo
    Aquinas’ Summa contra Gentiles and his                     luogo, gli autori di cui si studiano le
    sources.                                                   presenze implicite in Tommaso dovreb-
    Italiano. Questo articolo descrive un’ana-                 bero essere informatizzati e interrogabili
    lisi computazionale effettuata su testi la-                nella totalità delle loro opere; in secondo
    tini volta a valutare le prestazioni di TRA-               luogo, bisognerebbe disporre di un soft-
    CER, uno strumento “language-agnostic”                     ware molto potente e raffinato. (Porta-
    per l’identificazione automatica del riuso                 lupi, 1994, p. 583) 1
    testuale. Il caso studio scelto a tale scopo
                                                         Today, a once visionary task is conceivable, giving
    si avvale dell’Index Thomisticus quale
                                                         way to studies such as the present, which poses
    gold standard per verificare l’efficacia di
                                                         the following research question: to which extent
    TRACER nel recupero di citazioni delle
                                                         can historical text reuse detection (HTRD) soft-
    fonti della Summa contra Gentiles di Tom-
                                                         ware detect explicit and implicit text reuse in the
    maso d’Aquino.
                                                         writings of Thomas Aquinas ? To this end, we test
                                                         the performance of TRACER, a text reuse detec-
1   Introduction                                         tion framework, for the creation of an Index fon-
                                                         tium computatus (a computed index of text reuse).
   Thomas Aquinas (1225-1274) was a prolific             The Summa contra Gentiles (ScG) was chosen as a
medieval author from Italy: his 118 works, known         case study because the critical edition used for the
as the Corpus Thomisticum, amount to 8,767,883           Index Thomisticus, the 1961 Marietti Editio Leo-
words (Portalupi, 1994, p. 583) and discuss a va-        nina (Gauthier et al., 1882), is still in use today
riety of topics, ranging from metaphysical to le-        and because an ongoing treebanking effort of the
gal, political and moral theory (Kretzmann and           text will, in future, provide us with the linguistic
Stump, 1993). The web of references to biblical,         data needed to further refine the experiments des-
ecclesiastical and classical literature that stretches   cribed here (Passarotti, 2011).
the whole Corpus Thomisticum speaks to daun-
ting erudition. In the late 1940s, Humanities Com-           1. Our English translation reads: ‘It will be even harder to
                                                         automatically compare all of Thomas against all of the texts
puting pioneer Father Roberto Busa (1913-2011)           of one or multiple authors to check for the presence of im-
spearheaded a scholarly effort, known as the In-         plicit sources. Such a task would only be possible under two
dex Thomisticus, to manually annotate reuse, both        conditions: firstly, the texts of the authors quoted by Thomas
                                                         would have to be digitised and searchable in their entirety;
explicit (i.e., explicitly introduced by Aquinas as      secondly, one would need very powerful and sophisticated
a quote) and implicit (i.e., reference to works wi-      software’.
2     Related Work                                                  Roberto Busa’s effort in the late 1940s resul-
                                                                 ted in the creation of the Index Thomisticus, a
2.1    The significance of text reuse                            manually-lemmatised version of Thomas Aqui-
   Text reuse (TR) can be summarily described as                 nas’ opera omnia (Jones, 2016). Among the an-
the written repetition or borrowing of text and can              notations, the Index Thomisticus tags tokens for-
take different forms. Büchler et al. (2014) sepa-               ming explicit quotations as QL if literal (ad litte-
rate syntactic TR, such as (near-)verbatim quota-                ram) and QS if a paraphrase (ad sensum), and to-
tions or idiomatic expressions, from semantic TR,                kens forming implicit quotations as QR to indicate
which can manifest itself as a paraphrase, an al-                a reference or citation alluding to another text. An
lusion or other loose reproduction. The study of                 example quotation in the ScG containing a mixed
quotation is key to any philological examination                 annotation is:
of a text, as it is not only indicative of the intel-                  [. . .] ratio(QL) vero (QL) signi-
lectual and cultural endowment of an author, but                       ficata(QL) per(QL) nomen(QL)
may shed light on the sources used, the relation                       est(QL)        definitio(QL) secun-
between works and literary influence. Crucially,                       dum(QR) philosophum(QR) in(QR)
quotations may also preserve text that is now lost,                    IV(QR) Metaph.(QR) 5
thus facilitating efforts of textual reconstruction. 2
Owing to the magnitude of the task, the publi-                      The (QL) portion of this example contains the
cation of a work’s complete index of references,                 literal quote, while the second (QR) portion pro-
conventionally known as Apparatus fontium or In-                 vides the reference.
dex scriptorum, is rare (Portalupi, 1994, p. 582).
                                                                 2.3   Historical text reuse detection
2.2    Text reuse in Thomas Aquinas                                 HTRD is a Natural Language Processing (NLP)
                                                                 task aimed at identifying syntactic and semantic
   Like many of his Christian predecessors, Aqui-
                                                                 TR in historical sources. The computational ana-
nas’ body of work teems with references to secular
                                                                 lysis of historical languages is particularly chal-
and Christian literature alike. In the ScG (1259-
                                                                 lenging as tools at our disposal are often trained
1265) Aquinas cites 170 works both explicitly and
                                                                 on a synchronic rather than diachronic state of
implicitly (Gauthier et al., 1882, Vols. IV-XV).
                                                                 a language 6 and on controlled textual corpora.
Explicit quotations provide information about the
                                                                 Eger et al. (2015) and Passarotti (2010) tested
source text and the author and/or work, and can
                                                                 the performance of seven different taggers, inclu-
either be direct or indirect (Gauthier et al., 1882,
                                                                 ding TreeTagger (Schmid, 1994), for different trai-
vol. XVI, pp. XVI-XXII). Implicit reuses, in the
                                                                 ning sets and tag-sets of medieval (church) La-
ScG and in general, are more elusive, as they are
                                                                 tin texts showing accuracies tightly below 96%
almost never syntactically nor lexically-faithful to
                                                                 and 96.75% for PoS-tagging, and around 90% and
the original text, thus making them hard for both
                                                                 89.90% for morphological analysis, respectively.
machines and humans to spot (Portalupi, 1994, p.
                                                                 These results have yet to be generalised to other
582). 3 Durantel notes that Aquinas’ tendency in
                                                                 variants of Latin and can be improved upon with
TR is to borrow only what is necessary to fit the
                                                                 the provision of additional training corpora, tree-
flow of his narrative without significant semantic
                                                                 banked and semantically-tagged, the creation of
or syntactic deviation from the original (Duran-
                                                                 corpora containing intertexts, or with the expan-
tel, 1919, p. 63). And yet, Pelster’s observation
                                                                 sion of lexical resources, such as the Latin Word-
on Aquinas’ paraphrastic reuse of Aristotle might
                                                                 Net (Minozzi, 2017, p. 130).
suggest greater deviation (Pelster, 1935, p. 331). 4
                                                                    The extent to which the limitations of these re-
    2. One notable example is the fragmentary survival of        sources and taggers (e.g., correct resolution of ho-
Alexandrian scholarship at the hands of Roman philologists
(who wrote commentaries known as scholia) and gramma-
                                                                 mographs) affect HTRD tools, including Tesse-
rians (Turner, 2014, p. 16).                                     rae (Coffee et al., 2013), Passim (Smith et al.,
    3. For problems with implicit quotations, see (Haverfield,   2015) 7 and TRACER (Büchler, 2013) is not yet
1916, p. 197) and (Fowler, 1997, p. 15). For automatic allu-
sion detection, see (Bamman and Crane, 2008).                         5. Book 1, chap. 12, n. 4. Our English translation reads:
    4. “Da Thomas die Schriften des Aristoteles [. . .]          ‘[. . .] according to the philosopher in Metaph. IV, the mea-
gewöhnlich nur dem Gedanken nach, nicht wörtlich anführt.”    ning of a name is its definition’.
In English: ‘Since Thomas usually quotes paraphrastically,            6. See Janda and Joseph (2005) for the dichotomy.
not literally.’                                                       7. https://github.com/dasmiq/passim
fully understood. Reasons for this are the fiel-           tences to TRACER requirements.
d’s lack of progress caused by “inconsistent stan-
dards and the scattering of insights across pu-            3.3    Text reuse detection with TRACER
blications” (Coffee, 2018), the general failure of            The HTRD on this corpus was performed
HTRD studies to publish negative results, and the          (server-side) with TRACER, a language-agnostic
quasi-absence of gold standards for testing. To our        framework comprising hundreds of information
knowledge, the only projects to have published             retrieval (IR) algorithms designed to work with
computed results from intertextual studies on his-         historical and modern languages alike. 9 TRACER
torical sources are the Proteus Project (English           is a Java command-line tool driven by an XML
and Latin) (Yalniz et al., 2011), the Chinese Text         configuration file, which users can modify to fit
Project (early Chinese) (Sturgeon, 2017), Com-             their detection needs. TRACER follows a six-
monplace Cultures (English and Latin) (Gladstone           step architecture, 10 which demystifies the detec-
and Cooney, forthcoming), SHEBANQ (Hebrew)                 tion process by storing the computed output of
(Naaijer and Roorda, 2016), Samtla (Search and             each step on the disk so that users can more easily
Mining Tools for Language Archives) (language-             follow and locate errors in the processing chain,
independent) (Harris et al., 2018), and Tesserae           if any. TRACER is resilient to OCR-noise and ca-
(Latin), but of these only the latter discloses tool       pable of detecting both (near-)verbatim quotations
configurations.                                            and looser forms of TR. The detection of para-
                                                           phrase requires the use of linguistic resources to
3     Methodology                                          help TRACER match a word against its synsets
3.1    Gold Standard                                       and an inflected form against its base-form. For
                                                           synonym detection, we extracted synonymous re-
  To facilitate the classification of automatically-       lations from the Latin WordNet. TR identified with
detected reuse, all QL-, QS- and QR-annotated to-          TRACER was manually compared against the IT-
kens were extracted from the Index Thomisticus.            GS to separate the True (TP) from the False Posi-
Of the total 24,416 sentences constituting the ScG,        tives (FP), and to identify False Negatives (FN).
the 7,396 (30.29%) containing any combination
of QL, QS and QR were stored in a tabular file,            4     Results
which we define as the Index Thomisticus Gold
Standard of TR (hereafter IT-GS). The number of            4.1    Philosophiae Consolationis
sentences containing only QL tokens (1,139) com-              To detect both verbatim quotations and para-
pared to that of sentences containing only QS to-          phrase, TRACER was optimised for recall over
kens (2,270) corroborates expert assertions about          precision and configured to work with single
Aquinas’ paraphrastic style of TR.                         words as features, to ignore the top 20% most
                                                           frequent words, 11 to link text pairs with a mini-
3.2    Text acquisition and preparation                    mum overlap of 5 features, 12 to expand the query
   For the sake of processing efficiency, out of the       to synonyms, and to return only those aligned text
ScG’s 170 source works we began with a set of              pairs presenting an overall sentence similarity of
five readily available texts. These are Philosophiae       at least 50%. 13 Of the eight reuses indicated in
Consolationis and De Trinitate of Boethius, De
                                                           reference. Ambiguously-lemmatised word forms were not di-
Deo Socratis of Apuleius, Cicero’s De Divinatione          sambiguated.
and the Moerbeke Latin translation of Aristotle’s              9. https://doi.org/21.11101/
Metaphysica. The texts were acquired from dif-                10. The six steps are: Preprocessing, Featuring, Selection,
ferent sources and cleaned of all paratextual in-          Linking, Scoring and Postprocessing.
formation. The clean texts were then segmenti-                11. The parameter, known as feature density, is a language-
                                                           independent measure used to decontaminate the texts and to
sed by sentence, PoS-tagged and lemmatised with            contain the number of results based on chance repetition; an
the TreeTagger Brandolini parameter file (with an          80% feature density means that TRACER ignores or removes
average accuracy of 93.72%), whose tag-set pro-            the most frequent types that cover 20% of the tokens.
                                                              12. For a 24k sentence corpus such as this, an overlap of 5
vides the degree of granularity needed in this expe-       is statistically significant (Büchler, 2013, p. 134).
riment. 8 Finally, a script was used to format sen-           13. The value was chosen on the basis of previous ex-
                                                           periments as a good trade-off between precision and recall.
   8. The Brandolini tag-set was manually mapped against   The similarity measure used is Broder’s containment, which
that of Morpheus (Crane, 1991), which TRACER uses as a     is particularly suited to documents or sentences of uneven
the Editio Leonina, we were unable to precisely                 tively). The F1-score for this analysis was 5, 6 ·
locate one as it alludes to four paragraphs of                  10−4 .
text; 14 of the remaining seven, as shown in Figure
1, TRACER identified three (42%). Upon close                    4.3    De Deo Socratis
inspection, two FNs were affected by the 20%
threshold of feature removal, for example:                         This work of Apuleius is quoted twice in the
                                                                ScG. Of the two reuses, TRACER was able to de-
Boethius 1.4.105 Unde haud iniuria tuorum                       tect one in full and only parts of the second. The
quidam familiarium quaesivit: “Si quidem deus”,                 second reuse spans three sentences and is mostly
inquit, “est, unde mala ? 15                                    paraphrastic, with only three words annotated in
                                                                the Index Thomisticus as QL (sunt animo pas-
Aquinas 3.71.10 , introducit quendam philoso-                   siva). 19 To capture the fullest range of reuse diver-
phum quaerentem: si deus est, unde malum ? 16                   sity, TRACER’s feature removal was set to 10%,
                                                                the overlap to 3 and the overall similarity to 20%.
   Here, the tokens si, est and unde were ignored as            However, as sunt (form of the verb sum ‘to be’)
they fell within the pool of the 20% most frequent              is the most frequent word across the texts, TRA-
words removed.                                                  CER’s inbuilt feature removal prevented the de-
   One reuse was successfully identified on the ba-             tection of the short QL portion of the reuse; the
sis of feature overlap but did not amount to a 50%              QR+QS portions, on the other hand, were success-
sentence similarity; and the fourth reuse could                 fully detected. We counted both results as TPs, re-
not be identified because of a missing synony-                  sulting in an F1-score of 2, 6 · 10−5 .
mous relation in the Latin WordNet (i.e., gaudium-
beatitudo) 17 and its insufficient feature overlap.             4.4    De Divinatione
The resulting F1-score is 4, 6 · 10−3 .                            The only recorded reuse that Aquinas makes of
4.2   De Trinitate                                              Cicero’s text is implicit and alludes to a block of
                                                                text, making it difficult to manually pinpoint with
   Given the results of the previous analysis, for
                                                                precision. To detect as loose a similarity as pos-
this second investigation the feature removal and
                                                                sible, the TRACER search was cast with the same
the sentence similarity values were lowered to
                                                                configuration used in the previous analysis. No
10% and 40% respectively, thus optimising for
                                                                reuse, however, was found.
even higher recall (10,349 total sentences aligned).
Of the four known reuses, TRACER identified
three. The 40% similarity threshold was essential               4.5    Metaphysica
to the identification of one reuse (where the score                The Editio Leonina lists 97 reuses of Aristot-
is 0.4375); the FN, which was indeed found on the               le’s Metaphysica. As previously mentioned, Pel-
basis of an eight-word overlap but did not meet                 ster describes Aquinas’ reuse of the Latin trans-
the minimum sentence similarity threshold, revea-               lation of the Metaphysica as more paraphrastic
led another missing synonymous relation in the                  than literal. Our manual examination of the texts
WordNet (i.e., disciplinatus-eruditus) 18 and a fai-            and the results of TRACER confirmed this obser-
led alignment of the variants temptare (Boethius)               vation, in that we could not manually locate se-
and tentare (Aquinas) owing to inconsistent Tree-               ven reuses (due to their strong allusiveness) and
Tagger lemmatisation (tempto and tento, respec-                 a fault-tolerant TRACER configuration (removal
length (Broder, 1997).                                          of the top 10% most frequent words, overlap of 3
  14. This reuse would have doubtless been overlooked by        features and an overall sentence similarity of 40%)
TRACER too owing to the absence of features to compare.
  15. Our English translation reads: ‘It is not wrong that a    yielded 19 TPs only (6 out of 15 QL 20 and 13 out
certain acquaintance of yours has questioned: ‘If in fact God   of 75 QR+QS). The F1-score resulting from this
exists,’ he asks, ‘where is evil from ?”                        analysis is 3, 8 · 10−4 .
  16. Our English translation reads: ‘(Boethius) introduces
a certain philosopher who asks: ‘If God exists, where is evil
from ?’.’                                                         19. [daemones] [. . .] sunt animo passiva or ‘demons are
  17. Incidentally, this relation is also not mapped in Ba-     emotional in mind’ (Jones, 2017, pp. 372-373).
belNet (bn:00042905n) nor in ConceptNet (http://                  20. The QL quotations in the ScG seem to refer to a dif-
conceptnet.io/c/la/gaudium) (as of 8 June 2018).                ferent Latin translation than that available to us, which would
  18. Also not present in neither BabelNet nor ConceptNet.      explain why some instances of QL went undetected.
F IGURE 1 – For every TRACER analysis, a MySQL table is created to store and manually-evaluate the
results against the IT-GS. The evaluation table for Philosophiae Consolationis illustrated here contains
a wealth of information, including full citation information for both works, the TRACER settings used
for the detection task, the Index Thomisticus quotation annotations, the result classification (into True
Positive and False Negative), as well as the feature overlap and the overall similarity value of the aligned
sentences. The reuse in the highlighted row, for instance, was correctly identified by TRACER on the
basis of a 9-word overlap and an overall sentence similarity of 90%.

5   Discussion                                            6   Conclusion
                                                             This article describes a computational text reuse
   Our results show that the FNs emerging from            study on Latin texts designed to evaluate the per-
the computational analyses were largely caused            formance of TRACER, a language-agnostic IR
by Aquinas’ paraphrastic and allusive TR style,           text reuse detection engine. The results obtained
which at times challenged our own ability to spot         were manually evaluated against a gold standard
similarities, even with the help of the critical edi-     and are contributing to the creation of an Index
tion. The allusions that we could identify generally      fontium computatus to both assess TRACER’s ef-
retain the semantics of the alluded-to texts, thus        ficacy and to provide a test-bed against which ana-
confirming Durantel’s insights. While a number of         logous IR systems can be measured and thus com-
these negative results were also directly tied to la-     pared to TRACER. Our study shows that despite
cunae in the Latin WordNet and to inconsistent            the known limitations of existing linguistic re-
lemmatisation, the flexibility and methodological         sources for Latin, the diverse spectrum of para-
transparency of TRACER allowed us to locate er-           phrastic reuse encountered and its own language-
ror sources and accordingly tune configurations to        agnosticism, TRACER is equipped to detect a
work around these issues (e.g., by increasing the         wide range of explicit text reuse in the ScG, be
feature overlap and/or lowering the sentence simi-        that short or long, verbatim or paraphrastic, and
larity scoring thresholds). Notwithstanding, TRA-         implicit reuse only if coupled with explicit. To in-
CER’s panlingual feature removal parameter af-            crease the detection accuracy, we are implemen-
fected the retrieval of shorter instances of reuse,       ting a black/white list to give users the power
particularly those containing forms of the highly         to control words or multi-word expressions to be
frequent verb sum.                                        ignored or retained in the detection; furthermore,
   The manual evaluation of TRACER results                we plan on re-running these analyses with the di-
against the IT-GS for the creation of an Index fon-       sambiguated linguistic annotation currently being
tium computatus was time-consuming, not least             added to the text of the ScG (Passarotti, 2015) to
because of a number of reference inaccuracies in          measure its impact on this particular IR task.
the critical edition itself (in one case, the reference      The data used and generated in the current
is off by ten lines). Nevertheless, the creation of       study is available from: https://github.
the index is proving essential to the assessment of       com/CIRCSE/text-reuse-aquinas.
TRACER’s fitness for purpose on Latin texts.
   As far as the usability of the tool is concerned,
TRACER’s detection power is offset by its cum-               The authors would like to thank Eleonora Litta
bersome setup, which is unfriendly to those who           for proofreading this article and the anonymous
are not familiar with the command line, NLP ba-           reviewers for their valuable comments. This re-
sics and/or Java (stack traces). This issue is being      search was funded by the German Federal Minis-
addressed with the development of a user manual           try of Education and Research (No. 01UG1409).
(Franzini et al., 2018).
