Using and evaluating TRACER for an Index fontium computatus of the Summa contra Gentiles of Thomas Aquinas Greta Franzini, Marco Passarotti Università Cattolica del Sacro Cuore {greta.franzini,marco.passarotti}@unicatt.it Maria Moritz, Marco Büchler Georg-August-Universität Göttingen {mmoritz,mbuechler}@etrap.eu Abstract thout quotation), in the texts of Thomas Aquinas (Busa, 1980). Four decades later, Portalupi noted: English. This article describes a computa- tional text reuse study on Latin texts desi- Ancora più difficile sarà [. . .] il ten- gned to evaluate the performance of TRA- tativo di confrontare automaticamente CER, a language-agnostic text reuse de- tutto Tommaso con tutti i testi di uno tection engine. As a case study, we use o più autori, per rintracciare in modo the Index Thomisticus as a gold standard globale la presenza implicita di una to measure the performance of the tool fonte. Per fare questo occorrerebbe che in identifying text reuse between Thomas si verificassero due condizioni: in primo Aquinas’ Summa contra Gentiles and his luogo, gli autori di cui si studiano le sources. presenze implicite in Tommaso dovreb- Italiano. Questo articolo descrive un’ana- bero essere informatizzati e interrogabili lisi computazionale effettuata su testi la- nella totalità delle loro opere; in secondo tini volta a valutare le prestazioni di TRA- luogo, bisognerebbe disporre di un soft- CER, uno strumento “language-agnostic” ware molto potente e raffinato. (Porta- per l’identificazione automatica del riuso lupi, 1994, p. 583) 1 testuale. Il caso studio scelto a tale scopo Today, a once visionary task is conceivable, giving si avvale dell’Index Thomisticus quale way to studies such as the present, which poses gold standard per verificare l’efficacia di the following research question: to which extent TRACER nel recupero di citazioni delle can historical text reuse detection (HTRD) soft- fonti della Summa contra Gentiles di Tom- ware detect explicit and implicit text reuse in the maso d’Aquino. writings of Thomas Aquinas ? To this end, we test the performance of TRACER, a text reuse detec- 1 Introduction tion framework, for the creation of an Index fon- tium computatus (a computed index of text reuse). Thomas Aquinas (1225-1274) was a prolific The Summa contra Gentiles (ScG) was chosen as a medieval author from Italy: his 118 works, known case study because the critical edition used for the as the Corpus Thomisticum, amount to 8,767,883 Index Thomisticus, the 1961 Marietti Editio Leo- words (Portalupi, 1994, p. 583) and discuss a va- nina (Gauthier et al., 1882), is still in use today riety of topics, ranging from metaphysical to le- and because an ongoing treebanking effort of the gal, political and moral theory (Kretzmann and text will, in future, provide us with the linguistic Stump, 1993). The web of references to biblical, data needed to further refine the experiments des- ecclesiastical and classical literature that stretches cribed here (Passarotti, 2011). the whole Corpus Thomisticum speaks to daun- ting erudition. In the late 1940s, Humanities Com- 1. Our English translation reads: ‘It will be even harder to automatically compare all of Thomas against all of the texts puting pioneer Father Roberto Busa (1913-2011) of one or multiple authors to check for the presence of im- spearheaded a scholarly effort, known as the In- plicit sources. Such a task would only be possible under two dex Thomisticus, to manually annotate reuse, both conditions: firstly, the texts of the authors quoted by Thomas would have to be digitised and searchable in their entirety; explicit (i.e., explicitly introduced by Aquinas as secondly, one would need very powerful and sophisticated a quote) and implicit (i.e., reference to works wi- software’. 2 Related Work Roberto Busa’s effort in the late 1940s resul- ted in the creation of the Index Thomisticus, a 2.1 The significance of text reuse manually-lemmatised version of Thomas Aqui- Text reuse (TR) can be summarily described as nas’ opera omnia (Jones, 2016). Among the an- the written repetition or borrowing of text and can notations, the Index Thomisticus tags tokens for- take different forms. Büchler et al. (2014) sepa- ming explicit quotations as QL if literal (ad litte- rate syntactic TR, such as (near-)verbatim quota- ram) and QS if a paraphrase (ad sensum), and to- tions or idiomatic expressions, from semantic TR, kens forming implicit quotations as QR to indicate which can manifest itself as a paraphrase, an al- a reference or citation alluding to another text. An lusion or other loose reproduction. The study of example quotation in the ScG containing a mixed quotation is key to any philological examination annotation is: of a text, as it is not only indicative of the intel- [. . .] ratio(QL) vero (QL) signi- lectual and cultural endowment of an author, but ficata(QL) per(QL) nomen(QL) may shed light on the sources used, the relation est(QL) definitio(QL) secun- between works and literary influence. Crucially, dum(QR) philosophum(QR) in(QR) quotations may also preserve text that is now lost, IV(QR) Metaph.(QR) 5 thus facilitating efforts of textual reconstruction. 2 Owing to the magnitude of the task, the publi- The (QL) portion of this example contains the cation of a work’s complete index of references, literal quote, while the second (QR) portion pro- conventionally known as Apparatus fontium or In- vides the reference. dex scriptorum, is rare (Portalupi, 1994, p. 582). 2.3 Historical text reuse detection 2.2 Text reuse in Thomas Aquinas HTRD is a Natural Language Processing (NLP) task aimed at identifying syntactic and semantic Like many of his Christian predecessors, Aqui- TR in historical sources. The computational ana- nas’ body of work teems with references to secular lysis of historical languages is particularly chal- and Christian literature alike. In the ScG (1259- lenging as tools at our disposal are often trained 1265) Aquinas cites 170 works both explicitly and on a synchronic rather than diachronic state of implicitly (Gauthier et al., 1882, Vols. IV-XV). a language 6 and on controlled textual corpora. Explicit quotations provide information about the Eger et al. (2015) and Passarotti (2010) tested source text and the author and/or work, and can the performance of seven different taggers, inclu- either be direct or indirect (Gauthier et al., 1882, ding TreeTagger (Schmid, 1994), for different trai- vol. XVI, pp. XVI-XXII). Implicit reuses, in the ning sets and tag-sets of medieval (church) La- ScG and in general, are more elusive, as they are tin texts showing accuracies tightly below 96% almost never syntactically nor lexically-faithful to and 96.75% for PoS-tagging, and around 90% and the original text, thus making them hard for both 89.90% for morphological analysis, respectively. machines and humans to spot (Portalupi, 1994, p. These results have yet to be generalised to other 582). 3 Durantel notes that Aquinas’ tendency in variants of Latin and can be improved upon with TR is to borrow only what is necessary to fit the the provision of additional training corpora, tree- flow of his narrative without significant semantic banked and semantically-tagged, the creation of or syntactic deviation from the original (Duran- corpora containing intertexts, or with the expan- tel, 1919, p. 63). And yet, Pelster’s observation sion of lexical resources, such as the Latin Word- on Aquinas’ paraphrastic reuse of Aristotle might Net (Minozzi, 2017, p. 130). suggest greater deviation (Pelster, 1935, p. 331). 4 The extent to which the limitations of these re- 2. One notable example is the fragmentary survival of sources and taggers (e.g., correct resolution of ho- Alexandrian scholarship at the hands of Roman philologists (who wrote commentaries known as scholia) and gramma- mographs) affect HTRD tools, including Tesse- rians (Turner, 2014, p. 16). rae (Coffee et al., 2013), Passim (Smith et al., 3. For problems with implicit quotations, see (Haverfield, 2015) 7 and TRACER (Büchler, 2013) is not yet 1916, p. 197) and (Fowler, 1997, p. 15). For automatic allu- sion detection, see (Bamman and Crane, 2008). 5. Book 1, chap. 12, n. 4. Our English translation reads: 4. “Da Thomas die Schriften des Aristoteles [. . .] ‘[. . .] according to the philosopher in Metaph. IV, the mea- gewöhnlich nur dem Gedanken nach, nicht wörtlich anführt.” ning of a name is its definition’. In English: ‘Since Thomas usually quotes paraphrastically, 6. See Janda and Joseph (2005) for the dichotomy. not literally.’ 7. https://github.com/dasmiq/passim fully understood. Reasons for this are the fiel- tences to TRACER requirements. d’s lack of progress caused by “inconsistent stan- dards and the scattering of insights across pu- 3.3 Text reuse detection with TRACER blications” (Coffee, 2018), the general failure of The HTRD on this corpus was performed HTRD studies to publish negative results, and the (server-side) with TRACER, a language-agnostic quasi-absence of gold standards for testing. To our framework comprising hundreds of information knowledge, the only projects to have published retrieval (IR) algorithms designed to work with computed results from intertextual studies on his- historical and modern languages alike. 9 TRACER torical sources are the Proteus Project (English is a Java command-line tool driven by an XML and Latin) (Yalniz et al., 2011), the Chinese Text configuration file, which users can modify to fit Project (early Chinese) (Sturgeon, 2017), Com- their detection needs. TRACER follows a six- monplace Cultures (English and Latin) (Gladstone step architecture, 10 which demystifies the detec- and Cooney, forthcoming), SHEBANQ (Hebrew) tion process by storing the computed output of (Naaijer and Roorda, 2016), Samtla (Search and each step on the disk so that users can more easily Mining Tools for Language Archives) (language- follow and locate errors in the processing chain, independent) (Harris et al., 2018), and Tesserae if any. TRACER is resilient to OCR-noise and ca- (Latin), but of these only the latter discloses tool pable of detecting both (near-)verbatim quotations configurations. and looser forms of TR. The detection of para- phrase requires the use of linguistic resources to 3 Methodology help TRACER match a word against its synsets 3.1 Gold Standard and an inflected form against its base-form. For synonym detection, we extracted synonymous re- To facilitate the classification of automatically- lations from the Latin WordNet. TR identified with detected reuse, all QL-, QS- and QR-annotated to- TRACER was manually compared against the IT- kens were extracted from the Index Thomisticus. GS to separate the True (TP) from the False Posi- Of the total 24,416 sentences constituting the ScG, tives (FP), and to identify False Negatives (FN). the 7,396 (30.29%) containing any combination of QL, QS and QR were stored in a tabular file, 4 Results which we define as the Index Thomisticus Gold Standard of TR (hereafter IT-GS). The number of 4.1 Philosophiae Consolationis sentences containing only QL tokens (1,139) com- To detect both verbatim quotations and para- pared to that of sentences containing only QS to- phrase, TRACER was optimised for recall over kens (2,270) corroborates expert assertions about precision and configured to work with single Aquinas’ paraphrastic style of TR. words as features, to ignore the top 20% most frequent words, 11 to link text pairs with a mini- 3.2 Text acquisition and preparation mum overlap of 5 features, 12 to expand the query For the sake of processing efficiency, out of the to synonyms, and to return only those aligned text ScG’s 170 source works we began with a set of pairs presenting an overall sentence similarity of five readily available texts. These are Philosophiae at least 50%. 13 Of the eight reuses indicated in Consolationis and De Trinitate of Boethius, De reference. Ambiguously-lemmatised word forms were not di- Deo Socratis of Apuleius, Cicero’s De Divinatione sambiguated. and the Moerbeke Latin translation of Aristotle’s 9. https://doi.org/21.11101/ 0000-0007-C9CA-3 Metaphysica. The texts were acquired from dif- 10. The six steps are: Preprocessing, Featuring, Selection, ferent sources and cleaned of all paratextual in- Linking, Scoring and Postprocessing. formation. The clean texts were then segmenti- 11. The parameter, known as feature density, is a language- independent measure used to decontaminate the texts and to sed by sentence, PoS-tagged and lemmatised with contain the number of results based on chance repetition; an the TreeTagger Brandolini parameter file (with an 80% feature density means that TRACER ignores or removes average accuracy of 93.72%), whose tag-set pro- the most frequent types that cover 20% of the tokens. 12. For a 24k sentence corpus such as this, an overlap of 5 vides the degree of granularity needed in this expe- is statistically significant (Büchler, 2013, p. 134). riment. 8 Finally, a script was used to format sen- 13. The value was chosen on the basis of previous ex- periments as a good trade-off between precision and recall. 8. The Brandolini tag-set was manually mapped against The similarity measure used is Broder’s containment, which that of Morpheus (Crane, 1991), which TRACER uses as a is particularly suited to documents or sentences of uneven the Editio Leonina, we were unable to precisely tively). The F1-score for this analysis was 5, 6 · locate one as it alludes to four paragraphs of 10−4 . text; 14 of the remaining seven, as shown in Figure 1, TRACER identified three (42%). Upon close 4.3 De Deo Socratis inspection, two FNs were affected by the 20% threshold of feature removal, for example: This work of Apuleius is quoted twice in the ScG. Of the two reuses, TRACER was able to de- Boethius 1.4.105 Unde haud iniuria tuorum tect one in full and only parts of the second. The quidam familiarium quaesivit: “Si quidem deus”, second reuse spans three sentences and is mostly inquit, “est, unde mala ? 15 paraphrastic, with only three words annotated in the Index Thomisticus as QL (sunt animo pas- Aquinas 3.71.10 , introducit quendam philoso- siva). 19 To capture the fullest range of reuse diver- phum quaerentem: si deus est, unde malum ? 16 sity, TRACER’s feature removal was set to 10%, the overlap to 3 and the overall similarity to 20%. Here, the tokens si, est and unde were ignored as However, as sunt (form of the verb sum ‘to be’) they fell within the pool of the 20% most frequent is the most frequent word across the texts, TRA- words removed. CER’s inbuilt feature removal prevented the de- One reuse was successfully identified on the ba- tection of the short QL portion of the reuse; the sis of feature overlap but did not amount to a 50% QR+QS portions, on the other hand, were success- sentence similarity; and the fourth reuse could fully detected. We counted both results as TPs, re- not be identified because of a missing synony- sulting in an F1-score of 2, 6 · 10−5 . mous relation in the Latin WordNet (i.e., gaudium- beatitudo) 17 and its insufficient feature overlap. 4.4 De Divinatione The resulting F1-score is 4, 6 · 10−3 . The only recorded reuse that Aquinas makes of 4.2 De Trinitate Cicero’s text is implicit and alludes to a block of text, making it difficult to manually pinpoint with Given the results of the previous analysis, for precision. To detect as loose a similarity as pos- this second investigation the feature removal and sible, the TRACER search was cast with the same the sentence similarity values were lowered to configuration used in the previous analysis. No 10% and 40% respectively, thus optimising for reuse, however, was found. even higher recall (10,349 total sentences aligned). Of the four known reuses, TRACER identified three. The 40% similarity threshold was essential 4.5 Metaphysica to the identification of one reuse (where the score The Editio Leonina lists 97 reuses of Aristot- is 0.4375); the FN, which was indeed found on the le’s Metaphysica. As previously mentioned, Pel- basis of an eight-word overlap but did not meet ster describes Aquinas’ reuse of the Latin trans- the minimum sentence similarity threshold, revea- lation of the Metaphysica as more paraphrastic led another missing synonymous relation in the than literal. Our manual examination of the texts WordNet (i.e., disciplinatus-eruditus) 18 and a fai- and the results of TRACER confirmed this obser- led alignment of the variants temptare (Boethius) vation, in that we could not manually locate se- and tentare (Aquinas) owing to inconsistent Tree- ven reuses (due to their strong allusiveness) and Tagger lemmatisation (tempto and tento, respec- a fault-tolerant TRACER configuration (removal length (Broder, 1997). of the top 10% most frequent words, overlap of 3 14. This reuse would have doubtless been overlooked by features and an overall sentence similarity of 40%) TRACER too owing to the absence of features to compare. 15. Our English translation reads: ‘It is not wrong that a yielded 19 TPs only (6 out of 15 QL 20 and 13 out certain acquaintance of yours has questioned: ‘If in fact God of 75 QR+QS). The F1-score resulting from this exists,’ he asks, ‘where is evil from ?” analysis is 3, 8 · 10−4 . 16. Our English translation reads: ‘(Boethius) introduces a certain philosopher who asks: ‘If God exists, where is evil from ?’.’ 19. [daemones] [. . .] sunt animo passiva or ‘demons are 17. Incidentally, this relation is also not mapped in Ba- emotional in mind’ (Jones, 2017, pp. 372-373). belNet (bn:00042905n) nor in ConceptNet (http:// 20. The QL quotations in the ScG seem to refer to a dif- conceptnet.io/c/la/gaudium) (as of 8 June 2018). ferent Latin translation than that available to us, which would 18. Also not present in neither BabelNet nor ConceptNet. explain why some instances of QL went undetected. F IGURE 1 – For every TRACER analysis, a MySQL table is created to store and manually-evaluate the results against the IT-GS. The evaluation table for Philosophiae Consolationis illustrated here contains a wealth of information, including full citation information for both works, the TRACER settings used for the detection task, the Index Thomisticus quotation annotations, the result classification (into True Positive and False Negative), as well as the feature overlap and the overall similarity value of the aligned sentences. The reuse in the highlighted row, for instance, was correctly identified by TRACER on the basis of a 9-word overlap and an overall sentence similarity of 90%. 5 Discussion 6 Conclusion This article describes a computational text reuse Our results show that the FNs emerging from study on Latin texts designed to evaluate the per- the computational analyses were largely caused formance of TRACER, a language-agnostic IR by Aquinas’ paraphrastic and allusive TR style, text reuse detection engine. The results obtained which at times challenged our own ability to spot were manually evaluated against a gold standard similarities, even with the help of the critical edi- and are contributing to the creation of an Index tion. The allusions that we could identify generally fontium computatus to both assess TRACER’s ef- retain the semantics of the alluded-to texts, thus ficacy and to provide a test-bed against which ana- confirming Durantel’s insights. While a number of logous IR systems can be measured and thus com- these negative results were also directly tied to la- pared to TRACER. Our study shows that despite cunae in the Latin WordNet and to inconsistent the known limitations of existing linguistic re- lemmatisation, the flexibility and methodological sources for Latin, the diverse spectrum of para- transparency of TRACER allowed us to locate er- phrastic reuse encountered and its own language- ror sources and accordingly tune configurations to agnosticism, TRACER is equipped to detect a work around these issues (e.g., by increasing the wide range of explicit text reuse in the ScG, be feature overlap and/or lowering the sentence simi- that short or long, verbatim or paraphrastic, and larity scoring thresholds). Notwithstanding, TRA- implicit reuse only if coupled with explicit. To in- CER’s panlingual feature removal parameter af- crease the detection accuracy, we are implemen- fected the retrieval of shorter instances of reuse, ting a black/white list to give users the power particularly those containing forms of the highly to control words or multi-word expressions to be frequent verb sum. ignored or retained in the detection; furthermore, The manual evaluation of TRACER results we plan on re-running these analyses with the di- against the IT-GS for the creation of an Index fon- sambiguated linguistic annotation currently being tium computatus was time-consuming, not least added to the text of the ScG (Passarotti, 2015) to because of a number of reference inaccuracies in measure its impact on this particular IR task. the critical edition itself (in one case, the reference The data used and generated in the current is off by ten lines). Nevertheless, the creation of study is available from: https://github. the index is proving essential to the assessment of com/CIRCSE/text-reuse-aquinas. TRACER’s fitness for purpose on Latin texts. Acknowledgments As far as the usability of the tool is concerned, TRACER’s detection power is offset by its cum- The authors would like to thank Eleonora Litta bersome setup, which is unfriendly to those who for proofreading this article and the anonymous are not familiar with the command line, NLP ba- reviewers for their valuable comments. This re- sics and/or Java (stack traces). This issue is being search was funded by the German Federal Minis- addressed with the development of a user manual try of Education and Research (No. 01UG1409). (Franzini et al., 2018). References Greta Franzini, Emily Franzini, Kirill Bulert, Marco Büchler, and Maria Moritz. 2018. TRACER: A David Bamman and Gregory Crane. 2008. The Logic User Manual. https://tracer.gitbook. and Discovery of Textual Allusion. In Proceedings io/-manual/. of the ACL Workshop LaTeCH - Language Tech- nology for Cultural Heritage Data. ACL. http: R. A. Gauthier, L. J. Bataillon, A. Oliva, T. de Vio Ca- //hdl.handle.net/10427/42685. jetan, Commissio Leonina, and Dominicans. 1882. Sancti Thomae Aquinatis Doctoris Angelici Opera Andrei Z. Broder. 1997. On the resem- Omnia iussu edita Leonis XIII P.M. Ex Typographia blance and containment of documents. In Pro- Polyglotta S.C. de Propaganda Fide, Rome. ceedings of the Compression and Complexity of Sequences 1997, SEQUENCES ’97, pages Clovis Gladstone and Charles Cooney. forthcoming. 21–29, Washington, DC, USA. IEEE Computer Opening New Paths for Scholarship: Algorithms to Society. http://dl.acm.org/citation. Track Text Reuse in ECCO. Digitizing Enlighten- cfm?id=829502.830043. ment. Marco Büchler, Philip R. Burns, Martin Müller, Emily Martyn Harris, Mark Levene, Dell Zhang, and Dan Le- Franzini, and Greta Franzini. 2014. Towards a vene. 2018. Finding Parallel Passages in Cultu- Historical Text Re-use Detection. In Chris Bie- ral Heritage Archives. Journal on Computing and mann and Alexander Mehler, editors, Text Mining, Cultural Heritage, 11(3):15:1–15:24. http:// pages 221–238. Springer International Publishing, doi.acm.org/10.1145/3195727. Cham. http://link.springer.com/10. Francis John Haverfield. 1916. Tacitus during the Late 1007/978-3-319-12655-5_11. Roman Period and the Middle Ages. The Journal Marco Büchler. 2013. Informationstechnische As- of Roman Studies, 6:196–201. https://doi. pekte des Historical Text Re-use. PhD Thesis. org/10.2307/296272. http://www.qucosa.de/fileadmin/ Richard D. Janda and Brian D. Joseph. 2005. On data/qucosa/documents/10851/ Language, Change, and Language Change – Or, Of Dissertation.pdf. History, Linguistics, and Historical Linguistics. In Roberto Busa. 1980. The annals of humanities com- Brian D. Joseph and Richard D. Janda, editors, The puting: The Index Thomisticus. Computers and Handbook of Historical Linguistics, pages 3–181. the Humanities, 14(2):83–90, October. http:// Wiley-Blackwell, Oxford. www.jstor.org/stable/30207304. Steven E. Jones. 2016. Roberto Busa, S. J., and the Neil Coffee, Jean-Pierre Koenig, Shakthi Poornima, Emergence of Humanities Computing: The Priest Christopher W. Forstall, Roelant Ossewaarde, and and the Punched Cards. Routledge, March. Sarah L. Jacobson. 2013. The Tesserae Project: in- Christopher P. Jones, editor. 2017. Apuleius. Apolo- tertextual analysis of Latin poetry. Literary and Lin- gia. Florida. De Deo Socratis, volume 534 of Loeb guistic Computing, 28:221–228. https://doi. Classical Library. Harvard University Press, Loeb org/10.1093/llc/fqs033. Classical Library. Neil Coffee. 2018. An Agenda for the Study of Inter- Norman Kretzmann and Eleonore Stump, editors. textuality. Transactions of the American Philologi- 1993. The Cambridge Companion to Aquinas. cal Association, 148:205–223. https://muse. Cambridge University Press, Cambridge; New York, jhu.edu/article/693654. May. Gregory Crane. 1991. Generating and Parsing Clas- Stefano Minozzi. 2017. Latin WordNet, una rete di sical Greek. Literary and Linguistic Computing, conoscenza semantica per il latino e alcune ipotesi page 243–245. https://doi.org/10.1093/ di utilizzo nel campo dell’Information Retrieval. In llc/6.4.243. Paolo Mastandrea, editor, Strumenti digitali e col- Jean Durantel. 1919. Saint Thomas et le laborativi per le Scienze dell’Antichità, number 14 Pseudo-Denis. Librairie Félix Alcan, Pa- in Antichistica, pages 123–134. http://doi. ris. http://archive.org/details/ org/10.14277/6969-182-9/ANT-14-10. cuasaintthomaset00dura. Martijn Naaijer and Dirk Roorda. 2016. Parallel Texts Steffen Eger, Tim vor der Brück, and Alexander Meh- in the Hebrew Bible, New Methods and Visualiza- ler. 2015. Lexicon-assisted tagging and lemmatiza- tions. CoRR, abs/1603.01541. http://arxiv. tion in Latin: A comparison of six taggers and two org/abs/1603.01541. lemmatization methods. In In Proceedings of the Marco Passarotti. 2010. Leaving behind the less- 9th SIGHUM Workshop on Language Technology resourced status. The case of Latin through the ex- for Cultural Heritage, Social Sciences, and Huma- perience of the Index Thomisticus Treebank. In nities, pages 105–113. http://www.aclweb. 7th SaLTMiL Workshop on Creation and use of ba- org/anthology/W15-3716. sic lexical resources for less-resourced languages Don Fowler. 1997. On the Shoulders of Giants: LREC 2010, Valletta, Malta, 23 May 2010. Intertextuality and Classical Studies. Mate- Marco Passarotti. 2011. Language Resources. The riali e discussioni per l’analisi dei testi clas- State of the Art of Latin and the Index Thomisti- sici, 39:13–34. http://www.jstor.org/ cus Treebank Project. In Marie-Sol Ortola, editor, stable/40236104. Corpus anciens et Bases de données,  ALIENTO. Échanges sapientiels en Méditerranée , volume 2, pages 301–320. Presses universitaires de Nancy, Nancy. Marco Passarotti. 2015. What you can do with lin- guistically annotated data. From the Index Thomis- ticus to the Index Thomisticus Treebank. In Vijgen Roszak Piotr, editor, Reading Sacred Scripture with Thomas Aquinas. Hermeneutical Tools, Theological Questions and New Perspectives, pages 3–44. Bre- pols. F. Pelster. 1935. Die Uebersetzungen der aris- totelischen Metaphysik in den Werken des hl. Thomas von Aquin: Ein Beitrag. Gregoria- num, 16(3):325–348. http://www.jstor. org/stable/23567607. Enzo Portalupi. 1994. L’uso dell’“Index Tho- misticus” nello studio delle fonti di Tommaso d’Aquino: Considerazioni generali e questioni di metodo. Rivista di Filosofia Neo-Scolastica, 86(3):573–585. http://www.jstor.org/ stable/43062344. Helmut Schmid. 1994. Probabilistic Part-of- Speech Tagging Using Decision Trees. In Pro- ceedings of International Conference on New Methods in Language Processing, Manchester, UK. http://www.cis.uni-muenchen. de/˜schmid/tools/TreeTagger/data/ tree-tagger1.pdf. David A. Smith, Ryan Cordell, and Abby Mullen. 2015. Computational Methods for Uncovering Re- printed Texts in Antebellum Newspapers. Ameri- can Literary History, 27(3):E1–E15. http://dx. doi.org/10.1093/alh/ajv029. Donald Sturgeon. 2017. Unsupervised iden- tification of text reuse in early Chinese lite- rature. Digital Scholarship in the Humani- ties. https://academic.oup.com/dsh/ advance-article/doi/10.1093/llc/ fqx024/4583485. James Turner. 2014. Philology: The Forgotten Ori- gins of the Modern Humanities. Princeton Univer- sity Press, Princeton and Oxford. Ismet Zeki Yalniz, Ethem F. Can, and R. Man- matha. 2011. Partial Duplicate Detection for Large Book Collections. In Proceedings of the 20th ACM International Conference on Informa- tion and Knowledge Management, CIKM ’11, pages 469–474. http://doi.acm.org/10.1145/ 2063576.2063647.