The Challenges of Multilingualism in the Search for Ancient Wisdom: A Case Study of VERITRACE’s Text Matching Tool Jeffrey C. Wolf*1, Nicolò Cantoni1, Eszter Kovács1, Demetrios Paraschos1, and Cornelis J. Schilt1 1 Vrije Universiteit Brussel, Pleinlaan 2 B-1050 Brussels, Belgium Abstract The ERC-funded VERITRACE project is applying the latest digital tools, including machine learning algorithms, on a large corpus of early modern texts in order to trace the influence of ancient wisdom writings on the development of early modern natural philosophy. Innovative capabilities of the project include text matching, where a query text is used as a search ‘query’ across a much larger comparison corpus. This poses challenges when query and comparison corpora are multilingual. This paper will explore these issues using VERITRACE’s Text Matching tool. Keywords digital humanities, distant reading, text matching, machine translation, multilingual texts, similarity 1. Introduction The European Research Council is funding the ambitious five-year project (2023-2028) Traces de la Verité: The reappropriation of ancient wisdom in early modern natural philosophy, aka VERITRACE (ERC-StG Project VERITRACE, 101076836) [1]. Led by Professor Cornelis J. Schilt at the Vrije Universiteit Brussel (VUB), VERITRACE aims to uncover the influence of prominent ancient wisdom writings on natural philosophical discourse in the early modern period.1 VERITRACE applies sophisticated digital analysis techniques, including machine learning algorithms, to a large corpus of early modern texts, tracing referenced and more subtle uses of the Corpus Hermeticum (including the Asclepius), the Chaldean and Sibylline Oracles, and the Orphic Hymns. Moreover, it analyses how these texts were being used (employing Latent Semantic Analysis, among other tools), and with what sentiment they were discussed (using Sentiment Analysis) by their proponents and antagonists, and how these debates were influenced by key episodes in the transmission history of these texts. VERITRACE will provide the first-ever comprehensive analysis of ancient wisdom’s role in shaping early modern natural philosophy, and it will do so by making use of new methodologies never employed at this scale to interpret the early modern history of science. Humanities-Centred AI (CHAI), 4th Workshop at the 47th German Conference on Artificial Intelligence, September 23, 2024, Würzburg, Germany ∗ Corresponding author. Jeffrey.Charles.Wolf@vub.be (J. Wolf); Nicolo.Cantoni@vub.be (N. Cantoni); Eszter.Kovacs@vub.be (E. Kovács); Demetrios.Paraschos@vub.be (D. Paraschos); Cornelis.Johannes.Schilt@vub.be (C. Schilt) 0009-0007-9879-4476 (J. Wolf); 0009-0008-4438-4778 (N. Cantoni); 0000-0002-5900-8301 (E. Kovács); 0009- 0004-6785-8033 (D. Paraschos); 0000-0001-7826-8355 (C. Schilt) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Much of the text in this introductory section has been adapted from the VERITRACE ERC-funding proposal [2]. A variation of this introductory material, along with portions of the case study of Everard’s English translation of the Divine Pymander, will be printed in a forthcoming special issue of the journal Society and Politics [3]. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings VERITRACE draws on printed books, the most ubiquitous intellectual materials found in this period. Although early modern debates followed other modes of discourse, such as oral discussions and the circulation of manuscripts and letters, these often pertained to small circles of select readers. Books, on the other hand, were everywhere. Indeed, even if we only focus on works in Latin, English, German, French, Dutch, and Italian, the number of books that have come down from the early modern period is staggering, presenting lots of challenges – and opportunities. Some of these challenges uniquely characterise VERITRACE as a digital humanities project. These features include: 1. Multilingual Complexity: The project grapples with at least 6 different languages, both modern and classical ones. Many digital humanities projects only contend with one or two languages, especially English. Since many natural language processing (NLP) techniques were initially developed for English-speaking users, the multilinguistic nature of VERITRACE raises its own set of challenges. In fact, even when available tools are comprehensive in terms of modern languages, they sometimes exclude, or have limited support for, classical ones like Latin.2 2. Longue durée: Spanning nearly two centuries (1540-1728), VERITRACE must account for significant shifts in historical context and linguistic meaning. An interpretation that applies to a smaller slice of the data cannot be assumed to apply to the whole, given change in historical context and linguistic meaning over time. 3. Big Data Management: With a corpus comprising hundreds of thousands of texts, the project requires sophisticated data handling and analysis techniques, beyond simple search processes. It will be resource intensive and sometimes require different tools and solutions, given the size of the data collections with which we are working. 4. Complex Data Integration: Because our data comes from different sources held in separate institutions collected over long periods of time, there are inherent challenges to integrating and harmonising the data. This necessitates paying careful attention to data cleaning and data transformation, along with careful documentation, so that we have a solid basis for subsequent analysis. VERITRACE must also grapple with the familiar challenges of any distant reading project: uncertain accuracy of the underlying digital texts (OCR quality), the parameter-dependent nature of various NLP techniques, and so forth. 1.1. Distant Reading Traditional methods of tracing textual influence would require an enormous research team or a drastically reduced scope. But this is where digital techniques come in, most notably from the field of distant reading, which have been developed specifically to query large corpora. These techniques, closely related to natural language processing, allow for the analysis of large text corpora, identifying patterns and uncovering both prominent and neglected works, the latter termed 'the great unread' by Margaret Cohen [4, 5]. Famously, early modern writers would 2 For example, Latin is not available as a default trained pipeline package in the open-source natural language processing library spaCy (https://spacy.io/usage/models), although Patrick J. Burns has created LatinCy to fill this gap (https://spacy.io/universe/project/latincy). The Natural Language Toolkit (NLTK: https://www.nltk.org) likewise has limited support for Latin (e.g. for tokenisation), though the Classical Language Toolkit (CLTK: http://cltk.org) – which does support Latin – has been developed to supplement this. Another example: OpenSearch has no built-in language analyser for Latin (https://opensearch.org/docs/latest/analyzers/language-analyzers/). 2 rarely include references to their source material, which provides one of the key challenges for the project. Recent advancements in book digitisation have greatly expanded the potential of distant reading approaches. Improved OCR technology now yields meaningful results even with suboptimal text recognition [6, 7]. Online repositories, like those of the Bibliothèque nationale de France, provide standardised data for content extraction, facilitating large-scale analysis [8, 9]. VERITRACE’s chosen Distant Reading Corpus (DRC) consists of several hundred thousand works from important European library collections, written in Latin, French, German, Dutch, English, and Italian, including: • Early English Books Online (EEBO) (ProQuest), which in its EEBO-TCP format developed by the Text Creation Partnership3 contains about 60,000 English and Latin texts published between 1540 and 1700 (hereafter 'EEBO' unless we refer specifically to our custom version of EEBO, which we call ‘VEEBO’) • Gallica (Bibliothèque nationale de France) contains almost 125,000 books published between 1540 and 1728 in a variety of languages including French, Italian, Dutch, and Latin (hereafter 'Gallica') • The Digitale Sammlungen of the Bavarian State Library, which contain more than 340,000 books published between 1540 and 1728, including in Latin, German, French, Greek, Italian, and Dutch, among others (hereafter 'BSB') This carefully selected corpus enables VERITRACE to make substantive claims about the existence and evolution of the prisca sapientia tradition, e.g. how prevalent it was and the level of interest in it over time. Did curiosity in the Corpus Hermeticum, for instance, decline after the first quarter of the seventeenth-century, or not? By interrogating a truly representative sample of books, we can make reasonable claims about levels of interest and prevalence. A rigorously statistical frame of mind underpins our approach, and we believe the sources we have chosen can be the basis for constructing a representative sample size of books published in Europe between 1540 and 1728. 2. Monolingual Text Matching The above has been a general introduction to VERITRACE and how it will harness digital techniques in the pursuit of its intellectual goals. We turn now to a specific tool in development, which we call Text Matching. In the following discussion, we see some of the promise, as well as the challenges, inherent in using such a tool, especially with multilingual corpora. To make the following observations more concrete, we will work with a specific research question to explore our Text Matching tool: what was the influence (however vaguely defined) of the first English translation of the Divine Pymander (1650) upon the subsequent generation of thinkers, who published texts in English between 1650 and 1680? We approach this in terms of the influence of a specific query text upon a much larger comparison corpus. Sometimes, especially in the Figures below, we refer, intuitively, to a source text and a target corpus, but the query text and comparison corpus terminology is more 3 https://textcreationpartnership.org/about-the-tcp/ 3 generalisable and to be preferred.4 The query text is the first translation into English of a work from the Corpus Hermeticum; namely, John Everard’s The divine Pymander of Hermes Mercurius Trismegistus, published in 1650 [10], and the comparison corpus (a subset of our larger Distant Reading Corpus) consists of all the English-language texts contained in EEBO published between 1650 and 1680: 18,633 individual texts in total. VERITRACE provides the user the ability to conduct simple and more complex keyword searches of the comparison corpus, including keyword search, fuzzy searches, wildcard searches, and exact phrase searches, among others. So traditional keyword search is part of the VERITRACE toolkit, but our Text Matching tool moves beyond this, for we want the ability to match entire passages from our query text with similar ones from the comparison corpus. Similar lexical phrasing, regardless of the precise words used, should be discoverable. In other words, we are building a kind of early modern plagiarism detector.5 Text Matching, as we call this, can likewise be seen as a more ambitious kind of search. It does not take a single keyword or phrase as input but instead the entire query text itself – all its sentences individually and collectively. Then we attempt to find the most similar matching sentences and passages (groups of sentences) in the comparison corpus and rank them based on similarity to the sentences found in the query text. Before further discussion, let us see this in action. First, we want to identify the most similar matching sentences between query text and comparison corpus (see next page): 4 We are being cautious about the terms in use here. In this case, the query text is indeed the source text, and the comparison corpus is the target corpus – we are asking what influence the source text had on the subsequent target corpus. We assume some kind of cause and historical effect. But the Text Matching tool works in the other direction as well. Perhaps one has a set of all of Isaac Newton’s works, and one wants to know if they had an influence on a single text of a later thinker’s work. In that case, it might make more sense to use the target corpus (text) as the source text, and Newton’s collective works as the target corpus, as it better aligns with traditional information retrieval. But because we precompute all the vectors beforehand, computationally it will not make much difference if we choose a one-to-many instead of a many-to-one comparison. We retain bidirectionality. Therefore, we favour the more neutral ‘query text’ and ‘comparison corpus’ terminology. 5 At this stage of the VERITRACE project, this is meant more metaphorically than in the strict sense that we are consciously using standard methods of plagiarism detection (though there is some overlap). For a description of some research in plagiarism detection – including the use of translated texts – see [11], especially 49.4. 4 Most Similar Matching Sentences 5 Figure 1. The most similar matching sentences between query text and comparison corpus. Please note a few details about this result (see Figure 1). First, the ‘Similarity Score’ column at the far right contains score values, which should be interpreted as equivalent to ‘relevance scores’ (relative to each other). The values provide a ranking of the most relevant match results of the query text matched against the comparison corpus. The similarity score ranges between 0 (no similarity) to 1 (exact similarity, e.g. identical), with higher values indicating greater similarity between the query and comparison texts being matched. Another detail is that, within our comparison corpus, there appears to be another edition of the Divine Pymander, published 7 years later [12]. It is almost identical to our query text, which is why sentences from the 1657 edition of the Divine Pymander are the top matches to sentences in the query text (the 1650 edition). This is exactly what we would expect, if the editions are virtually identical to each other (with similarity scores of ‘1.00’). It provides a ‘common sense’ check on the validity of our Text Matching tool. 2.1. Under the Hood: TF-IDF and Cosine Similarity Because the concept of a similarity score is so central to what we are doing, it is worth pausing to look ‘under the hood’ to see how we are computing it. This is important to understand because we will be using similarity scores extensively in the examples that follow. Here is the relevant part of the NLP pipeline: we begin our text normalisation with some basic preprocessing of the ‘raw’ text from our data sources. Our ‘raw’ text is, in practice, generally derived from files obtained from our data sources. These include xml, html, or hocr files, depending on the initial data source, which we then parse to extract the text we will use for downstream analysis. The extracted text itself is often saved in individual txt files for ease-of- use. The parsing and extraction process is full of pitfalls (not to mention long compute times, given the number of files), and there are many considerations and nuances that we must consider, in order to capture the most accurate version of the digitised transcription of the original printed text (this is not the place to discuss these intricacies further). We then segment the query text and the comparison corpus texts into sentences (with a default minimum word count of 5, which is adjustable) and groups of sentences (with a default ‘chunk size’ of 3 sentences, also adjustable). There are multiple decisions here too: we prefer to segment the pre-tokenised raw text into sentences rather than segment sentences from the tokenised text. This adds some time and inefficiency to the pipeline, but we believe it is more accurate because errors in tokenisation could compound into segmentation. With the sentences in hand, we then tokenise them using the SpaCy library.6 There are many ways to optimise the tokenization process at this stage. We have added some custom rules to handle early modern English, as opposed to the 21st-century English that SpaCy assumes in its trained English pipeline. The results seem sufficient for now, but we continue to iterate this part of the pipeline. Also, we prefer to use LatinCy for tokenising Latin texts, and, in general, each language requires its own customisations. Once tokenisation is done, we vectorise the sentences and sentence chunks, using the scikit- learn machine learning library, extracting their features using TF-IDF (Term Frequency-Inverse Document Frequency).7 TF-IDF is a well-known and popular algorithm in vector semantics that 6 https://spacy.io 7 https://scikit-learn.org/stable/. In particular, we use the TfidfVectorizer, which converts “a collection of raw documents to a matrix of TF-IDF features.” 6 allows us to convert text into sparse numerical vectors by assigning weights to document terms (words) based on their frequency in a document, offset by assigning a higher weight to terms that only occur in a few documents in a larger corpus.8 This is to say, words that are more unique to, but also occur frequently within a document (in comparison to a larger corpus) are given a higher weight under the TF-IDF paradigm. Note that this treats the meaning of a word simplistically as a function of the number of nearby words [13]. Word order is essentially ignored in this semantics. Using TF-IDF as our vector semantics model here is a choice, and it needs defending. 9 After all, there are more advanced approaches readily available, including using dense vector word embeddings like word2vec or GloVe, or using the latest transformer models to generate dense embeddings (e.g. using BERT, GPT, and its descendants).10 As the discussion below will show, the demands of our multilingual corpus are pushing us in that direction – especially towards transformer-based embeddings.11 But it is worth seeing both why that is the case and also why using TF-IDF is still worth retaining, even if we can obtain more semantically accurate ones with newer models.12 For TF-IDF still provides a nice balance between simplicity and effectiveness, and, as Jurafsky and Martin note, it is “a great baseline, the simple thing to try first.”13 And so we do. Vectorisation using TF-IDF is not sufficient to produce similarity scores, of course. We must compute those, and here again, there is more than one option, though using the cosine of the angle between vectors as a measure of similarity is the predominant paradigm in the ‘vector space model for scoring’ – it is formally referred to as cosine similarity ([14], Section 6.3, p. 120). Because TF-IDF represents each textual unit as a numerical vector – a point in vector space (this is the well-known vector space model) – we can compute similarity by computing the cosine of 8 This is an oversimplification of sorts, and there are different ways to compute the TF-IDF weight, and many variations of it. See [13], especially pp. 108-114. 9 There is often a confusing array of terminology in NLP, in contrasting use both among academics and DH developers. Because we find the discussion of vector semantics in [13] to be admirably clear and because it is one of the leading textbooks in the field, we prefer to use their terminology where possible. Two other texts have been particularly helpful in situating our approach: the classic textbook on information retrieval by Manning, et al [14] – for, our ‘text matching’ is a kind of information retrieval, with query and ranked results retrieval. Finally, a recent reference work on many of these topics is [15], though the quality of the chapters is variable and not all of them have been recently updated despite the 2022 publication date, so some caution is warranted. But it does have a very helpful Glossary (pp. 1243-1290), and [16] and [17] are most relevant to our discussion. This is usefully compared to the older NLP handbook by Indurkhya and Damerau [18]. For a more applied approach with illuminating case studies, see [9]. Despite the older code, [19] is also still worth consulting. 10 Many of the latest models, Transformer-based or otherwise, are readily available on the open-source machine learning platform, Hugging Face (https://huggingface.co). For word2vec, see https://www.tensorflow.org/text/tutorials/word2vec. For GloVe, see https://nlp.stanford.edu/projects/glove/. 11 Attempts have been made to use TF-IDF on multilingual corpora by modifying it to use subword tokenization (STF-IDF) [20]. An open-source model ‘Text2Text’ has been created to implement this. We have not evaluated the claims or the model. 12 Jurafsky and Martin claim that dense vector embeddings universally generate more accurate results than sparse vector models, including TF-IDF: “It turns out that dense vectors work better in every NLP task than sparse vectors” (117). It is not entirely clear why this is the case, though a plausible theory, they suggest, is that the smaller parameter space (e.g. 300-dimensional dense vectors vs. 50,000 sparse ones) better avoids overfitting and enhances generalisation. It also better represents synonymy. See [13], p. 117. 13 See [13], p. 113. Note that we are using ‘TF-IDF’ here as a stand-in for both its traditional application and also more sophisticated variations of it, like the Okapi BM25 algorithm, which we are likely to prefer for our lexical similarity scoring. It is still within the family of TF-IDF models, however. For Okapi BM25, see [14], especially 11.4.3. 7 the angle between the two vectors.14 A smaller cosine angle means the vectors are more similar. The similarity score is just the cosine similarity metric, explained above. Note that, when using TF-IDF, it will always be between 0 and 1 (even though the cosine of an angle between two vectors can vary between -1 and 1) because term frequency values cannot be negative ([13], p. 111).15 2.2. Text Matching: Initial Results Let’s return to the Text Matching results. For illustrative purposes, it is not very interesting to see a list of the results from the 1657 Divine Pymander (refer again to Figure 1) – given the unsurprising, near-identical nature of the texts - so what happens if we exclude these (see next page)? 14 See Ch. 6 in [14]. Jurafsky and Martin (see their historical notes on pp. 129-131) attribute the original vector space model to Salton [21] in the realm of information retrieval, though Osgood had already suggested in 1957 that the meaning of a word could be represented as a point in a multidimensional ‘semantic’ space [22]. 15 There are a handful of cosine terms that are easy to mix-up: cosine, cosine of an angle, cosine similarity and cosine distance. Cosine simply refers to the well-known trigonometric function of the same name, normally defined as the ratio of the lengths of the side of a right triangle adjacent to the angle and the hypotenuse (https://mathworld.wolfram.com/Cosine.html). This also explains what is meant by the ‘cosine of an angle’ – it is just the evaluation of the cosine function for a specific angle. Cosine similarity is a measure of the similarity between two vectors of the same dimensionality that is derived from the cosine of the angle between those two vectors (see the helpful discussion in section 6.4 of [13]). When two vectors are more similar, the cosine of the angle between them is smaller. Therefore, cosine of an angle and cosine similarity are interrelated. Cosine distance is also related because it is simply 1 – cosine similarity (https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/cosdist.htm). But we will not use cosine distance directly in this discussion. 8 Most Similar Matching Sentences(excluding 1657 edition results) 9 Figure 2. The most similar matching sentences between query text and comparison corpus, excluding results from the 1657 edition of the Divine Pymander. Not all the sentences that match are particularly meaningful (e.g. see the second result in Figure 2 above). But we also find that many of the most similar sentences come from Thomas Traherne’s Christian ethicks (1675) [23], and it appears that he is copying directly from the query text (or the 1657 edition), though he sometimes alters the language in minor ways (see the two results we have highlighted).16 So this is certainly more interesting and worthy of follow-up: is Traherne acknowledging the Divine Pymander as his source, or pretending he has written this himself?17 The VERITRACE Text Matching tool provides a line of investigation.18 Now, sentence matching is a start, but what we would really like is to find entire passages – groups of sentences (sometimes referred to as ‘chunks’) – that match between our query text and the comparison corpus. And indeed, this is the next step. Once again, we first exclude the matching passages from the 1657 Divine Pymander and then display the following results (see next page): 16 The connection between Traherne and the Corpus Hermeticum is not a novel observation, though the Text Matching tool is a new way to observe it. Already by the 1960s, historians were connecting the two [24]. 17 It is also worth emphasising that just because we find matching sentences or passages between two sentences or two sentence chunks does not prove that the author of the comparison text used the source text. As we have been reminded by the existence of both a 1650 and 1657 edition of the Divine Pymander, any number of similar editions could have been used. But more generally, it could be that both texts used a third source. We cannot rule that out, which is why the Text Matching Tool provides grounds for further inquiry but should not be considered determinative evidence in itself. It is a tool of inquiry – not proof. 18 In this instance, Traherne is in fact summarising what Hermes Trismegistus says in his ‘Poemander’. He is not trying to claim the thoughts as his own. See p. 443, which introduces this discussion: “Trismegistus counteth thus, First GOD, secondly the World, thirdly Man: the World for Man, and Man for GOD. Of the Soul that which is sensible is Mortal, but that which is reasonable Immortal… This in his Poemander.” The matching sentence found in the second highlighted result in Figure 2 (“But if thou shut up thy Soul in thy Body…”) is printed on p. 447 [23]. 10 Matching Sentence Groups(excluding 1657 edition results) 11 Figure 3. The most similar matching passages (groups of sentences) between query text and comparison corpus, excluding results from the 1657 edition of the Divine Pymander. The most similar passage between the query text and the comparison corpus comes from Ralph Cudworth’s The True Intellectual System of the Universe (1678) [25] – see Figure 3. Scholars have known about the influence of the Corpus Hermeticum on the so-called Cambridge Platonists like Ralph Cudworth and Henry More for some time, and here is lexical evidence to support this [26]. Our Text Matching tool highlights all matching words from the passages in a yellow colour, so one can see how they overlap or differ. Notice that the passages in question are not exact matches – instead, they have minor differences in language and meaning, yet the corpus passages appear to be drawn from the query text. Subject to further confirmation, we appear to be observing shades of influence. This is what we hoped to see, and there are a variety of intellectual questions that could be pursued here, with just this small sample of results. 3. Text Matching: From Mono- to Multilingual Thus far, we have seen how the VERITRACE Text Matching tool functions when the query text and the comparison corpus share the same language. But the VERITRACE project is inherently multilingual, encompassing texts in 6 different languages. We want the ability to handle query texts and comparison texts in any of them. Suppose, for example, we want as our query text a Latin book instead of an English one – that we want to search our English-language corpus with this Latin text. How could this work with our existing Text Matching tool? Because of the difference in languages – and hence vocabulary – it would seem impossible to find matching sentences and groups of sentences between the query and comparison texts, at least using TF-IDF. 3.1. A Simple Multilingual Corpus In order to investigate this, let us examine a simple 3-text multilingual corpus: TEXT 1 I. Corpus omne perseverare in statu suo quiescendi vel movendi uniformiter in directum, nisi quatenus illud a viribus impressis cogitur statum suum mutare. II. Mutationem motus proportionalem esse vi motrici impressæ, et fieri secundum lineam rectam qua vis illa imprimitur. III. Actioni contrariam semper et æqualem esse reactionem: sive corporum duorum actiones in se mutuo semper esse æquales et in partes contrarias dirigi. TEXT 2 The quick brown fox jumps over the lazy dog. This sentence is commonly used as a typing exercise because it contains every letter of the English alphabet. It has nothing to do with Isaac Newton. TEXT 3 I. Every body perseveres in its state of rest, or of uniform motion in a right line, unless it is compelled to change that state by forces impress’d thereon. II. The alteration of motion is ever proportional to the motive force impress’d; and is made in the direction of the right line in which that force is impress’d. III. To every Action there is always opposed an equal Reaction: or the mutual actions of two bodies upon each other are always equal, and directed to contrary parts. 12 TEXT 1, in Latin, includes Isaac Newton’s three laws of motion (printed artificially together) as found in the third edition of his Philosophiae naturalis principia mathematica (1726) [27]. TEXT 3, in English, is Andrew Motte’s 1729 translation of those very same Latin passages [28]. The assumption here is that the Motte translation is, more or less, a typical English translation of a Latin text that one finds in the early modern period and therefore ought to be representative of some of the multilingual editions we have in the VERITRACE corpus. And, finally, TEXT 2, in English, is a dummy passage, often used as a baseline in natural language processing to check our assumptions about the effect of various NLP transformations, with a few extra lines added by VERITRACE, so it resembles a passage (a group of sentences) instead of a single sentence. 3.2. Some Queries Let us try some queries, but before we examine the results, please take note of the three different measures of similarity we are using for illustrative purposes here (see Figure 4 below). Because we use TF-IDF-based cosine similarity (i.e. cosine similarity between vectors created through TF-IDF-based vectorisation) in our Text Matching, we have highlighted that metric as the one to focus on. But each metric measures slightly different kinds of similarity. Jaccard similarity, our second metric, is a term-based similarity measure, evaluated “as the number of shared terms over the number of all unique terms in both strings” [29, p. 14]. Finally, our N- gram Overlap metric is simply a variation on Jaccard Similarity, but it uses the Jaccard metric on n-grams (bigrams of words, by default) rather than single characters, so it has a wider span. Longest Common Substring (LCS) and Levenshtein distance measures would have been natural here too, but for the sake of simplicity, we have chosen just three. Now, to the queries. First, as a ‘gut check’ of our assumptions, we will use TEXT 3, the Motte English translation, as our search query across the 3-text corpus. Here are the results: Figure 4. Motte’s English translation of Newton’s Latin (TEXT 3) is used as a search query across the 3-text multilingual corpus. The results agree with common sense. First, as we expect, when TEXT 3 is the search query, it should match identically with TEXT 3 in the corpus – as it does. 13 We do not expect, nor do we find, much similarity with TEXT 2, the dummy text (only 0.254). And TEXT 1, in Latin, is also very dissimilar to our query text, given the differences in language and vocabulary. Again, this is what we expect, and it looks like our small case study confirms common sense assumptions so far. What we want to explore, however, is using a Latin text as our query to search across our comparison corpus. Let’s do that now: Figure 5. Newton’s Latin text (TEXT 1) is used as the search query across the multilingual corpus. When TEXT 1 (Newton’s Latin) is used as our search query, it matches identically with that same text in our corpus (see Figure 5). And, for our dummy text (TEXT 2), there is almost no similarity, as we would expect. But the match with TEXT 3 is troublesome – there is almost no similarity here (only 0.085), even though it is an early modern English translation of the Latin query. This is not surprising given the language differences, but it is not what we want. Instead, we would prefer a much closer similarity between a text and its translation in another language, given the semantic similarity between the two. The basic problem for this task is that, up to now, we have only been comparing lexical similarity (syntactical similarity between characters or strings) – not semantic similarity (similarity between broader contextual units of meaning, regardless of specific characters, strings or words used). And, while that works well enough for monolingual text matching, it seemingly will not work across the language barrier. So why not align the languages? We can translate the Latin search query into English and then compare it to the comparison corpus. Given the size of the VERITRACE corpus, we must do this automatically using machine translation; manual, human translation would be too time consuming. Even a few years ago, the quality of the translation simply would not have been good enough to attempt this, but it has improved rapidly since then and continues to do so. Whether it is good enough for our purposes – that is a subject for investigation. Let’s try it: 14 Figure 6. Newton’s Latin text (TEXT 1) is machine translated into a Machine Translated Query (now in English), which is then used as the search query across the multilingual corpus. We take Newton’s Latin text (TEXT 1) and apply machine translation to it, and then send this machine-translated query (now in English) across our comparison corpus (see Figure 6). For automated machine translation, we used Google Translate through the ‘deep-translator’ library.19 While we chose Google Translate here, we have not yet done a systematic analysis of translation tools that would best meet our needs (further optimisation is possible). In any case, notice that now, even though we began with the Newton Latin text (TEXT 1), it is no longer considered similar to TEXT 1 in the corpus because we are using its English translation. That is to be expected. The same with the similarity to the dummy text (TEXT 2). But see what happened to the comparison to TEXT 3 – the Motte translation is now considered significantly similar (0.675) to the machine-translated version of Newton’s original Latin text (TEXT 1). This is quite a bit better and should allow us, in theory, to match texts that are semantically similar, even when they are in different languages. There are some important limitations we should be aware of. The quality of the translation clearly has a big impact on the effectiveness of Text Matching across languages. And we must remember that when we create the machine translation, at least as it is set up above, we are creating a translation into 21st-century English, which can differ substantially from the early modern English found in our corpus. These differences negatively impact the effectiveness of our Text Matching. This is indicated by the low N-gram overlap between the texts. This suggests that, despite the TF-IDF cosine similarity, individual sequences of 2 consecutive words (bigrams) between the two translations are quite dissimilar. The machine translation, in other words, does not use many of the same sequences of words as the Motte translation. For this example, we can measure the similarity between the two translations (Motte’s and the machine translation): 19 https://pypi.org/project/deep-translator/#translation-for-humans 15 Figure 7. A similarity comparison between the machine translation of Newton’s Latin text and Motte’s early modern English translation (TEXT 3) The machine translation into 21st-century English is significantly similar to the 18th-century Motte translation (0.692) but far from identical (see Figure 7). Indeed, even while the TF-IDF keywords are fairly similar, individual sequences of words (N-grams) are rather less so. Whether this is due more to the limitations of the machine translation of the original Latin or the lexical differences between contemporary and early modern English is unclear, but in either case, it reduces the effectiveness of the text matching. Also, we should keep in mind that the nature of translation itself – especially translation in the early modern era – is not intended to produce lexically equivalent texts but semantically similar ones, to varying degrees, sometimes rather loosely. Thus, we should not demand that the similarities between the machine translation and the original Motte translation be identical. Nonetheless, are they similar enough to produce meaningful Text Matching results? Based on our sample corpus, and the similarity scores above, we cannot know for sure. We have dramatically increased the similarity scores by introducing machine translation, which is surely in the right direction, but the only way to know if they are ‘good enough’ is to attempt to match some texts from our actual corpus – to test the Text Matching tool ‘in the wild’, with more than one language. 3.3. Multilingual Text Matching ‘in the Wild’ To test this, we can re-run our original research query. Instead of using Everard’s English translation of the Divine Pymander as our query text, we will use what may have been his original Latin source text: Marsilio Ficino’s 1471 De potestate et sapientia Dei [30].20 This should give us a sense of whether our results using machine translation are ‘good enough’. So, now our search ‘query’ will be an entire text in Latin, which we then translate into English using automated 20 While we have chosen Ficino’s 1471 edition in this example, some recent work [31] points to Patrizi’s 1591 translation instead [32]. In fact, the VERITRACE Close Reading Corpus, once finalised, should be able to establish this with determinative evidence because it will allow us to compare, in minute detail, all the editions used. Nonetheless, for our purposes here, the Ficino source text should be sufficient, even if we must revise our assumption later. 16 machine translation, before searching across our comparison corpus, which remains the one we used with our original Everard example – our corpus of primarily 17th-century English-language texts. If this approach is sound, we should get fairly similar results to what we found when we used Everard as our query text. We should not demand the very same results, however, given the nature of translation and some of the limitations of the approach we explained above. But do we at least obtain ‘fairly similar’ ones? Figure 8. The most similar sentences matched between the machine-translated Ficino query text and the comparison corpus of English-language texts. In Figure 8 we find the most similar sentences to the machine-translated Ficino query text. It is a relief to see that we find some familiar results: the 1657 Divine Pymander has the most similar sentence (a long, meaningful one) and Ralph Cudworth appears as well. But, at the same time, we are not getting the exact same results that we did when Everard was the query text. The results are similar but not identical. Now what about the most similar sentence groups (chunks) (see next page)? 17 Figure 9. The most similar sentence groups (chunks) matched between the machine- translated Ficino query text and the comparison corpus of English-language texts. Here again, it would be worrisome – and a sign of the deficiencies of our approach – if most of these results did not come from the 1657 Divine Pymander. In fact, except for the top matching result above (more on that in a moment), the top 10 matches all come from the 1657 edition – just as it did when Everard was our query text (see Figure 9 for the top 4 results). And what about that top result? It is the exception that proves the rule for, if one looks closely, one can see that Mr. La Peyrère is paraphrasing directly from the Divine Pymander (‘Wherefore the same Poemander would have eternity to be in God, and the world in eternity…’) [33]. These results, we believe, are enough to show proof of concept for a multilingual approach using automated machine translation. They appear to be ‘good enough’ to obtain meaningful and reliable results. And this is before we refine it. For instance, we can include some transformation rules for each translation such that the resulting translation is closer in syntax to its early modern variant (in whatever language), instead of relying on 21st-century vocabulary and syntax. Or we could include a ‘Translation Matrix’ that maps translated terms between source and target languages. Both steps should bump up the similarity scores. We can also cross- reference our Text Matching results to some of the more corpus-based measures we are using in VERITRACE (not discussed in this paper), including Latent Semantic Analysis and Latent Dirichlet Allocation, which have been used on multilingual corpora before [34, 35]. Automated machine translation is not our only option, however. There are exciting possibilities with the latest multilingual transformer models. In fact, we can obtain even more accurate results – without the extra step of automated translation – by using some of these on our corpus. This is not without trade-offs – as we mention below – but there are great opportunities here. To illustrate this, let’s return to our simple 3-text corpus and use some multilingual transformer models to compare the Latin search query to the corpus: 18 Figure 10. Newton’s Latin text (TEXT 1) is used as a search query across the multilingual corpus and processed using a set of multilingual transformer models – instead of TF-IDF. Here we send Newton’s Latin text (TEXT 1) as our search query across the corpus – without performing any additional machine translation first. The multilingual transformer models have been trained on many languages and have no trouble handling multilingual texts. We have provided three of these for comparison, but the LaBSE model appears to perform best in this instance, so we have highlighted its results above (see Figure 10). LaBSE is a language-agnostic BERT sentence embedding model that supports 109 languages out of the box.21 LaBSE was originally developed by Google. It “is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So, it can be used for mining for translations of a sentence in a larger corpus.”22 As we expect, the TEXT 1 query is identical to itself, and the dummy English TEXT 2 is not very similar to the Latin query. But the results for TEXT 3 should make us sit up and take notice: the LaBSE model provides a cosine similarity score of 0.887 – a significantly higher score than we were able to achieve using automated translation. The model appears to detect the deep semantic similarity between the Latin query and its English translation. This is very much what we had hoped for when we began this investigation. If this kind of result were to hold up across all 6 languages, then we have found a very powerful tool indeed. The VERITRACE team will therefore explore the use of multilingual transformer models for use with our Text Matching tool. But we should also be cautious for there are definite trade-offs in using these new tools. They require significantly more computational resources, add complexity and are much less interpretable. We cannot yet understand how these models come to the conclusions they do, nor can we consistently reproduce the same results. There is an element of indeterminism in their method, which makes reproducible research much harder to achieve [36]. Still, automated machine translation is also built on the advances of transformer models, so once we introduce this into our project, we must confront this ‘black box’ technology. 21 https://huggingface.co/sentence-transformers/LaBSE 22 https://www.kaggle.com/models/google/labse/tensorFlow2/labse/1?tfhub-redirect=true 19 This is the direction NLP has been headed in the past few years, and it would be obstinate to ignore these developments entirely. A final point about capabilities: we are just scratching the surface of what one can do, even with our more traditional tools. For Text Matching, we have only used one query text at a time, but we can do so for our entire Close Reading Corpus or a subset of that, e.g. all editions of the Corpus Hermeticum. We could then look for matches and similarities to this larger source collection. We also will use our entire multilingual VERITRACE text collection of c.430,000 texts as the comparison corpus – not the 18,633 English texts we limited ourselves to for this specific research question. Using many query and comparison texts will, of course, dramatically increase the computational demands of the Text Matching task, and there may be a practical limit here. Whether this is at 100 or 1000 or 100,000 texts, we have yet to explore. The VERITRACE team must also consider whether the increased accuracy of transformer models is worth the trade- offs and complexity they bring with them. 4. Next Steps: A Two-Pronged Approach to Text Matching What are the lessons for VERITRACE in this small, multilingual case study? To conclude this paper, we outline our current thinking about what it means for the creation of a more robust version of multilingual text matching. The VERITRACE Text Matching tool should be able to measure lexical similarity between highly similar words or phrases (irrespective of their ‘semantic’ meaning) from different texts. If a text from the comparison corpus, for example, repeats verbatim a passage from the query text, no matter what model we use this should generate a similarity score of 1. And if two sentences or passages are very similar from a lexical standpoint – they tend to use many of the same words, even with different semantic meanings – that too will show up as high lexical similarity. This is, again, a sort of plagiarism detector, at least of the simplistic kind, where one author simply re-uses exact or very similar words and phrases from another work or set of works. And for this matching task, using TF-IDF and cosine similarity for identifying similar texts is a great choice because TF-IDF (and its variations, like Okapi BM25) is an effective, surface-level tool, focused on a text’s most important keywords and vocabulary. There are drawbacks, however, to using TF-IDF-based cosine similarity alone. If we limit ourselves to that, we are likely to get some false positives, passages that appear to overlap based on vocabulary and lexical similarity but have different meanings. Consider the pair of sentences: “He couldn’t desert his post at the plant” and “Desert plants can survive droughts” [37]. They do not mean remotely the same thing – they have vastly different semantic meanings – but they share some of the same key vocabulary. They would likely receive a high similarity score, if matched on lexical similarity alone. There would undoubtedly be false negatives as well, if we consider passages that have been extensively paraphrased. One can imagine a more sophisticated plagiariser – continuing our metaphor – who changes most of the words from a borrowed passage but keeps the sense and meaning of it. A simple example drawn from [38]: Consider “Peter is a handsome boy” and “Peter is a good-looking lad.” They are arguably quite close in semantic meaning, but their keywords (beyond the proper name) do not overlap and would therefore not be identified as similar using TF-IDF. Now, we are not trying to identify early modern plagiarism per se, but traces of influence between ancient wisdom texts and natural philosophical discourse. But the need is the same: to be able to identify both lexical and deep semantic similarity between query and comparison texts. Therefore, if we want to avoid too many false positives and false negatives, 20 we cannot limit ourselves to TF-IDF alone. To capture paraphrasing, we need a nuanced semantic similarity tool, like what transformer models seemingly supply. And this is just in terms of monolingual text matching. If we want to capture any sort of cross- lingual similarity at all, we also need an effective semantic similarity tool for that as well. Indeed, at a minimum, this is what our case study above has demonstrated. That means our Text Matching Tool ought to have a two-pronged approach – it needs to be able to capture and identify, on one end of the spectrum, surface-level, lexical similarity. And at the other end, deep, contextual, semantic similarity. TF-IDF-based cosine similarity is therefore likely to be our tool of choice for the lexical similarity metric. For the other prong, we can use a multilingual transformer model, like LaBSE (and cosine similarity), to capture deep semantic similarity. Both can be used together, as complements, for monolingual text matching, but multilingual text matching will lean more heavily on the latter by using a multilingual transformer model.23 This two-pronged approach is becoming standard, in fact, with the latest iterations of vector search. Vector database software vendors, for instance, already advertise sparse (e.g. TF-IDF) vs. dense search (e.g. using transformers), as well as hybrid search, which combines the two result sets.24 Multilingual hybrid search is, unfortunately, harder to find.25 In any case, whether we rely on an implementation using open-source vector database software, or customise some variation of our own, pursuing this general hybrid approach for Text Matching is a logical next step. In short, the capabilities of VERITRACE will be expanded significantly, as we proceed, and we look forward to sharing our results with the academic community. 23 An interesting question is: does it make sense to consider lexical similarity in a multilingual context? What would that mean? This is beyond the scope of this paper, but a promising line of investigation is found in work done on cross-lingual plagiarism detection. For languages that are not too dissimilar lexically, character n-gram vectors have been tried [39]. More recently, attempts have been made using multilingual word clusters or sets of multilingual thesauri, combined with automated translation [40, 41]. 24 https://weaviate.io/blog/hybrid-search-explained 25 Weaviate, for instance, offers multilingual semantic search but not lexical or keyword search in anything other than English: https://weaviate.io/blog/weaviate-non-english-languages 21 Acknowledgements VERITRACE would like to acknowledge funding from the European Research Council as part of ERC-StG Project VERITRACE (101076836). We would also like to thank Dr. Klaus Ceynowa, the Director General of the Bayerische Staatsbibliothek, for his support, as well as the entire staff of the Munich Digitisation Centre. VERITRACE also thanks: Dr. Arthur der Weduwen, the Co- Director and Project Manager of the Universal Short Title Catalogue (USTC: https://www.ustc.ac.uk) for allowing us access to raw USTC database files; Doug Knox, Martin Mueller, Craig Berry, Joseph Loewenstein, and Anupam Basu of the EarlyPrint Project (https://earlyprint.org) for helpful advice right at the start of our project; Paul F. Schaffner, Manager of the Text Creation Partnership (EEBO-TCP) and Senior Associate Librarian, Digital Content & Collections, at the University of Michigan Library, for clear explanations of the EEBO dataset; Samuel Pizelo and Arthur Koehl, from Project Quintessence (http://quintessence.ds.lib.ucdavis.edu) at the UC Davis Data Lab, who shared their approach to managing data; and, finally, audience feedback on our project at panel presentations at the Scientiae Conference in Brussels (June 2024), the European Society for the History of Science in Barcelona (Sept. 2024), and the 4th Humanities-Centred A.I. Workshop in Würzburg (Sept. 2024). References [1] VERITRACE project website. URL: https://veritrace.eu. [2] C. J. Schilt, Traces de la Verité: The reappropriation of ancient wisdom in early modern natural philosophy, VERITRACE (ERC-2022-STG-101076836), 2022. URL: https://veritrace.eu/wp-content/uploads/2023/04/Project-Traces-de-la-Verite- Condensed.pdf. [3] J. C. Wolf, From Data Acquisition to Latent Semantic Analysis: Developing VERITRACE’s Computational Approach to Tracing the Influence of Ancient Wisdom in Early Modern Philosophy, Society and Politics 18(1:35) (April 2024, forthcoming 2025). [4] M. Cohen, Narratology in the Archive of Literature, Representations 108 (2009). [5] D. Reid, Distant Reading, ‘the Great Unread’, and 19th-Century British Conceptualizations of the Civilizing Mission: A Case Study, Journal of Interdisciplinary History of Ideas 15 (2019). [6] M. J. Hill, S. Hengchen, Quantifying the Impact of Dirty OCR on Historical Text Analysis: Eighteenth Century Collections Online as a Case Study, Digital Scholarship in the Humanities 34, no. 4 (2019). [7] P. Kurhekar, S. Nigam, S. Pillai, Automated Text and Tabular Data Extraction From Scanned Document Images, Data Management, Analytics and Innovation, in: Proceedings of ICDMAI 2021, 1(2021), 169-182. [8] K. Imai, Quantitative Social Science: An Introduction, Princeton University Press, Princeton and Oxford, 2018. [9] F. Karsdorp, M. Kestemont, A. Riddell, Humanities Data Analysis: Case Studies with Python, Princeton University Press, Princeton and Oxford, 2021. [10] H. Trismegistus, The divine Pymander of Hermes Mercurius Trismegistus, in XVII. books. Translated formerly out of the Arabick into Greek, and thence into Latine, and Dutch, and now out of the original into English; by that learned divine Doctor Everard. Printed by 22 Robert White, London, [1650]. A digital transcription of this text can be found online. URL: https://sacred-texts.com/eso/pym/index.htm. [11] M. P. Oakes, Author Profiling and Related Applications, in: R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, 2nd ed., Oxford University Press, Oxford, UK, 2022, pp. 1165- 1197. [12] H. Trismegistus, Hermes Mercurius Trismegistus his Divine pymander in seventeen books: together with his second book called Asclepius, containing fifteen chapters with a commentary / translated formerly out of the Arabick into Greek, and thence into Latine, and Dutch, and now out of the original into English by Dr. Everard. Printed by J.S. for Thomas Brewster, London, 1657. [13] D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 20, 2024. URL: https://web.stanford.edu/~jurafsky/slp3. [14] C. D. Manning, R. Prabhakar, H. Schütze, An Introduction to Information Retrieval. Cambridge Online Edition, Cambridge University Press, Cambridge, UK, 2009. [15] R. Mitkov (ed.), Oxford Handbook of Computational Linguistics, 2nd ed., Oxford University Press, Oxford, UK, 2022. [16] O. Levy, Word Representation, in: R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, 2nd ed., Oxford University Press, Oxford, UK, 2022, pp. 334-358. [17] R. Mihalcea, S. Hassan, Similarity, in: R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, 2nd ed., Oxford University Press, Oxford, UK, 2022, pp. 415-434. [18] N. Indurkhya, F. J. Damerau (Eds.), Handbook of Natural Language Processing, 2nd ed., Chapman and Hall/CRC, Boca Raton, FL, 2010. [19] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O’Reilly Media, Inc., Sebastopol, CA, 2009. [20] A. Wangperawong, Multilingual Search with Subword TF-IDF, arXiv, 29 Sept 2022. [21] G. Salton, The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Upper Saddle River, NJ, 1971. [22] C. E. Osgood, G. J. Suci, P.H. Tannenbaum, The Measurement of Meaning, University of Illinois Press, Urbana, IL, 1957. [23] T. Traherne, Christian ethicks, or, Divine morality opening the way to blessedness, by the rules of vertue and reason, London: Printed for Jonathan Edwin, 1675. URL: https://name.umdl.umich.edu/A63047.0001.001 [24] C. Marks, Thomas Traherne and Hermes Trismegistus, Renaissance News, 19/2 (1966): pp. 118-131 [25] R. Cudworth, The true intellectual system of the universe. The first part wherein all the reason and philosophy of atheism is confuted and its impossibility demonstrated, London: Printed for Richard Royston, 1678. URL: https://name.umdl.umich.edu/A35345.0001.001 [26] D. P. Walker, The Ancient Theology: Studies in Christian Platonism from the Fifteenth to the Eighteenth Century, Cornell University Press, Ithaca, NY, 1972. [27] I. Newton, Philosophiae naturalis principia mathematica. Editio tertia aucta & emendate, London: Apud Guil. & Joh. Innys, 1726. [28] A. Motte (ed.), The mathematical principles of natural philosophy. By Sir Isaac Newton. Translated into English by Andrew Motte. To which are added, The laws of the moon's 23 motion, according to gravity. By John Machin ... In two volumes [volume 1], London: Printed for Benjamin Motte, 1729. [29] W. H. Gomaa, A. A. Fahmy, A Survey of Text Similarity Approaches, International Journal of Computer Applications, 68(13), pp. 13-18. https://doi.org/10.5120/11638-7118 [30] M. Ficino, De potestate et sapientia Dei, Treviso: Gerardus de Lisa, 1471. [31] W. J. Hanegraaff, A Suggestive Inquiry into Hermetic Rebirth: Nondual Noēsis and Bodily Fluids in Victorian England, in: S. Perez, B. van Rijn, J. Schlieter (Eds,), Intentional Transformative Experiences, De Gruyter, Berlin, pp.149-178. [32] F. Patrizi, Nova de universis philosophia, Ferraro: Apud Benedictum Mammarellum, 1591. [33] I. La Peyrère, Men before Adam. Or a discourse upon the twelfth, thirteenth, and fourteenth verses of the fifth chapter of the Epistle of the Apostle Paul to the Romans. By which are prov'd, that the first men were created before Adam. [A theological systeme upon that presupposition, that men were before Adam. The first part.], London : Leach, F., 1656. [34] A.A.P. Ratna, P. D. Purnamasari, B. A. Adhi, F. A. Ekadiyanto, M. Salman, M. Mardiyah, D. J. Winata. Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization, Algorithms 10(69), 2017. https://doi.org/10.3390/a10020069 [35] T. K. Landauer, D.S. McNamara, S. Dennis, W. Kintsch (Eds.), Handbook of Latent Semantic Analysis, Psychology Press, New York, NY, 2007. [36] J.E. Dobson, Interpretable Outputs: Criteria for Machine Learning in the Humanities, Digital Humanities Quarterly, 15 (2), 2021. [37] A. Ng, N. Namjoshi, Understanding and applying text embeddings [MOOC], DeepLearning.AI, URL: https://www.deeplearning.ai/short-courses/google-cloud-vertex- ai/ [38] R. Ferreira, R.D. Lins, S. J. Simske, F. Freitas, M. Riss, Assessing sentence similarity through lexical, syntactic and semantic analysis, Computer Speech and Language 39 (2016) 1–28. [39] P. McNamee, J. Mayfield, Character n-gram tokenization for European language text retrieval, Information Retrieval 7 (2004) 73–97. [40] M. Potthast, A. Eiselt, L. A. Barrón-Cedeño, B. Stein, P. Rosso, Overview of the 3rd international competition on plagiarism detection, in: CEUR workshop proceedings, Vol. 1177, CEUR Workshop Proceedings, 2011. [41] K. Avetisyan, A. Malajyan, T. Ghukasyan, A. Avetisyan, A Simple and Effective Method of Cross-Lingual Plagiarism Detection, arXiv, 5 April 2023. 24