A plagiarism detection procedure in three steps: selection, matches and ”squares” ∗ Chiara Basile Dario Benedetto Emanuele Caglioti Dip. Matematica Dip. Matematica Dip. Matematica Univ. Bologna Sapienza, Univ. Roma Sapienza, Univ. Roma P.zza di Porta S. Donato 5, P.le Aldo Moro 2, P.le Aldo Moro 2, 40126, Bologna, Italy 00185, Roma, Italy 00185, Roma, Italy basile@dm.unibo.it benedetto@mat.uniroma1.it caglioti@mat.uniroma1.it Giampaolo Cristadoro Mirko Degli Esposti Dip. Matematica Dip. Matematica Univ. Bologna Univ. Bologna P.zza di Porta S. Donato 5, P.zza di Porta S. Donato 5, 40126, Bologna, Italy 40126, Bologna, Italy cristadoro@dm.unibo.it desposti@dm.unibo.it Abstract: We present a detailed description of an algorithm tailored to detect external plagiarism in PAN-09 competition. The algorithm is divided into three steps: a first reduction of the size of the problem by a selection of ten suspicious plagiarists using a n-gram distance on properly recoded texts. A search for matches after T9-like recoding. A “joining algorithm” that merges selected matches and is able to detect obfuscated plagiarism. The results are briefly discussed. Keywords: n-grams, plagiarism, coding, string matching 1 Introduction 2 The problem and the datasets In this short paper we aim to describe our The contest was divided into two different approach to automatic plagiarism detection. challenges: external and internal plagiarism. In particular, we discuss how we were able to We concentrated on the external plagiarism borrow and adapt some techniques and ideas only. For the sake of completeness we re- recently developed by some of us for differ- call the main goal (see (PAN-09-Organizers, ent, but related, problems such as Author- 2009) for more details): ship Attribution (AA)(Benedetto, Caglioti, ..given a set of suspicious documents and and Loreto, 2002; Basile et al., 2008). While a set of source documents the task is to find some of the authors gained over the last years all text passages in the suspicious documents certain expertise in the field of AA, it is the which have been plagiarized and the corre- first time we face plagiarism detection. The sponding text passages in the source docu- algorithm we are going to describe has been ments. tailored on the “1st International Competi- The organizers provided a training corpus, tion on Plagiarism Detection” (Stein et al., composed of 7214 source documents and 7214 2009) and does not pretend to be optimal suspicious documents, each with an associ- for generic situations. Indeed we joined the ated XML containing the information about competition while being completely unaware plagiarized passages. A first statistical anal- of the relevant literature, and thus the main ysis shows that text lengths vary from few aim of this paper is to participate to a de- hundreds to 2.5 million characters while the tailed comparison of the different approaches total number of plagiarized passages is 37046. to the contest that, we hope, will permit to Moreover, exactly half of the suspicious texts enlarge the applicability of the methods and contain no plagiarism and about 25% of the generate a profound integration and combi- source documents are not used to plagiarize nation of different ideas. any suspicious document. The length of the plagiarized passages has a very peculiar dis- ∗ We thank the organizers of PAN-09 competition tribution, see Figure 1: there are no passages for the stimulating challenges and C. Cattuto and V. with length in the window 6000-12000 char- Loreto for bringing the contest to our attention. acters, and even for long texts there is no Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 19-23, 2009. 20 Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti plagiarism longer than 30000 characters. A transformed into sequences of word lengths, remarkable fact is that about 13% of the pla- so that for example the sentence To be, giarized passages consist of translated plagia- or not to be: that is the question rism. becomes simply 2223224238. All word 10 lengths greater than 9 were cut to 9, so that the new alphabet consists of the nine percentage of plagiarized passages 1 symbols {1, . . . , 9}. These coded versions of the texts are on average 82.5% shorter than 0.1 the original ones, and so computation times are greatly reduced. 0.01 The distance between each suspicious and source text has been computed by comparing 0 5000 10 000 15 000 20 000 25 000 in a suitable way the frequency vectors of the length carachters 8-grams of the coded versions. This n-gram distance used here has been proved successful Figure 1: Distribution of plagiarized passage in Authorship Attribution problems (Basile lengths in the training corpus. et al., 2008). To be more precise, after a first experiment based on bigram frequencies pre- Similarly, the competition corpus is com- sented in 1976 by Bennett (Bennett, 1976), posed of 7215 source documents and 7214 Kešelj et al. published in 2003 a paper in suspicious documents (obviously without any which n-gram frequencies were used to define associated XML files). The length statistics a similarity distance between texts (V. Kešelj are very close to those for the training corpus, and Thomas, 2003). The distance introduced in (Basile et al., 2008) and used here should see Figure 2. be considered as a natural development of the The overall score is then calculated as the previous ones: once the value of n has been ratio between F-measure and granularity over fixed, usually between 4 and 8, n-gram fre- the whole set of detected chars (see (Stein et quencies are calculated for a given text x. If al., 2009) for more details). we denote by ω an arbitrary n-gram, by fx (ω) the relative frequency with which ω appears 0.100 in the text x and by Dn (x) the so called n- 0.050 source texts training source texts competition gram dictionary of x, that is, the set of all n-grams which have nonzero frequency in x, percentage of texts logarithmic scale suspicious texts training 0.010 suspicious texts competition for any pair of texts x and y, we can define: 0.005    1 fy (ω) − fx (ω) dn (x, y) := |Dn (x)| + |Dn (y)| fy (ω) + fx (ω) 0.001 ω∈Dn (x)∪Dn (y) 5  104 This is exactly the distance that has been 1  104 0 500 000 1.0  106 1.5  106 2.0  106 2.5  106 used in (Basile et al., 2008), together with a text length characters suitable voting procedure, for approaching a two-class classification problem. Figure 2: Text length distribution for the train- The parameter n = 8 for the n−gram ing corpus and for the competition corpus. distance was chosen here as a compromise between a good recall (the fraction of pla- 3 The method giarized characters coming from the first 10 nearest neighbors of each suspicious text 3.1 A first selection is 81%) and acceptable computation times For each suspicious document, we first iden- (about 2.3 days for the whole corpus). Since tify a small subset of source documents for it was impossible to do repeated computa- a second and deeper (computationally de- tions on the whole corpus, the optimization manding) analysis. was done using a small subset of 160 sus- We start by calculating a suitable dis- picious texts and 300 source texts, suitably tance between each suspicious and source selected by imposing total and plagiarism text. Then, for each suspicious text the first length distributions comparable to those of 10 source neighbors ordered according to this the whole corpus. distance are kept for further analysis. Note that a recall of 81% is a very good In order to bring computational time result for the 8-gram distance, since the to a practical scale the texts were first method just described has basically no hope A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares" 21 to recognize the 13% of translated plagiarism. common PC for the whole corpus of 7214 Therefore, at least on the small subset of the texts took about 40 hours. The threshold for training corpus, only about 6% of the (non the match length was fixed arbitrarily to 15 translated) plagiarized passages are lost in for texts shorter than 500000 characters, to this phase. 25 for longer texts. 3.2 T9 and matches 3.3 Looking for “squares” The previous phase gives a long list of After identifying the 10 “most probable pla- matches for each suspicious-source pair of giarist sources” we now perform a more de- documents. Since the plagiarized passages tailed analysis to detect the plagiarized pas- had undergone various levels of obfusca- sages. The idea is to look for common subse- tion, typically the matches are “close” to quences (matches) longer than a fixed thresh- each other in the suspicious texts but taken old. To this goal we need to recover some from not necessarily subsequent places in the of the information lost on the first passage source texts. By representing the pair of by first rewriting the text in the original al- texts in a bidimensional plot, with the sus- phabet and then using a different (and less picious text on the x axis and the source “lossy”) coding. We perform a T9-like cod- text on the y axis, each match of length l, ing: this system is typically used for assisted starting at x in the suspicious document and writing on most mobile phones. The idea at y in the source document, draws a line is to translate three or four different letters from (x, y) to (x + l, y + l). The result is of- into the same character, for example {a,b,c} ten something similar to Figure 3: there are → 2, {d,e,f} → 3 and so on. The symbol 0 some random (short) matches all around the is used for newline and blank space, 1 for all plane but there are places where matches ac- symbols other than these two and the letters cumulate, forming lines or something similar of the alphabet. The new alphabet for the to a square. Non-obfuscated plagiarism cor- coded texts is therefore made up of 10 sym- responds to lines, i.e. a single long match bols: {0, 1, 2, . . . , 9}. Note that the use of or many short close matches which are in T9 “compression”, which could seem strange succession both in the suspicious and in the at a first sight, can be justified by observing source texts, whereas intuitively obfuscated that a “long” T9 sequence (10-15 characters) plagiarism corresponds to “squares”: here has in most cases an “almost unique” trans- the matching sequences are in a different or- lation in a sentence which makes sense in the der in the source and suspicious documents. original language. The “true” matches between a suspicious suspiciousdocument00814.txt vs. sourcedocument04005.txt 140 000 and a source document are now found: from 120 000 any possible starting position in the suspi- 100 000 80 000 cious document the longest match in the 60 000 source document is calculated (possibly more 40 000 than one with the same length). If the length 20 000 is larger than a fixed threshold and the match 0 50 000 100 000 150 000 200 000 250 000 is not a submatch of a previously detected one, it is stored in a list. Figure 3: Orange points correspond to the po- Here, we take advantage of the choice of sition of matching chars (measured in number of the T9 encoding, which uses ten symbols: chars from the beginning) between a suspicious for any starting position in the source doc- and a source document (see top of the plot) of the training corpus. A “square” of matches cor- ument, the algorithm stores the last previ- responds to an obfuscated plagiarism. ous position of the same string of length 7, and for any possible string of length 7 it is Figure 4 is an example of what can happen memorized, in a vector of size 107 , the last when there is no plagiarism: matches are uni- occurrence in the source file. With respect formly spread around the plane and do not to other methods (suffix trees or sorting, for accumulate anywhere. Obviously these are istance), in this way we can search the maxi- just two of the many possible settings: longer mum match in the source document avoiding texts or the presence of “noise” (e.g. long to compare the smaller ones. sequences of blanks, tables of numbers...) Running this part of the algorithm on a can give rise to a much higher density of 22 Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti matches, substantially increasing the difficul- we obtain some quadruples which correspond ties in identifying the correct plagiarized pas- roughly to the “diagonals” of the “squares” in sages. Figure 3. We finally run the algorithm once suspiciousdocument00814.txt vs. sourcedocument03464.txt again using smaller parameters δx and δy in order to reduce the granularity from 2 to ap- 300 000 proximately the optimal value of 1. Figure 5 shows the result of the algorithm for the 250 000 couple of texts of Figure 3 (blue), and Fig- ure 6 shows the very good superimposition 200 000 with the actual plagiarized passages (black), as derived from the XML file. 150 000 The procedure just described has been de- veloped and tuned for the competition in a re- 100 000 stricted time schedule, but it would be quite interesting to compare the efficiency of this 50 000 procedure to the ones that can be achieved by using standard clustering algorithms. We 0 plan to do this in the future. 0 50 000 100 000 150 000 200 000 250 000 suspiciousdocument00814.txt vs. sourcedocument04005.txt 140 000 Figure 4: Orange points correspond to the po- 120 000 sition of matching chars (measured in number of 100 000 chars from the beginning) between a suspicious 80 000 60 000 and a source document (see top of the plot) of 40 000 the training corpus. No plagiarism is present in 20 000 this case. 0 50 000 100 000 150 000 200 000 250 000 In order to provide a single quadruple Figure 5: Detected plagiarism for the pair of (x, y, lx , ly ) of starting points and lengths for texts of the training corpus indicated at the top each detected plagiarized passage we need of the plot. Single matches in orange, joined to implement an algorithm that joins the matches in blue. “cloud” of matches of each “square”. Note that the algorithm that performs 140 000 suspiciousdocument00814.txt vs. sourcedocument04005.txt this task needs to be scalable with pla- 120 000 giarism lengths, which can vary from few 100 000 80 000 hundreds, up to tens of thousands characters. 60 000 40 000 20 000 The algorithm used here joins two 0 50 000 100 000 150 000 200 000 250 000 matches if the following conditions hold si- multaneously: Figure 6: Plagiarized passages for the pair of texts of the training corpus indicated at the top of 1. matches are subsequent in the x coordi- the plot. Single matches in orange, actual plagia- nate; rism in black. Note the perfect superimposition 2. the distance between the projections of between the blue lines in figure 5 and the black the matches on the x axis is greater than lines here. or equal to zero (no superimposed pla- giarism) but shorter than or equal to the 3.4 Tuning the parameters lx of the longest of the two sequences, The “joining algorithm” described above de- scaled by a certain δx ; pends on 4 parameters: δx and δy for the 3. the distance between the projection of first joining phase, and the rescaled δx and the matches on the y axis is greater than δy for the second joining phase. Our choice or equal to zero but shorter than or equal of the actual value in use has been dictated to the ly of the longer of the two se- essentially by the lack of time and no rigor- quences, scaled by a certain δy . ous and efficient optimization has been per- formed. Driven by very few trials and with Merging now repeatedly the segments some heuristic, we decided to use the follow- which are superimposed either in x or in y, ing values: δx = δy = 3 and δx = δy = 0.5. A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares" 23 It is important to remark that different ple of mathematical authorship attribu- choices of the δ values yield to different de- tion. Journal of Mathematical Physics, tection results. For example, increasing their 49:125211–125230. values typically results in a larger recall and Benedetto, D., E. Caglioti, and V. Loreto. in a better granularity, but also in a smaller 2002. Language Trees and Zipping. precision. A further analysis of these de- Physical Review Letters, 88(4):048702–1, pendencies could provide (in future develop- 048702–4. ments) a controllable way of modifying the precision, the recall and the granularity, de- Bennett, W.R. 1976. Prentice-Hall, Engle- pending on the plagiarism detection task into wood Cliffs, NJ. consideration. A promising strategy that we PAN-09-Organizers. 2009. Proceedings of plan to explore in the future consists in a dy- PAN-09. namical tuning of these parameters accord- ing, for example, to the density of matches Stein, B., M. Potthast, A. Eiselt, or to the lengths of the two texts into consid- P. Rosso, and A. Barrón-Cedeño. eration. 2009. http://www.webis.de/pan- The match-joining algorithm was devel- 09/competition.php. oped using Mathematica  c 7, and it ran on a V. Kešelj, F. Peng, N. Cercone and common PC for about 20 hours on the whole C. Thomas. 2003. n-gram-based author data set of the competition. profiles for authorship attribution. 4 Results and comments The algorithm described gave the following results on the competition corpus (Stein et al., 2009): • Precision: 0.6727 • Recall: 0.6272 • F-measure: 0.6491 • Granularity: 1.0745 • Overall score: 0.6041 The overall score is the third best re- sult after 0.6093 and 0.6957 of the first two. We stress that the overall score drops con- siderably starting from the fourth position (0.3045), the fifth (0.1885), and so on. More- over, while the winner has better results in all precision, recall and granularity, our pre- cision and recall are better than the second, while granularity is worse. The competition had a very tight sched- ule, therefore many improvements are possi- ble. In particular, it may be that the match- joining problem can be advantageously for- mulated in the framework of Hidden Markov Models. Also, it would be worth to see how standard clustering algorithms perform in this case. We are eager to compare techniques and ideas with the other participants of the con- test. References Basile, C., D. Benedetto, E. Caglioti, and M. Degli Esposti. 2008. An exam-