Tackling the PAN’09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector∗ James A Malcolm Peter C R Lane University of Hertfordshire University of Hertfordshire College Lane, Hatfield, Herts College Lane, Hatfield, Herts j.a.malcolm@herts.ac.uk p.c.lane@herts.ac.uk Abstract: Ferret is a fast and effective tool for detecting similarities in a group of files. Applying it to the PAN’09 corpus required modifications to meet the require- ments of the competition, mainly to deal with the very large number of files, the large size of some of them, and to automate some of the decisions that would nor- mally be made by a human operator. Ferret was able to detect numerous files in the development corpus that contain substantial similarities not marked as plagiarism, but it also identified quite a lot of pairs where random similarities masked actual plagiarism. An improved metric is therefore indicated if the “plagiarised” or “not plagiarised” decision is to be automated. Keywords: Ferret, n-grams, plagiarism detection, similarity 1 Introduction all, plagiarism is an academic judgement not something that can be measured by a ma- In this paper we describe how we approached chine (Flint, Clegg, and Macdonald, 2006; the challenge of the PAN’09 Plagiarism De- Lyon, Barrett, and Malcolm, 2004). Before tection Competition using the Ferret plagia- tackling the competition we had done little rism detection software. We outline Ferret’s to automate the decision making process: “is strengths in normal use, highlight the diffi- this plagiarism or not”; Ferret tells its user culties we had in using Ferret for the com- which pairs to look at (and helps in review- petition task, and describe the results of the ing those pairs) but leaves the actual decision improvements that we made as a result of en- has to him or her. There is therefore no need tering the competition. for the software to draw a dividing line – a The “external plagiarism analysis” task of ranked list is sufficient. the PAN’09 Plagiarism Detection Competi- The competition did highlight what we tion (International Competition on Plagia- knew to be Ferret’s strengths and weak- rism Detection, 2009) is an example of cate- nesses: we came second on recall, but pre- gory 1 plagiarism (Lyon and Malcolm, 2002), cision was poor as Ferret is too fine grained as we have the source(s) in our hand. This in its identification of similarities. suggests that Ferret is the tool for the job. Ferret (Lyon, Barrett, and Malcolm, 2004; 2 Ferret’s Strengths Lyon, Malcolm, and Barrett, 2005) is a tool that (when used on student work) is primar- Assuming, as in the competition, that we al- ily good for detecting collusion rather than ready have the sources of all the copying, plagiarism (though it has been extended to then there are two tasks in identifying pla- generate search terms to drive an Internet giarism in a collection of documents: finding search (Malcolm and Lane, 2008b)). It is which pairs of documents to look at and (once a desktop plagiarism detector, which means a pair has been selected for further examina- that it is fast and interactive. It has to tion) finding the blocks of matching text. be fast, because a human is waiting for the In the case of the competition there ap- results, and because it is interactive, hu- pear to be (7214)2 comparisons to be made. man input is available and appropriate: after In the more general case usually considered by Ferret, every file needs to be compared ∗ We thank Bob Dickerson who developed the core with every other, giving n·n−1 2 comparisons from which the original Ferret code was developed. (which for source and suspicious files together Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 29-33, 2009. 30 James A. Malcolm and Peter C. R. Lane is 104,076,378 pairs). But it is important to If M is the available memory, the num- note that the way Ferret works these com- ber of batches of source documents, NO , will parisons are not made explicitly; as the doc- be 2|O| M (where |O| is the total size of all the uments are read by Ferret it creates a data source documents in the corpus). The num- structure which enables the most similar pair ber of batches of suspicious documents, NU , of documents to be selected without making should be 2|U | M . The number of runs, NO ·NU , a direct comparison of those two documents. To be more specific, it remembers every three will thus be 4|O||U M2 | which is quadratic in cor- word sequence (triple) that it has seen, and pus size; doubling the memory available will which input files that triple appears in. Sim- make the system four times faster. ilar files are those with the largest number 3.2 Automation of common triples. This simple approach is what makes Ferret fast. We needed to automate the decision between The Ferret user interface then allows the “plagiarised” and “innocent similarity”. At operator to display the similar documents present Ferret supports two possible metrics: side by side with the similar text highlighted. a count of the number of common triples, and For large documents, it can take as long to the Jacquard coefficient, ‘R’. display one pair as it does to find all the sim- Some of the files are very big, but these ilarities in the set (we mention this to high- are mixed with quite small files so our current light the speed of the first phase, rather than metric does not work very well. The copied as a deficiency of the user interface). chunks vary from a couple of sentences up, There is no specific support in Ferret for but it is the huge size of some of the files finding the sections of a source that has been that causes the problem, because the number copied. This is done by the operator (al- of randomly matching triples in a huge file though he or she can click on any one of is bigger than the size of one of the smaller the matching triples to find where in the two copied chunks. compared documents it appears). Examining results for the first 100 sus- We expect to look at the most similar pairs picious files in the Development corpus we (in descending order of similarity) and stop found that we should take the 50 most simi- when we judge that plagiarism is no longer lar pairs to catch all the suspicious files where occurring. In effect this is a corpus spe- artificial plagiarism had been introduced. cific threshold (partly mitigating the problem Ferret picks out many very small similar- mentioned in section 3.2). ities. Eliminating these “accidental” similar- ities (common triples, typically isolated) was a problem that we had to address with code 3 Adapting for the Competition if we were to have success in the competition. The problems that have to be solved in order It was not a problem we had seen before be- to use Ferret for the competition relate firstly cause of the way we use Ferret. to the large volume of data to be examined and secondly to the difference between how 3.3 Improvements we would normally use Ferret and how the Our submission was the first run of the com- competition is run. plete system. We later revised it to take some account 3.1 Scale of the order of triples in the source docu- To deal with the large volume of data in the ment when deciding if matching triples in the competition, we had to divide the input into suspicious document are part of a matching batches that were processed in turn as we did block or just a random match. This code is not have a machine with sufficient RAM to quite slow, and there are now a lot of param- deal with all the data in one go. We esti- eters that can be fiddled to change how well mate that 32GB would be enough; our ma- Ferret would perform in the competition: chine was 9GB (for normal use, Ferret runs in 512KB or less). If batching is necessary, • How many triples matching is considered it is most efficient if half the available mem- too small to be worth considering as in- ory is used for source documents, and half for put to the second phase: currently less suspicious documents. than 50 (or R < 0.007). Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 31 • How many documents to keep in the the data, but it was initially easier to test a “most similar pairs” list: 5 on the first few suspicious files at a time. (submitted) run, but considering 50. Step 1 produces a lot of output. As an indication of the scale of the problem, run- • The unmatched gap between matching ning the Unix wc (word count) utility on the sections that can be merged: 1 in the complete set of output files took about 37 first run; considering up to 4. minutes, involving 7 minutes CPU time. The • How many sections before or after the average size of the files is around 45MB, and current section we can jump when merg- the total size about 6.5GB. ing: no restriction in first run; consider- This emphasises the quadratic nature of ing a range from 5 before to 10 after. plagiarism detection: every suspicious file has to be checked against every possible source 4 How the system operates file. In the case of the competition this is 4.1 Identifying Similar Pairs 52,041,796 comparisons. As mentioned ear- lier, Ferret usually compares every file with First we run Ferret on the complete corpus. every other, but fortunately we had already A bash script make-input-document-list implemented a grouping facility (it has sev- that creates an input file for the ferret -f eral other applications) whereby files in a (definition file) option. Several copies of the group are not compared with each other, but make-input-document-list script are com- only with files in other groups. We should bined in a script that does multiple runs of have filtered out the least similar pairs be- Ferret to do (small) groups of suspicious files fore generating the output from phase 1, as against (largish) groups of source files to pro- the set of phase 1 output files is considerably duce a set of output files. larger than the set of suspicious documents. 4.2 Identifying Sources In normal use, Ferret displays a list of the We read the output files generated by step most similar pairs; the user only looks at the 1 to select the likely sources for a given sus- most similar and because the rest of the list is picious document. This uses a ruby script in memory there is little extra cost involved. process-output that runs ferret -x to 6 Results produce an XML file highlighting the simi- larities in a particular suspicious-source pair Here now are our observations on running the where the number of matches and/or resem- system we built around Ferret on the develop- blance metric meets hard coded (but easy ment corpus (we do not yet know the “correct to change) constraints. This produces a set answers” for the competition corpus). of XML files (one for each ferret -x run); We group the pairs into 5 types depending these are scanned by another ruby script: on the kind of artificial plagiarism: read-ferret-xml-files to produce output in the required XML format. Formatting the • raw plagiarism (without obfuscation) results to meet the competition requirements • low obfuscation raised a minor difficulty that the source off- • high obfuscation set required was not available in our system; fortunately it did not appear to be part of • plagiarism by translation the evaluation metric. • no artificial plagiarism 5 Resource Analysis We plotted the value of R for each of these Tackling more than about 500 files at a time types: figure 1 shows (for first 500 suspicious on a 1GB laptop led to it thrashing. On a documents of each type in the development 9GB server, about 5000 files at a time could corpus) how documents which were identified be dealt with comfortably. as plagiarised (with various degrees of obfus- For the development corpus, the out- cation) differed from those where no similar- put from Ferret was divided into 146 files: ity was intended. each file has results for about half of the We see immediately that Ferret is (as ex- source files (numbers 1-3999 or 4000-7214) pected) ineffective at detecting plagiarism by and about 100 suspicious files. As explained translation (the line on top of the x-axis), so above, this is not the best way of organising this is left out of the later graphs. 32 James A. Malcolm and Peter C. R. Lane Suspicious Source R Cause 00569 04692 .533 editions 01302 05069 .917 read me 01656 04163 .501 editions 01756 03972 .662 editions 01862 03766 .640 editions 02483 05054 .550 fragments 02730 01431 .629 fragments 04740 04973 .717 editions 05096 00620 .622 fragments 05959 06555 .706 editions 05964 06148 .816 editions Figure 1: R for the first 500 pairs of each type 06188 05357 .798 plagiarism? We also note some very high values for R, Table 1: Most similar non-plagiarised pairs and not just where there is introduced plagia- rism. Some of the pairs where there was no Gutenberg used to be distributed, so it is artificial plagiarism showed very high similar- hard to tell how the similarity to other frag- ity using Ferret; there are some very high val- ments came about. ues of R before the graph flattens off. In to- We would presume that R values below 0.5 tal 49 pairs (in the development corpus) have also indicate similar situations, as previous R > 0.5. Twelve of these are pairs where no work has shown that R > 0.04 is the limit plagiarism is alleged. We looked at each of for “accidental similarity” but maybe a larger these pairs in detail, and present the results value is appropriate for this corpus. in Table 1. The worst case was suspicious document 1 No obfuscation 1302 which had R = 0.91 when compared to 0.9 Low obfuscation High obfuscation Truncated source document 5069. It turned out that 0.8 No plagiarism these were both Project Gutenberg (1971- 0.7 2009) “READ ME” documents with no other 0.6 text. I guess this could be viewed as an acci- 0.5 dent on the part of the compilers of the cor- 0.4 pus (Potthast, Martin et al. (editors), 2009). 0.3 Some of the other similarities are more in- teresting, such as R = 0.80 between “The 0.2 Impossibles” and “Out Like a Light”, both 0.1 by the same author. Despite the consider- 0 0 10 20 30 40 50 60 70 80 90 100 able number of small changes between the two documents, Wikipedia Authors (2009) Figure 2: R for the first 100 pairs of each type suggest this is a re-publication of the same work under a new title. In figure 2 (where R is plotted for only the Most of the high similarities in documents first 100 pairs of each type), it can be seen not alleged to be plagiarised were different more clearly that the graphs for obfuscated editions of the same work, such as a volume plagiarism are higher than for raw, maybe be- which is repeated (with numerous spelling cause the obfuscated cases are longer: the av- corrections) in the collected works or a “Sec- erage amount of source material in suspicious ond Edition Revised and Enlarged” with ob- documents containing raw (un-obfuscated) viously a considerable overlap with the first plagiarism was 20,628 whereas for low and edition. high obfuscation the average lengths were In most of these 12 cases, we have differ- about 50% higher at 30,402 and 33,330 re- ent editions of the same work, or different spectively. collections containing most of the same ma- Figure 3 compares R for the four types terial. A few are incomplete fragments, pos- across the full range of pairs. Here the x- sibly arising from the way in which Project axis is a percentage of the total number of Tackling the PAN'09 External Plagiarism Detection Corpus with a Desktop Plagiarism Detector 33 pairs of the type and the most similar pairs RAM, so can now process the entire compe- of each type (R > 0.4) have been omitted. tition corpus in a single ferret run of 1h42m. The big difference between plagiarised and This works out at an effective rate of compar- non plagiarised is evident, but also we note ison of 50,000 pairs per second. The input is that at the RHS of the diagram these lines read at 450kB/s (this includes all ferret’s pro- merge. This is because the influence on the R cessing, including writing out the very large metric for small pieces of plagiarism in large results file). documents is rather too small. References 0.4 No plagiarism No obfuscation Low obfuscation Flint, Abbi, Sue Clegg, and Ranald Mac- 0.35 High obfuscation donald. 2006. Exploring staff percep- 0.3 tions of student plagiarism. Journal of Further and Higher Education, 30(2):145– 0.25 156, May. 0.2 International Competition on Plagiarism De- 0.15 tection. 2009. http://www.webis.de/ 0.1 pan-09/competition.php. 0.05 Lyon, Caroline, Ruth Barrett, and James Malcolm. 2004. A theoretical basis to the 0 0 10 20 30 40 50 60 70 80 90 100 automated detection of copying between texts, and its practical implementation in Figure 3: R for all the pairs of each type the ferret plagiarism and collusion detec- tor. In Plagiarism: Prevention, Practice and Policies Conference, June. 7 Conclusions and Future Work Lyon, Caroline and James Malcolm. 2002. Producing a corpus with no similar text ex- Experience of plagiarism detection and cept that which has been added deliberately prevention in higher education. In Pro- is hard: there is far too much duplicate data ceedings of the World Congress, Net- on the Internet to make it easy. It is for this worked Learning in a Global Environment: reason that this paper concentrates on the Challenges and Solutions for Virtual Ed- Development corpus, as we know which parts ucation. ICSC-NAISO Academic Press. of it are supposed to be plagiarised. However Lyon, Caroline, James Malcolm, and Ruth the competition organisers have done an ex- Barrett. 2005. The ferret copy detector: cellent job in encouraging research. finding similar passages in large quantities The competition has clearly shown us that of text. In Submitted to the 43rd Annual a better metric than R is needed. As we have Meeting of the Association for Computa- suggested before (Malcolm and Lane, 2008a), tional Linguistics. a metric that takes account of the order of the similar features looks promising. Calculating Malcolm, J.A. and P.C.R. Lane. 2008a. the longest common subsequence of triples An approach to detecting article spin- would probably work well, but is computa- ning. Proceedings of the Third Interna- tionally costly; we want to take care not to tional Conference on Plagiarism. slow Ferret down. Malcolm, James A. and Peter C. R. Lane. We need to develop better approaches to 2008b. Efficient search for plagiarism on spanning gaps caused by obfuscation, espe- the web. In Proceedings of i-TCE 2008. cially in very long files as our current algo- Potthast, Martin et al. (editors). 2009. PAN rithm can still get two isolated triples that Plagiarism Corpus PAN-PC-09. http: happen to be in the right order, for example a //www.webis.de/research/corpora. b c X k l m turning into a 7 word match, which is probably long enough to appear in the out- Project Gutenberg. 1971-2009. http: put. We should also optimise the other pa- //www.gutenberg.org/. rameters listed in subsection 3.3. Wikipedia Authors. 2009. http: Ferret’s strength is its speed: we were able //en.wikipedia.org/wiki/Mark to upgrade our machine from 9 to 32GB of Phillips (author).