University of Sheffield Lab Report for PAN at CLEF 2010 Rao Muhammad Adeel Nawab, Mark Stevenson and Paul Clough University of Sheffield, UK. {r.nawab|m.stevenson}@dcs.shef.ac.uk, p.d.clough@sheffield.ac.uk Abstract This paper describes the University of Sheffield entry for the 2nd in- ternational plagiarism detection competition (PAN 2010). Our system attempts to identify extrinsic plagiarism. A three-stage approach is used: pre-processing, candidate document selection (using word n-grams) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling string matching algorithm). This approach achieved an overall performance of 0.20 in the official evaluation with a precision of 0.40, recall of 0.16 and granularity of 1.21. Keywords: extrinsic plagiarism dteection, greedy string tiling, GST, n-grams 1 Introduction This paper describes the University of Sheffield entry for the 2nd international pla- giarism detection competition (PAN 2010). The system we submitted only attempted to identify extrinsic plagiarism: in this scenario the assumption is that the source (i.e. original) document(s) for a plagiarized document is hidden within a large collection of documents that can be accessed, i.e. within a closed–set of documents. The task is to identify the portions of the documents that have been plagiarized by aligning them with the corresponding text within the relevant source documents. 1.1 Related Work A simple approach to extrinsic plagiarism detection is to compare each suspicious doc- ument with a collection of source documents to find the potential source(s) of a pla- giarized text. However, the problem is that if the reference collection is very large, for instance the World Wide Web, then the computational resources required can be pro- hibitive for many approaches. The vector space model is considered as a baseline for the external plagiarism detection task. This model has been used to detect duplicate and near duplicate documents [7]. A variant of this model was used for copy detection at word level (the SCAM system) [13] and sentence level (the COPS system) [2]. Fin- gerprinting is another popular approach. Fingerprints, a means of uniquely identifying texts, are used to represent the contents of the source documents. To query a collection of documents, fingerprints are also generated for the query document. The number of common fingerprints between the source and query document determines the similar- ity between them. This approach has also been used by other researchers [12,8]. An- other approach that has also proved effective is based upon n-gram overlap. The source and suspicious documents are converted into fixed length n-grams (either characters or words) and the proportion of common n-grams used to determine the similarity between documents. This approach has been applied using different similarity measures, such as the Jaccard, Dice and Overlap coefficients [9,10,1,6]. Similar approaches were used by several of the systems that participated in the 1st international PAN-PC-09 plagiarism detection competition [15]. 2 External Plagiarism Detection We divided the task of external plagiarism detection into three phases: preprocessing, candidate document selection and detailed analysis using Running Karp-Rabin Greedy String Tiling (RKR-GST). 2.1 Preprocessing In the first phase we used TextCat1 [3], an existing language recognizer, to identify the language of each source document. Documents that were identified as being in German or Spanish were translated into English using the Google Translate 2 tools (see [4] for an overview of the Google translation tools). Each document was preprocessed by converting the text to lowercase and removing any non-alphanumeric characters. Each document was then converted into a set of word n-grams (n = 5). Word n-grams of length 5 were chosen for the text representation since they were found to generate high recall in the 2009 PAN competition (see [15]). 2.2 Candidate Document Selection Detailed comparison of each suspicious document with all the source documents could be computationally prohibitive. Therefore, for each suspicious document we attempt to identify likely source documents. The similarity between two documents can be com- puted by counting the number of n-grams they have in common. This technique has been used in copy detection and plagiarism detection [2,13,14,10]. We used one such technique, the overlap coefficient [11], to compare each suspicious document with the source documents. The overlap coefficient, Simoverlap (A, B), is computed using equa- tion 1, where S(A, n) and S(B, n) are the set of distinct n-grams in documents A and B respectively. Values for the overlap coefficient range from 0 to 1 where 0 means that two documents are entirely different and 1 means that two documents are exactly the same. T |S(A, n) S(B, n)| Simoverlap (A, B) = (1) min(|S(A, n)|, |S(B, n)|) 1 http://www.let.rug.nl/vannoord/TextCat/ 2 http://translate.google.com/ The potential source documents were sorted in descending order according to their overlap score and the top 10 selected. Any documents with overlap score below a certain threshold, αoverlap , are discarded and the remaining source-suspicious pairs passed to the next stage of processing. 2.3 Detailed Analysis In the final stage of processing detailed analysis is carried out using the Greedy String Tiling (GST) algorithm. This performs an alignment between the contents of the sus- picious document and each source document to identify potential plagiarized sections from the set of source-suspicious pairs generated by the previous stage. The advantages of using GST compared with alternative string similarity algorithms, such as longest common subsequence or edit distance, is its ability to detect block moves: treating the transposition of a substring of contiguous words as a single move rather than consid- ering each word separately. GST has previously been used in program code plagia- rism detection [18], biological sequence alignment [17,16] and measuring text re–use in journalism [5]. The algorithm has a run time of O(n3 ), but has been optimised to run in linear time using a string matching algorithm Running Karp-Rabin Greedy String Tiling (RKR-GST) [16]. An example of the output of GST for a fragment of source text and a rewritten (or plagiarized) version is shown below. ([1] and [2] indicate aligned matches between the two texts.) Source A dog[1] bit the postman[2] . Rewrite The postman[2] was bitten by a dog[1] . Match merging heuristics were applied to the output of GST to identify contiguous portions of plagiarized text from suspicious documents. GST computes the length of the longest tile3 in common between the suspicious and source documents. If this is greater than a pre–defined length, αlength , the tiles are combined to identify plagiarized sections. If the distance between adjacent tiles is less than or equal to a given number of characters, αmerge , then they are combined into a single section. 3 Evaluation 3.1 System Development Candidate Document Selection (Section 2.2). We initially experimented with using a copy detection system (an implementation of SCAM [13]) to extract candidate docu- ments. However, due to the processing speed and memory problems this approach had to be re-designed. The αoverlap threshold was set using a small corpus of 895 source and 88 suspicious documents from the first PAN competition (PAN-PC-09 corpus). We calculated the precision, recall and F1 measures for overlap scores with thresholds from 0.001 to 0.100. The best performing threshold, 0.005, gave a precision, recall and F1 measure of 0.74, 0.93 and 0.78 respectively. 3 A tile is a consecutive subsequence of maximal length that occurs as one-to-one pairing be- tween two input strings. Detailed Analysis (Section 2.3). To avoid accidental matches a parameter, αmml , was used to set the minimum length of tiles that could participate in merges. We set αmml to 5. A value of 1 produced the best performance but the processing time was prohibitive. The αlength parameter was set using a small set of 22 suspicious and 125 source documents from the PAN-PC-09 corpus. The length of the longest matching substring was computed for each source document and suspicious document in which it appeared. This value was always greater than 10. A small corpus of 30 suspicious doc- uments from the PAN-PC-09 corpus was used to find a suitable value for the αmerge parameter for combining the adjacent tiles to create a section. For αmerge =250 charac- ters, our heuristic achieved a performance of 0.58 overall, precision of 0.87, recall of 0.49, and granularity of 1.12. 3.2 System Performance Our system achieved an overall performance of 0.20, precision of 0.40, recall of 0.16 and granularity of 1.21 in the formal evaluation. Table 1 shows the performance of the candidate document selection (Section 2.2) and detailed analysis (Section 2.3) stages. Various performance metrics are shown for each stage and results broken down for each of the different types of obfuscation used in the corpus. The focus of candidate selection step was to maximise recall, thereby ensuring that the source document is included in the set that are returned so that it can be identified in the detailed analysis stage. However, the recall values are quite low, indicating that the source document is often missed. Recall also varies according to the type of obfus- cation. The best recall (0.7105) is obtained for low obfuscation and the worst for the simulated plagiarism (0.4154). The results are somewhat different for the detailed analysis stage and the best per- formance is obtained for simulated plagiarism. We analyzed the documents and found that simulated obfuscated documents contain cases that have not been obfuscated and the number of source documents for this type of obfuscation is very small compared to other types. Candidate Document Selection Detailed Analysis Obfuscation Precision Recall F1 Precision Recall PlagDet Score Granularity None 0.8446 0.6693 0.7468 0.6366 0.2162 0.3205 1.0100 Low 0.7963 0.7105 0.7510 0.5960 0.1934 0.2619 1.1660 High 0.6729 0.6907 0.6817 0.6918 0.1775 0.2041 1.6110 Simulated 0.7643 0.4154 0.5383 0.6659 0.3980 0.4864 1.0330 Translated 0.9622 0.5941 0.7346 0.3994 0.2219 0.2610 1.3290 Table 1. Performance of candidate selection and detailed analysis stages for various types of obfuscation. 3.3 Sources of Error Preprocessing (Section 2.1). Documents in German and Spanish were translated using Google Translate. These automatic translations contained errors that may have affected the accuracy of later processing stages. Candidate Document Selection (Section 2.2) Analysis of the system performance (Section 3.2) showed that the source documents are often not identified. Detailed Analysis (Section 2.3). Several parameters (αoverlap , αlength and αmerge ) were set using small training corpora. It is possible that larger training corpora would lead to better values for these thresholds. Processing limitations led the αmml to be set to 5. This is a relatively high value which caused plagiarized sections with different types of obfuscation to be missed. 4 Conclusion This paper described the University of Sheffield’s entry to the 2nd international PAN plagiarism detection competition which attempted to identify extrinsic plagiarism. Our system did not attempt to identify intrinsic plagiarism. A three stage approach was used: preprocessing, candidate selection and detailed analysis. The approach achieved an overall performance of 0.20 in the evaluation with precision of 0.40, recall of 0.16 and granularity of 1.21 in the formal evaluation in which extrinsic and intrinsic plagia- rism cases were evaluation together. In the future we plan to improve our approaches for both the candidate document selection and detailed analysis stages. For the candidate document selection stage we plan to experiment with using approaches from information retrieval and copy detection (e.g. SCAM) to identify potential source documents since this may be faster than the approach that is currently used and may also improve recall. For the detailed analysis stage we plan to experiment with alignment techniques that can cope better with high obfuscation and also plan to make more use of approaches from natural language pro- cessing to improve matching. We also want to improve our approach at computing byte offsets in the source documents. Acknowledgements Rao Muhammad Adeel Nawab thanks the COMSATS Institute of Information Technol- ogy, Lahore, Pakistan for funding this work under the Faculty Development Program (FDP). References 1. Bao, J., Shen, J., Liu, X., Liu, H., Zhang, X.: Finding Plagiarism Based on Common Seman- tic Sequence Model. In: Lecture notes in computer science. vol. 3129/2004, pp. 640–645. Springer (2004) 2. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the 1995 ACM SIGMOD international conference on Management of data. pp. 398–409. ACM New York, NY, USA (1995) 3. Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval. pp. 161–175. UNLV Publications/Reprographics (1994) 4. Chen, J., Bao, Y.: CrossâĂŞlanguage search: The case of Google Language Tools. First Mon- day 14(3) (2009) 5. Clough, P., Gaizauskas, R., Piao, S., Wilks, Y.: Measuring text reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania. pp. 152–159 (2002) 6. Clough, P., Stevenson, M.: Developing A Corpus of Plagiarised Short Answers. In: Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis. Springer (2010) 7. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. In: Jour- nal of the American Society for Information Science and Technology. vol. 54, pp. 203–215 (2003) 8. Kent, C., Salim, N., Johor, M.: Features Based Text Similarity Detection. In: Journal of Computing. vol. 2 (2010) 9. Lane, P., Lyon, C., Malcolm, J.: Demonstration of the Ferret plagiarism detector. In: Pro- ceedings of the 2nd International Plagiarism Conference (2006) 10. Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large docu- ment collections. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing. pp. 118–125 (2001) 11. Manning, C., Schutze, H.: Foundations of statistical natural language processing. MIT Press,Massacchusetts Institute of Technology, Cambridge MA (1999) 12. Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: local algorithms for document finger- printing. In: Proceedings of the 2003 ACM SIGMOD international conference on Manage- ment of data. pp. 76–85. ACM (2003) 13. Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital doc- uments. In: Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries,Austin,Texas (1995) 14. Shivakumar, N., Garcia-Molina, H.: Building a scalable and accurate copy detection mech- anism. In: Proceedings of the first ACM international conference on Digital Libraries. pp. 160–168 (1996) 15. Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E.: 3rd PAN Workshop on Uncover- ing Plagiarism, Authorship and Social Software Misuse. In: 25th Annual Conference of the Spanish Society for Natural Language Processing, SEPLN 2009. pp. 1–77 (2009) 16. Wise, M.: Running karp-rabin matching and greedy string tiling. Tech. Rep. 463, Basser Department of Computer Science Technical Report, Sydney University (1993) 17. Wise, M.: Neweyes: A system for comparing biological sequences using the running karp- rabin greedy string-tiling algorithm. In: Third International Conference on Intelligent Sys- tems for Molecular Biology, Cambridge, England. pp. 393–401 (1995) 18. Wise, M.: YAP3: improved detection of similarities in computer program and other texts. In: Proceedings of the Twenty-Seventh SIGCSE Technical Symposium on Computer Science Education, Philadelphia, USA. pp. 130–134 (1996)