From English to Persian: Conversion of Text Alignment for Plagiarism Detection Lee Gillam Anna Vartapetiance Department of Computer Science Department of Computer Science University of Surrey University of Surrey UK UK l.gillam@surrey.ac.uk a.vartapetiance@surrey.ac.uk ABSTRACT In PAN 11, this approach gained 4th place, with This paper briefly describes the approach taken to Persian PlagDet=0.2467329, Recall=0.1500480, Precision=0.7106536, Plagiarism Detection based on modification to the approach used Granularity=1.0058894. In 2012, we showed good granularity, for PAN between 2011 and 2014 in order to adapt to Persian. This with high recall and precision for non-obfuscated text, but not effort has offered us the opportunity to evaluate detection such great recall when faced with higher orders of obfuscation, performance for the same approach with another language. A key and subsequent results are consistent or slightly better. part of the motivation remains that of undertaking plagiarism In this paper, we assess our efforts against texts in Persian, with detection in such a way as to make it highly unlikely that the some commentary on the effects of the make-up of the dataset. content being matched against could be determined based on the Section 2 presents findings in respect to the training data of matches made, and hence to allow for privacy. Persian Plagdet, and Section 3 addresses the test data. Section 4 concludes the paper and considers future work. CCS Concepts • Information systems → Near-duplicate and plagiarism 2. TRAINING DATA detection • Information systems → Evaluation of retrieval results. The Persian Plagdet training data, dated 2016-07-17 comprises Keywords 1563 source documents and 1525 suspicious documents. There Plagiarism detection; text alignment; Persian; PAN are some 2749 associated annotations, the breakdown of which is shown below in Table 1. 1. INTRODUCTION Detection of plagiarism has been shown to be beneficial to both Table 1. Breakdown of 2749 potential cases of plagiarism education and research, ensuring that students and researchers within training data. alike are demonstrating their own understanding and findings. The 01-no-plagiarism 783 28% same techniques can be used against the document archives of an 02-no-obfuscation 208 8% organization to improve how such information is managed. And 03-random-obfuscation 1611 59% we have successfully demonstrated how related techniques can be 04-simulated-obfuscation 147 5% used to protect by preventing the accidental propagation of Total files 2749 corporate information deemed to be of high value in an Innovate UK project on Intellectual Property Protecting Cloud Services in Supply Chains (IPCRESS) collaboratively with Jaguar Land It is striking that there is such a significant minority of Rover and GeoLang Ltd [1]. This work was grounded in our unobfuscated cases, in contrast to 59% ‘random’. A high- previous PAN efforts, e.g. [2], undertaken with respect to finding performing system must therefore be able to address the nature of matching text without, at the time of the match, directly using the this randomness, and this was uncovered during training as it textual content itself (e.g. n-grams) or using patterns as could be seemed difficult to push performance beyond a certain point. uniquely resolved to the textual content. The approach [3] Following re-implementation of our approach in Python for ease produces a minimal representation of an input text by of inspection (and necessary upgrading to Python 3.4 so that wide distinguishing content words from auxiliary words, and producing characters were handled properly), and a switch to a Persian matchable binary patterns directly from these dependent on the stoplist1, initial tests were conducted to ascertain performance number of classes of interest in each. This acts like hashing, but against the training data. no effort is taken to ensure collision-avoidance; indeed, the approach actively encourages hash collision over short distances, Key parameters for the system are: as acts to prevent reverse-engineering of the patterns, and uses the 1. length used for match (windowSize, measured in words) number of coincident matches to indicate the extent of similarity. 2. distance within which two non-overlapping matches Such an approach is, therefore, more suited for longer initial can be merged (merge_dist, measured in characters) pattern matching. Further, with the intention to undertake match without access to the textual content, there is a need for subsequent verification of potential matches based on access to 1 the text, which can be undertaken automatically here, but would Obtained from: https://github.com/kharazi/persian- be anticipated to involve delegation of permissions within the stopwords/blob/master/persian, comprising some 778 words. An kinds of system envisaged. alternative of just 330 words was also considered but not used: http://www.ranks.nl/stopwords/persian 3. minimum proportion of shared words in matched To begin with, if passages are constructed which would not be segments (match_threshold), required due to the meaningful within the language, there would seem to be limited methodology used as verification of match gain from its treatment here, as there is a question over how this 4. minimum length of a final match (min_length), used reflects the reality of the problem being addressed. In addition, if later to filter out short matches. the extent of change is high as would require significant human effort to reproduce, the likelihood of such highly edited passages Brute force efforts were used to obtain approximate values for in the wild would seem to be lessened unless approaches are parameters 1-3, with: partially automated and do not undergo post-editing. It also 1. windowSize from 15 to 35, in increments of 5 becomes difficult to address the difference between necessary 2. merge_dist from 100 to 225 in increments of 25, and inclusion within a focused discussion, and an act of deliberate then honed in increments of 5 copying. 3. match_threshold from 0.5 to 0.95 in increments of 5 Table 3. Breakdown of scoring against specific subsets of the The best initial combination was windowSize=30, training data. merge_dist=210, match_threshold=0.75, producing the following: 02 before 02 03 04 content Table 2. Best scores on training data following initial brute check force determination. Plagdet Score 0.5740 0.9050 0.0031 0.0426 Plagdet Score 0.1894 Recall 0.1055 Recall 0.9817 0.9729 0.0016 0.0357 Precision 0.9542 Precision 0.4055 0.8461 0.0476 0.0605 Granularity 1.0043 Granularity 1.0000 1.0000 1.0000 1.0769 Following the investigation into the make-up of the training data, % cases 8% 8% 59% 5% explained above, the performance against each subset was evaluated. Initially, we determine the detection capability with Consider the example passages below with matches between match_threshold=0 (second column), i.e. without content source and suspicious, with colour used to identify those passages checking, then look at each subset with checking (subsequent shared. These passages are not fully matchable: lowermost in the columns), as shown in Table 3. table are two matched sub-passages where the maximal fragment length is 5, and the sub-matches are mostly smaller. Clearly, when Such scores demonstrate acceptable performance for the approach the initial extents are short this will favour those approaches that used, which is geared towards copying with low levels of address short n-grams. In addition, these passages include extents obfuscation, as even without subsequent checking of matches of text in that are quite different between the two, and this brings recall is very high for non-obfuscated text. The low recall for implications to the treatment of the gap between passages as well obfuscation therefore leads us to explore the nature of the kind of as to any verification step as addresses word overlap. obfuscation in use, and here the use of so-called random obfuscation within the training data certainly merits discussion. Table 4. Matching segments in random obfuscation, within annotations provided for the training data. Highlighted parts lowermost show differences between matched segments, as assists in identifying favourable n-gram sizes for matching. Suspicious . ‫ ساختواى اصلی ًادرستیست کتاب شیعی تشکیل قصیذٍ دادُاًذ‬، ‫ کاٍ است‬، ‫ کذخذا ّ ُوَ خزّص‬، ‫ گٌج‬، ْ‫ٍ عِذآبخ‬ ‫ اّلیي بیشیلِپیلَ ًاهَ طفزٍ اًذر‬. ‫ابزاُین گلستاى در اعتزاض کزدى اسن غیزقاًًْی تیغ قٌذک ّ جزاحی شذٍ کتابش بَ جوِْری اسالهی ایزاى دّ کار هیٌْیسذ‬ . ‫ اگز خْارّبارفزّشی بزای شزم کٌِسال‬، ‫ شزهآّر اًتشاراتیای‬، ‫ خطاب بَ هذیز است یا کتاب در آى یغوا شذٍ ًْشتَ_شذٍ_است‬۳۱۳۱ ‫ تیز قوز‬۱۳ ‫تپَ تاریخ‬ ٍ‫ هي ّقتن را اس تلف ًویکٌن هعصیتبار ساّیَ کَ اس هٌافی عالقَ هٌذی کساًی چٌیي کزدُاًذ شکایتی اشغال بَ را هاخذ بَ اقاهتگا‬. ‫ چاپ ًاهَ ایي‬. ‫ایي اقذام شوا‬ ‫ ًویشٌاسن … بِتز است ایي باشذ کش رفتٌِا ّ هثلَ کزدًِا بواًذ بزقزار شخ بزای عولَٔ شک‬، ‫ببزم کاٍ چٌیي هقام ّ هزجع را اگز ُن باشذ‬ Source ‫ ساختارُای اصلی ایي کتاب را تشکیل دادُاًذ‬،‫ کَ هظِز بیذاری است‬،‫ کذخذا ّ اس ُوَ هِوتز خزّص‬،‫ گٌج‬،ٍ‫جشیز‬. ‫ خطاب بَ هذیز‬۳۱۳۱ ٍ‫ تیز ها‬۱۳ ‫ اّلیي ًاهَ در تاریخ‬.‫ابزاُین گلستاى در اعتزاض بَ چاپ غیزقاًًْی ّ جزاحی شذٍ کتابش در ایزاى دّ ًاهَ هیٌْیسذ‬ ‫اًتشاراتیای کَ کتاب در آى چاپ شذٍ ًْشتَ شذٍ است‬: َ‫ هي ّقتن را بیش اس تلف ًویکٌن کَ اس کساًی ک‬.‫ اسن ایي کار ًادرستیست‬.‫ ایي اقذام شوا ًادرست بْدٍ است‬.‫ اگز بزای شزم هعٌایی باقیواًذٍ باشذ‬،‫شزهآّر است‬ ‫ ًویشٌاسن … بِتز است ایي کش رفتٌِا ّ هُثلَ کزدًِا بواًذ بزای‬،‫ کَ چٌیي هقام ّ هزجع را اگز ُن باشذ‬،‫چٌیي کزدُاًذ شکایتی بَ هزجعی یا بَ هقاهی ببزم‬ َ‫عولَٔ شکٌج‬. ‫گٌج کذخذا ّ *** ُوَ *** خزّص‬ ‫خزّص‬ ‫هِوتز‬ َ‫ُو‬ ‫اس‬ ّ ‫کذخذا‬ ‫گٌج‬ ‫شزهآّر‬ ‫است‬ ٍ‫شذ‬ َ‫ًْشت‬ ٍ‫شذ‬ ‫یغوا‬ ‫آى‬ ‫در‬ ‫کتاب‬ ‫یا‬ ‫است‬ ‫هذیز‬ َ‫ب‬ ‫خطاب‬ ۳۱۳۱ ‫قوز‬ ‫تیز‬ ۱۳ ‫تاریخ‬ ‫شزهآّر‬ ‫است‬ ٍ‫شذ‬ َ‫ًْشت‬ ٍ‫شذ‬ ‫چاپ‬ ‫آى‬ ‫در‬ ‫کتاب‬ َ‫ک‬ ‫اًتشاراتی‬ ‫هذیز‬ َ‫ب‬ ‫خطاب‬ ۳۱۳۱ ٍ‫ها‬ ‫تیز‬ ۱۳ ‫تاریخ‬ Assuming that the test data would be formulated similarly to the reasons for the detection performance in respect to random training data, peak performance would be somewhat constrained, obfuscation after it was apparent that an improved plagiarism and with time available, further tests were performed using rather detection score could not be achieved. shorter initial word numbers (windowSize values) and lower values for other parameters. 5. ACKNOWLEDGMENTS The authors gratefully recognize prior contributions of Neil Table 5. Example scores from further brute force Cooke, Scott Notley, Peter Wrobel, Neil Newbold and Henry determination, showing increases in recall for smaller Cooke in respect to the approach in previous competitions, and by windowSize, at the cost of drops in precision and granularity. Neil Cooke and Peter Wrobel to the patents generated from these. We recognize, also, prior support from the EPSRC and JISC merge_dist 50 50 50 (EP/I034408/1) and by Innovate UK (TSB, 169201). The authors windowSize 20 10 8 are also grateful for the efforts of the Persian Plagdet organizers in match_threshold 0.35 0.35 0.35 formulating proceedings [5] and for system and data provision [6], min_length 200 200 200 [7], [8]. Plagdet Score 0.1478 0.3669 0.4231 6. REFERENCES Recall 0.0859 0.3055 0.4568 [1] Gillam, L., Notley, S., Broome, S. and Garside, D. 2015 Precision 0.9514 0.8493 0.7586 IPCRESS: Tracking Intellectual Property through Supply Granularity 1.0935 1.3370 1.5453 Chains in Clouds. In Raghavendra Rao, N. (ed.) Enterprise Management Strategies in the Era of Cloud Computing. IGI- 3. TEST DATA Global. From Table 4, above, values used for the submission based on test [2] Cooke, N., Gillam, L., Wrobel, P. Cooke, H. and Al-Obaidli, data are in the final column. Results from all participants are F. 2011 "A high performance plagiarism detection system". shown below in Table 5. Although only achieving 8 th place of 9, Proceedings of the 3rd PAN workshop. with all other participants are from Iranian institutions, we are satisfied that we have managed to maintain the core of our [3] Cooke, N and Gillam, L. 2012. System, process and method approach, which we were already aware was only robust to a for the detection of common content in multiple documents certain extent of obfuscation and is not readily tuned to random in an electronic system. U.S. Patent filing US13/307,428, effects as may not necessarily be readable unless via a codebook. filed 30th November 2011. [4] The Computer Language Benchmarks Game: Table 6. Results of all participants. http://benchmarksgame.alioth.debian.org/u64q/compare.php? lang=python3&lang2=gpp Rank Plagdet Granularity Precision Recall [5] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and Potthast, M., 2016. Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016. In 1 0.92204 1.00146 0.92688 0.91919 2 0.90593 1 0.95927 0.85820 Working notes of FIRE 2016 - Forum for Information 3 0.87103 1 0.89258 0.85049 Retrieval Evaluation, Kolkata, India, December 7-10, 2016, 4 0.83015 1.03968 0.92034 0.79602 CEUR Workshop Proceedings, CEUR-WS.org. 5 0.80083 1.0 0.93337 0.70124 [6] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P. 6 0.77496 1.22759 0.96383 0.83615 2010. An Evaluation Framework for Plagiarism Detection. In 7 0.72662 1 0.74962 0.70499 Huang, C-R and Jurafsky, D. (eds.), 23rd International 8 0.39968 1.52803 0.75484 0.41407 Conference on Computational Linguistics (COLING 10), 9 0.38994 3.53698 0.90002 0.80659 pages 997-1005, Stroudsburg, Pennsylvania, August 2010.Association for Computational Linguistics. 4. CONCLUSIONS [7] Gollub, T., Stein, B. and Burrows, S.. 2012. Ousting Ivory In this paper, we briefly described the approach taken to Persian Tower Research: Towards a Web Framework for Providing Plagiarism Detection, based on modification to the approach used Experiments as a Service. In Hersh, B., Callan, J., Maarek, for PAN between 2011 and 2014. Detection performance for Y. and Sanderson. M. (eds.), 35th International ACM Persian is appropriate with respect to the nature of application we Conference on Research and Development in Information have in mind, and the large proportion of randomly obfuscated Retrieval (SIGIR 12), pages 1125-1126, August 2012. ACM. data, allied to the manner in which obfuscation is conducted, ISBN 978-1-4503-1472-5. limits what our approach would achieve. Runtime performance is [8] Potthast, M., Gollub, T., Rangel,, F., Rosso, P., Stamatatos, also inadequate, however this is believed due in significant part to E. and Stein, B.. Improving the Reproducibility of PAN's using a Python implementation instead of our main C++ codebase Shared Tasks: Plagiarism Detection, Author Identification, – and we are aware that some suggest Python 3 is significantly and Author Profiling. In Kanoulas, E. et al, editors, slower than C++ for the majority of tasks [4]. Evaluating a Information Access Evaluation meets Multilinguality, standard n-gram approach, via C++, would be expected to Multimodality, and Visualization. 5th International improve detection performance against these data. Conference of the CLEF Initiative (CLEF 14), pages 268- It is worth noting that the first author has no familiarity with 299, Berlin Heidelberg New York, September 2014. Persian languages, and only sought the co-author’s advice on Springer. ISBN 978-3-319-11381-4.