Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems Morteza Rezaei Sharifabadi Seyed Ahmad Eftekhari Computer Research Center of Islamic Sciences Computer Research Center of Islamic Sciences Tehran, I. R. Iran. Tehran, I. R. Iran. m.rezaei@noornet.net s.ahmad.ef@gmail.com ABSTRACT corpora were books from Project Gutenberg. The corpora were In this paper we introduce Mahak Samim, a plagiarism detection used to evaluate both external and intrinsic plagiarism detection. corpus that consists of Persian academic texts in which plagiarism In external plagiarism detection suspicious documents are cases are embedded. This corpus, which can be used for checked against a collection of source documents, but in intrinsic evaluating plagiarism detection systems, consists of more than plagiarism suspicious documents are analyzed in isolation for five thousand artificial plagiarism cases with various lengths and changes in writing style etc. Fifty percent of the documents were diverse degrees of obfuscation. The development process and the used as source documents and fifty percent as suspicious features of the corpus are described here. documents. The corpora contain plagiarism cases with different lengths and various degrees of artificial and simulated CCS Concepts obfuscation. Artificial obfuscation includes techniques such as • Information systems ➝ Information retrieval ➝ Retrieval tasks automatically shuffling and replacing words and simulated obfuscation was achieved through crowdsourcing the obfuscation and goals ➝ Near-duplicate and plagiarism detection. task. The major shortcoming of corpora presented in these years Keywords was their relatively small size. The plagiarism detectors were plagiarism detection; evaluation corpus; Persian; academic texts. expected to include a stage of heuristic retrieval in which they selected a group of candidate documents among the total collection of source documents. However, since the size of the corpora were not large enough, the systems skipped this stage. In 1. INTRODUCTION PAN 2012 [7] this issue is addressed and a new approach is Plagiarism is defined as “copying or closely imitating the work of adopted for developing the plagiarism corpus. For this purpose a another writer, composer, etc., without permission and with the number of professional writers were asked to write articles – intention of passing the results off as original work” [12]. containing plagiarism - on a set of topics. A one billion document Plagiarism detectors are software programs developed to detect corpus resembling the web was used as the collection of source cases of such misconduct in documents. The PAN evaluation lab documents. The writers compiled their articles by searching series has provided a framework for evaluating plagiarism through this huge collection. In PAN 2013 [9] and PAN 2014 [8] detection systems. This framework relies on plagiarism corpora expanded versions of the 2012 corpus were used. which are basically collections of text that include cases of plagiarism. Plagiarism detection systems receive the corpus texts In PAN 2015 [10] a task of corpus construction was introduced. In as input and their ability to detect the plagiarism cases embedded this task, participants were asked to provide their own plagiarism in the texts are examined. corpora. Eight plagiarism corpora were provided for this task among which two included Persian documents. Khoshnavataher Since plagiarism detectors are not entirely language independent, et.al. [3] present a monolingual Persian corpus based on about there is a need for plagiarism corpora in various languages. In 2100 Wikipedia articles with plagiarism cases obfuscated recent years a couple of Persian plagiarism detection systems have artificially and intended for evaluation of extrinsic plagiarism been developed. Proper evaluation of these systems is dependent detection. Asghari et.al. [1] use Wikipedia documents and a on reliable Persian plagiarism corpora. Persian-English sentence-aligned corpus to develop a bilingual plagiarism detection corpus. In this paper we introduce Mahak Samim1, a corpus suitable for evaluating Persian plagiarism detectors. We first briefly review 3. CORPUS DEVELOPMENT previous works in this field and then we introduce our own approach. The paper concludes with a summary and an outlook 3.1 Document Collection for further work. Academic papers are one of the major types of texts subject to plagiarism. In order to cover such texts in our corpus, we 2. RELATED WORK collected Persian papers from peer reviewed journals. We crawled Prior to PAN evaluation lab series, plagiarism corpora were rare. the websites of journals introduced in the System for Evaluation The corpora used in PAN labs held in 2009 [4], 2010 [5] and 2011 of Scientific Journals2 (affiliated to Iran’s Ministry of Science, [6] had basically the same structure. The documents used in these Research and Technology) and we downloaded papers from journal websites that provide free full-text access to their articles in plain-text format. Table1 shows the statistics of the number of 1 Samim-Noor is a commercial plagiarism detection system developed by the Computer Research Center of Islamic 2 Sciences. Mahak in Persian means “Touchstone”. http://journals.msrt.ir/ documents in each subject, as grouped by the System for Evaluation of Scientific Journals, and Table 2 provides Table 4. Statistics of lengths of plagiarism cases information about the document lengths. Plagiarism Case Length Percent of Cases Short (50-150 words) 34 % Table 1. Statistics of number of documents per subject Medium (300-500 words) 33 % Subject Number of documents Long (3000-5000 words) 33 % Humanities 2697 Science 1204 3.5 Topic match Veterinary Science 469 The six general topic categories of the papers used in our corpus Agriculture and Natural Resources 281 were introduced in table 1. Fifty percent of the plagiarism cases Engineering 38 were made between papers with same topics (intra-topic cases) and fifty percent between papers with different topics (inter-topic Art and Architecture 18 cases). Total 4707 3.6 Obfuscation types In many cases, plagiarized texts are manipulated by those Table 2. Statistics of document lengths committing plagiarism in order to avoid being detected by Document Length Percent of Documents plagiarism detection systems or human readers. Plagiarism corpora developers use different techniques to include such short (1-3000 words) 20 % obfuscations in their plagiarism cases. An overview of different medium (3000-6000 words) 50 % types of obfuscation in our plagiarism cases is available in Table long (6000-30000 words) 30 % 5. Table 5. Statistics of Obfuscation types 3.2 Source / suspicious documents Obfuscation Percent of Cases In plagiarism corpora, the documents collection is usually split into two main subgroups i.e. source documents and suspicious None 40 % documents. Source documents are documents from which parts of Random Text Operations text are selected as plagiarism cases. These parts are then inserted > low obfuscation 20 % inside the text of so-called suspicious documents. In other words, > high obfuscation 20 % suspicious documents are documents which include text used in Semantic Word Variation source documents. We follow PANs tradition of using half of the > low obfuscation 10 % documents as source documents and half as suspicious > high obfuscation 10 % documents. It is noteworthy that the subjects of the papers were taken into consideration while dividing the collection into halves. As shown in table 4, 40 percent of the plagiarism cases have no i.e. 50 percent of the papers in humanities were used as source obfuscation. As explained in [11], since the writing style of the documents and 50 percent as suspicious documents etc. original author is preserved in plagiarism cases without obfuscation, these cases are especially appropriate for evaluating 3.3 Plagiarism per document intrinsic plagiarism detection. Random text operations are 50 percent of the suspicious documents have no plagiarism cases. operations such as adding, deleting and substituting words, which As mentioned in [11], the documents without plagiarism allow to are all done randomly. Semantic word variation, on the other determine whether or not a detector can distinguish plagiarism hand, is the random substitution of words with their synonyms. cases from overlaps that occur naturally between random We use the Comprehensive Dictionary of Persian Synonyms and documents. Statistics of plagiarism per document in the rest of the Antonyms3 as a resource for extracting synonyms. The terms “low suspicious documents, i.e. 25 percent of the whole corpus, is obfuscation” and “high obfuscation” mentioned in table 4 show available in Table 3. the degree of obfuscation i.e. how many words have been added, deleted or substituted etc. Table 3. Statistics of plagiarism per document in documents with plagiarism Plagiarism Per Document Percent of Documents 4. SUMMARY AND FUTURE WORK As explained above, Mahak Samim is a plagiarism corpus which hardly (5%-20%) 30 % can be used for evaluating both intrinsic and external plagiarism medium (20%-50%) 25 % detection systems. In order to preserve overall balance, many much (50%-80%) 30 % factors – plagiarism per document, plagiarism case length, topic entirely (>80%) 15 % match, obfuscation type, and obfuscation degree – were taken into consideration while preparing each plagiarism case. The corpus files are prepared according to the format of previous PAN 3.4 Plagiarism case length Our corpus consists of a total of 5862 plagiarism cases with 3 lengths between 50 and 5000 words. Table 4 shows the statistics. The plain-text version of this dictionary can be downloaded from Long plagiarism cases may include more than one sentence. this link: http://dadegan.ir/catalog/D3911124a corpora which include xml files that have information about the [8] Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, starting point of the plagiarism in relevant source and suspicious Rosso, P. and Stein, B., 2014. Overview of the 6th documents and the length of the plagiarism case. international competition on plagiarism detection. Working Notes Papers of the CLEF 2014 Evaluation Labs, CEUR Plagiarism cases in our corpus are cases of “artificial plagiarism”. Workshop Proceedings. Using “real plagiarism” cases in plagiarism corpora is problematic due to ethical, legal, and financial issues [11]. However, we may [9] Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, enrich our corpus by adding cases of simulated plagiarism. Other J., Rosso, P., Stamatatos, E. and Stein, B., 2013. Overview of types of artificial obfuscation, such as POS-preserving word the 5th international competition on plagiarism detection. shuffling could also be employed. The corpus may be easily In CLEF Conference on Multilingual and Multimodal expanded with both academic papers and other types of Information Access Evaluation (pp. 301-331). CELCT. documents such as books, web articles, etc. [10] Potthast, M., Hagen, M., Göring, S., Rosso, P. and Stein, B., This paper has been submitted to The PAN@FIRE2016 Shared 2015. Towards data submissions for shared tasks: first Task on Persian Plagiarism Detection and Text Alignment Corpus experiences for the task of text alignment. Working Notes Construction [2] and the corpus is available through Peykaregan4 Papers of the CLEF 2015, pp.1613-0073. website. [11] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P., 2010. An evaluation framework for plagiarism detection. 5. ACKNOWLEDGMENTS In Proceedings of the 23rd international conference on Special thanks to Dr. Martin Potthast for his valuable help and to computational linguistics: Posters (pp. 997-1005). Dr. Mahdi Behnia and Mr. Amirhossein Rajabzadeh Assarha, our Association for Computational Linguistics. colleagues in the Computer Research Center of Islamic Sciences, [12] Reitz, J.M., 1996. ODLIS: Online dictionary for library and for their support and their comments. information science. Libraries Unlimited. 6. REFERENCES [1] Asghari, H., Khoshnava, K., Fatemi, O. and Faili, H., 2015. Developing Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus. In Working Notes Papers of the CLEF 2015. [2] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and Potthast, M., 2016. Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings, CEUR-WS.org. [3] Khoshnavataher, K., Zarrabi, V., Mohtaj, S. and Asghari, H., 2015. Developing Monolingual Persian Corpus for Extrinsic Plagiarism Detection Using Artificial Obfuscation. In Working Notes Papers of the CLEF 2015. [4] Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B. and Rosso, P., 2009. Overview of the 1st international competition on plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse. [5] Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P., 2010. Overview of the 2nd international competition on plagiarism detection. In Notebook Papers of CLEF 2010 LABs and Workshops. [6] Potthast, M., Eiselt, A., Barrón Cedeño, L.A., Stein, B. and Rosso, P., 2011. Overview of the 3rd international competition on plagiarism detection. InCEUR Workshop Proceedings. [7] Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B., 2012. Overview of the 4th international competition on plagiarism detection. In Working Notes Papers of CLEF 2012 Evaluation Labs and Workshop. 4 www.peykaregan.com