A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab Mehrnoush Shamsfard NLP Research Lab, NLP Research Lab, Faculty of Computer Science and Engineering, Faculty of Computer Science and Engineering, Shahid Beheshti University, Iran Shahid Beheshti University, Iran f.mashhadirajab@mail.sbu.ac.ir m-shams@sbu.ac.ir Razieh Adelkhah Fatemeh Shafiee Chakaveh Saedi NLP Research Lab, NLP Research Lab, NLX Lab of university of Lisbon Faculty of Computer Science and Faculty of Computer Science and Department of Informatics Engineering, Engineering, Portugal Shahid Beheshti University, Iran Shahid Beheshti University, Iran Ch_saedi@sbu.ac.ir r.adelkhah@yahoo.com f.shafiee@hotmail.com 2. RELATED WORK ABSTRACT Numerous text alignment datasets, including PAN plagiarism This paper describes how a Persian text alignment corpus was corpora, have been employed to evaluate text alignment constructed to evaluate plagiarism detection systems. This corpus algorithms in plagiarism detection competitions since 2009 [3, 8, is in PAN format and contains 11,089 documents and more than 9, 16, 17, 18]. The first text alignment data set that was developed 11,603 plagiarism cases. Efforts were made to simulate various by PAN in 2010 [3] includes 27,073 documents in English and types of plagiarism manually, semi-automatically, or 68558 cases of plagiarism. In this massive data set, plagiarism automatically in this large-scale corpus. cases are generally provided with two strategies. Simulated Plagiarism is the first strategy in which 907 people were asked to CCS Concepts rewrite the given original texts so that the meaning of the original • Information systems → Near-duplicate and plagiarism is not changed but the appearance of the text be replaced with detection. different words and phrases. Artificial Plagiarism is the second • Information systems → Evaluation of retrieval results. strategy where automated methods have been used to change the Keywords text. Techniques used in this section are divided into three Plagiarism detection; Text alignment corpus; Types of plagiarism; categories. The first category uses techniques to insert, remove Corpus construction. and replace words and short phrases, the second category uses techniques to replace words with their synonyms, antonyms, 1. INTRODUCTION hyponyms, or hypernyms and the third category uses the Plagiarism is using others’ phrases, solutions, ideas, or results movement of vocabulary in a sentence with the same POS Tag. with no faithful citation. The considerable worldwide growth of Another text alignment corpus which was offered by was used by plagiarism in recent years emphasizes the importance of dealing PAN to evaluate algorithms in 2013 and 2014 [9,18]. This corpus with this phenomenon. Plagiarism is an ethical challenge in includes the 3653 suspicious document and 4774 source document science to which there are many contributing factors; however, the in English and 8,000 cases of plagiarism. This corpus consists development in plagiarism detection systems can at least result in three types of obfuscation strategies: Random obfuscation, Cyclic a reduction in plagiarism growth. The PAN competition which has Translation obfuscation and Summary obfuscation. In Random been held yearly since 2009 is one famous example in the obfuscation they use techniques similar to Artificial Plagiarism plagiarism detection area. Such competitions provide a suitable strategy. In cyclic translation obfuscation, a text is manually or layout for comparing researchers’ different approaches and automatically translated into another language and after edition it solutions. Having a suitable evaluating corpus is one of the most is translated into the source language again. To simulate Summary important requirements in such a competition. This article obfuscation which is considered as a plagiarism technique, PAN describes how a corpus for the task of text alignment corpus has used evaluation corpora of summarizer automatic system. construction in Persian Plagdet 2016 [1] was constructed. Moreover, in year 2015, instead of inviting text alignment Researchers have produced different taxonomies of plagiarism algorithms, PAN demanded to have text alignment data set sent, types [19, 20, 21]. The taxonomy of plagiarism presented by and a total of 8 data sets have been submitted to the PAN 2015. Alzahrani et al. [2] is shown in Fig. 1. This taxonomy was used in [22-29]. These data sets are in different languages and have used a the current study to construct a data set for evaluating plagiarism variety of techniques to obfuscate the text. Alvi’ corpus [22] is detection systems. In the second section, we review available text among such sent corpora, which includes 272 documents in alignment corpora and in the third section the method for English and 150 plagiarism cases. Alvi uses character- developing a corpus is described. The fourth section explains how substitution, human-retelling and synonym-replacement to simulate each mentioned type of plagiarism, and finally, dataset techniques to obfuscate text. Asghari [27] has submitted a statistics for the constructed corpus are given. Persian-English parallel corpus to the PAN 2015. This corpus includes 27115 documents and 11200 plagiarism cases. Cheema’ corpus [23] includes 1000 documents in English and 250 3. TEXT ALIGNMENT CORPUS plagiarism cases. In this corpus, in order to obfuscate texts, a number of students of different academic courses were asked to CONSTRUCTION select and rewrite a number of texts related to their fields and put The goal in text alignment is to identify plagiarized segments for them inside documents with the same subject such as Wikipedia each given source and suspicious document pairs [8]. documents. Also A bilingual English-Urdu corpus that includes In this study, a text alignment corpus is created to evaluate 1000 documents and 270 plagiarism cases sent to the PAN 2015 plagiarism detection systems on Persian scientific documents. The competitions by Hanif [24]. In this corpus he used machine conducted procedure to build this corpus is described herein. translation with and without manual correction of results, with the use of random-obfuscation strategy in some translation results to a. Data Source Preparation obfuscate the text. Khoshnavataher [26] has presented a corpus in We use some documents of source documents collection in Persian that includes 2111 documents and 823 plagiarism cases. Mahtab plagiarism detection system [15] to construct our text In order to obfuscate, he used Random obfuscation technique and alignment corpus. Mahtab plagiarism detector is developed at the no-obfuscation technique where a piece of the source document is Shahid Beheshti University NLP Lab. The goal of Mahtab is added to suspicious document without any change. Kong [25] also detecting plagiarized articles in the fields of computer science and took part in the PAN 2015competition with 160 documents in engineering. Our text alignment corpus in this study contains Chinese and 152 cases of plagiarism. In order to obfuscate text, 11,089 documents. They are all articles or theses in the fields of Kong asked a number of volunteers to write a paper for topics that computer science and engineering and also electrical engineering have been identified. Mohtaj’s corpus [28] also was submitted to with the following distribution: PAN 2015 with 4261 documents in English and 2781 plagiarism 4,500 documents from Wikipedia articles; cases. In this corpus, techniques of no-obfuscation, random- obfuscation and simulated-obfuscation is used to obfuscate text. 1,500 documents from CSICC1 articles (2004-2015); Palkovskii [29] also makes use of PAN 2013-2014 corpus to prepare a corpus that included 5057 documents in English and 1,500 documents from articles and theses available from 4185 plagiarism cases. Obfuscation was made based on online stores; techniques of random-obfuscation, no-obfuscation, cyclic- 3,589 documents from free Persian resources including mag- translation-obfuscation and summary-obfuscation. In the rest of iran2, iran-doc3, SID4, prozhe5, and MatlabSite6. this paper we will describe the construction method we employed to develop a text alignment corpus to evaluate Persian plagiarism b. Documents Clustering detection systems. Since all documents in the corpus are in the field of computer science, there is a general similarity among them. The method proposed for document clustering is to estimate cluster features first, and then perform clustering based on the introduced features. Finally, an optimization process improves the results. To extract features, all words included in a document are extracted and stemmed using STeP-1 [4]. Each word is then labeled based on Table 1 which is introduced by Makrehchi [5]. For each document, an n-bit histogram vector is produced named V( , , …, ) where n is the number of features. If existed in a document, = 1; otherwise, = 0. Afterwards, these vectors are classified based on the K-means algorithm and Cosine similarity. To optimize the extracted features in a cluster, the sum of all vectors of a cluster is found and used to produce H ( , , …, )), where h1 indicates the number of documents containing the first feature. H is produced for all clusters; Equation (1) can be used to calculate the weight of each feature in the corresponding cluster. fc indicates the number of clusters containing this feature. The features are sorted in a descending order based on their weights. Afterwards, the first 100 words of each cluster are considered as the features for that cluster. To improve clustering accuracy, the 1 Computer Society of Iran Computer Conference, http://csi.org.ir 2 http://mag-iran.com 3 http://www.irandoc.ac.ir 4 http://sid.ir 5 http://www.prozhe.com 6 Fig.1. A taxonomy of plagiarism [2] www.MatlabSite.com membership degree to each cluster must be calculated, and e. Source Set for a No-plagiarism Document documents must be placed in the most similar cluster. The For each no-plagiarism document, a source set is selected as membership degree for each document is calculated as follows: described in step d. However, in this step a similarity detection at the sentence level for each randomly selected source and √ suspicious document is considered based on the Jaccard similarity measure and a threshold of 0.9; if there are no same sentences In which is the number of all seen cluster features (the first between both mentioned files, the source document is added to 100 words of each cluster based on their weights are considered as Dsrc. Using this method, 2,630 pairs of documents are produced in cluster features) in the corresponding document, is the this phase. number of cluster features occurring in the document, and is the document length. f. Source Documents Segmentation In this step, first a document is divided into its paragraphs. Each subsequence of paragraphs that contain at least 300 words is considered a segment. If a paragraph contain less than 300 words Table 1. Three categories of words in a corpus [5] it is combined with the next paragraph. Ultimately, all segments contain at least 300 words. document frequency g. Determine the Length of Plagiarized frequency of the term in Low Medium High Segments in each Suspicious Document The number of plagiarized segments which are employed in a the corpus Key word Feature Stop Word High suspicious document depends on the source document length and the length of any plagiarized segments. To decide the number of Key word Feature Stop Word Medium segments to use from a source document in its paired suspicious one, all paired documents are first labeled. Each randomly Stop Word Stop Word Stop Word Low selected pair of documents is labeled as “entirely,” “much,” “medium,” or “hardly” as described below.  Entirely: The length of the source document is more than 80% of the length of the suspicious document.  Much: The length of the source document is more than 50%- c. Suspicious Documents Selection 80% of the length of the suspicious document. Some documents are randomly selected from each cluster as suspicious documents. Almost half of the documents are  Medium: The length of the source document is about 20%- 50% of the length of the suspicious document. employed as source documents and the other half as suspicious documents. Half of the suspicious documents are considered as  Hardly: The length of the source document is less than 20% of no-plagiarism documents, and the other half of the documents are the length of the suspicious document. used to produce plagiarized documents. If the number of paired documents with the same label is more than one-fourth of the number of paired documents with a label of d. Source Set for a plagiarism Document smaller length that do not have enough paired documents, the For each plagiarism document in a cluster, a set of source label with the lower length is assigned; thus, a uniform documents named Dsrc is selected in which there is no repeated distribution is obtained. document or very similar document to suspicious document, a source document can be used in many suspicious documents so h. Segment Extraction every time a suspicious document can select each source From each source document, some segments are randomly document from the corresponding cluster therefore the selected selected. The number of selected segments is based on the documents may be selected by this suspicious document before. classification defrofrep in step g. Moreover if the similarity between source document and suspicious document is more than 50% before adding plagiarism i. Segment Obfuscation This study offers a strategy to manually, semi-automatically, or passages to suspicious document, then it is not a good selection automatically produce each type of plagiarism mentioned in because even if a hard strategy is used to obfuscate, plagiarism Alzahrani’s taxonomy of plagiarism. In this step, each segment is passages may be discovered by simple similarity detection obfuscated based on one strategy and add to one suspicious algorithms. To create Dsrc for each plagiarism document, a document. It is noteworthy that all obfuscated segments included document from the corresponding cluster is selected randomly; if in a document must be obfuscated using the same strategy because the similarity based on the SimHash method [10] between the according to PAN corpus format, there is no overlap between selected document and each document in Dsrc is more than 50%, suspicious documents in different strategies[9] and only one type the document is considered repeated; otherwise, it is included in of plagiarism should be employed in each suspicious document. Dsrc. This step is continued until there are at least 3 documents in Dsrc. A Dsrc contains a suspicious document and at least 2 source j. Obfuscated Segment Insertion documents. The reason for employing the SimHash method is the In this step each obfuscated segment is inserted into a suspicious noticeable results achieved in [11]. In this phase, the source and document in a randomly chosen space. suspicious document pairs are specified. In this way, 3,867 paired documents (source- suspicious) are produced to be included in the corpus. 4. STRATEGIES FOR PLAGIARISMS  Automatic Translation TYPES According to types of plagiarism in Fig .1 translation is a type of plagiarism that is divided into automatic and manual translation. Hanif et al. [24] use the automatic translation strategy to  Exact Copy obfuscate documents in their corpus. Moreover in the PAN 2013- In this strategy, the segments produced in step h were inserted into 2014 corpus [9, 18] use cyclic translation strategy. a suspicious document with no obfuscation. Using this strategy, We use all of above three strategies in our corpus (described in 324 paired documents were produced. Automatic Translation, Manual Translation and Cyclic  Near Copy Translation stages). For the automatic translation strategy, the selected sections are translated from Persian to English by Google According to Fig .1 a type of plagiarism is Near Copy [2] that consists insertion, deletion, substitution and sentence split or join translate and the results are checked by Hunspell. Then they are added to the English suspicious documents. 306 paired documents methods. To create this kind of plagiarism, the segments produced are produced using this strategy. in step h are obfuscated through deletion, insertion, sentence replacement, and sentence division. With this method, some randomly selected sentences are deleted from the segment and  Manual Translation The suspicious documents in this step are English articles in the replaced with randomly selected sentences from the suspicious field of computer engineering, and the source documents are document. Then, some randomly selected sentences are swapped. Persian articles in the same field. The English articles are Finally, complex sentences are identified and broken into main clustered as described in step b, and an equivalent English cluster simple sentences. To do this, the complex sentence identifier developed at the Shahid Beheshti University NLP Lab is is produced for each Persian one. Then, for each suspicious document, a source document from its equivalent Persian cluster employed. Each complex sentence in this segment is replaced is randomly selected. Based on what was described in steps f, g with its main and subordinate clauses, and 457 paired documents and h, some sections of the source document are selected. are produced based on this strategy. Afterwards, these sections are translated by experts in the fields of  Modified Copy computer engineering and are added to the suspicious documents as described in step j. Seven hundred paired documents are In the taxonomy of plagiarism [2] there is a type of plagiarism produced using this strategy which can be employed to evaluate called Modified Copy that to obfuscate a text using this strategy, cross-language similarity detection systems (Persian-English). the Persian sentence understanding and generation system introduced by Adelkhah et al. [7] is employed. This system performs a bidirectional conversion between Persian sentences  Cyclic Translation With the cyclic translation strategy the selected sections are and their semantic representation. It changes each sentence to its translated from English to Persian using Google translate, and the semantic representation and then generates the Persian sentence results are checked by Negar, a Persian spell checker developed at using semantic representation. To clarify, this system is composed the NLP Lab of Shahid Beheshti University. The selected sections of 2 sub-systems: 1) semantic representation production for are then translated again from Persian to English. Finally, the sentences (sentence understanding), and 2) sentence production results are checked by Hunspell and add to the English suspicious based on semantic representation (sentence generation). It is documents. Using this method, 388 paired documents are created. noteworthy that in the sentence production phase, in addition to structural changes, there might be samples of chunk relocations in a sentence or samples of word relocations in a chunk. The aim of  Idea Adoption (semantic-based meaning) The goal in this step is to represent the main idea of a source this system is to produce sentences with the same meaning (deep document using new words/wording. Since most source structure) but different surface structures and words. Using this documents are computer related theses and articles, automatic strategy, 465 paired documents are created. idea extraction would be a complex task here for which no high  Text Manipulation (Paraphrasing) accurate system is yet available. Hence, the researchers asked computer experts to rewrite the idea of each document in their Text Manipulation was performed as described earlier in Modified own words. To simplify the task, only important sections of Copy. The difference here is the word replacement in the sentence source documents, such as the abstract, were considered. Source generation phase. Each word is replaced with a synonym retrieved documents were distributed among three computer PhD from FarsNet (Persian WordNet) [14] or FavaNet (WordNet of candidates and 30 computer MS students, and 109 paired Computer domain) [13]. Hence, in addition to structure documents were produced. modification, different words are included in the sentence compared to the main sentence, although the concept remains the 5. DATASET STATISTICS same. Chunks may be moved inside a sentence; however, there is Overall, employing all the mentioned strategies, 11,603 no movement for words in a chunk. Using this method, 604 paired plagiarism cases and 6,497 paired documents are produced, from documents are produced. which 2,650 are no-plagiarism, 780 are no obfuscation, and 3,067 are obfuscated ones. The dataset statistics are shown in Table 2.  Text Manipulation (Summarizing) The goal in this step is to obfuscate a text document using 6. CONCLUSION summarization methods. To create such queries, the automatic This article describes a methodology for building a Persian corpus Persian summarizer introduced by Shafiee et al. [6] is used, and for evaluating plagiarism detection systems. This large-scale 506 paired documents are produced. corpus is in PAN format. To produce this corpus, the focus is on the simulation of different types of plagiarism. Different strategies are employed to create obfuscation in each plagiarism category; MAN, AND CYBERNETICS—PART C: APPLICATIONS hence, a variety of plagiarism types in large volume are created. AND REVIEWS, vol. 42, no. 2. [3] Potthast, M., Stein, B. and et.al. 2010. An Evaluation Table 2. Dataset statistics for our corpus Framework for Plagiarism Detection. Proceedings of the documents 11089 23rd International Conference on Computational Linguistics, COLING 2010 Beijing,_c ACL. plagiarism cases 11603 [4] Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Document purpose Standard Text Preparation for Persian Language. CAASL3 languages fa Third Workshop on Computational Approaches to Arabic Script- Languages. source documents 48% [5] Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach suspicious documents to extracting keywords from abstracts. North American Fuzzy with plagiarism 28% Information Processing Society- NAFIPS 2003, Banf, Canada. w/o plagiarism 24% [6] Shafiee, F. and Shamsfard, M. 2015. The automatic Persian Document length summarizer. The 20st Computer Society of Iran Computer short (<10 pages7) 64% Conference. [7] Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian medium (10-100 pages) 35% sentence understanding and generation: a mutual conversion. long (>100 pages) 1% The 21st Computer Society of Iran Computer Conference. Plagiarism per document [8] Potthast, M., Göring, S. and et.al. 2015. Towards Data Submissions for Shared Tasks: First Experiences for the Task hardly (<20%) 25% of Text Alignment. Working Notes Papers of the CLEF 2015 medium (20%-50%) 20% Evaluation Labs, CEUR Workshop Proceedings, (September 2015), ISSN 1613-0073. much (50%-80%) 26% [9] Potthast, M., Hagen, M., Gollub, T. and et.al. 2013. entirely (>80%) 29% Overview of the 5th International Competition on Plagiarism Case length Detection”, Working Notes Papers of the CLEF 2013Evaluation Labs and Workshop, (September 2013), short (<1k characters) 37% ISBN 978-88-904810-3-1. medium (1k-3k characters) 55% [10] Manku, G. S., Jain, A. and Sarma, A. D. 2007. Detecting long (>3k characters) 8% NearDuplicates for Web Crawling. Data mining. [11] Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013. Obfuscation synthesis approaches Plagiarism detection in Persian text using Fingerprint Exact Copy 8% algorithms. The 21st Iranian Conference on Electrical Engineering. Near Copy 12% [12] Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi Modified Copy 12% Lexical Analysis and StopWord List. Library Hi Tech, vol. Paraphrasing 16% 27, pp 435–449. [13] Iran Telecommunication Research Center (ITRC), 2013. Summary 13% Buali Sina University. http://217.218.62.234:8080/. Manual Translation 18% [14] Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi Automatic Translation 8% Automatic Development of FarsNet; The Persian WordNet. Proceedings of 5th Global WordNet Conference. Cyclic Translation 10% [15] Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism semantic-based meaning 3% Detection in Persian documents. Master's thesis. Shahid Beheshti University. [16] Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd International Competition on Plagiarism Detection. Notebook 7. REFERENCES Papers of CLEF 2011 Labs and Workshops, (September [1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and 2011), ISBN 978-88-904810-1-7. Potthast, M., 2016. Algorithms and Corpora for Persian [17] Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th Plagiarism Detection: Overview of PAN at FIRE 2016. In International Competition on Plagiarism Detection. CLEF Working notes of FIRE 2016 - Forum for Information 2012 Evaluation Labs and Workshop – Working Notes Retrieval Evaluation, Kolkata, India, December 7-10, 2016, Papers, (September 2012), ISBN 978-88-904810-3-1. CEUR Workshop Proceedings, CEUR-WS.org. [18] Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th [2] Alzahrani, M., Salim, N. and Abraham, A. 2012. International Competition on Plagiarism Detection. CLEF Understanding plagiarism linguistic patterns, Textual 2014 Evaluation Labs and Workshop – Working Notes features, and detection Methods. IEEE Trans. SYSTEMS, Papers, (September 2014). [19] Joy, M. S., Sinclair, J. E. and et.al. 2013. Student perspectives on source-code plagiarism. International 7 A page is measured as 1500 chars. Journal for Educational Integrity, Vol. 9, No. 1, pp. 3–19. [20] Joy, M. S., Cosma, G. and et.al. 2009. A TAXONOMY OF [26] Khoshnavataher, K., Zarrabi, V., Mohtaj, S. and Asghari, H. PLAGIARISM IN COMPUTER SCIENCE. Proceedings of 2015. Developing monolingual Persian corpus for extrinsic EDULEARN09 Conference, (July 2009), ISBN: 978-84-612- plagiarism detection using artificial obfuscation. Notebook 9802-0. for PAN at CLEF. [21] Naik, R. R., Landge, M. B., Mahender, C. N. and et.al 2015. [27] Asghari, H., Khoshnavataher, K., Fatemi, O. and Faili, H. A Review on Plagiarism Detection Tools. International 2015. Developing bilingual plagiarism detection corpus Journal of Computer Applications, vol. 125 – No.11. using sentence aligned parallel corpus. Notebook for PAN at [22] Alvi, F., Stevenson, M., Clough, P. and et.al 2015. The short CLEF. stories corpus. Notebook for PAN at CLEF. [28] Mohtaj, S., Asghari, H. and Zarrabi, V. 2015. Developing [23] Cheema, W., Najib, F., Ahmed, S. and et.al 2015. A corpus monolingual english corpus for plagiarism detection using for analyzing text reuse by people of different groups. human annotated paraphrase corpus. Notebook for PAN at Notebook for PAN at CLEF. CLEF. [24] Hanif, I., Nawab, A., Arbab, A. and et.al 2015. Cross- [29] Palkovskii, Y. and Belov, A. 2015. Submission to the 7th language urdu-english (clue) text alignment corpus. international competition on plagiarism detection. Notebook for PAN at CLEF. http://www.uni-weimar.de/medien/webis/events/pan-15, [25] Kong, L., Lu, Z., Han, Y. and et.al 2015. Source retrieval and http://www.clef-initiative.eu/publication/working-notes, text alignment corpus construction for plagiarism detection. From the Zhytomyr State University and SkyLine LLC. Notebook for PAN at CLEF.