A Plagiarism Detection Approach Based on SVM for Persian Texts Fezeh Esteki Faramarz Safi Esfahani Department of Computer Engineering, Department of Computer Engineering, Najafabad Branch, Najafabad Branch, Islamic Azad University, Islamic Azad University, Najafabad, Iran, Najafabad, Iran fe.esteki@Gmail.com fsafi@iaun.ac.ir ABSTRACT as tree edit distance [6], semantic-based techniques such as Plagiarism is defined as an unauthorized act of using or adapting WordNet Bi-Gram [22], etc. are noteworthy. One of the best ways others’ works and ideas without referring to them. Numerous to integrate a fine collection out of the mentioned methods is to methods have been proposed to detect plagiarism in different use Machine Learning Algorithms. languages; however, not a lot has been accomplished in Persian. There are not many methods to detect plagiarism in Persian texts. The present study has utilized statistical and semantic features to Among them we can point to [1] which used a combinational determine the functionality of Support Vector Machines (SVMs) fuzzy approach and [18] which utilized Vector Space Model in detecting acts of plagiarism in Persian. To increase accuracy, a method. Determining the level of similarity in above-mentioned stemmer was designed to stem Persian words. The statistical and methods necessitates setting some threshold limits based on semantic features were used to train and apply the SVM. The training data. This may lead to wrong results on other sets of data. statistical features used are Jaccard coefficient, Dice coefficient, To avoid this, machine learning algorithms can be utilized [7]. Levenshtein distance, and Longest Common Subsequence. To Among these algorithms, SVM has shown better performance detect semantic similarities, a new method called “Index Words over other algorithms [14]. It is notable that there is not any Replacement” was proposed. The proposed framework was tested research which has applied SVM for plagiarism detection in on PAN data set. The results show the precision of 0.93337, recall Persian texts. of 0.70124 and Plagdet of 0.80083. In the current study, the functionality of a Support Vector Machine (SVM) to detect plagiarism in Persian texts through CCS Concepts extrinsic method has been studied. The SVM algorithm has shown • Text mining➝ Paraphrase detection➝ Plagiarism superior applicability over other algorithms at solving 2-class detection➝ Support Vector Machine classification problems with multiple features and data. Statistical Keywords techniques have been used to train the machine and apply it to detect plagiarism since they operate faster and show better Plagiarism detection; Text similarity; Paraphrase detection; SVM; performance in NLP applications. Jaccard coefficient, Dice Support vector machine; Persian texts; Farsi texts; coefficient, Levenshtein distance, and Longest Common 1. INTRODUCTION Subsequence are the statistical techniques which will be briefly Vast presence of e-texts on the web and their availability has discussed in the following sections. The proposed framework caused a remarkable increase in plagiarism [15]. Plagiarism is works as follows: first, each of the mentioned techniques is divided into different categories including code stealing, applied to sentences in both suspicious and original documents. paraphrasing, summarizing, translating and copying [3]. Changes Outputs (mostly digits) are then compared and make an attribute made to a text can be lexical changes at which words are replaced vector to train the SVM. Using a set of attribute vectors, the SVM with their synonyms, or structural changes at which sentence is trained and then applied to detect plagiarism. In addition to the structure undergo some changes to make the piece untraceable [8]. mentioned methods, a new procedure called “Index Words To detect plagiarism in texts which have been changed thoroughly Replacement” which uses a semantic Database named FarsNet (have not been exactly copied), some more complicated methods [24] is proposed to detect semantic similarities. The proposed are needed [13]. method chooses a word from a collection of synonyms to mark it as the index. It then detects words related to the index word in Plagiarism detection methods are divided into two main categories both original and suspicious text and replace them with the index. including Intrinsic and Extrinsic. Through extrinsic methods, a Using the mentioned techniques, level of similarity is detected at suspicious document is compared with some reference documents the final step. and similar parts are labelled as plagiarized. In intrinsic methods, no reference collection is used; parts whose writing style is 2. RELATED WORK different from the rest of the text are defined as plagiarized. The There are a few methods for detecting similarities and plagiarism majority of researches conducted in plagiarism field have utilized in monolingual Persian texts. Some of these methods can be found extrinsic methods [23]. in [1] and [18]. Numerous methods have been suggested to detect plagiarism In [1], a combinational fuzzy approach was used to detect among which statistical-based Techniques such as N-gram and plagiarism. This approach utilized a dataset for e-learning domain. Longest Common Subsequence [9], syntax-based techniques such Using the dataset jargon, the approach broke each sentence into general and knowledge domain. Then it calculated Skip-gram, N- implemented in java language. The schema of the proposed gram and Number of words for each part. At the end, it identified approach is presented in figure 1. Different phases of the proposed the level of similarity based on fuzzy rules. plagiarism detection process will be discussed below. Mahdavi et al. [18] used vector space model. This model transferred all texts in the database and the source text to attribute vectors. It then evaluated cosine similarity among semi-similar texts and finally detected similarity level by measuring the overlapping area of Tri-gram. The SVM has not been used for detecting text similarities in Persian texts yet. In this research, we used SVM to detect plagiarism in monolingual Persian texts. What follows reviews some researches which have adopted SVM to detect similarities between English texts. In [2], a method which used SVM and Naïve Bayes techniques to detect plagiarism was employed. The study used fingerprint similarity, latent semantic analysis, word pair and word similarity attributes to train the SVM. The word similarity attribute was applied to detect similar words and the word pair was employed to detect non-similar words with similar meanings. The Results showed that plagiarism detection by SVM has more accuracy compared to that by Naïve Bayes. Plagiarism detection by counting similar tokens and taking advantage of SVM method was the approach proposed in [25]. At Figure 1. The schema of proposed approach in this study. the first step of this approach, words were labelled and named as Token. At the next level, sentence tokens were compared with each other; meanwhile, the tokens were being replaced with their 3.1 Preprocessing Phase similar words and comparison process was repeated over and over At this phase, original and suspicious texts were preprocessed. again to increase the accuracy of semantic similarity detection This phase includes normalization, stemming and eliminating stop process. This approach used The Latent Semantic Analysis (LSA), words. The normalization process standardized the space between Translation Edit Rate–Plus (TER-P), Dictionary-based Similarity, words, their forms and punctuation marks. Then, words were Maximum Similarity and BLEU attributes to train and test the stemmed and their prefixes and suffixes were eliminated. SVM. Since there was not a perfect stemmer implemented in Java Similarity detection through the use of SVM by utilizing programming language for Persian and the available stemmers did statistical and semantic attributes was the main goal of the not eliminate some affixes and also the verbs were not stemmed proposed systems in [7] and [14]. In these systems, statistical by existing stemmers, in this study, a stemmer which stems the similarity was detected using Skip-gram and LCS attributes and words based on part of speech categories was designed and used. semantic similarity was determined using Noun/Verb Similarity Measure, Lin Similarity Measure, Cardinal Number Attribute and After stemming, stop words were eliminated by using the Proper Name Attribute. The semantic features were extracted available stop word list. Stop words are frequently used words in a using a Tree Tagging tool. This tool labels words based on the text. The majority of these words do not have an independent semantic relationships existed in WordNet database. The Cardinal meaning and are used to establish relationships between main Number Attribute was used to detect similarities between numbers words. and the Name Attribute Proper was applied to do so between names. [5] proposed a system to detect similarities between sentences 3.2 Suggested Approach: Index Word using SVM and statistical attributes. The system divided the Replacement sentences of both suspicious and similar documents into segments A great deal of methods proposed to detect plagiarism in English and then calculated the statistical attributes for each segment. This texts use WordNet lexicon database. A similar database has been research proposed a new feature called EDU ((Elementary created for Persian which is called FarsNet [24]. This database is Discourse Unit) Similarity, and utilized BLEU, NIST, TER, very limited in terms of semantic relationships (hypernyms and TERP, METEOR and BADGER attributes parallel to EDU hyponyms), so instead of extracting semantic features from the Similarity. This features usually used in automatic translation sentences by using this database, we suggested a new approach systems. that uses synonyms to detect semantic similarity between sentences. The proposed approach operates as follows: an index word is chosen out of each collection of synonyms. Then, if there is a match for each of the index words in original and suspicious 3. METHODOLOGY texts, it will be replaced by its related index word. For example, The present study has proposed a new method based on SVM to from the lexical set of “beautiful”, “good looking” and “pretty”, detect plagiarism in Persian texts. Statistical attributes were used the word “beautiful” is selected as the index and it will be to train and apply the SVM. A brand-new approach called “Index substituted with its similar words- “good looking” or “pretty”– if a Word Replacement” which will be discussed below was proposed text includes them. to determine semantic similarities. The suggested algorithm was 3.3 Extracting Features from a Text 3.5 Plagiarism Detection Using SVM Jaccard similarity, Dice similarity, Levenshtein distance and The statistical attributes of sentences were extracted from both Longest Common Subsequence are the statistical attributes which original and suspicious texts and classified by SVM. If a pair is have been used in the present study. They are calculated as labelled as “PLAGIARISM”, it means that the suspicious one has follows: been stolen; otherwise, the sentence is considered as original. 3.3.1 Levenshtein distance 3.5.1 Support Vector Machine (SVM) Levenshtein distance between two strings is the minimum Support vector machines which use statistical learning techniques number of editing operators required to change one string into were proposed by Vapnik in 1998. These algorithms utilize the another. By editing operators, one means the operations of strategy of maximizing the distance between a hyperplane and insertion, deletion and substitution of words. [16] It is calculated training samples to choose a proper hyperplane in order to classify by equation 1. data correctly. When there is noise in training data, using a liner classifier (hyperplane) for classifying training data is impossible, so primary samples are mapped into a higher space in non-linear Equation 1 form. In the new and larger space, data will be linearly classified by a kernel function using the proper hyperplane without raising the complexity of computations. In fact, kernel function uses the similarity between data in the original space to find similarities 3.3.2 Jaccard coefficient between vectors in a larger space. Kernel function can be selected For text document, the Jaccard coefficient is the number of form polynomial functions, RBF function, hyperbolic tangent or common words to the number of words which are not common other proper functions. [12] between two texts. [11]. The formula presented in equation 2. The present study has used a Support Vector Machine with RBF (Radial Base Function) kernel. Equation 2. 3.6 Evaluation 3.6.1 The utilized data set for training the SVM 3.3.3 Dice coefficient The dataset used for training the SVM is the one used in [21]. The The Dice measure is very similar to the Jaccard measure [17]. It dataset consists of 3 different categories of text as follows: 1) is computed by equation 3. TMC which contains news texts, 2) IRANDOC which includes texts related to sciences and technology and 3) Selected texts from Prozheh.com which consists of students’ researches. Paraphrased Equation 3. texts which are based on this collection have been generated manually and mechanically and are divided into four categories: “Non-plagiarism” category that includes non-similar texts. “Synonyms replacement” category that contains texts whose 3.3.4 Longest Common Subsequence words have been replaced with their synonyms. “Change It measures the longest sequence of words with the same order in Structured” category that includes texts whose sentence structures two string [22]. This attribute is calculated using Equation 4. have been changed, and “Combined category” consists of texts with changed structures and word-replaced-with-synonym vocabulary. Sim (A, B) Equation 4. 3.6.2 The utilized data set for testing and evaluating 3.4 SVM Training Process the proposed system At this phase, the non-similar sentence pairs were extracted from Evaluation of the proposed system was conducted on a dataset the xml files in “non-plagiarism” part of dataset and the similar introduced for Persian by [4] in PAN 2016 competition. This sentence pairs were derived from the xml files in other parts of dataset contains a collection of Source and Suspicious document dataset. It is notable that in the utilized training dataset, instead of pairs which have been divided into four categories with regard to paragraphs, the sentences are tagged in xml files. After extracting different levels of obfuscation. “No-obfuscation” category the sentence pairs, the similar index words, explained in includes copied and clear stolen texts. “Random-obfuscation” section3.2, were substituted in sentences. Then, the statistical group contains documents which have been changed artificially techniques, mentioned in section3.3, were applied to sentences and by a machine. “Obfuscation-simulated” group keeps and the attributes for each pair were extracted. Using the numeric documents which have been changed manually and by man and outputs, some training vectors were created to train the machine. bear high levels of obfuscation. Attribute vectors of similar sentences were labelled as “PLAGIARISM” class and those of non-similar sentences were 3.6.3 Evaluation results Parameters used in PAN2016 include Recall, Precision and labelled as “NONPLAIGIARISM” class. Plagdet which have been proposed by Potthast and colleagues [19], [20]. Also, the evaluation platform is designed by [10]. Evaluation results can be found in figure 2. [6] Bille, P. 2005. A survey on tree edit distance and related problems. Theoretical computer science, 2005. 337(1): p. 217-239. [7] A. Chitra and C. Kumar. 2010. Paraphrase identification using machine learning techniques. In Proceedings of the 12th international conference on Networking, VLSI and signal processing, 2010: p. 245-249. [8] Chong, M.Y.M. 2013. A study on plagiarism detection and plagiarism direction identification using natural language processing techniques. [9] Dagan, I. and Glickman, O. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Learning Methods for Text Understanding and Mining, 2004: p. 26-29. [10] Gollub, T., Stein, B. and Burrows, S. 2012. Ousting Ivory Tower Research: Towards a Web Framework for Providing Figure 2. The evaluation results Experiments as a Service. In Bill Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), August 2012. ACM. ISBN 4. CONCLUSION 978-1-4503-1472-5. p. 1125-1126. An extrinsic SVM-based method was proposed to detect plagiarism in Persian texts in the current study. Also, the [11] Huang, A. 2008. Similarity measures for text document functionality and performance of SVM method to detect clustering. In Proceedings of the sixth new zealand computer plagiarism in Persian texts was evaluated. To train the SVM, a science research student conference (NZCSRSC2008), combination of statistical attributes were used. A new approach Christchurch, New Zealand. called “Index Word Replacement” was suggested to detect [12] Joachims, T. 1998. Text categorization with support vector semantic similarities. According to the results, it can be concluded machines: Learning with many relevant features. In that statistical methods operate effectively at similarity detection European conference on machine learning. Springer. processes. As further suggestions, testing different statistical or semantic and syntactic attributes to train the machine and [13] Keck, C. 2006. The use of paraphrase in summary writing: A evaluating the following improvements to the system can be comparison of L1 and L2 writers. Journal of Second another field of further research. Language Writing, 2006. 15(4): p. 261-278. [14] Kozareva, Z. and Montoyo, A. 2006. Paraphrase identification on the basis of supervised machine learning 5. REFERENCES techniques. In Advances in natural language processing. 2006, Springer. p. 524-533. [1] Ahangarbahan, H. and Montazer, G.A. 2015. A Mixed Fuzzy [15] Leung, C.-H. and Chan, Y.-Y. 2007. A natural language Similarity Approach to Detect Plagiarism in Persian Texts. In processing approach to automatic plagiarism detection. In International Work-Conference on Artificial Neural Proceedings of the 8th ACM SIGITE conference on Networks. Springer. Information technology education. 2007. ACM. [2] Alfikri, Z.F. and Purwarianti, A. 2014. Detailed Analysis of [16] Levenshtein, V. I. 1966. Binary codes capable of correcting Extrinsic Plagiarism Detection System Using Machine deletions, insertions and reversals. In Soviet physics doklady, Learning Approach (Naive Bayes and SVM). Indonesian 1966, p. 707. Journal of Electrical Engineering and Computer Science, [17] Ljubešić, N., Boras, D., Bakarić, N., Njavro, J. 2008. 2014. 12(11): p. 7884-7894. Comparing measures of semantic similarity. In Information [3] Alzahrani, S.M., Salim, N. and Abraham, A. 2012. Technology Interfaces, 2008. ITI. 30th International Understanding plagiarism linguistic patterns, textual features, Conference on. IEEE. and detection methods. IEEE Transactions on Systems, Man, [18] Mahdavi, P., Siadati, Z. and Yaghmaee, F. 2014. Automatic and Cybernetics, Part C (Applications and Reviews), 2012. external Persian plagiarism detection using vector space 42(2): p. 133-149. model. In Computer and Knowledge Engineering (ICCKE), [4] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and 2014 4th International eConference on. IEEE. Potthast, M., 2016. Algorithms and Corpora for Persian [19] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P. Plagiarism Detection: Overview of PAN at FIRE 2016. In 2010. An evaluation framework for plagiarism detection. In Working notes of FIRE 2016 - Forum for Information Proceedings of the 23rd international conference on Retrieval Evaluation, Kolkata, India, December 7-10, 2016, computational linguistics: Posters. 2010. Association for CEUR Workshop Proceedings, CEUR-WS.org. Computational Linguistics. [5] Bach, N.X., Le Minh, N. and Shimazu, A. 2014. Exploiting [20] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E. discourse information to identify paraphrases. Expert and Stein, B. 2014. Improving the Reproducibility of PAN's Systems with Applications, 2014. 41(6): p. 2832-2841. Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In Evangelos Kanoulas et al, [23] Seaward, L. and S. Matwin. 2009. Intrinsic plagiarism editors, Information Access Evaluation meets detection using complexity analysis. In Proc. SEPLN. Multilinguality, and Visualization. 5th International [24] Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Conference of the CLEF Initiative (CLEF 14), Berlin Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M. and Heidelberg New York, September2014. Springer. ISBN 978- Assi, S.M. 2010. Semi automatic development of farsnet; the 3-319-11381-4. p. 268-299. persian wordnet. In Proceedings of 5th Global WordNet [21] Rakian, S., Esfahani, F.S. and Rastegari, H. 2015. A Persian Conference, Mumbai, India. Fuzzy Plagiarism Detection Approach. Journal of [25] Vigneshvaran, P., Jayabalan, E. and Kathiravan, A.V. 2014. Information Systems and Telecommunication (JIST), 2015. An Eccentric Approach for Paraphrase Detection Using 3(3): p. 182-190. Semantic Matching and Support Vector Machine. In [22] Roostaee, M., Fakhrahmad, S.M., Sadreddini M.H. and Intelligent Computing Applications (ICICA), 2014 Khalili, A. 2014. Efficient calculation of sentence semantic International Conference. IEEE. similarity: a proposed scheme based on machine learning approaches and NLP techniques. Scientific Journal of Review, 2014. 3(3): p. 94-106.