A Text Alignment Algorithm Based on Prediction of Obfuscation Types Using SVM Neural Network Fatemeh Mashhadirajab Mehrnoush Shamsfard NLP Research Lab, NLP Research Lab, Faculty of Computer Science and Engineering, Faculty of Computer Science and Engineering, Shahid Beheshti University Shahid Beheshti University Iran Iran f.mashhadirajab@mail.sbu.ac.ir m-shams@sbu.ac.ir ABSTRACT In this paper we discuss our proposed algorithm which has In this paper, we describe our text alignment algorithm that participated in Persian Plagdet 2016 and ranked first among achieved the first rank in Persian Plagdet 2016 competition. The participants. Our approach firstly uses a neural network for Persian Plagdet corpus includes several obfuscation strategies. detecting the type of obfuscation in each document pair. Then it Information about the type of obfuscation helps plagiarism sets the parameters in the text alignment algorithm based on the detection systems to use their most suitable algorithm for each detected type of obfuscation. The rest of the paper explains the type. For this purpose, we use SVM neural network for proposed algorithm with the special focus on the obfuscation type classification of documents according to the type of obfuscation detection module. Then the result of our evaluation of the system strategy used in a document pair. Then, we set the parameter on Persian Plagdet corpus 2016 is discussed. values in our text alignment algorithm based on the detected type of obfuscation. The results of our algorithm on the test dataset and 2. RELATED WORK training dataset in the Persian Plagdet 2016 are shown in this At the PAN competition, the text alignment algorithms are article. evaluated by the evaluation corpora that contain different types of obfuscation. For example in PAN 2013- 2014 competitions, the CCS Concepts evaluation corpus consisted of the obfuscation types: none, • Information systems → Near-duplicate and plagiarism random, translation and summary. In PAN text alignment corpora it is assumed that just one type of obfuscation is employed in each detection•Text mining➝ Paraphrase detection➝ Plagiarism document pair. Based on this assumption most participants try to detection➝ Support Vector Machine. predict the type of obfuscation strategy used in a document pair and detect similarities based on the predicted type. At PAN 2014 Keywords competition in Glinos’ algorithm [10] all of plagiarism documents Plagiarism detection; Text alignment; SVM neural network. are divided into two categories: order-based and non-order based. 1. INTRODUCTION The order-based plagiarism involves none and random obfuscations. The non-order based plagiarism involves translation In recent years, automatic discovery of plagiarism has been and summary obfuscation. They use Smith-Waterman algorithm considered by many researchers, and many plagiarism detection [13] to detect aligned sequences of document pairs and so, detect systems have been developed [5, 6, 7]. Plagiarism detection has the order-based plagiarism cases. If no aligned sequences have become a task of PAN competition1 which is held every year been found, then document pairs are given to the clustering since 2009 to evaluate participants’ plagiarism detection component to detect non-order based plagiarism cases. Miguel et algorithms. At the PAN competition the plagiarism detection task al [1, 9] categorize document pairs of PAN 2014 corpus into three is divided into source retrieval and text alignment subtasks. The categories: Verbatim, Summary and Other plagiarism cases and task of source retrieval is to retrieve documents similar to the set the parameters in their algorithm based on the categories. They suspicious document from the set of source documents, and the use the Longest Common Substring (LCS) algorithm to find every duty of text alignment task is to extract all the plagiarized single common sequence of word (th-Verbatim). If at least one passages from the given source-suspicious document pair. Figure Verbatim case have been found, the document pair is considered 1 shows different parts of a plagiarism detection system. as Verbatim plagiarism. If no Verbatim case have been found and As mentioned, in text alignment, the documents in the data sets the length of plagiarism fragments in the suspicious document is used to evaluate similarity detector systems are divided into two much smaller than the length of source fragments, the document categories of source documents and suspicious documents. Each pair is considered as Summary plagiarism, otherwise the Suspicious document contains one or more parts of a source document pair is considered as Other plagiarism cases. Also document in its original or edited or rephrased form. The duty of Palkovskii et al [11] in their algorithm use a graphical clustering text alignment -which is the focus of this paper- is to find the algorithm to detect type of plagiarism in a document pair. They plagiarized parts of the source document in the suspicious classify document pairs of PAN 2014 text alignment corpus into document for each pair of source and suspicious document [8]. four categories: Verbatim Plagiarism, Random Plagiarism, Persian Plagdet 2016 competition2 which is a subtask of PAN Fire Summary type Plagiarism and Undefined type. Afterward, they 2016 competition3 is held for Persian language. It means that the set the parameters based on the detected type of plagiarism. In text alignment algorithms are evaluated on a Persian corpus. Persian plagdet 2016 corpus there are three type of obfuscation: 1 http://pan.webis.de/ 3 2 http://ictrc.ac.ir/plagdet/ http://fire.irsi.res.in/fire/2016/home none, random and simulated. In our proposed approach, the Where is the vector of ith sentence from suspicious document pairs of the Persian plagdet 2016 corpus are classified document and is the vector from jth sentence of source and |.| into two categories: Verbatim plagiarism and Simulated is the Euclidean length. Cosine measure and Dice coefficient are plagiarism. We use SVM neural network to detect type of calculated for all pairs of sentences and if the similarity of two plagiarism. The SVM neural network has been trained by type of vectors of and is more than threshold of 0.3 (chosen obfuscation in the Persian plagdet 2016 training corpus. Then we based on [1]) based on the both criteria above, this pair of set the parameters based on the detected type of plagiarism. sentences are considered as seed, and for the pairs of sentences whose similarity is more than 0.1 and less than 0.3 (chosen based on our experiments), the similarity will be evaluated semantically. For this purpose, using SVM neural network 4, the type of obfuscation strategy used in the document pairs will be specified. We use cosine similarity percentage between all pairs of sentences of two suspicious and source documents to create our SVM input vector. SVM input vector for suspicious and source document pair is calculated as follows: An 8-bit vector for each document pair is considered. The range of similarities is divided into 8 intervals. Each bit of the vector corresponds to one of these intervals and indicates if there are two sentences in the document pair whose similarity is in the corresponding interval. In other words, indicates sentence similarity between the value and value. Where for , Figure 1. Plagiarism detection systems. If there are sentence pairs whose cosine similarity are between and in a document pair, then = 1; otherwise, = 0. 3. METHODOLOGY Our proposed text alignment algorithm like many other text This vector is given to SVM neural network previously trained by alignment algorithms [12] includes four stages of preprocessing, Persian Plagdet training dataset 2016, and document pair seeding, extension, and filtering. Each of these four stages will be obfuscation strategy is projected. We set maximum and minimum explained in this section. In addition, Figure 2 is an overarching similarity threshold in semantic similarity measure based on the scheme of our text alignment algorithm that shows these four type of obfuscation and the amount of similarity between the pairs stages. of sentences. To calculate the semantic similarity we use FarsNet [3] to extract synsets of each term and STeP_1 to extract 3.1 Preprocessing inflectional and derivational stems of each term. Thus, for each In the preprocessing stage, first, the text is segmented into term, a set of words called is considered as shown in Figure sentences and then tokenized by STeP_1 [2], and Stopwords [4] 3. Then, for each from vector , if overlaps are removed, and inflectional and derivational stems of tokens are ́ of each ́ of vector , of vector is replaced by extracted and restored by STeP_1. Preprocessing is done for a pair ́ of vector . Finally, the similarity of cosine and Dice is of suspicious and source document and the sentences of calculated for the two resulting vectors, and the similarity between suspicious and source document will be given the seeding stage. the results at this stage and results of cosine and Dice in the 3.2 Seeding previous stage are averaged; if the result is greater than the In this stage, the purpose is to extract the similar sentence pairs threshold, the pair of and are considered as seed. The from source and suspicious documents that we call them seed. For set of seeds obtained in this stage enter the extension stage. seeding, our method is initially based on the method introduced by [1]. We expanded the mentioned method by using SVM neural 3.3 Extension net to predict the obfuscation type in order to adjust parameters to The purpose of the extension stage is the extraction of the longest gain better results. In this approach, based on vector space model similar passages from the suspicious and source documents. As (VSM) method, first, tf-idf vector is calculated for all sentences of shown in Figure 2, extension consists of two parts: clustering and suspicious and source documents. Where tf is the term frequency validation. In the clustering stage, the document is clustered into in the corresponding sentence and idf is the inverse sentence pieces, so that each piece contains a number of seeds where the frequency. Then, the similarity of each sentence pair of suspicious (similarity) distance between them does not exceed a threshold. In and source document is calculated using cosine measure and Dice the validation stage, among the pair of passages created in the coefficient according to Eq. 1, 2 and 3. clustering stage; those that are not similar enough are removed. Again, for the extension stage, we adopt and enhance the method proposed by [1]. The difference is that in the validation stage, we ( ) | || (1) use semantic similarity measure instead of cosine measure to | determine the similarity between pairs of passages. | | ( ) | | | | { 4 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Suspicious Source Document Document Preprocessing Seeding Sentence Splitting tf_idf Training Cosine Measure Dataset Tokenizing Dice Coefficient Remove Stopwords Classification SVM neural network Stemming Setting Parameters STeP_1 Semantic FarsNet Similarity Measure Extension Clustering Validation Filtering Resolving Overlapping Removing Small Cases Plagiarism Passages Figure 2. The proposed text alignment algorithm. 3.4 Filtering obfuscation and parameter settings based on the type of The filtering stage removes some passages that either overlap or obfuscation, precision and recall have been improved dramatically are too short. To remove overlapping passages we use the on all types of obfuscation. proposed method in [1].To remove too short passages, we use a recursive algorithm. If the length of a passage is less than a Table 1. The results of the proposed text alignment algorithm threshold, we first assume that other seeds should have been on Persian Plagdet corpus 2016 existed in this passage, but we had not identified them. So we Corpus Plagdet Recall Precision Granularity decrease the threshold on semantic similarity measure and go back to the seeding stage, and we extract the seeds based on the Test 0.92 0.92 0.93 1.00 new threshold; and repeat all the stages to remove the too short parts. If the part was not big enough this time, the part will be Training 0.94 0.94 0.93 1.00 removed. Table 2. The proposed algorithm on types of obfuscation in ́ Persian Plagdet training dataset 2016 ́ Obfuscation P_1 P_2 P_3 = ́1 1 … … ́ Plagdet 0.94 0.96 0.97 Recall 0.96 0.98 0.99 ́ None Precision 0.92 0.95 0.94 Granularity 1 1 1 { { Plagdet 0.81 0.84 0.94 Recall 0.78 0.84 0.93 n = the number of terms in m = the number of terms in Random Precision 0.85 0.84 0.94 Granularity 1 1 1 ́ Plagdet 0.55 0.69 0.86 ́ Recall 0.41 0.61 0.83 ́ Simulated Precision 0.84 0.80 0.91 Granularity 1 1 1 Where Figure 3. New vectors for semantic similarity. 5. Conclusions and Future Work We described our algorithm for the task of text alignment, and presented the results of the evaluation of this algorithm on test and 4. RESULTS training dataset in Persian Plagdet 2016, that it was the best result We implemented our algorithm in C#.Net and evaluated it based compared with the results of other participants. In our method, we on the PAN evaluation setup [15, 16, 17]. In the evaluating stage used SVM neural network to identify the type of obfuscation and we ran our algorithm on the Persian Plagdet 2016 training and test then to set the parameters on the basis of obfuscation; the results dataset [14]. The results of this evaluation also are shown in Table showed that this is effective in improving the precision and recall. 1. As can be seen, the results of our algorithm on both training In the future, we are going to improve the semantic similarity and test corpora are very close. Training corpus in this measure in the seeding stage of our system. We want to use the competition includes a variety of obfuscation strategies including neural network to estimate the semantic similarity of pair of None, Random and simulated obfuscation category. Table 2 sentences. We also want to use methods such as genetic shows the results of our algorithm on any of the obfuscation algorithms to automatically adjust the parameters. strategies in training dataset. In Table 2, column P_1 shows the results of our algorithm on the types of obfuscation in training 6. REFERENCES dataset where the semantic similarity measure is not used. P_2 column shows the algorithm results using semantic similarity [1] Sanchez-Perez, M. A., Gelbukh, A. F., Sidorov, G. 2015. measure. Column P_3 shows the results of our algorithm after Dynamically adjustable approach through obfuscation type adding the criterion of semantic similarity, and also adjusting the recognition. In: Working Notes of CLEF 2015 - Conference parameters based on the detected type of obfuscation using neural and Labs of the Evaluation forum, (Toulouse, France, network. As can be seen, in column P_2, by adding semantic September 8-11, 2015). CEUR Workshop Proceedings, vol. similarity criteria, the recall for the types of obfuscation in 1391. CEUR-WS.org. training corpus is improved, but the precision has been declined in [2] Shamsfard, M., Kiani, S. and Shahedi, Y. STeP-1: standard some cases while in the column P_3, it is seen that by adding a text preparation for Persian language, CAASL3 Third neural network to the system for the diagnosis of type of Workshop on Computational Approaches to Arabic Script- Languages. [3] Shamsfard, M. 2008. Developing FarsNet: A lexical (15-18 September, Sheffield, UK). CEUR-WS.org. ISSN ontology for Persian. proceedings of the 4th global WordNet 1613-0073. conference. [12] Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, [4] Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi M., Rosso, P., Stein, B. 2014. Overview of the 6th lexical analysis and StopWord list. Library Hi Tech, vol. 27, International Competition on Plagiarism Detection. In: pp 435–449. Working Notes for CLEF 2014 Conference, (Sheffield, UK, [5] FIEDLER, R. and KANER, C. 2010. Plagiarism Detection 15-18 September). CEUR Workshop Proceedings, vol. 1180, Services: How Well Do They Actually Perform. IEEE pp. 845-876. CEUR-WS.org. Technology And Society Magazine, pp. 37-43. [13] Smith, T., Waterman, M. 1981. Identification of common [6] Alzahrani, M., Salim, N. and Abraham, A. 2012. molecular subsequences. Journal of molecular biology. Vol. Understanding plagiarism linguistic patterns, textual features, 147(1), pp. 195–197. and detection methods. IEEE Trans. SYSTEMS, MAN, AND [14] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and CYBERNETICS—PART C: APPLICATIONS AND Potthast, M., 2016. Algorithms and Corpora for Persian REVIEWS, vol. 42, no. 2. Plagiarism Detection: Overview of PAN at FIRE 2016. In [7] Ali, A. M. E. T., Abdulla, H. M. D. and Snasel, V. 2011. Working notes of FIRE 2016 - Forum for Information Survey of plagiarism detection methods. IEEE Fifth Asia Retrieval Evaluation, Kolkata, India, December 7-10, 2016, Modelling Symposium (AMS), pp. 39_42. CEUR Workshop Proceedings, CEUR-WS.org. [8] Potthast, M., Göring, S. 2015. Towards data submissions for [15] Potthast, M., Stein, B., Barrón-Cedeño, A., and Rosso, P. shared tasks: first experiences for the task of text alignment. 2010. An Evaluation Framework for Plagiarism Detection. Working Notes Papers of the CLEF 2015 Evaluation Labs, In 23rd International Conference on Computational CEUR Workshop Proceedings, ISSN 1613-0073. Linguistics (COLING 10), pp. 997-1005. [9] Sanchez-Perez, M., Sidorov, G., Gelbukh, A. 2014. The [16] Gollub, T., Stein. B. and Burrows, S. 2012. Ousting Ivory Winning Approach to Text Alignment for Text Reuse Tower Research: Towards a Web Framework for Providing Detection at PAN 2014. In: Notebook for PAN at CLEF Experiments as a Service. In 35th International ACM 2014. (15-18 September, Sheffield, UK). CEUR Workshop Conference on Research and Development in Information Proceedings, ISSN 1613-0073, Vol. 1180, CEUR-WS.org, Retrieval (SIGIR 12), pp. 1125-1126. ACM. ISBN 978-1- pp. 1004–1011. 4503-1472-5. [10] Glinos, D. 2014. A Hybrid Architecture for Plagiarism [17] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, Detection—Notebook for PAN at CLEF 2014. CLEF 2014 E., and Stein, B. 2014. Improving the Reproducibility of Evaluation Labs and Workshop – Working Notes Papers, PAN's Shared Tasks: Plagiarism Detection, Author (15-18 September, Sheffield, UK). CEUR-WS.org. ISSN Identification, and Author Profiling. In Information Access 1613-0073. Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF [11] Palkovskii, Y. and Belov, A. 2014. Developing High- Initiative (CLEF 14), pp. 268-299, Berlin Heidelberg New Resolution Universal Multi-Type N-Gram Plagiarism York. Springer. ISBN 978-3-319-11381-4. Detector—Notebook for PAN at CLEF 2014. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers,