A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Nava Ehsan Azadeh Shakery School of Electrical and Computer Engineering School of Electrical and Computer Engineering College of Engineering College of Engineering University of Tehran University of Tehran n.ehsan@ece.ut.ac.ir shakery@ut.ac.ir ABSTRACT reference of the suspicious text. In monolingual plagiarism detec- The task of plagiarism detection entails two main steps, suspicious tion, the suspicious text could be an exact copy or a modified copy candidate retrieval and pairwise document similarity analysis also [17] and it should be large enough to be more than just a coinci- called detailed analysis. In this paper we focus on the second sub- dence. task. We will report our monolingual plagiarism detection system One method for monolingual plagiarism detection is comparing which is used to process the Persian plagiarism corpus for the task fragments of suspicious and source documents using fingerprint in- of pairwise document similarity. To retrieve plagiarised passages a dexing. Winnowing approach [22], which is used in the widely plagiarism detection method based on vector space model, insen- used plagiarism detection tool, MOSS, is based on fingerprint in- sitive to context reordering, is presented. We evaluate the perfor- dexing. mance in terms of precision, recall, granularity and plagdet metrics. There are different approaches for detection monolingual plagia- rism detection. Some of these approaches could be classified into fingerprinting [22, 13], string matching [8], using stopwords [24], CCS Concepts vector space models [12, 5], probability models [3], classification •Information systems → Near-duplicate and plagiarism detec- [16], semantic models [6, 7] and structural models [23]. According tion; •Applied computing → Document analysis; to monolingual experiments in [4] paraphrasing could make pla- giarism detection more difficult. There have been some works for detecting paraphrased sentences in monolingual texts [1]. 1. INTRODUCTION In recent years PAN competition offers evaluation environment to evaluate plagiarism detection algorithms. This competition also The task of plagiarism detection comprises two main steps: can- offers evaluation corpora. PAN@FIRE Persian Plagdet 2016 com- didate document retrieval and pairwise document detection or de- petition [2] prepares a set of suspicious and source documents writ- tailed analysis. Some researchers also included a third step, called ten in Persian and the task is to find all plagiarized sections in the post-processing, where the extracted passage pairs are cleaned, fil- suspicious documents and, if available, the corresponding source tered, and possibly visualized for later presentation [18]. Detailed sections. This external plagiarism detection task provides a situa- analysis of plagiarism detection task is retrieving passages of text tion to evaluate Persian plagiarism detection systems. which have originated from another document. This process is comparing source and suspicious pairs and retrieving plagiarized fragments. In this paper we introduce a vector space model for this task. The proximity of words are considered by dividing the text 3. PAIRWISE DOCUMENT SIMILARITY ANAL- into passages. After creating the passages we didn’t take into ac- YSIS count the order of the words in each passage. Thus, this approach is insensitive to punctuations, extra white spaces and permutation In this section we deal with pairwise document similarity analy- of the document context in the passages. We also didn’t use any sis and we will introduce a method based on a vector space model language specific feature. Thus, our approach is applicable in any to detect plagiarized fragments of specified set of suspicious and language. The result we obtained on the training data was about source document pairs. We assume that the source and suspicious 0.82 with respect to Plagdet score. pairs are pre-defined for the system. According to PAN competi- The rest of the paper is organized as follows: Section 2 outlines tion this process is called, detailed analysis stage. The problem of related works in monolingual plagiarism detection. Section 3 de- pairwise document analysis is defined as follows: let S 0 be a suspi- scribes the pairwise document similarity approach. Finally exper- cious document and S be a source document that is likely to contain imental results are discussed in Section 4. Conclusion and future similar passages to some passages in S 0 . S and S 0 are compared work are reported, in Section 5. section-wise using a language retrieval model. A plagiarism is con- sidered, if for a pair of sections (Sf and Sf0 0 ) a similarity above a threshold is detected. 2. RELATED WORK According to [18] the detection approaches of this sub-task in- Plagiarism detection could be classified into two classes, intrin- cludes three building blocks named (1) seeding, (2) match merg- sic and external also called without reference or with reference, re- ing, and (3) extraction filtering. In the following subsections, we spectively. Intrinsic evaluation is referred to those methods, which describe them in detail. use style analysis to detect parts of the text that are inconsistent in terms of writing style [15, 14]. The aim of external plagiarism de- tection is not only finding the suspicious text but also finding the 3.1 Seeding 4. RESULTS Given a suspicious document and a source document, matches The results of Persian detailed analysis subtask on PAN@FIRE between the two documents are identified using some seed [18]. In Persian Plagdet 2016 [2] using the training data is summarized in our approach, first preprocessing phase is performed. Given a doc- Tables 1, 2 and 3. The experimentation platform TIRA is used for ument, a preprocessing phase is performed. We substitute Arabic our evaluations [11, 19]. ø and ¼ with Persian ø and ¸. The reason is that the mentioned Parameters t1 and n are respectively the similarity threshold and two letters have different character encodings. Then, stopwords, number of consecutive sentences described in Section 3.1. The ad- punctuations and extra white spaces are removed and tokens are jacency threshold t2 , described in Section 3.2, is set to 1500 char- extracted. acters. The n value for n-gram creation described in Section 3.3 Since, plagiarism usually happens in parts of the text, a plagia- for each sentence is chosen to be 9, except for the final partition of rism detection method should be able to detect local similarities the sentence that may include fewer or more than 9 terms. where only a short passage may be in common in both documents. The evaluation metrics are described in [21]. The evaluation is Thus, there is a requirement to segment the texts into fragments. based on macro-average precision and recall. Also, the granular- For each document pair, we split the texts into sentences by using ity measure characterizes the power of a detection algorithm. It ".", "?" and "!" marks. We choose an amount of consecutive sen- shows whether a plagiarism case is detected as a whole or in several tences as the smallest unit of plagiarism. Documents are divided pieces. The Plagdet score is the combination of the three metrics, into some fragments each containing n sentences with one sentence precision, recall and granularity defined as follows for plagiarism overlap. The sensitivity of the algorithm with respect to parameter cases S and plagiarism detections R [21]: n is shown in Section 4. After sentence splitting, a vector space model is created. The F1 plagdet(S, R) = (1) terms of the source document are considered as the vocabulary and log(1 + granularity(S, R)) the binary weighting schema is used by setting the ith index 1 if The results of the tables show that the plagdet score improves by the ith term occurs in the fragment and 0 otherwise. We regard decreasing the amount of the sentences in a fragment. The reason suspicious passage Sf and reference passage Sf0 0 as pairs of pla- could be that there are more short plagiarized texts than long ones in giarism candidate sentences whose cosine similarity is greater than the dataset. The increase in precision comparing different number a threshold t1 , and at least three terms of the source text are found in of sentences in a fragment using 0.3 for the value t1 , could show the suspicious fragment. The last criterion is added to avoid retriev- that decreasing the amount of sentences may have higher impact on ing fragments with coincidental similarity. Sf will be considered decreasing rather than increasing the false positive detections. as a plagiarism source of Sf0 0 if it has maximum cosine similarity We also analysed the effect of extraction filtering part of the ap- among all source fragments with similarity above the threshold. proach and we realized that without applying this stage for (t1 = The vector creation approach is similar to Eurovoc-based model 0.3, n = 3) the precision was 0.6026 and the plagdet score was proposed in [20] except that we use it for monolingual texts rather 0.7087. This shows that this step improved the plagdet score about than cross-lingual texts. Thus, this approach could be easily adapted 14 percent. to cross language plagiarism detection. We used 0.3 for t1 and 5 sentences for parameter n (number of consecutive sentences) for the test set. The results on the test set 3.2 Match Merging are shown in Table 4. Finding seed matches between a suspicious and a source doc- ument, they are merged into aligned passages of maximal length 5. CONCLUSION AND FUTURE WORK which are then reported as plagiarism detections [18]. To improve The task of plagiarism detection entails two sub-tasks, suspicious performance with respect to the granularity metric, we merge ad- candidate retrieval and pairwise document similarity. We introduce jacent suspicious and source fragments to report a single plagia- a pairwise document analysis approach for Persian language. An rism case. If the number of characters between two detected frag- approach based on a vector space model is described for computing ments of source and suspicious documents are below a threshold t2 pairwise document similarity. The principle of this approach is that those fragments are considered as adjacent fragments and they are sentences containing more common words are likely to be a source merged to report a single plagiarism case. of plagiarism. The method contains three building blocks named seeding, match merging and extraction filtering. Our work is tested 3.3 Extraction Filtering on a Persian corpora which offers evaluation environment to eval- uate plagiarism detection algorithms. The proposed approach is Given a set of aligned passages, a passage filter removes all insensitive to context reordering and could be applied in any lan- aligned passages that do not meet certain criteria. For example guage. dealing with overlapping passages or extremely short passages [18]. Detailed analysis subtask will be improved by expanding the rep- After retrieving potential plagiarized fragments in previous steps, resentative words of the document to find appropriate substitutes the sentences within a fragment are partitioned into non-overlapping for a word in the context in order to capture intelligent plagiarisms. n-grams. For extraction filtering step we applied a method simi- For this reason, there is requirement to minimize the risk of noises lar to result filtering approach proposed in [10] for excluding false that word expansion may cause. Other weighting schemas such as positive detections and improving precision. We used the dynamic tf −idf weighting could be applied in comparing the vectors of the algorithm to find the alignments between the n-grams of the source texts. A complete plagiarism detection system could be developed and suspicious texts and then the null alignments are excluded from by adding a candidate selection [9] step before pairwise document the start and end of the reported fragments. analysis. Precision Recall Granularity Plagdet (t1 = 0.4, n = 5) 0.8532 0.5357 1 0.6582 (t1 = 0.3, n = 5) 0.7630 0.7486 1 0.7557 (t1 = 0.2, n = 5) 0.4004 0.8151 1 0.5370 (t1 = 0.3, n = 3) 0.7867 0.8304 1 0.8080 (t1 = 0.3, n = 2) 0.8482 0.7876 1 0.8168 Table 1: Results of detailed analysis sub-task using macro-averaged precision, recall, granularity, and the plagdet score Precision Recall F1 (t1 = 0.4, n = 5) 0.9434 0.5478 0.6931 (t1 = 0.3, n = 5) 0.8484 0.7631 0.8035 (t1 = 0.2, n = 5) 0.4379 0.8292 0.5731 (t1 = 0.3, n = 3) 0.8931 0.8518 0.8720 (t1 = 0.3, n = 2) 0.9069 0.8162 0.8592 Table 2: Results of detailed analysis sub-task in case-level Precision Recall F1 (t1 = 0.4, n = 5) 0.9457 0.5534 0.6982 (t1 = 0.3, n = 5) 0.8798 0.7701 0.8213 (t1 = 0.2, n = 5) 0.6266 0.8342 0.7156 (t1 = 0.3, n = 3) 0.9312 0.8555 0.8918 (t1 = 0.3, n = 2) 0.9452 0.8220 0.8793 Table 3: Results of detailed analysis sub-task in document-level Precision Recall Granularity Plagdet (t1 = 0.3, n = 5) 0.7496 0.7050 1 0.7266 Table 4: Results of detailed analysis sub-task on the test set Acknowledgement 3936 LNCS of Lecture Notes in Computer Science, pages We would like to acknowledge the assistance and information pro- 565–569, Berlin Heidelberg New York, 2006. Springer. vided by Hossein Nasr Esfahani and Mahsa Shahshahani. [15] G. Oberreuter and J. D. Velásquez. Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9):3756–3763, 2013. [16] R. C. Pereira, V. P. Moreira, and R. Galante. A new approach 6. REFERENCES for cross-language plagiarism analysis. In Multilingual and [1] I. Androutsopoulos and P. Malakasiotis. A survey of Multimodal Information Access Evaluation, volume 6360, paraphrasing and textual entailment methods. Journal of pages 15–26. 2010. Artificial Intelligence Research, 38:135 – 187, 2009. [17] M. Potthast, A. Barrón-Cedeño, B. Stein, and P. Rosso. [2] H. Asghari, S. Mohtaj, O. Fatemi, H. Faili, P. Rosso, and Cross-language plagiarism detection. Language Resources M. Potthast. Algorithms and corpora for persian plagiarism and Evaluation, 45(1):45–62, 2011. detection: Overview of pan at fire 2016. In Working notes of [18] M. Potthast, T. Gollub, M. Hagen, J. Kiesel, M. Michel, FIRE 2016 - Forum for Information Retrieval Evaluation, A. Oberländer, M. Tippmann, A. Barrón-Cedeño, P. Gupta, CEUR Workshop Proceedings. CEUR-WS.org, 2016. P. Rosso, et al. Overview of the 4th international competition [3] A. Barrón-Cedeño, P. Rosso, and J.-M. Benedí. Reducing the on plagiarism detection. In CLEF (Online Working plagiarism detection search space on the basis of the Notes/Labs/Workshop), 2012. kullback-leibler distance. In Computational Linguistics and [19] M. Potthast, T. Gollub, F. Rangel, P. Rosso, E. Stamatatos, Intelligent Text Processing, pages 523–534. Springer, 2009. and B. Stein. Improving the Reproducibility of PAN’s Shared [4] A. Barrón-Cedeno, M. Vila, M. A. Martí, and P. Rosso. Tasks: Plagiarism Detection, Author Identification, and Plagiarism meets paraphrasing: Insights for the next Author Profiling. In E. Kanoulas, M. Lupu, P. Clough, generation in automatic plagiarism detection. Computational M. Sanderson, M. Hall, A. Hanbury, and E. Toms, editors, Linguistics, 39(4):917–947, 2013. Information Access Evaluation meets Multilinguality, [5] S. Brin, J. Davis, and H. Garcia-Molina. Copy detection Multimodality, and Visualization. 5th International mechanisms for digital documents. In ACM SIGMOD Conference of the CLEF Initiative (CLEF 14), pages Record, volume 24, pages 398–409. ACM, 1995. 268–299, Berlin Heidelberg New York, Sept. 2014. Springer. [6] C.-Y. Chen, J.-Y. Yeh, and H.-R. Ke. Plagiarism detection [20] M. Potthast, B. Stein, and M. Anderka. A wikipedia-based using rouge and wordnet. Journal of Computing, pages 34 – multilingual retrieval model. In Advances in Information 44, 2010. Retrieval, pages 522–530. 2008. [7] M. Chong and L. Specia. Lexical generalisation for [21] M. Potthast, B. Stein, A. Barrón-Cedeño, and P. Rosso. An word-level matching in plagiarism detection. In RANLP, evaluation framework for plagiarism detection. In pages 704–709, 2011. Proceedings of the 23rd International Conference on [8] P. Clough and M. Stevenson. Developing a corpus of Computational Linguistics: Posters, pages 997–1005. plagiarised short answers. Language Resources and Association for Computational Linguistics, 2010. Evaluation, 45(1):5–24, 2011. [22] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: [9] N. Ehsan and A. Shakery. Candidate document retrieval for local algorithms for document fingerprinting. In Proceedings cross-lingual plagiarism detection using two-level proximity of the 2003 ACM SIGMOD international conference on information. Information Processing & Management, Management of data, pages 76–85. ACM, 2003. 52(6):1004–1017, 2016. [23] A. Si, H. V. Leong, and R. W. Lau. Check: a document [10] N. Ehsan, F. W. Tompa, and A. Shakery. Using a dictionary plagiarism detection system. In Proceedings of the 1997 and n-gram alignment to improve fine-grained ACM symposium on Applied computing, pages 70–77. ACM, cross-language plagiarism detection. In Proceedings of the 1997. 2016 ACM Symposium on Document Engineering, pages [24] E. Stamatatos. Plagiarism detection using stopword n-grams. 59–68. ACM, 2016. Journal of the American Society for Information Science and [11] T. Gollub, B. Stein, and S. Burrows. Ousting Ivory Tower Technology, 62(12):2512–2527, 2011. Research: Towards a Web Framework for Providing Experiments as a Service. In B. Hersh, J. Callan, Y. Maarek, and M. Sanderson, editors, 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 12), pages 1125–1126. ACM, Aug. 2012. [12] C. Grozea, C. Gehl, and M. Popescu. Encoplot: Pairwise sequence matching in linear time applied to plagiarism detection. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, pages 10 – 18, 2009. [13] G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141–150. ACM, 2007. [14] S. Meyer zu Eißen and B. Stein. Intrinsic Plagiarism Detection. In Advances in Information Retrieval. 28th European Conference on IR Research (ECIR 06), volume