An n-gram based Method for Nearly Copy Detection in Plagiarism Systems Behrouz Minaei Mahdi Niknam Iran University of Science and Technology University of Qom Tehran, Iran Qom, Iran b_minaei@iust.ac.ir niknam.znt@gmail.com ABSTRACT  dsrc the original document from which the plagiarized work There has been plagiarism as a concept of "intellectual-property- has been derived. theft" form the time that human and artistic research activities  ssrc part of the original document that has been stolen. have been created. But easy access to the web, the massive  dplg the document where plagiarism has been detected. database of information and communications system in recent  splg part of the document dplg in which plagiarism has occurred. years has led to the issue of plagiarism as a serious issue for publishers, researchers and the research institutions. In this paper, we introduce a method based on n-gram to identify similar textual parts between two documents. Evaluation of our method shows that our method has obtained both high accuracy and proper efficiency simultaneously. CCS Concepts • Information systems → Near-duplicate and plagiarism detection • Information systems→ Evaluation of retrieval results Keywords Plagiarism detection; text alignment; n-gram Figure 1. Elements constituting a plagiarism 1. INTRODUCTION A plagiarism detector’s task is to report a case of plagiarism as Several definitions for plagiarism have been mentioned in 〈 〉. The relation r shows the part rplg from scientific resources. In [9] plagiarism has been described as the the document dplg which has been plagiarized from the part rsrc of “wrongful appropriation and stealing and publication of another the document d'src and it is aimed at giving a maximum estimation author's language, thoughts, ideas, or expressions and the of s. if the following circumstanced are realized between s and r, representation of them as one's own original work.” Plagiarism, it can be said that r has managed to recognize s: intellectual robbery or idea thievery has been defined as ́ “attribution of others’ research or literary creativity or a part of it or its derivative text to oneself, as if they have created it by themselves.” Intellectual robbery is called literary plagiarism if committed in the sphere of literature, artistic plagiarism if 3. THE GENERAL APPROACH IN perpetrated in the sphere of arts, and academic plagiarism if done DETECTING PLAGIARISM in scientific fields. Article [8] has offered a general categorization of activities required to identify academic plagiarism using a huge amount of Plagiarism, as “stealing of intellectual property”, has a history external resources; so that most of works in the field of academic coincided with the emergence of man’s research and artistic plagiarism detection, although too different in implementation activities. But easy access to the web, the massive databases of details, have used the general approach offered in the information and, in general, communication means in recent years aforementioned article. This approach has generally the following has caused the issue of plagiarism to be a serious problem for three stages: publishers, researchers and research institutions. On the other hand, the country’s rapid scientific growth in recent years has  Heuristic Retrieval: since the plagiarism suspect document dplg caused an increase in the possibility of intentional and may have used any document available in dataset D and the total unintentional academic plagiarism. amount of data is usually very huge, comparing the suspicious document with each document in dataset isn’t possible. Therefore, 2. DEFINITION OF PLAGIARISM in the first step, for each dsrc, a collection Dx is so selected as a subset of D that the original document is most probable to occur DETECTION in this dataset. In order to develop the plagiarism algorithm, the article [7] defines the issue of plagiarism as follows. A case of plagiarism  Detail analysis: in this phase the document dplg is compared can be shown as 〈 〉 that is composed of with any of the documents in Dx to identify (splg, sx) so that they the following parts: have a high similarity with each other and also splg ϵ dplg and sx ϵ firstly sentences of the suspicious document are converted to tri- dx. grams and then are searched for in the tri-grams of the original texts.  Post-processing: to reduce the output inaccuracy, in this stage, the results of the previous phase are filtered so that the overall 5. MODELING THE PROBLEM IN TWO- efficiency of algorithm is optimized. Filters can include a range of activities such as deleting the cases with low length, combining DIMENSIONAL SPACE For better understanding of the plagiarism detection problem, we two or more cases, or removing the cases that have true can show the comparison between the two documents dplg and dsrc references. using the Scatter Plot. In this visualization, index of the document These three steps are shown in Figure 2. dsrc is displayed on the horizontal axis and the index of the document dplg is displayed on the vertical axis. If a word has occurred in both documents, its index in the documents dplg and dsrc is shown as a spot on the graph. If a piece of text is copied from another document without any modification, the image plagiarized area creates a diagonal line. Figure 3 shows the similar parts of the two non-plagiarized documents. As shown in the graph, some several words are normally similar in the two documents. This similarity has shown as a dark page in the Scatter Plot. Number of similar spots and level of darkness of the graph differ according to similarity of subjects of the two documents. Figure 4 displays two documents in which a part of one document Figure 2. Three Phases of Plagiarism Detection System has occurred exactly in the other one. The diagonal line in the graph represents a list of words used in the other text in the same order. In this paper, we present a method to compare two documents and to identify similar parts of them. This method can be used in "detail analysis" phase. 4. TEXT FRAGMENTATION UNIT IN PLAGIARISM DETECTION ALGORITHMS In all text processing tasks, the text must be split into chunks so that processing algorithms can be applied on them. It is also necessary in plagiarism detection. Various processing units have been used in plagiarism detection researches, such as character, word, sentence or paragraph. The document is chunked based the chosen unit. The smaller the unites are (eg, word and character), the more chunks the document is divided into. This leads to an increase in computation time; however, it provides the Figure 3. Visualization of similarities in two documents possibility to identify the snippets of the copied text. If the length without plagiarism of the pieces is high (as in sentence and paragraph), the number of pieces decreases and the speed of performance rises; but the accuracy of the algorithm in identifying small pieces of the text disappears. COPS (Copy Protection System) is a part of Digital Library Project at Stanford University. This system uses sentence units for comparison between documents and it is unable to identify overlaps within sentences [3]. To cover COPS inefficiencies, the SCAM system was developed using word units to compare documents. The CHECK system uses the structural information of text to compare documents. Its comparison unit is paragraph. Its Figure 4. Visualization of exact copy of a text to another complete dependence to document structure was its constraint [8]. document There has been introduced a system in [5] as PPChecker which uses sentence unit to compare documents. In their study, [6] uses By modifying the plagiarized text, one causes changes in the n-grams with the value n=3 to convert each document to a set of diagonal line. In the following, we will examine different states of tri-grams and then compares two documents by calculating the text modification and their effects in the corresponding picture. number of common tri-grams of them. There has been introduced a system in [2] which uses a combination of the two methods [6] and [5]. This system’s search strategy is to divide the suspicious document to separate sentences and converts original documents to tri-grams. In order to search, " ‫تحقیق حاضر وشان میدهد میاوگیه ومرات کسب شده توسط کشتیگیران هر دو‬ .‫ "گروه مورد آزمایش بسیار باال است‬and " ‫تحقیق حاضر وشان داده است که‬ ‫میاوگیه ومرههای کسب شده توسط کشتیگیران هر دو گروه مورد بررسی بسیار‬ ‫"باالست‬.  The process starts from the beginning of the list. If both (i, j) and (i+1, j+1) occur in the list then they are put in one plagiarism case and this task continues until the end of the list. Using this method, all n-grams occurring consecutively in both documents dplg and dsrc are identified as one plagiarism case. The purpose of this step is to identify all similar cases in both documents. Plagiarism cases that occur without any changes will be Figure 5. Visualization of copying text with low replacement completely identified at this step. Two parts of documents that rate complete resemble are shown in Figure 9.  Figure 6. Visualization of copying text with high replacement rate Figure 8. Modeling two pieces of text for detecting similar parts Figure 7. Visualization of copying text with sentence displacement Taking into consideration the presented pictures, it can be said that "the process of detecting plagiarism is the identification of areas with higher density in the picture." 6. SUGGESTED METHOD In this section, we present the proposed method for identifying similar parts in two documents. Figure 9. Detecting parts that are identical in two documents To identify cases of plagiarism in the two documents dsrc and dplg, we used the visualized model presented in the previous section. The algorithm steps can be outlined as follows:  To overcome the changes made to the text and reduce the  The list of n-grams is generated out of each document and number of cases detected, they will be merged. For this purpose, if each n-gram, in accordance with its display order in the document, the distance between two cases is lower than a threshold in regard is assigned an integer counter starting at 1. to their length, the cases are merged.  All n-grams occurred in both documents are shown as (i, j); that is i represents the n-gram position in the document dplg and j represents the n-gram position in the document dsrc. For example, Figure 8 shows how these two sentences are modeled for detecting similar parts: 8. CONCLUSION The process of identifying plagiarism has two steps: Heuristic Retrieval and detail analysis. In this paper, we presented a method for the second step of the plagiarism detection process in order to identify similar parts of two documents. In this method, by generating n-grams from each document and by comparing n- grams of two documents, the similar parts are identified. In the next step, the similar parts that are closer to each other than a threshold limit are merged with each other. The proposed method, without being engaged in complex text analysis technique, can detect copy and near-copy parts in two documents though with reasonable accuracy. 9. REFERENCES [1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and Figure 10. Merging the adjacent identical cases Potthast, M., 2016. Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016. In  Finally, to reduce misidentified instances, the cases that are Working notes of FIRE 2016 - Forum for Information lower than a threshold in terms of length will be removed from the Retrieval Evaluation, Kolkata, India, December 7-10, 2016, list. CEUR Workshop Proceedings, CEUR-WS.org. [2] Barrón-Cedeño, A. and Rosso, P. 2009. On automatic 7. EVALUATION plagiarism detection based on n-grams comparison. To evaluate the proposed method, the Persian dataset introduced Advances in Information Retrieval. Springer. 696–700. in PAN 2016 conference and the evaluation method in [7] and [4] [3] Brin, S., Davis, J. and Garcia-Molina, H. 1995. Copy were used. Taking into consideration the explanations provided in detection mechanisms for digital documents. ACM SIGMOD the previous sections, the proposed method may have two Record (1995), 398–409. adjustable parameters. The first parameter is the intimacy [4] Improving the Reproducibility of PAN’s Shared Tasks: - threshold limit for integrating proximate similar parts; and the Springer: http://link.springer.com/chapter/10.1007%2F978- second parameter is the length threshold limit for removing the 3-319-11382-1_22. Accessed: 2016-11-04. cases that are shorter than a specified length. In order to find the [5] Kang, N., Gelbukh, A. and Han, S. 2006. PPChecker: best threshold values, we used the range of 20 to 50 characters Plagiarism pattern checker in document copy detection. Text, with 5 intervals for proximity threshold limit and the range of 50 Speech and Dialogue (2006), 661–667. to 100 characters with 10 intervals for length threshold limit. [6] Lyon, C., Barrett, R. and Malcolm, J. 2004. A theoretical basis to the automated detection of copying between texts, Then the algorithm was run per possible values for the parameters and its practical implementation in the Ferret plagiarism and and in each case the output was calculated based on the evaluation collusion detector. Plagiarism: Prevention, Practice and metrics. After the algorithm was executed, the best results, Policies. (2004). considering the evaluations metrics, were gained with the values [7] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P. 45 characters for proximity threshold limit and 90 for length 2010. An evaluation framework for plagiarism detection. threshold limit. The outcome of the algorithm based on the Proceedings of the 23rd International Conference on evaluation metrics is shown in Table 1. Computational Linguistics: Posters (2010), 997–1005. Table 1. Algorithm result based on evaluation metrics [8] Stein, B., zu Eissen, S.M. and Potthast, M. 2007. Strategies for retrieving plagiarized documents. Proceedings of the Plagdet Granularity Recall Precision 30th annual international ACM SIGIR conference on 0.84 1.04 0.79 0.93 Research and development in information retrieval (2007), 825–826. [9] 2016. Plagiarism. Wikipedia, the free encyclopedia.