An n-gram based Method for Nearly Copy Detection in
                     Plagiarism Systems
              Behrouz Minaei                                                                                 Mahdi Niknam
Iran University of Science and Technology                                                                   University of Qom
                Tehran, Iran                                                                                   Qom, Iran
           b_minaei@iust.ac.ir                                                                           niknam.znt@gmail.com


ABSTRACT                                                                   dsrc the original document from which the plagiarized work
There has been plagiarism as a concept of "intellectual-property-         has been derived.
theft" form the time that human and artistic research activities           ssrc part of the original document that has been stolen.
have been created. But easy access to the web, the massive                 dplg the document where plagiarism has been detected.
database of information and communications system in recent                splg part of the document dplg in which plagiarism has occurred.
years has led to the issue of plagiarism as a serious issue for
publishers, researchers and the research institutions. In this paper,
we introduce a method based on n-gram to identify similar textual
parts between two documents. Evaluation of our method shows
that our method has obtained both high accuracy and proper
efficiency simultaneously.

CCS Concepts
• Information systems → Near-duplicate and plagiarism
detection
• Information systems→ Evaluation of retrieval results

Keywords
Plagiarism detection; text alignment; n-gram                                        Figure 1. Elements constituting a plagiarism

1. INTRODUCTION                                                           A plagiarism detector’s task is to report a case of plagiarism as
Several definitions for plagiarism have been mentioned in                      〈                     〉. The relation r shows the part rplg from
scientific resources. In [9] plagiarism has been described as the         the document dplg which has been plagiarized from the part rsrc of
“wrongful appropriation and stealing and publication of another           the document d'src and it is aimed at giving a maximum estimation
author's language, thoughts, ideas, or expressions and the                of s. if the following circumstanced are realized between s and r,
representation of them as one's own original work.” Plagiarism,           it can be said that r has managed to recognize s:
intellectual robbery or idea thievery has been defined as
                                                                                                                       ́
“attribution of others’ research or literary creativity or a part of it
or its derivative text to oneself, as if they have created it by
themselves.” Intellectual robbery is called literary plagiarism if
committed in the sphere of literature, artistic plagiarism if
                                                                          3. THE GENERAL APPROACH IN
perpetrated in the sphere of arts, and academic plagiarism if done        DETECTING PLAGIARISM
in scientific fields.                                                     Article [8] has offered a general categorization of activities
                                                                          required to identify academic plagiarism using a huge amount of
Plagiarism, as “stealing of intellectual property”, has a history         external resources; so that most of works in the field of academic
coincided with the emergence of man’s research and artistic               plagiarism detection, although too different in implementation
activities. But easy access to the web, the massive databases of          details, have used the general approach offered in the
information and, in general, communication means in recent years          aforementioned article. This approach has generally the following
has caused the issue of plagiarism to be a serious problem for            three stages:
publishers, researchers and research institutions. On the other
hand, the country’s rapid scientific growth in recent years has            Heuristic Retrieval: since the plagiarism suspect document dplg
caused an increase in the possibility of intentional and                  may have used any document available in dataset D and the total
unintentional academic plagiarism.                                        amount of data is usually very huge, comparing the suspicious
                                                                          document with each document in dataset isn’t possible. Therefore,
2. DEFINITION OF PLAGIARISM                                               in the first step, for each dsrc, a collection Dx is so selected as a
                                                                          subset of D that the original document is most probable to occur
DETECTION                                                                 in this dataset.
In order to develop the plagiarism algorithm, the article [7]
defines the issue of plagiarism as follows. A case of plagiarism           Detail analysis: in this phase the document dplg is compared
can be shown as       〈                    〉 that is composed of          with any of the documents in Dx to identify (splg, sx) so that they
the following parts:
have a high similarity with each other and also splg ϵ dplg and sx ϵ     firstly sentences of the suspicious document are converted to tri-
dx.                                                                      grams and then are searched for in the tri-grams of the original
                                                                         texts.
 Post-processing: to reduce the output inaccuracy, in this stage,
the results of the previous phase are filtered so that the overall       5. MODELING THE PROBLEM IN TWO-
efficiency of algorithm is optimized. Filters can include a range of
activities such as deleting the cases with low length, combining
                                                                         DIMENSIONAL SPACE
                                                                         For better understanding of the plagiarism detection problem, we
two or more cases, or removing the cases that have true
                                                                         can show the comparison between the two documents dplg and dsrc
references.
                                                                         using the Scatter Plot. In this visualization, index of the document
These three steps are shown in Figure 2.                                 dsrc is displayed on the horizontal axis and the index of the
                                                                         document dplg is displayed on the vertical axis. If a word has
                                                                         occurred in both documents, its index in the documents dplg and
                                                                         dsrc is shown as a spot on the graph. If a piece of text is copied
                                                                         from another document without any modification, the image
                                                                         plagiarized area creates a diagonal line. Figure 3 shows the similar
                                                                         parts of the two non-plagiarized documents. As shown in the
                                                                         graph, some several words are normally similar in the two
                                                                         documents. This similarity has shown as a dark page in the Scatter
                                                                         Plot. Number of similar spots and level of darkness of the graph
                                                                         differ according to similarity of subjects of the two documents.
                                                                         Figure 4 displays two documents in which a part of one document
    Figure 2. Three Phases of Plagiarism Detection System                has occurred exactly in the other one. The diagonal line in the
                                                                         graph represents a list of words used in the other text in the same
                                                                         order.
In this paper, we present a method to compare two documents and
to identify similar parts of them. This method can be used in
"detail analysis" phase.

4. TEXT FRAGMENTATION UNIT IN
PLAGIARISM DETECTION ALGORITHMS
In all text processing tasks, the text must be split into chunks so
that processing algorithms can be applied on them. It is also
necessary in plagiarism detection. Various processing units
have been used in plagiarism detection researches, such as
character, word, sentence or paragraph. The document is chunked
based the chosen unit. The smaller the unites are (eg, word and
character), the more chunks the document is divided into. This
leads to an increase in computation time; however, it provides the           Figure 3. Visualization of similarities in two documents
possibility to identify the snippets of the copied text. If the length                         without plagiarism
of the pieces is high (as in sentence and paragraph), the number of
pieces decreases and the speed of performance rises; but the
accuracy of the algorithm in identifying small pieces of the text
disappears.
COPS (Copy Protection System) is a part of Digital Library
Project at Stanford University. This system uses sentence units for
comparison between documents and it is unable to identify
overlaps within sentences [3]. To cover COPS inefficiencies, the
SCAM system was developed using word units to compare
documents.
The CHECK system uses the structural information of text to
compare documents. Its comparison unit is paragraph. Its                    Figure 4. Visualization of exact copy of a text to another
complete dependence to document structure was its constraint [8].                                   document
There has been introduced a system in [5] as PPChecker which
uses sentence unit to compare documents. In their study, [6] uses        By modifying the plagiarized text, one causes changes in the
n-grams with the value n=3 to convert each document to a set of          diagonal line. In the following, we will examine different states of
tri-grams and then compares two documents by calculating the             text modification and their effects in the corresponding picture.
number of common tri-grams of them.
There has been introduced a system in [2] which uses a
combination of the two methods [6] and [5]. This system’s search
strategy is to divide the suspicious document to separate sentences
and converts original documents to tri-grams. In order to search,
                                                                       " ‫تحقیق حاضر وشان میدهد میاوگیه ومرات کسب شده توسط کشتیگیران هر دو‬
                                                                      .‫ "گروه مورد آزمایش بسیار باال است‬and " ‫تحقیق حاضر وشان داده است که‬
                                                                      ‫میاوگیه ومرههای کسب شده توسط کشتیگیران هر دو گروه مورد بررسی بسیار‬
                                                                      ‫"باالست‬.
                                                                       The process starts from the beginning of the list. If both (i, j)
                                                                      and (i+1, j+1) occur in the list then they are put in one plagiarism
                                                                      case and this task continues until the end of the list. Using this
                                                                      method, all n-grams occurring consecutively in both documents
                                                                      dplg and dsrc are identified as one plagiarism case. The purpose of
                                                                      this step is to identify all similar cases in both documents.
                                                                      Plagiarism cases that occur without any changes will be
 Figure 5. Visualization of copying text with low replacement         completely identified at this step. Two parts of documents that
                              rate                                    complete resemble are shown in Figure 9.
                                                                      


Figure 6. Visualization of copying text with high replacement
                             rate


                                                                          Figure 8. Modeling two pieces of text for detecting similar
                                                                                                   parts


     Figure 7. Visualization of copying text with sentence
                         displacement
Taking into consideration the presented pictures, it can be said
that "the process of detecting plagiarism is the identification of
areas with higher density in the picture."

6. SUGGESTED METHOD
In this section, we present the proposed method for identifying
similar parts in two documents.
                                                                      Figure 9. Detecting parts that are identical in two documents
To identify cases of plagiarism in the two documents dsrc and dplg,
we used the visualized model presented in the previous section.
The algorithm steps can be outlined as follows:                        To overcome the changes made to the text and reduce the
 The list of n-grams is generated out of each document and           number of cases detected, they will be merged. For this purpose, if
each n-gram, in accordance with its display order in the document,    the distance between two cases is lower than a threshold in regard
is assigned an integer counter starting at 1.                         to their length, the cases are merged.

 All n-grams occurred in both documents are shown as (i, j);
that is i represents the n-gram position in the document dplg and j
represents the n-gram position in the document dsrc. For example,
Figure 8 shows how these two sentences are modeled for
detecting similar parts:
                                                                       8. CONCLUSION
                                                                       The process of identifying plagiarism has two steps: Heuristic
                                                                       Retrieval and detail analysis. In this paper, we presented a method
                                                                       for the second step of the plagiarism detection process in order to
                                                                       identify similar parts of two documents. In this method, by
                                                                       generating n-grams from each document and by comparing n-
                                                                       grams of two documents, the similar parts are identified. In the
                                                                       next step, the similar parts that are closer to each other than a
                                                                       threshold limit are merged with each other. The proposed method,
                                                                       without being engaged in complex text analysis technique, can
                                                                       detect copy and near-copy parts in two documents though with
                                                                       reasonable accuracy.

                                                                       9. REFERENCES
                                                                       [1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
        Figure 10. Merging the adjacent identical cases
                                                                           Potthast, M., 2016. Algorithms and Corpora for Persian
                                                                           Plagiarism Detection: Overview of PAN at FIRE 2016. In
 Finally, to reduce misidentified instances, the cases that are           Working notes of FIRE 2016 - Forum for Information
lower than a threshold in terms of length will be removed from the         Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
list.                                                                      CEUR Workshop Proceedings, CEUR-WS.org.
                                                                       [2] Barrón-Cedeño, A. and Rosso, P. 2009. On automatic
7. EVALUATION                                                              plagiarism detection based on n-grams comparison.
To evaluate the proposed method, the Persian dataset introduced            Advances in Information Retrieval. Springer. 696–700.
in PAN 2016 conference and the evaluation method in [7] and [4]        [3] Brin, S., Davis, J. and Garcia-Molina, H. 1995. Copy
were used. Taking into consideration the explanations provided in          detection mechanisms for digital documents. ACM SIGMOD
the previous sections, the proposed method may have two                    Record (1995), 398–409.
adjustable parameters. The first parameter is the intimacy             [4] Improving the Reproducibility of PAN’s Shared Tasks: -
threshold limit for integrating proximate similar parts; and the           Springer: http://link.springer.com/chapter/10.1007%2F978-
second parameter is the length threshold limit for removing the            3-319-11382-1_22. Accessed: 2016-11-04.
cases that are shorter than a specified length. In order to find the   [5] Kang, N., Gelbukh, A. and Han, S. 2006. PPChecker:
best threshold values, we used the range of 20 to 50 characters            Plagiarism pattern checker in document copy detection. Text,
with 5 intervals for proximity threshold limit and the range of 50         Speech and Dialogue (2006), 661–667.
to 100 characters with 10 intervals for length threshold limit.        [6] Lyon, C., Barrett, R. and Malcolm, J. 2004. A theoretical
                                                                           basis to the automated detection of copying between texts,
Then the algorithm was run per possible values for the parameters          and its practical implementation in the Ferret plagiarism and
and in each case the output was calculated based on the evaluation         collusion detector. Plagiarism: Prevention, Practice and
metrics. After the algorithm was executed, the best results,               Policies. (2004).
considering the evaluations metrics, were gained with the values       [7] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.
45 characters for proximity threshold limit and 90 for length              2010. An evaluation framework for plagiarism detection.
threshold limit. The outcome of the algorithm based on the                 Proceedings of the 23rd International Conference on
evaluation metrics is shown in Table 1.                                    Computational Linguistics: Posters (2010), 997–1005.
    Table 1. Algorithm result based on evaluation metrics              [8] Stein, B., zu Eissen, S.M. and Potthast, M. 2007. Strategies
                                                                           for retrieving plagiarized documents. Proceedings of the
       Plagdet    Granularity         Recall      Precision
                                                                           30th annual international ACM SIGIR conference on
          0.84            1.04          0.79           0.93                Research and development in information retrieval (2007),
                                                                           825–826.
                                                                       [9] 2016. Plagiarism. Wikipedia, the free encyclopedia.