<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An n-gram based Method for Nearly Copy Detection in Plagiarism Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Behrouz Minaei Iran University of Science and Technology Tehran</institution>
          ,
          <country country="IR">Iran</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mahdi Niknam University of Qom Qom</institution>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There has been plagiarism as a concept of "intellectual-propertytheft" form the time that human and artistic research activities have been created. But easy access to the web, the massive database of information and communications system in recent years has led to the issue of plagiarism as a serious issue for publishers, researchers and the research institutions. In this paper, we introduce a method based on n-gram to identify similar textual parts between two documents. Evaluation of our method shows that our method has obtained both high accuracy and proper efficiency simultaneously. • Information systems → detection • Information systems→ Evaluation of retrieval results Near-duplicate and plagiarism</p>
      </abstract>
      <kwd-group>
        <kwd>Plagiarism detection</kwd>
        <kwd>text alignment</kwd>
        <kwd>n-gram</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Several definitions for plagiarism have been mentioned in
scientific resources. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] plagiarism has been described as the
“wrongful appropriation and stealing and publication of another
author's language, thoughts, ideas, or expressions and the
representation of them as one's own original work.” Plagiarism,
intellectual robbery or idea thievery has been defined as
“attribution of others’ research or literary creativity or a part of it
or its derivative text to oneself, as if they have created it by
themselves.” Intellectual robbery is called literary plagiarism if
committed in the sphere of literature, artistic plagiarism if
perpetrated in the sphere of arts, and academic plagiarism if done
in scientific fields.
      </p>
      <p>Plagiarism, as “stealing of intellectual property”, has a history
coincided with the emergence of man’s research and artistic
activities. But easy access to the web, the massive databases of
information and, in general, communication means in recent years
has caused the issue of plagiarism to be a serious problem for
publishers, researchers and research institutions. On the other
hand, the country’s rapid scientific growth in recent years has
caused an increase in the possibility of intentional and
unintentional academic plagiarism.</p>
    </sec>
    <sec id="sec-2">
      <title>2. DEFINITION OF PLAGIARISM</title>
    </sec>
    <sec id="sec-3">
      <title>DETECTION</title>
      <p>
        In order to develop the plagiarism algorithm, the article [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
defines the issue of plagiarism as follows. A case of plagiarism
can be shown as 〈 〉 that is composed of
the following parts:
 dsrc the original document from which the plagiarized work
has been derived.
 ssrc part of the original document that has been stolen.

      </p>
      <p>dplg the document where plagiarism has been detected.
 splg part of the document dplg in which plagiarism has occurred.
A plagiarism detector’s task is to report a case of plagiarism as
〈 〉. The relation r shows the part rplg from
the document dplg which has been plagiarized from the part rsrc of
the document d'src and it is aimed at giving a maximum estimation
of s. if the following circumstanced are realized between s and r,
it can be said that r has managed to recognize s:
́</p>
    </sec>
    <sec id="sec-4">
      <title>3. THE GENERAL APPROACH IN</title>
    </sec>
    <sec id="sec-5">
      <title>DETECTING PLAGIARISM</title>
      <p>
        Article [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has offered a general categorization of activities
required to identify academic plagiarism using a huge amount of
external resources; so that most of works in the field of academic
plagiarism detection, although too different in implementation
details, have used the general approach offered in the
aforementioned article. This approach has generally the following
three stages:
 Heuristic Retrieval: since the plagiarism suspect document dplg
may have used any document available in dataset D and the total
amount of data is usually very huge, comparing the suspicious
document with each document in dataset isn’t possible. Therefore,
in the first step, for each dsrc, a collection Dx is so selected as a
subset of D that the original document is most probable to occur
in this dataset.
 Detail analysis: in this phase the document dplg is compared
with any of the documents in Dx to identify (splg, sx) so that they
have a high similarity with each other and also splg ϵ dplg and sx ϵ
dx.
 Post-processing: to reduce the output inaccuracy, in this stage,
the results of the previous phase are filtered so that the overall
efficiency of algorithm is optimized. Filters can include a range of
activities such as deleting the cases with low length, combining
two or more cases, or removing the cases that have true
references.
      </p>
      <sec id="sec-5-1">
        <title>These three steps are shown in Figure 2.</title>
        <p>In this paper, we present a method to compare two documents and
to identify similar parts of them. This method can be used in
"detail analysis" phase.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. TEXT FRAGMENTATION UNIT IN</title>
    </sec>
    <sec id="sec-7">
      <title>PLAGIARISM DETECTION ALGORITHMS</title>
      <p>In all text processing tasks, the text must be split into chunks so
that processing algorithms can be applied on them. It is also
necessary in plagiarism detection. Various processing units
have been used in plagiarism detection researches, such as
character, word, sentence or paragraph. The document is chunked
based the chosen unit. The smaller the unites are (eg, word and
character), the more chunks the document is divided into. This
leads to an increase in computation time; however, it provides the
possibility to identify the snippets of the copied text. If the length
of the pieces is high (as in sentence and paragraph), the number of
pieces decreases and the speed of performance rises; but the
accuracy of the algorithm in identifying small pieces of the text
disappears.</p>
      <p>
        COPS (Copy Protection System) is a part of Digital Library
Project at Stanford University. This system uses sentence units for
comparison between documents and it is unable to identify
overlaps within sentences [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To cover COPS inefficiencies, the
SCAM system was developed using word units to compare
documents.
      </p>
      <p>
        The CHECK system uses the structural information of text to
compare documents. Its comparison unit is paragraph. Its
complete dependence to document structure was its constraint [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
There has been introduced a system in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as PPChecker which
uses sentence unit to compare documents. In their study, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] uses
n-grams with the value n=3 to convert each document to a set of
tri-grams and then compares two documents by calculating the
number of common tri-grams of them.
      </p>
      <p>
        There has been introduced a system in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which uses a
combination of the two methods [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This system’s search
strategy is to divide the suspicious document to separate sentences
and converts original documents to tri-grams. In order to search,
firstly sentences of the suspicious document are converted to
trigrams and then are searched for in the tri-grams of the original
texts.
      </p>
    </sec>
    <sec id="sec-8">
      <title>5. MODELING THE PROBLEM IN TWO</title>
    </sec>
    <sec id="sec-9">
      <title>DIMENSIONAL SPACE</title>
      <p>For better understanding of the plagiarism detection problem, we
can show the comparison between the two documents dplg and dsrc
using the Scatter Plot. In this visualization, index of the document
dsrc is displayed on the horizontal axis and the index of the
document dplg is displayed on the vertical axis. If a word has
occurred in both documents, its index in the documents dplg and
dsrc is shown as a spot on the graph. If a piece of text is copied
from another document without any modification, the image
plagiarized area creates a diagonal line. Figure 3 shows the similar
parts of the two non-plagiarized documents. As shown in the
graph, some several words are normally similar in the two
documents. This similarity has shown as a dark page in the Scatter
Plot. Number of similar spots and level of darkness of the graph
differ according to similarity of subjects of the two documents.
Figure 4 displays two documents in which a part of one document
has occurred exactly in the other one. The diagonal line in the
graph represents a list of words used in the other text in the same
order.
Taking into consideration the presented pictures, it can be said
that "the process of detecting plagiarism is the identification of
areas with higher density in the picture."</p>
    </sec>
    <sec id="sec-10">
      <title>6. SUGGESTED METHOD</title>
      <p>In this section, we present the proposed method for identifying
similar parts in two documents.</p>
      <p>To identify cases of plagiarism in the two documents dsrc and dplg,
we used the visualized model presented in the previous section.
The algorithm steps can be outlined as follows:
 The list of n-grams is generated out of each document and
each n-gram, in accordance with its display order in the document,
is assigned an integer counter starting at 1.
 All n-grams occurred in both documents are shown as (i, j);
that is i represents the n-gram position in the document dplg and j
represents the n-gram position in the document dsrc. For example,
Figure 8 shows how these two sentences are modeled for
detecting similar parts:
" ود ره ناریگیتشک طسوت هدش بسک تارمو هیگوایم دهدیم ناشو رضاح قیقحت
.تسا الاب رایسب شیامزآ دروم هورگ" and " هک تسا هداد ناشو رضاح قیقحت
رایسب یسررب دروم هورگود ره ناریگیتشک طسوت هدش بسک یاه هرموهیگوایم
تسالاب ".
 The process starts from the beginning of the list. If both (i, j)
and (i+1, j+1) occur in the list then they are put in one plagiarism
case and this task continues until the end of the list. Using this
method, all n-grams occurring consecutively in both documents
dplg and dsrc are identified as one plagiarism case. The purpose of
this step is to identify all similar cases in both documents.
Plagiarism cases that occur without any changes will be
completely identified at this step. Two parts of documents that
complete resemble are shown in Figure 9.

 To overcome the changes made to the text and reduce the
number of cases detected, they will be merged. For this purpose, if
the distance between two cases is lower than a threshold in regard
to their length, the cases are merged.
 Finally, to reduce misidentified instances, the cases that are
lower than a threshold in terms of length will be removed from the
list.</p>
    </sec>
    <sec id="sec-11">
      <title>7. EVALUATION</title>
      <p>
        To evaluate the proposed method, the Persian dataset introduced
in PAN 2016 conference and the evaluation method in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
were used. Taking into consideration the explanations provided in
the previous sections, the proposed method may have two
adjustable parameters. The first parameter is the intimacy
threshold limit for integrating proximate similar parts; and the
second parameter is the length threshold limit for removing the
cases that are shorter than a specified length. In order to find the
best threshold values, we used the range of 20 to 50 characters
with 5 intervals for proximity threshold limit and the range of 50
to 100 characters with 10 intervals for length threshold limit.
Then the algorithm was run per possible values for the parameters
and in each case the output was calculated based on the evaluation
metrics. After the algorithm was executed, the best results,
considering the evaluations metrics, were gained with the values
45 characters for proximity threshold limit and 90 for length
threshold limit. The outcome of the algorithm based on the
evaluation metrics is shown in Table 1.
      </p>
      <sec id="sec-11-1">
        <title>Plagdet</title>
      </sec>
      <sec id="sec-11-2">
        <title>Granularity 0.84 1.04</title>
      </sec>
      <sec id="sec-11-3">
        <title>Recall 0.79</title>
      </sec>
      <sec id="sec-11-4">
        <title>Precision 0.93</title>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>8. CONCLUSION</title>
      <p>The process of identifying plagiarism has two steps: Heuristic
Retrieval and detail analysis. In this paper, we presented a method
for the second step of the plagiarism detection process in order to
identify similar parts of two documents. In this method, by
generating n-grams from each document and by comparing
ngrams of two documents, the similar parts are identified. In the
next step, the similar parts that are closer to each other than a
threshold limit are merged with each other. The proposed method,
without being engaged in complex text analysis technique, can
detect copy and near-copy parts in two documents though with
reasonable accuracy.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Asghari</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohtaj</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fatemi</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faili</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016</article-title>
          . In Working notes of FIRE 2016 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          , CEUR Workshop Proceedings, CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>On automatic plagiarism detection based on n-grams comparison</article-title>
          .
          <source>Advances in Information Retrieval</source>
          . Springer.
          <fpage>696</fpage>
          -
          <lpage>700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Garcia-Molina</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>Copy detection mechanisms for digital documents</article-title>
          .
          <source>ACM SIGMOD Record</source>
          (
          <year>1995</year>
          ),
          <fpage>398</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Improving the Reproducibility of PAN's Shared Tasks</article-title>
          : - Springer: http://link.springer.com/chapter/10.1007%
          <fpage>2F978</fpage>
          -
          <fpage>3</fpage>
          -
          <fpage>319</fpage>
          -11382-1_
          <fpage>22</fpage>
          .
          <string-name>
            <surname>Accessed</surname>
          </string-name>
          :
          <fpage>2016</fpage>
          -11-04.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            , A. and Han,
            <given-names>S.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>PPChecker: Plagiarism pattern checker in document copy detection</article-title>
          .
          <source>Text, Speech and Dialogue</source>
          (
          <year>2006</year>
          ),
          <fpage>661</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Lyon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrett</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Malcolm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector</article-title>
          .
          <source>Plagiarism: Prevention, Practice and Policies</source>
          . (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>Proceedings of the 23rd International Conference on Computational Linguistics: Posters</source>
          (
          <year>2010</year>
          ),
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>zu Eissen</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Strategies for retrieving plagiarized documents</article-title>
          .
          <source>Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          (
          <year>2007</year>
          ),
          <fpage>825</fpage>
          -
          <lpage>826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <fpage>2016</fpage>
          .
          <article-title>Plagiarism. Wikipedia, the free encyclopedia</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>