<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic External Plagiarism Detection Using Passage Similarities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Clara Vania</string-name>
          <email>clara.vania@ui.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mirna Adriani</string-name>
          <email>mirna@cs.ui.ac.id</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fakultas Ilmu Komputer Universitas Indonesia Kampus Depok Depok 16424</institution>
          ,
          <country country="ID">Indonesia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we report our approach in detecting external plagiarism. For the pre-processing stage, we identify non-English documents and translate them into English using an online translator tool. Then we index and retrieve the top documents that are similar to the suspicious documents. We divide the retrieved documents into passages where each passage contains twenty sentences. The plagiarism is detected by identifying the number of overlapped words between suspicious and source passages.</p>
      </abstract>
      <kwd-group>
        <kwd>plagiarism detection</kwd>
        <kwd>overlapping n-grams</kwd>
        <kwd>passage retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Nowadays, plagiarism happen easily and more difficult to detect. With the
advances of technology, especially the Internet, plagiarism can happen across
languages and has different level of obfuscation. People can easily copy and paste,
paraphrase, or translate websites, papers, or other sources from the Internet
without mentioning its source and acknowledge it as their own work. This
situation motivates in constructing an accurate automatic plagiarism detector. A
plagiarism detector is a tool to detect if a suspicious document contains plagiarized
work.</p>
      <p>
        In recent years, some research in the text plagiarism detection have been published
and developed. Mozgoyov et.al.
        <xref ref-type="bibr" rid="ref7">(Mozgoyov, Kakkonen, and Sutinen, 2007)</xref>
        develop natural language parser to find swapped words and phrases to detect
intentional plagiarism. Chen et.al.
        <xref ref-type="bibr" rid="ref3">(Chen, Yeh, and Ke, 2010)</xref>
        use n-gram
cooccurrence statistic to detect verbatim copy while LCS (Longest Common
Subsequence) is used to handle text modification.
      </p>
      <p>
        According to Potthas et al.
        <xref ref-type="bibr" rid="ref1 ref9">(Potthast, et al., 2009)</xref>
        , it is still difficult to determine
the best system or algorithm to detect plagiarism because there is no controlled
evaluation environment to compare the results. So, the PAN track on Plagiarism
Detection was held last year to overcome this plagiarism problem. The plagiarism
track offers two topics to detect text plagiarism automatically: external plagiarism
and intrinsic plagiarism. The external plagiarism is intended to detect plagiarism
section in a suspected document and its corresponding source document. While the
intrinsic plagiarism detects a plagiarized section without comparing the suspect
documents to the source documents.
      </p>
      <p>
        Grozea et.al.
        <xref ref-type="bibr" rid="ref4 ref5">(Grozea, Gehl, and Popescu, 2009)</xref>
        use character-16 gram VSM
(Vector Space Model) for their retrieval model and get most similar documents to
each suspicious document using cosine similarity score. To extract the pair
sections, they join the matches based on a Monte Carlo Optimization. Basile et.al.
        <xref ref-type="bibr" rid="ref1 ref9">(Basile et al., 2009)</xref>
        use word 8-grams VSM to retrieve similar documents and use
their “joining algorithm” to extract the plagiarized passage. Kasprzak et.al.
        <xref ref-type="bibr" rid="ref1 ref5 ref9">(Kasprzak et al., 2009)</xref>
        apply word-5-gram VSM to retrieve documents which
share at least 20 n-grams with each suspicious document. Then they extract pairs
of section which share at least 20 matching n-grams and at most 49 not-matching
n-grams.
      </p>
      <p>In this paper we report our approach in detecting plagiarism (external plagiarism).
The remaining of this paper is organized as follows: section 2 discusses our
methods in plagiarism detection, section 3 describes the evaluation and section 4 is
the conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2 External Plagiarism Detection</title>
      <p>In this section, we describe the method that we use in our plagiarism detection.
There are four main steps in our detection method such as preprocessing stage,
finding candidate documents, extract similar passages, and post-processing stage.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing Phase</title>
        <p>The pre-processing phase is mainly analyzing the corpus. The PAN ’10 corpus1
consists of 11.148 source documents and 15.925 suspicious documents. The
corpus not only contains English documents but also several other languages. The
external plagiarism cases also include the cross-lingual plagiarism cases. So, at the
beginning we identify the language used in the documents using an automatic
language identifier. The result shows that the non-English documents only occur
in the source document set. The language identifier recognizes 10.480 English
documents, 474 German documents, and 194 Spanish documents. Then we
translate all non-English documents into English using an online language
translator. We substitute the non-English documents in the corpus with their
translated documents.
1 http://pan.webis.de/
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Finding Candidate Documents</title>
        <p>The procedure in finding candidate documents is the same as document retrieval
using suspicious document as queries. In this phase, we index the overall source
documents and use suspicious documents as queries. We use Lucene2 to index and
retrieve the corpus. Lucene is an open source information retrieval system based
on combination of Boolean Model and Vector Space Model. During the indexing
process, we remove the stopwords, however we do not apply any stemming
algorithm. In this work, for each suspicious document (as query), we retrieve the
10 most similar source documents.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Extract Similar Passages 2.4</title>
      </sec>
      <sec id="sec-2-4">
        <title>Post-processing Phase</title>
        <p>We divide the top 10 source documents and suspicious documents into small
passages. Each passage contains 20 sentences. Then we index and retrieve
passages that are similar to the sections found in the source documents. We only
use the top-5 similar source passages for each suspicious passage.</p>
        <p>
          In the post-processing phase, we analyze both of the pair passages. We filter the
top-5 most similar source passages by removing pair passages that have low
similarity score. After that, we compute the overlapping n-grams
          <xref ref-type="bibr" rid="ref2 ref6">(Broder, 1997;
Lyon et.al., 2001)</xref>
          between two passages. For the final result, we take pair
passages that have at least three overlapping 6-grams. Small n-grams parameter is
used because the size of the passages is also small (twenty sentences).
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Evaluation</title>
      <p>
        We don’t have time to try our method using the training corpus, so the evaluation
is only done using the testing corpus. Based on the evaluation measure given by
the organizer
        <xref ref-type="bibr" rid="ref8">(Potthast, 2010)</xref>
        , the detail score of our algorithm can be seen in
Table 2.
2 http://lucene.apache.org
      </p>
      <p>Our result show that our method performs quite good precision score (we were 4th
for this parameter), but it has very low recall score. In other words, for the
precision score, 91.14% of our detections are correct while 8.86% are incorrect.
On the other hand, the recall means that our detector can only detect 26.2% of the
overall plagiarism cases.</p>
      <p>Based on our result, we need to explore further in terms of plagiarism with
different level of obfuscation. The translation process at early stage is quite
effective to overcome cross-language plagiarism, but in the detailed step, passage
retrieval and n-grams overlapping technique just can handle exact match
plagiarism. Plagiarism using word modification such as the use of synonym, word
reordering, and paraphrasing still can’t be identified using our method.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusion</title>
      <p>We report our participation in identifying external plagiarism in CLEF 2010. We
apply N-grams overlapping words to measure the plagiarism between pair
passages found in the documents. Our result achieves high precision (0.9114), but
still low in terms of recall (0.2620). This method can identify the cross-language
plagiarism, however it fails to detect plagiarism with various word modifications.
In the future we will include words variations and develop method to detect
plagiarism with different level of obfuscation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Basile</surname>
          </string-name>
          et al.
          <year>2009</year>
          .
          <article-title>A Plagiarism Detection Procedure in Three Steps: Selection, Matches and “Squares”</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Broder</surname>
            ,
            <given-names>A Z.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>On the resemblance and containment of documents</article-title>
          .
          <source>In Compression and Complexity of Sequences. IEEE Computer Society.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chien-Ying</surname>
          </string-name>
          ,
          <article-title>Jen-Yuan Yeh, and</article-title>
          <string-name>
            <given-names>Hao-Ren</given-names>
            <surname>Ke</surname>
          </string-name>
          .
          <article-title>Plagiarism Detection using ROUGE and WordNet</article-title>
          .
          <source>Journal of Computing</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ), pages
          <fpage>34</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>March 2010</year>
          . https://sites.google.com/site/journalofcomputing/.
          <source>ISSN 2151-9617.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Grozea</surname>
            , Cristian,
            <given-names>Christian</given-names>
          </string-name>
          <string-name>
            <surname>Gehl</surname>
            , and
            <given-names>Marius</given-names>
          </string-name>
          <string-name>
            <surname>Popescu</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Kasprzak</surname>
            , Jan,
            <given-names>Michal</given-names>
          </string-name>
          <string-name>
            <surname>Brandejs</surname>
            , and
            <given-names>Miroslav</given-names>
          </string-name>
          <string-name>
            <surname>Křipač</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Finding Plagiarism by Evaluating Document Similarities</article-title>
          . In Stein et al. (
          <article-title>Stein et al</article-title>
          .,
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Lyon</surname>
          </string-name>
          et al.
          <year>2001</year>
          .
          <article-title>Detecting short passages of similar text in large document collections</article-title>
          .
          <source>In Conference on Empirical Methods in Natural Language (EMNLP2001)</source>
          . pp.
          <fpage>118</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Mozgovoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kakkonen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Sutinen</surname>
          </string-name>
          .
          <article-title>Using Natural Language Parsers in Plagiarism Detection</article-title>
          .
          <source>In Proceeding of SLaTE'07 Workshop</source>
          , Pennsylvania, USA,
          <year>October 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          , Martin et al.
          <year>2010</year>
          .
          <article-title>An Evaluation Framework for Plagiarism Detection</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on Computational Linguistics (COLING</source>
          <year>2010</year>
          ), Beijing, China,
          <year>August 2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          , Martin et al.
          <year>2009</year>
          .
          <article-title>Overview of the 1st International Competition on Plagiarism Detection</article-title>
          . In Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Eneko Agirre, editors,
          <source>SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>September 2009</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
          <source>ISSN 1613-0073.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>