<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>External &amp; Intrinsic Plagiarism Detection : VSM &amp; Discourse Markers based Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sameer Rao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parth Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khushboo Singhal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasenjit Majumder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DA-IICT</institution>
          ,
          <addr-line>Gandhinagar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper aims to explain the performance of plagiarism detection system which can detect External as well as Intrinsic Plagiarism in text. It reports the results on PAN-PC-2011 test corpus. We investigated Vector Space Model based techniques for detecting external plagiarism cases and discourse markers based features to detect intrinsic plagiarism cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Automatic plagiarism detection has gained immense attention of the researchers
because of an absence of the one state-of-the-art algorithm and hence every year
many systems are being tested in PAN. In the external setting of plagiarism
detection, system has to nd evidence of plagiarism from the pool of source
documents. Sometimes there are no source documents available for suspicious
documents to compare with. In such cases intrinsic plagiarism detectors play
a major role. We present a Vector Space Model(VSM)[2] based approach for
external plagiarism detection and discourse markers based approach for internal
plagiarism detection.</p>
    </sec>
    <sec id="sec-2">
      <title>Detection</title>
      <p>For external plagiarism detection setting the Dataset PAN-PC-20111 contains
11093 suspicious documents and 11093 source documents. The literal size of the
corpus is 4.5 GB.
2.1</p>
      <sec id="sec-2-1">
        <title>Algorithm</title>
        <p>We convert all the non-english documents to english by a two stage strategy.
First, we identify language of the document using Google Language Identi er2
and then translate all non-english documents into english using Google
Translator API3. We notice that some of the words had character level di erences in
our system and hence were not properly translated in turn translation of the
sentence was not proper.</p>
        <p>Candidate Selection : We use VSM based approach to select the candidate
documents. All the source documents are indexed and each suspicious
document is given as query to this index. We consider top 250 source documents in
the ranked list as candidate or those with Similarity Score greater than 0.01,
whichever is less. This strategy of involving two parameters for upper bound
works good because there were many suspicious documents which were not at
all plagiarised. For such documents the similarity score rapidly goes below 0.01
and hence we save computational power by not analysing all 250 candidate
documents. Anyway we analyze at least top 20 documents for because we found some
suspicious documents which have very small amount of plagiarism have
similarity score below 0.01 even for top documents. Here, similarity score is typically
the Dot Product of source document (d ) and suspicious document(q ).
cos =</p>
        <p>
          d2:q
kd2k:kqk
(1)
Detailed Analysis : Last year we tried overlapping 7 word-grams to compare
the sections of suspicious and source documents[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This time we used a window
based similarity score to detect plagiarism. First, we take a 7-word gram of the
suspicious document and look for it in source document. If it matches, we believe
there can be a case of plagiarism because `seven consecutive words match' is a
potential evidence. Now, from that matching point we take 25 words window
in both suspicious and source documents and calculate the similarity score. We
remove a small set of stop-words from that window. We chose 25 words window
because smallest case of plagiarism can be of 200 characters and which is
explained by 25 words. We choose the similarity threshold to consider plagiarism as
0.50 which reveals at least 50% of words match in the window. We stop matching
the windows if 8 consecutive windows have similarity score below 0.50. Keeping
8 tolerence windows helps to improve the granularity if obfuscation is very high
for some sentences in between. Another reason to keep a tolerence window is, it
becomes possible to keep a high similarity score to avoid false positives and still
maintaining the granularity.
        </p>
        <p>We merge the consecutive plagiarism cases if they are 500 characters apart.
This helps in improving the granularity if algorithm has detected one case as split
in many small cases due to obfuscation. If a suspicious document annotated by
our algorithm has no plagiarism cases of length greater than 160 characters, we
consider that document as plagiarism-free.
3 Google Language Translator: http://translate.google.com/
The main idea behind the Intrinsic Algorithms is to nd out the sections which
are not in the harmony of the whole document in terms of writing style and/or
author style. This year we also tried to address this issue.
The Algorithm tries to calculate the distance between two normalized feature
vectors: One is composed of the whole document while the other representing the
partially overlapping sections of the documents of 2000 characters window with
200 step size. All the sections for which the style change value comes out to be
greater than 2.0 are marked to be plagiarized. Consecutive plagiarized sections
which are 500 characters apart are merged to form a single plagiarized case to
maintain proper granularity value.
Frequent character n-grams based feature to detect style change was used in [3],
while frequency of di erent pronouns, closed class words, stem su xes,
punctuation marks, average length of a statement were used to classify author style in
[4, 5]. We have combined these features and also added frequency of discourse
markers. We believe some authors use some words more often and these words
are generally discourse markers.</p>
        <p>Discourse markers Discourse markers are words that do not change the
meaning of the text. They are either used as ller element in the text or out of author's
habit. People use them frequently in the text and most likely twice every 2 or
3 sentence. So frequency of such words can help us in detecting author's style
change. Few discourse markers in English language are "well", "actually",
"basically" , "then", "means", etc. Such commonly used discourse markers are added
as a di erent dimension in our stylometric feature vector.</p>
        <p>Style change function Distance between normalized stylometric feature
vectors is calculated using style change function as
d1(A; B) =</p>
        <p>X
g P (A)
2(fA(g)
(fA(g) + fB(g))
fB(g)) 2
(2)
where A and B are normalized vectors for complete document and extracted
section of document respectively and g is di erent dimension of stylometric vector.
Further details of the function can be found in [3]
4.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Results and Analysis</title>
        <p>The corpus has 4753 number of documents for intrinsic setting. Performance
results are reported in Table 1. The major problem with our weak performance
is low recall and large number of false positive detection. We xed same
feature dimensions for all documents but some of those features don't apply to a
particular author and play a negative role in style change function calculation.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>We tested the performance of VSM based approach for the external plagiarism
detection and learnt that VSM can better handle obfuscation but one has to
carefully tackle the precision of the system. we plan to further investigate the
issue to improve on precision. VSM based technique to pull candidate documents
is very fast at the same time one has to go deep in the ranked list. Our external
plagiarism detection system seriously needs parameter tuning which we plan
to execute in near future. We also tried novel discourse markers based features
along with some well known features and successfully detect intrinsic plagiarism.
We would consider other features and techniques that help in removing false
positives, for which we need to analyze the fact of how much uniform an author
style can be when writing a document.
2. G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing,</p>
      <p>Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975
3. Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Pro les. In:
3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse.
pp. 38-46 (2009)
4. Mario Zechner,Markus Muhr,Roman Kern and Michael Granitzer.External and
Intrinsic Plagiarism Detection Using Vector Space Models.In 3rd PAN Workshop
Uncovering Plagiarism,Authorship and Social Software Misuse.pp.47-55(2009).
5. Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer. External and
Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation
System. Lab Report for PAN (2010).
6. Potthast M., Barrn-Cedeo A., Stein B., Rosso P. An Evaluation Framework for
Plagiarism Detection. In: Proc. of the 23rd International Conference on Computational
Linguistics, COLING-2010, Beijing, China, August 23-27, pp. 997-1005</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Parth</given-names>
            <surname>Gupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sameer</given-names>
            <surname>Rao</surname>
          </string-name>
          and
          <string-name>
            <given-names>Prasenjit</given-names>
            <surname>Majumder</surname>
          </string-name>
          .
          <article-title>External Plagiarism Detection: N-Gram Approach using Named Entity Recognizer: Lab Report for PAN at CLEF 2010</article-title>
          . In Braschler et al.
          <source>ISBN 978-88-904810-0-0.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>