<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Normalization based Stop-Word approach to Source Code Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saimadhav Heblikar</string-name>
          <email>saimadhavheblikar@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Poorva Sharma</string-name>
          <email>poorvasharma0615@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manogna Munnangi</string-name>
          <email>manogna08@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Channa Bankapur</string-name>
          <email>channabankapur@pes.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PES Institute of Technology</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PES University</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>6</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>This paper is a report of PES Institute of Technology's participation in the Cross Language Detection of Source Code Reuse (CL-SOCO) task at FIRE 2015 [1]. We approach this task as text document plagiarism task, without considering formal programming language grammatical structure. We use normalization of commonly used identi ers to detect pair of programs which have the same objective. We also nd that entirely removing these normalized operations improves the system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Source code reuse, Plagiarism detection</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Vast amounts of software code has become easily
available on the Internet. Sites like Stackover ow make available
solutions to common problems. In such a scenario,
software developers are tempted to copy and paste code from
one place to another. This could cause the owners of the
software legal, ethical, licensing and maintenance problems
in the long run. Software plagiarism also a ects
competitive programming competitions like ACM ICPC. The sheer
scale of available resources to plagiarize from and the
possible number of plagiarized documents makes this a source
code plagiarism detection a daunting task.</p>
      <p>
        Plagiarism detection in software source code is di erent
from text plagiarism detection task. One of the popular
approaches to text plagiarism detection is bag-of-words model[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, this is not useful in a software source code context
as a small set of programming constructs are bound to be
reused repeatedly, whilst doing altogether di erent things.
      </p>
      <p>
        There currently exist tools like MOSS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and JPLAG [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
which try to solve this problem. MOSS stands for \Measure
Of Software Similarity" and is a system for detecting
similarity in software. JPLAG is a system for detecting software
similarity considering text features as well as programming
language features. Both JPLag and MOSS are used in
academic environments.
2.
      </p>
    </sec>
    <sec id="sec-3">
      <title>TASK DESCRIPTION</title>
      <p>Cross language Source code reuse(CL-SOCO) track of FIRE
2015 deals with the detection of plagiarism in software source
code. The cross language aspect deals with detecting
plagiarism from C to Java sources.</p>
      <p>The training set given to us consisted of 599 C and 599
Java les. These les were numbered from 001.C to 599.C
and 001.java to 599.java The les with the same number
represented a plagiarized case. That is, 012.c and 012.java
represents a reuse case, while 012.c and 021.java don't. Since
some of these les were generated using a tool, they
contained parse errors. This was true for both the C and Java
data-set.</p>
      <p>The test set given to us consisted of 79 C les and 79
Java les. These les were numbered from 300.c to 378.c
and 000.java to 078.java.</p>
      <p>
        Both the training and test corpus are available at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>It is important to note that we do not have to mention
the direction of reuse. That is, whether the reuse was from
C to Java or from Java to C.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>CURRENT WORK</title>
      <p>In this section, we describe the state of research in the
eld of plagiarism detection in general, and source code
plagiarism detection in particular.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Bag-of-words-model</title>
      <p>
        In this model, the document is represented as a
bag-ofwords. In practice, it is a multi-set. It disregards order
or grammar, but accounts for multiplicity. Bag of words
is shown to work well for the text plagiarism detection task
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. It's performance is not satisfactory for the programming
language plagiarism detection task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The reason being
similar programming constructs are bound to be a very high
number of times in programs. However, these programs may
be doing entirely di erent things.
3.2
      </p>
      <p>
        Common Natural Language Processing techniques like word
n-grams are used to detect similarity between documents.
Some works also consider using features of the text like
number of white-spaces, average indentation, and other stylistic
features for evaluation. A popular tool is XPLAG [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Longest common sub-string</title>
      <p>
        Tools like JPLAG[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] make use of the Longest common
substring (LCS) approach. This is a pair wise approach.
The similarity between a pair is decided by the length of the
longest common substring.
      </p>
    </sec>
    <sec id="sec-7">
      <title>SYSTEM DESCRIPTION</title>
      <p>
        We build upon the existing work described in Section 3.1.
We work on the bag-of-words model, modifying it to support
term weighting. We then use word 1-grams as features. Our
approach is from XPLAG[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the sense that we do not
consider any other NLP techniques which were described in
Section 3.2.
      </p>
      <p>In this section we describe our approach. We present three
iterative runs, each built upon and improving over its
predecessor. Only the preprocessing stage varies for each run.
The rst run is the baseline run. The second run is the
normalization run. The third run builds upon the second,
and removes any normalized operation or identi er. We call
this third run as using the removal of stopwords, from the
normalized operations or identi ers.</p>
      <p>We divide the work ow into four stages :
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Preprocessing</title>
      <p>The preprocessing stage is divided into 2 parts. The rst
part is same for all approaches and is described below</p>
      <p>In the rst part of preprocessing, more than one
continuous whitespace are converted to a single whitespace. The
code is then converted to lower case. Any accents in the text
are stripped. The source code is then passed to a lexer. The
lexer removes lexemes like +, -, *, / etc. The output of the
lexer is a stream of tokens.</p>
      <p>Subsection, 4.1.1 to 4.1.3 provides a detailed description
of approach to the second part of preprocessing stage for
each run.
4.1.1</p>
      <sec id="sec-8-1">
        <title>Baseline</title>
        <p>In this stage, no work is done. That is, there is no
transformation of the tokenized stream obtained from part one of
preprocessing. This approach serves as a baseline.
4.1.2</p>
      </sec>
      <sec id="sec-8-2">
        <title>Normalization</title>
        <p>The input to this approach is the output from baseline
approach. In this stage, we study the language usage features
from the training data. We obtain frequency statistics about
the most commonly used identi ers in both the languages
C and Java. This was sorted based on frequency in
nonincreasing order. This list was pruned to consider those in
the top-n positions of the list. We also considered keywords
as identi ers.</p>
        <p>We then manually mapped similar identi ers/functions to
new operation identi ers or op-codes. For example, printf
is used as output function in C. System.out.println is used
as a output function in Java. Both these functions perform
similar operations. Therefore, we replace all occurrences of
printf and System.out.println with op1. Refer to Table 1 for
a full list of such replacements. The output from the stage
is fed to the vectorizer.
4.1.3</p>
      </sec>
      <sec id="sec-8-3">
        <title>Removing stopwords</title>
        <p>The input to this approach is the output from
preprocessing of normalization approach. All op-codes which were
generated in the previous stage are removed. This can also
be seen as using a stop-word list consisting of op-codes
generated earlier.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Vectorizer</title>
      <p>This stage receives as input a set of tokenized documents.
Each document is converted to a vector. The feature
chosen to create the vector is word 1-gram and 2-gram. The
weighting factor used is term frequency-inverse document
frequency (tf-idf).</p>
      <p>tf (t; d) = 0:5 +</p>
      <p>0:5 f (t; d)
maxf (t; d) : t 2 d
(1)
We know that term frequency(tf) increases proportionally
to the number of times it appears in a document, but is
o set by its frequency in the corpus. We require the o set
weights because certain set of programming language
constructs are used a very high number of times, almost always
in programs which do very di erent things. Inverse
document frequency(idf) serves this purpose. The output of this
stage is a set of tf-idf vectors, each vector representing a
document.</p>
      <p>We group the output into training set and a testing set.
The training set is passed to the similarity phase. The
testing set is passed to the deciding phase.</p>
      <p>The input to this stage is a set of vectors representing the
documents in the training set. The set of vectors
corresponding to the training set was divided into sets corresponding
to C les and Java les. A cross product was taken between
these two sets. This cross product set represents comparing
every C le with every Java le. A cross product was taken
between these two sets. This cross product set represents
every possible (C, Java) pair from training data.
similarity = cos( ) =</p>
      <p>A:B
kAkkBk
=
n
X Ai Bi
i=1
vuuXn (Ai)2
t
i=1
vuuXn (Bi)2
t
i=1
(2)</p>
      <p>As mentioned in Section 2, we know that a (C, Java) le
pair is plagiarized if and only if they have the same le
number. Any other case they are not plagiarized. We use the
above statistic to record the mean and median cosine
similarity values for all plagiarized cases and all non-plagiarized
cases.
4.4</p>
    </sec>
    <sec id="sec-10">
      <title>Deciding Phase</title>
      <p>The input to this stage is a set of vectors from the
testing data and similarity statistics like mean or median for
the plagiarized cases from the similarity phase. The set of
vectors corresponding to the test data was divided into sets
corresponding to C les and Java les. A cross product is
taken between these two sets. This cross product set
represents every possible (C, Java) pair from test data. For
every pair in the cross product set, cosine similarity is
computed. The deciding factor to say that a pair is plagiarized
was done by choosing the mean/median obtained from the
similarity phase as threshold. Anything above this
threshold was considered as plagiarized. For the 3 runs, we chose
mean value from the previous phase as a threshold. The
reason for choosing mean over median is given below.
4.4.1</p>
      <sec id="sec-10-1">
        <title>Choosing threshold for deciding</title>
        <p>The statistics in Table 2 are results from baseline
approach. The input documents were the training set itself.
There were 599x599 (C, Java pairs). The number of
plagiarized cases were 599. We measured the mean and median
values of cosine similarity for all the plagiarized cases.</p>
        <p>We see that using mean as threshold gives us slightly lower
precision, but a far better recall, and therefore a better F1
value. We see that number of false positives is higher using
mean as threshold, but the di erence between using mean or
median as threshold is tending to 0(0.00014) when expressed
as percentage.</p>
        <p>Precision is the fraction of retrieved instances that are
relevant, while recall is the fraction of relevant instances
that are retrieved.</p>
        <p>F1 score is the harmonic mean of precision and recall.
Since it takes both precision and recall into equal
consideration, it's an overall measure of relevance.</p>
        <p>As we can see in the results in Table 3, precision of all
the three runs is 1. This means all the retrieved results are
relevant i.e., there are no false positives.</p>
        <p>Next, we compare the di erent approaches.</p>
        <p>Comparison between baseline and normalization
The reuse cases of &lt;345.c and 033.java&gt;, &lt;351.c and
043.java&gt; and &lt;368.c and 061.java&gt; are there in
normalization but not in baseline. Our reasoning about this is as
follows - Since after preprocessing, there is no further
transformation of the tokens obtained from the lexer in baseline,
certain tokens in C and Java, although have the same
meaning (perform similar actions), might not have been
considered the same. This reduces the cosine similarity between
the les under consideration.</p>
        <p>For example, the statements
s t r c p y ( u r l , ` ` ' ' ) ( i n 3 4 5 . c )</p>
        <p>and
u r l = ` ` ' ' ( i n 0 3 3 . j a v a )</p>
        <p>do the same thing, but they are not considered the same
by the lexer. The reuse case of &lt;312.c and 005.java&gt; is there
in baseline but not in normalization. Since only the
topn positions of the frequency statistics regarding commonly
used identi ers/keywords in C and Java were considered for
the normalization, it would have missed out on certain
identi ers/keywords that perform the same actions in both C
and Java.</p>
        <p>Comparison between baseline and stop-word
removal</p>
        <p>The reuse case of &lt;337.c and 034.java&gt; is there in
baseline but not in stop word removal, reasons being similar
to above mentioned (we know that output of preprocessing
of normalization is fed to stop word removal preprocessing
stage).</p>
        <p>The reuse cases of &lt;321.c and 049.java&gt;, &lt;331.c and
065.java&gt;, &lt;345.c and 033.java&gt;, &lt;351.c and 043.java&gt;,
&lt;359.c and 029.java&gt;, &lt;368.c and 051.java&gt;, &lt;373.c
and 058.java&gt;, &lt;374.c and 006.java&gt;, &lt;375.c and
042.java&gt; and &lt;376.c and 074.java&gt; are all there in
stop-word removal but not in baseline. The possible reason
is since the op-codes are removed and direct mapping
between the identi ers or keywords of both the languages is
done, there is a higher possibility in matching similar
constructs and identi ers.</p>
        <p>Comparison between normalization and stop-word
removal</p>
        <p>The reuse cases in bold letters above along with &lt;312.c
and 005.java&gt; are all there in stop-word removal but not in
normalization run.</p>
        <p>The reuse case of &lt;337.c and 034.java&gt; is there in
normalization but not in stop-word removal. This may be
because in normalization, there are certain op-codes for similar
identi ers. In stop-word removal, we remove the op-codes.
Suppose there are many number of identi ers in both the
languages put together that have similar meaning, it might
become di cult/inaccurate to keep track of them without
using op-codes. (If we are using op-codes, all of them will
have a single op-code).</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>FUTURE SCOPE AND CONCLUSION</title>
    </sec>
    <sec id="sec-12">
      <title>Improving recall</title>
      <p>In order to improve the recall, we may do the following:
Use a combination of the methods used for normalization
and stop-word removal. In some cases, where a large
number of similar identi ers are there, op-code can be used. In
others, direct mapping can be used. More number of
identiers can be grouped under one op-code. For example, as and
when you encounter an identi er that has the same meaning
as the identi ers in an already de ned set, it can be added to
the set. The value of `n' in deciding the top-n by frequency
identi ers can be varied.
6.2</p>
    </sec>
    <sec id="sec-13">
      <title>Improving the normalization and stop-word removal procedure</title>
      <p>We mention certain aspects which were lacking in our
system and possibly suggestions on how it can be improved in
the future. This suggestions are keeping in mind how the
system can be made generic, that is, to support as many
language pairs as possible.
6.2.1</p>
      <sec id="sec-13-1">
        <title>Same language normalization automation</title>
        <p>Most programming languages provide many functions or
identi ers to do the same thing. For example, C provides
printf, fprintf and puts to output a string. We use
contextual information to automate the process of detecting such
functions or identi ers. Once contextually similar pairs are
detected, they may be assigned op-codes if they are indeed
doing similar things.
6.2.2</p>
      </sec>
      <sec id="sec-13-2">
        <title>Cross language normalization automation</title>
        <p>The approach of Section 6.2.1 may be used to decide
whether it would be feasible to automate the process of
generating op-codes across languages for similar operation.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Flores</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>VillatoroTello</surname>
          </string-name>
          , E.: PAN@
          <article-title>FIRE 2015: Overview of CL-SOCO Track on the Detection of Cross-Language SOurce COde Re-use</article-title>
          .
          <source>In Proceedings of the Seventh Forum for Information Retrieval Evaluation (FIRE</source>
          <year>2015</year>
          ), Gandhinagar, India,
          <fpage>4</fpage>
          -6
          <string-name>
            <surname>December</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Plagiarism</given-names>
            <surname>Detection</surname>
          </string-name>
          :
          <year>2015</year>
          . https://theory.stanford.edu/ aiken/moss/. Accessed:
          <fpage>2015</fpage>
          - 10- 18.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Prechelt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Malpohl</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Philippsen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Finding plagiarisms among a set of programs with JPlag</article-title>
          .
          <source>Journal of Universal Computer Science</source>
          ,
          <volume>8</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1016</fpage>
          -
          <lpage>1038</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Barron-Ceden</surname>
            ~o,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>On Automatic Plagiarism Detection Based on n-Grams Comparison</article-title>
          .
          <source>In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval (ECIR '09)</source>
          , Mohand Boughanem, Catherine Berrut, Josiane Mothe, and
          <string-name>
            <surname>Chantal</surname>
          </string-name>
          Soule-Dupuy (Eds.). Springer-Verlag, Berlin, Heidelberg,
          <fpage>696</fpage>
          -
          <lpage>700</lpage>
          ,
          <year>2009</year>
          . DOI=http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          - 00958-7 69')
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>DCU@FIRE-2014: An Information Retrieval Approach for Source Code Plagiarism Detection</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Arwin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and Tahaghoghi,
          <source>SMM. "Plagiarism detection across programming languages." Proceedings of the 29th Australasian Computer Science Conference-Volume 48. Australian Computer Society</source>
          , Inc.,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>