=Paper=
{{Paper
|id=Vol-1587/T1-2
|storemode=property
|title=Normalization based Stop-Word approach to Source Code Plagiarism Detection
|pdfUrl=https://ceur-ws.org/Vol-1587/T1-2.pdf
|volume=Vol-1587
|authors=Saimadhav Heblikar,Poorva Sharma,Manogna Munnangi,Channa Bankapur
|dblpUrl=https://dblp.org/rec/conf/fire/HeblikarSMB15
}}
==Normalization based Stop-Word approach to Source Code Plagiarism Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-1587/T1-2.pdf</pdf>
<pre>
 Normalization based Stop-Word approach to Source Code
                  Plagiarism Detection

             Saimadhav Heblikar                      Poorva Sharma                         Manogna Munnangi
          PES Institute of Technology PES Institute of Technology                        PES Institute of Technology
               Bangalore, India            Bangalore, India                                   Bangalore, India
         saimadhavheblikar@gmail.com poorvasharma0615@gmail.com                           manogna08@gmail.com
                                                    Channa Bankapur
                                                      PES University
                                                     Bangalore, India
                                                 channabankapur@pes.edu

ABSTRACT                                                              similarity considering text features as well as programming
This paper is a report of PES Institute of Technology’s par-          language features. Both JPLag and MOSS are used in aca-
ticipation in the Cross Language Detection of Source Code             demic environments.
Reuse (CL-SOCO) task at FIRE 2015 [1]. We approach
this task as text document plagiarism task, without consid-
ering formal programming language grammatical structure.              2.    TASK DESCRIPTION
We use normalization of commonly used identifiers to de-
                                                                        Cross language Source code reuse(CL-SOCO) track of FIRE
tect pair of programs which have the same objective. We
                                                                      2015 deals with the detection of plagiarism in software source
also find that entirely removing these normalized operations
                                                                      code. The cross language aspect deals with detecting pla-
improves the system.
                                                                      giarism from C to Java sources.
                                                                        The training set given to us consisted of 599 C and 599
CCS Concepts                                                          Java files. These files were numbered from 001.C to 599.C
•Information systems → Similarity measures; Clus-                     and 001.java to 599.java The files with the same number rep-
tering and classification;                                            resented a plagiarized case. That is, 012.c and 012.java rep-
                                                                      resents a reuse case, while 012.c and 021.java don’t. Since
                                                                      some of these files were generated using a tool, they con-
Keywords                                                              tained parse errors. This was true for both the C and Java
Source code reuse, Plagiarism detection                               data-set.

                                                                        The test set given to us consisted of 79 C files and 79
1.   INTRODUCTION                                                     Java files. These files were numbered from 300.c to 378.c
                                                                      and 000.java to 078.java.
   Vast amounts of software code has become easily avail-
                                                                        Both the training and test corpus are available at [1].
able on the Internet. Sites like Stackoverflow make available
                                                                        It is important to note that we do not have to mention
solutions to common problems. In such a scenario, soft-
                                                                      the direction of reuse. That is, whether the reuse was from
ware developers are tempted to copy and paste code from
                                                                      C to Java or from Java to C.
one place to another. This could cause the owners of the
software legal, ethical, licensing and maintenance problems
in the long run. Software plagiarism also affects competi-            3.    CURRENT WORK
tive programming competitions like ACM ICPC. The sheer                   In this section, we describe the state of research in the
scale of available resources to plagiarize from and the pos-          field of plagiarism detection in general, and source code pla-
sible number of plagiarized documents makes this a source             giarism detection in particular.
code plagiarism detection a daunting task.

   Plagiarism detection in software source code is different          3.1   Bag-of-words-model
from text plagiarism detection task. One of the popular ap-
proaches to text plagiarism detection is bag-of-words model[4].          In this model, the document is represented as a bag-of-
However, this is not useful in a software source code context         words. In practice, it is a multi-set. It disregards order
as a small set of programming constructs are bound to be              or grammar, but accounts for multiplicity. Bag of words
reused repeatedly, whilst doing altogether different things.          is shown to work well for the text plagiarism detection task
                                                                      [4]. It’s performance is not satisfactory for the programming
   There currently exist tools like MOSS [2] and JPLAG [3]            language plagiarism detection task [5]. The reason being
which try to solve this problem. MOSS stands for “Measure             similar programming constructs are bound to be a very high
Of Software Similarity” and is a system for detecting simi-           number of times in programs. However, these programs may
larity in software. JPLAG is a system for detecting software          be doing entirely different things.


                                                                  6
3.2     NLP Techniques
                                                                                  Table 1: Normalization List
  Common Natural Language Processing techniques like word                   Op-Code    Identifiers assigned to Op-Code
n-grams are used to detect similarity between documents.                      op1            len,strlen,length,size
Some works also consider using features of the text like num-                 op2             stdio,stdlib,system
ber of white-spaces, average indentation, and other stylistic                 op3                   size,sizeof
features for evaluation. A popular tool is XPLAG [6].                         op4         struct,typedef,class,object
                                                                              op5         string,str,StringFunctions
3.3     Longest common sub-string                                             op6                     list,iter
  Tools like JPLAG[3] make use of the Longest common                          op7                  new,malloc
substring (LCS) approach. This is a pair wise approach.                       op8                argv,argValue
The similarity between a pair is decided by the length of the                 op9                       rand
longest common substring.
                                                                             op10                   argc,args
                                                                             op11                pthis,pthread
4.    SYSTEM DESCRIPTION                                                     op12    print, fprintf, printf, sprintf, println
   We build upon the existing work described in Section 3.1.                         System.out.println,
We work on the bag-of-words model, modifying it to support                           System.out.print,
term weighting. We then use word 1-grams as features. Our                            System.out.printf,
approach is from XPLAG[6] in the sense that we do not                                puts, putchar,fputs
consider any other NLP techniques which were described in                    op13                 array,charAt
Section 3.2.                                                                 op14                   ret,return
   In this section we describe our approach. We present three                op15                      file,fd
iterative runs, each built upon and improving over its pre-                  op16                  int,integer
decessor. Only the preprocessing stage varies for each run.                  op17                char,character
The first run is the baseline run. The second run is the                     op18            bool,Boolean,boolean
normalization run. The third run builds upon the second,                     op19                  float,Float
and removes any normalized operation or identifier. We call                  op20      scanf,scanner,gets,getch,getchar
this third run as using the removal of stopwords, from the
normalized operations or identifiers.
   We divide the workflow into four stages :                        printf and System.out.println with op1. Refer to Table 1 for
                                                                    a full list of such replacements. The output from the stage
4.1     Preprocessing                                               is fed to the vectorizer.
  The preprocessing stage is divided into 2 parts. The first
part is same for all approaches and is described below              4.1.3     Removing stopwords
  In the first part of preprocessing, more than one contin-           The input to this approach is the output from prepro-
uous whitespace are converted to a single whitespace. The           cessing of normalization approach. All op-codes which were
code is then converted to lower case. Any accents in the text       generated in the previous stage are removed. This can also
are stripped. The source code is then passed to a lexer. The        be seen as using a stop-word list consisting of op-codes gen-
lexer removes lexemes like +, -, *, / etc. The output of the        erated earlier.
lexer is a stream of tokens.
  Subsection, 4.1.1 to 4.1.3 provides a detailed description
of approach to the second part of preprocessing stage for           4.2     Vectorizer
each run.                                                              This stage receives as input a set of tokenized documents.
                                                                    Each document is converted to a vector. The feature cho-
4.1.1    Baseline                                                   sen to create the vector is word 1-gram and 2-gram. The
  In this stage, no work is done. That is, there is no trans-       weighting factor used is term frequency-inverse document
formation of the tokenized stream obtained from part one of         frequency (tf-idf).
preprocessing. This approach serves as a baseline.
                                                                                                        0.5 ∗ f (t, d)
                                                                                  tf (t, d) = 0.5 +                             (1)
4.1.2    Normalization                                                                                maxf (t, d) : t ∈ d
   The input to this approach is the output from baseline ap-       We know that term frequency(tf) increases proportionally
proach. In this stage, we study the language usage features         to the number of times it appears in a document, but is
from the training data. We obtain frequency statistics about        offset by its frequency in the corpus. We require the offset
the most commonly used identifiers in both the languages            weights because certain set of programming language con-
C and Java. This was sorted based on frequency in non-              structs are used a very high number of times, almost always
increasing order. This list was pruned to consider those in         in programs which do very different things. Inverse docu-
the top-n positions of the list. We also considered keywords        ment frequency(idf) serves this purpose. The output of this
as identifiers.                                                     stage is a set of tf-idf vectors, each vector representing a
   We then manually mapped similar identifiers/functions to         document.
new operation identifiers or op-codes. For example, printf             We group the output into training set and a testing set.
is used as output function in C. System.out.println is used         The training set is passed to the similarity phase. The test-
as a output function in Java. Both these functions perform          ing set is passed to the deciding phase.
similar operations. Therefore, we replace all occurrences of


                                                                7
Table 2: Comparison of mean vs. median for decid-                       Table 3: Results for the cross language collection
ing threshold                                                             Preprocessing approach  F1     Precision Recall
                                  Mean Median                                    Baseline        0.683    1.000    0.519
      Threshold(similarity value) 0.644 0.776                                 Normalization      0.697    1.000    0.534
            False positives        139  88                                 Removing stopwords    0.740    1.000    0.603
                  F1              0.450 0.324
              Precision           0.738 0.775
                                                                       5.   RESULTS AND ANALYSIS
                                                                          Precision is the fraction of retrieved instances that are
4.3     Similarity Phase                                               relevant, while recall is the fraction of relevant instances
  The input to this stage is a set of vectors representing the         that are retrieved.
documents in the training set. The set of vectors correspond-             F1 score is the harmonic mean of precision and recall.
ing to the training set was divided into sets corresponding            Since it takes both precision and recall into equal consider-
to C files and Java files. A cross product was taken between           ation, it’s an overall measure of relevance.
these two sets. This cross product set represents comparing               As we can see in the results in Table 3, precision of all
every C file with every Java file. A cross product was taken           the three runs is 1. This means all the retrieved results are
between these two sets. This cross product set represents              relevant i.e., there are no false positives.
every possible (C, Java) pair from training data.                         Next, we compare the different approaches.
                                                                          Comparison between baseline and normalization
                                              n
                                              X                           The reuse cases of <345.c and 033.java>, <351.c and
                                       Ai ∗ Bi                         043.java> and <368.c and 061.java> are there in normal-
                       A.B         i=1                                 ization but not in baseline. Our reasoning about this is as
similarity = cos(θ) =        = v
                      kAkkBk
                                          v
                               u n
                               uX
                                          u n
                                          uX                           follows - Since after preprocessing, there is no further trans-
                               t (Ai ) ∗ t (Bi )2
                                       2
                                                                       formation of the tokens obtained from the lexer in baseline,
                                        i=1           i=1              certain tokens in C and Java, although have the same mean-
                                                             (2)       ing (perform similar actions), might not have been consid-
   As mentioned in Section 2, we know that a (C, Java) file            ered the same. This reduces the cosine similarity between
pair is plagiarized if and only if they have the same file num-        the files under consideration.
ber. Any other case they are not plagiarized. We use the                  For example, the statements
above statistic to record the mean and median cosine simi-
                                                                       strcpy ( url ,    ‘‘   ’ ’) ( in 345. c )
larity values for all plagiarized cases and all non-plagiarized
cases.                                                                   and

4.4     Deciding Phase                                                 url = ‘ ‘ ’ ’ ( in 033. java )
  The input to this stage is a set of vectors from the test-              do the same thing, but they are not considered the same
ing data and similarity statistics like mean or median for             by the lexer. The reuse case of <312.c and 005.java> is there
the plagiarized cases from the similarity phase. The set of            in baseline but not in normalization. Since only the top-
vectors corresponding to the test data was divided into sets           n positions of the frequency statistics regarding commonly
corresponding to C files and Java files. A cross product is            used identifiers/keywords in C and Java were considered for
taken between these two sets. This cross product set rep-              the normalization, it would have missed out on certain iden-
resents every possible (C, Java) pair from test data. For              tifiers/keywords that perform the same actions in both C
every pair in the cross product set, cosine similarity is com-         and Java.
puted. The deciding factor to say that a pair is plagiarized              Comparison between baseline and stop-word re-
was done by choosing the mean/median obtained from the                 moval
similarity phase as threshold. Anything above this thresh-                The reuse case of <337.c and 034.java> is there in base-
old was considered as plagiarized. For the 3 runs, we chose            line but not in stop word removal, reasons being similar
mean value from the previous phase as a threshold. The                 to above mentioned (we know that output of preprocessing
reason for choosing mean over median is given below.                   of normalization is fed to stop word removal preprocessing
                                                                       stage).
4.4.1    Choosing threshold for deciding                                  The reuse cases of <321.c and 049.java>, <331.c and
   The statistics in Table 2 are results from baseline ap-             065.java>, <345.c and 033.java>, <351.c and 043.java>,
proach. The input documents were the training set itself.              <359.c and 029.java>, <368.c and 051.java>, <373.c
There were 599x599 (C, Java pairs). The number of plagia-              and 058.java>, <374.c and 006.java>, <375.c and
rized cases were 599. We measured the mean and median                  042.java> and <376.c and 074.java> are all there in
values of cosine similarity for all the plagiarized cases.             stop-word removal but not in baseline. The possible reason
   We see that using mean as threshold gives us slightly lower         is since the op-codes are removed and direct mapping be-
precision, but a far better recall, and therefore a better F1          tween the identifiers or keywords of both the languages is
value. We see that number of false positives is higher using           done, there is a higher possibility in matching similar con-
mean as threshold, but the difference between using mean or            structs and identifiers.
median as threshold is tending to 0(0.00014) when expressed               Comparison between normalization and stop-word
as percentage.                                                         removal
                                                                          The reuse cases in bold letters above along with <312.c


                                                                   8
and 005.java> are all there in stop-word removal but not in              [4] Barrón-Cedeño, A. and Rosso, P. On Automatic Pla-
normalization run.                                                     giarism Detection Based on n-Grams Comparison. In Pro-
  The reuse case of <337.c and 034.java> is there in nor-              ceedings of the 31th European Conference on IR Research
malization but not in stop-word removal. This may be be-               on Advances in Information Retrieval (ECIR ’09), Mohand
cause in normalization, there are certain op-codes for similar         Boughanem, Catherine Berrut, Josiane Mothe, and Chantal
identifiers. In stop-word removal, we remove the op-codes.             Soule-Dupuy (Eds.). Springer-Verlag, Berlin, Heidelberg,
Suppose there are many number of identifiers in both the               696-700, 2009. DOI=http://dx.doi.org/10.1007/978-3-642-
languages put together that have similar meaning, it might             00958-7 69’)
become difficult/inaccurate to keep track of them without                [5] Ganguly, D., Jones, G.: DCU@FIRE-2014: An In-
using op-codes. (If we are using op-codes, all of them will            formation Retrieval Approach for Source Code Plagiarism
have a single op-code).                                                Detection.
                                                                         [6] Arwin, C., and Tahaghoghi, SMM. ”Plagiarism de-
                                                                       tection across programming languages.” Proceedings of the
6.    FUTURE SCOPE AND CONCLUSION                                      29th Australasian Computer Science Conference-Volume 48.
                                                                       Australian Computer Society, Inc., 2006.
6.1     Improving recall
   In order to improve the recall, we may do the following:
Use a combination of the methods used for normalization
and stop-word removal. In some cases, where a large num-
ber of similar identifiers are there, op-code can be used. In
others, direct mapping can be used. More number of identi-
fiers can be grouped under one op-code. For example, as and
when you encounter an identifier that has the same meaning
as the identifiers in an already defined set, it can be added to
the set. The value of ‘n’ in deciding the top-n by frequency
identifiers can be varied.

6.2     Improving the normalization and stop-word
        removal procedure
  We mention certain aspects which were lacking in our sys-
tem and possibly suggestions on how it can be improved in
the future. This suggestions are keeping in mind how the
system can be made generic, that is, to support as many
language pairs as possible.

6.2.1    Same language normalization automation
  Most programming languages provide many functions or
identifiers to do the same thing. For example, C provides
printf, fprintf and puts to output a string. We use contex-
tual information to automate the process of detecting such
functions or identifiers. Once contextually similar pairs are
detected, they may be assigned op-codes if they are indeed
doing similar things.

6.2.2    Cross language normalization automation
  The approach of Section 6.2.1 may be used to decide
whether it would be feasible to automate the process of gen-
erating op-codes across languages for similar operation.


7.    REFERENCES
   [1] Flores, E. and Rosso, P. and Moreno, L. and Villatoro-
Tello, E.: PAN@FIRE 2015: Overview of CL-SOCO Track
on the Detection of Cross-Language SOurce COde Re-use.
In Proceedings of the Seventh Forum for Information Re-
trieval Evaluation (FIRE 2015), Gandhinagar, India, 4-6
December (2015)
   [2] Plagiarism Detection: 2015.
https://theory.stanford.edu/ aiken/moss/. Accessed: 2015-
10- 18.
   [3] Prechelt, L. and Malpohl, G. and Philippsen, M. Find-
ing plagiarisms among a set of programs with JPlag. Journal
of Universal Computer Science, 8(11):1016-1038, 2002.


                                                                   9

</pre>