=Paper= {{Paper |id=Vol-1587/T1-3 |storemode=property |title=High Level Features for Detecting Source Code Plagiarism across Programming Languages |pdfUrl=https://ceur-ws.org/Vol-1587/T1-3.pdf |volume=Vol-1587 |authors=A. Ramírez-de-la-Cruz,G. Ramírez-de-la-Rosa,C. Sánchez-Sánchez,H. Jiménez-Salazar,C. Rodríguez-Lucatero,W. A. Luna-Ramírez |dblpUrl=https://dblp.org/rec/conf/fire/Ramirez-de-la-Cruz15 }} ==High Level Features for Detecting Source Code Plagiarism across Programming Languages== https://ceur-ws.org/Vol-1587/T1-3.pdf
High Level Features for Detecting Source Code Plagiarism
            across Programming Languages

                                                                                  ∗
           A. Ramírez-de-la-Cruz                G. Ramírez-de-la-Rosa                     C. Sánchez-Sánchez
              Universidad Autónoma                  Universidad Autónoma                    Universidad Autónoma
                  Metropolitana                         Metropolitana                           Metropolitana
               Unidad Cuajimalpa                     Unidad Cuajimalpa                       Unidad Cuajimalpa
                   México D.F.                           México D.F.                             México D.F.

             H. Jiménez-Salazar                  C. Rodríguez-Lucatero                     W. A. Luna-Ramírez
              Universidad Autónoma                  Universidad Autónoma                    Universidad Autónoma
                  Metropolitana                         Metropolitana                           Metropolitana
               Unidad Cuajimalpa                     Unidad Cuajimalpa                       Unidad Cuajimalpa
                   México D.F.                           México DF.                              México D.F.


ABSTRACT                                                               identify re-use source code cases in a monolingual scenario.
In this paper we describe the participation of the Language            Later, in this year 2015, the task considers a cross-language
and Reasoning group from UAM-C in the context of the                   scenario, that is, when a programmer tries to re-use source
Cross Language SOurce COde re-use competition (CL-SOCO                 code from one language, say Java, to another, say C.
2015). We proposed a representation of source code pairs by               In this paper, we present our methodology to solve the
using five high level features; namely: i) lexical feature, ii)        problem of finding source code re-use cases across Java and
stylistic feature, iii) comments feature, iv ) programmer’s            C programming languages. Consequently, we use a set of five
text feature, and v ) structure feature. We combine these              high level features within a classification problem, namely:
different representations in three ways, each of which was             lexical, stylistic, comments, programmer’s text, and struc-
a run submission for the CL-SOCO competition. Obtained                 tural features.
results indicate that proposed representations provide some               The rest of the paper is organized as follows, in Section
information that allows to detect particular cases of source           2 we describe the research work more closely related to our
code re-use.                                                           proposed methodology. In Section 3, we briefly describe the
                                                                       shared task. Then, in Section 4, we describe the computa-
                                                                       tion of each of our five proposed features. Our Section 5
CCS Concepts                                                           presents the experiment evaluation carried out on the train-
•Information systems → Content analysis and fea-                       ing set. Section 6 shows the details of submitted runs, and
ture selection; Near-duplicate and plagiarism detec-                   finally in Section 7, we present our conclusions and some
tion; •Applied computing → Document analysis;                          future perspectives.

Keywords                                                               2.   RELATED WORK
High level feature; Document representation; Plagiarism de-               There are several approaches and tools for finding plagia-
tection; Source code plagiarism                                        rism in source code [10]. Some of the most representatives’
                                                                       approaches are those that try to find syntactic similarities
1.   INTRODUCTION                                                      through the codes, in the same source language. Some of
                                                                       those are based on searching similar n-grams or small char-
  Source code re-use identification has been an interesting            acter sequences (strings) between two source codes. Some
topic in two fronts: in the software industry and in the               examples are: the proposed by Flores et al. [7] and, the
academia. On one hand, software companies are very in-                 proposed by Wettel et al. [15] respectively. Likewise, other
terested on protect their own software developments; thus,             approaches have tried to detect lexical similarities. For ex-
they invest lots of effort and money in trying to do so. On            ample the proposed approaches by Krinke et al. [12] or Chae
the other hand, the academia, mainly on computing related              et al. [4] that look for graph dependencies (of how methods
areas, worries that their students do not plagiarize source            are called) inside the abstract syntax tree.
code neither from other students nor from the forums on                   Focusing beyond watching a certain characteristic, the ap-
the Internet [5].                                                      proach proposed by Ramirez et al. [13] evaluates a set of
  The competition of Source Code Re-Use emerge in this                 three different types of features in order to determine the
context, when in 2014 [8] they invited to the scientific com-          similarity between code sources. The features are: 1) lexi-
munity to a shared task in order to evaluate systems that              cal (character n-grams), 2) structural (function names and
∗Corresponding        author.            E-mail      address:          parameter names and types), and 3) stylistic (the number
gramirez@correo.cua.uam.mx                                             of lines of code, the number of white spaces, the number of


                                                                  10
tabulations, the number of empty lines, the number of de-              4.   PROPOSED HIGH LEVEL FEATURES
fined functions, average word length, the number of upper                 In this section we describe our proposed representation for
case letters, the number of lower case letters, the number of          source code documents as a set of five high level features,
under scores, vocabulary size, and the lexical richness). This         in order to identify source code re-use. To compute each
combination has shown important aspects, with acceptable               of these features, we represent a document in five different
results, for determining plagiarism between pairs of Java              ways. Particularly, for the lexical, comments and program-
source codes.                                                          mer’s text features, we represent each document as a set of
   On the other hand, some researchers have focused on try-            characters n-grams; for the stylistic feature, we use eleven
ing to identify similarities or code clones in software writ-          attributes; and for the structural feature we use another ten
ten in different languages. Basically, their methods use an            attributes.
intermediate language to change the codes into it, but at                 The idea behind these set of high level features is to cap-
the end, they search for similarities in the same language.            ture aspect of source code that are inherent to the program-
An example of such methods is the proposal of Al-Omari                 mer more than a particular programming language. Thus,
et al. [1] in which clones of software, specified in different         the stylistic feature capture information about the writing
languages belonging to .NET framework, can be recognized.              style of the programmer; the comments’ feature attends for
In this proposal the software is analyzed when it is trans-            only the information in natural language that the program-
formed into the Common Intermediate Language, which is                 mer uses to explain the code; programmer’s text feature takes
the result of compiling source code in .NET, to finally look           into account the strings that the program produces, that is,
for similarities.                                                      text that, again, little has to do with the programming lan-
   Another example of the use of an intermediate language              guage per se. Additionally, the lexical and structural fea-
is the work of Brixel et al. [3]. They focus on identifying            tures take advantage of the fact that both language at hand
language-independent code clones, for doing that they pre-             (that is, Java and C) share, at some extent, some syntax.
processed the source codes, to take them to an intermediate            Next, we describe each of the five high level proposed fea-
language, and then they use an alignment method based on               tures.
the parallel principle at local resolution (character level) to           Lexical feature. The idea behind this representation is
compute similarities between documents.                                to find a global similarity along the entire document using
   There are other proposals that do not take the source code          the representation proposed by Flores [6]. We compute this
as main point, instead they focus on the source code file con-         feature in the same way as is described in [13]. That is, we
tent. That is the case of the approach proposed by Vidhya et           use a bag of character trigrams where all the white spaces
al. [14] where two types of metrics are calculated. The first          and line-breaks are deleted and the letters are changed into
kind of metrics work at method level: Number of lines of               lowercase. Additionally, as we know the language of each
code, arguments, function calls, local variables, conditional          source code file a priori, we eliminate the reserved words
statements, looping statements, return statements, assign-             within the document. Consequently, given two source code
ment statements. The second type of metrics work at file               documents D1 and D2 each one is represented as a vector
level: Number of lines of code, variables declared, methods            according to the vector space model [2], where the dimen-
defined, function calls, and sequences of function calls. The          sion of these vector is given by the vocabulary of character
author compare two documents by differences in each of the             trigrams in both documents. Finally, the lexical feature is
metrics described before; then, they detect a case of clone            compute as the cosine similarity of these two vectors (see
using a manual threshold of the average of the differences in          Equation 1).
the metrics computed.
   As can be observed, some of the methods described be-                                                −→ − →
fore are expensive to apply in large collections (for instance                                          D1 · D2
                                                                                        sim(D1 , D2 ) = −
                                                                                                        → −   →                  (1)
those based on compute the syntax tree), others depend on                                              kD1 kkD2 k
translators for each programming language to being used,
                                                                          Stylistic feature. As in [13] we took into account a set
and others need a manually thresholds to identify re-use
                                                                       of eleven stylistic characteristics of the programmer code
(plagiarized) source code pairs. Contrary to these previous
                                                                       written style. The characteristics are: the number of lines
methods, we proposed to use five high level features that are
                                                                       of code, the number of white spaces, the number of tabu-
both, easy to compute and considered more than one type
                                                                       lations, the number of empty lines, the number of defined
of aspect in the source code files.
                                                                       functions, average word length, the number of upper case
                                                                       letters, the number of lower case letters, the number of un-
3.   SHARED TASK DESCRIPTION                                           der scores, vocabulary size, and the lexical richness (i.e.,
   CL-SOCO, Cross Language Detection of SOurce COde                    the total number of tokens over the vocabulary size). To de-
Re-use, is a shared task that focuses on cross-language source         termine the stylistic feature we use a vector representation
code re-use detection. Participant systems were provided               for all these attributes and we applied the cosine similarity
with a set of cross-lingual training and test sets of source           (Equation 1) between the two vectors.
code files. The task consists on identifying the source code              Comments feature. As we mentioned before, we think
pairs that have been re-use at a document level across pro-            that the explanations and details of the procedures that are
gramming languages, particularly, Java and C. The details              given in the form of comments have particularities that are
about the tasks are described in [9]. Note that the rele-              inherent to the programmer; thus, these texts allow to cap-
vance judgments represent cases of re-use in both directions,          ture information that while re-using the code source can be
i.e., the direction of the re-use is not being detected. The           left unmodified. Accordingly, we use everything between
provided training set has 599 pairs (Java-C) of source code            blocks / ∗ ... ∗ / and everything after // as comments.
documents and the test set has 79 source code documents.                  First we concatenated all the text written as comments in


                                                                  11
the source code document, then we compute the character
trigrams. The next procedure was similar to the described
previously for the lexical feature; that is, to determine the                                                                                         Stylistic Feature
comments feature we use a vector representation and applied                                                       1

the cosine similarity as in Equation 1.
   Programmer’s text feature. To compute this feature                                                            0.8

we considered all the text that were passed as function’s ar-
guments (in prints sentences, for instance), or string that                                                      0.6




                                                                                                         Score
are assign to some variable (e.g. x="Hello World!"). All
these texts were concatenated together as we did with the                                                        0.4
previous features. Finally we use the same vector representa-
tion using character trigrams; then, by computing the cosine                                                     0.2
similarity, as in Equation 1, we determine the programmer’s                                                                  F mesure
                                                                                                                             Precision
text feature for two given source code documents.                                                                 0
                                                                                                                                Recall

   Structural feature. For this feature we took into ac-                                                               0.1     0.2       0.3   0.4       0.5       0.6    0.7      0.8      0.9   1
                                                                                                                                                     Similarity threshold
count ten attributes to represent a source code document.
                                                                                                  Figure 2: Results of identification of source code re-use when
These attributes are: the number of relational operations,
                                                                                                  using the stylistic feature only with manually set thresholds.
number of arithmetic operations, number of assignment state-
ments, number of function calls, number of looping state-
ments, number of write access, number of comments, num-
ber of functions or procedures defined, number of control
flow statements, and number of return statements. These
ten attributes form a vector for each document; then, the
similarity of two given documents is computed by the cosine                                                                                          Comments Feature
                                                                                                                  1
similarity given by the Equation 1.

                                                                                                                 0.8
5.   EXPERIMENTAL EVALUATION
  The evaluation was performed with the training set pro-                                                        0.6
vided in the shared task (see Section 3). We carried out a
                                                                                                         Score




series of experiments using single features in order to find                                                     0.4
the amount of relevant information given by each one of our
used high level features.                                                                                        0.2
                                                                                                                             F mesure
                                                                                                                             Precision
                                                   Lexical Feature                                                              Recall
                                                                                                                  0
               1                                                                                                       0.1     0.2       0.3   0.4       0.5       0.6    0.7      0.8      0.9   1
                                                                                                                                                     Similarity threshold

              0.8
                                                                                                  Figure 3: Results of identification of source code re-use when
                                                                                                  using the comments feature only with manually set thresh-
                                                                                                  olds.
              0.6
      Score




              0.4



              0.2
                          F mesure
                          Precision
                             Recall                                                                                                            Programmer’s text Feature
               0
                    0.1     0.2       0.3   0.4       0.5       0.6    0.7   0.8   0.9   1                        1
                                                                                                                                                                                F mesure
                                                  Similarity threshold                                                                                                          Precision
Figure 1: Results of classification when using the lexical                                                                                                                         Recall
                                                                                                                 0.8
feature only with manually set thresholds.
                                                                                                                 0.6
  For each experiment we computed the similarities values
                                                                                                         Score




of each source code file given in the training set. Then, we
                                                                                                                 0.4
measured the performance of each proposed representation
by means of establishing a manual threshold for considering
                                                                                                                 0.2
when two codes are plagiarized (re-used). That threshold
was set from 10 to 90 percent of similarity in increments of
                                                                                                                  0
10%. For each threshold we evaluated the Precision, Recall                                                             0.1     0.2       0.3   0.4       0.5       0.6    0.7      0.8      0.9   1
and F-measure1 .                                                                                                                                     Similarity threshold

  The results of our evaluation are given in Figures 1 to                                         Figure 4: Results of identification of source code re-use when
5. From these results we can observe that there are three                                         using the programmer’s text feature only with manually set
                                                                                                  thresholds.
1
 We compute the F-measure as describe in the evaluation
script provided by CL-SOCO 2015 at http://users.dsic.upv.
es/grupos/nle/clsoco/


                                                                                             12
                                                     Structural Feature                                   3. Lexical feature only (run 3). For this third run,
                  1
                                                                                                             we compute the lexical similarity of every single pair
                                                                                                             in the test set. As we see in the training set (Figure 1)
                 0.8
                                                                                                             we used a threshold of similarity of 30%. The results
                                                                                                             are shown in Table 1.
                 0.6
         Score




                                                                                                                 Table 1: Results of our submitted runs.
                 0.4                                                                                                        Precision Recall F-measure
                                                                                                          Monolingual          0.988       0.634      0.772
                 0.2                                                                                      model (run 1)
                             F mesure
                             Precision                                                                    Cross-language       0.620       0.771      0.687
                                Recall
                  0                                                                                       model (run 2)
                       0.1     0.2       0.3   0.4       0.5       0.6    0.7   0.8   0.9   1
                                                     Similarity threshold                                 Lexical feature      0.496       0.962      0.655
Figure 5: Results of identification of source code re-use when                                            only (run 3)
using the structure feature only with manually set thresh-
olds.
                                                                                                       In Table 1 we can see that our best model is the monolin-
                                                                                                     gual one. This result validates, to some extent, that re-use
features, lexical, comments and programmer’s text features                                           cases can be identified by aspects that has to do more with
that perform very well in Precision; this means that by using                                        the text in natural language than information of particular
all the written text in natural language allows to find almost                                       programming language.
every pair that has been re-use from each other. However,                                              To get a better idea of the performance in this task, in
with two features, namely: stylistic and structural, we ob-                                          Table 2 we show our best system (run 1) against the average
tained very good Recall. Consequently, we hypothesize that                                           performance results of all participant systems and the second
by using the five high level features as a representation of                                         best system. It is worth to mention that our system has the
each pair of code, we may have a compromise of Precision                                             best performance out of a total of 12 systems.
and Recall, hence, a better global performance.
   It is worth to mention that we did not consider the entire                                          Table 2: Comparison among all participants systems.
training set (599 pairs). The reason to do this is that we                                                                Precision Recall F-measure
realized that some pairs labeled as re-use cases were very                                                Our Run 1         0.988      0.634        0.772
different (i.e., they solved complete different problems). In                                            Second best        1.000      0.603        0.752
order to eliminate noise to our classification models, we com-                                        Average all systems   0.916      0.611        0.706
puted the similarity (lexical similarity) among the 599 pairs;
then, we computed the average (x) and standard deviation
                                                                                                       Table 2 shows that our monolingual model globally out-
(σ) of those similarity values and we removed all pairs with
                                                                                                     performs the others systems, that is, among all the actual
similarity values below a standard deviation. This way we
                                                                                                     re-use pairs we effectively identify most of them, however we
only conserved pairs with similarity value grater than (x−σ),
                                                                                                     are only 98.8% sure that they are in fact re-use cases.
that give us a total of 477 pairs.

                                                                                                     7.     CONCLUSIONS
6.     SUBMITTED RUNS
                                                                                                        In this paper, we have described the experiments per-
   We submitted three runs for the posed task based on our                                           formed by the Language and Reasoning group from UAM-C
proposed representation. For our first two submissions (run                                          in the context of the CL-SOCO 2015 evaluation exercise.
1 and run 2) we tackled the problems as a binary classifica-                                         Our proposed system was designed for addressing the prob-
tion problem, where the classes were re-use and not re-use.                                          lem of cross-language source code re-use detection by means
We trained our model using a Random Forest algorithm with                                            of employing five high level features within a classification
default parameters in the Weka [11] platform.                                                        problem.
                                                                                                        Particularly, we proposed the following features: i) lexi-
     1. Monolingual model (Run 1). As our five proposed                                              cal feature, ii) stylistic feature, iii) comments feature, iv )
        features focus on aspect that little has to do with a spe-                                   programmer’s text feature, and v ) structure feature. These
        cific programming language, we trained a model using                                         features are more oriented to detect aspects that the pro-
        source code re-use pairs for C language only, this set                                       grammers leave in natural language more than in a particu-
        was the same provided by SOCO 2014 [8]. This train-                                          lar programming language.
        ing set contains 79 source code files with 26 pairs of                                          Obtained results indicate that our proposed features can,
        re-use source codes. Once the model was trained we                                           to some extent, identify cases of source code re-use across
        classified the test data (from CL-SOCO 2015) and the                                         Java and C programming language. A deeper analysis need
        results are shown in Table 1.                                                                to be perform in order to determine which feature are the
                                                                                                     most useful in this task and if they are o not correlated.
     2. Cross-language model (Run 2). Here we decided
        to train our model with the training set of cross lan-
        guage examples (the subset of 477 pairs), then we clas-                                      8.     ACKNOWLEDGMENTS
        sified the test data and results of this evaluation is                                         Authors would like to thank UAM Cuajimalpa for its sup-
        shown in the third row in Table 1.                                                           port.


                                                                                                13
9.   REFERENCES                                                           re-use by mean of combining different types of
 [1] Al-Omari, F., Keivanloo, I., Roy, C. K., and                         representacions.
     Rilling, J. Detecting clones across Microsoft.NET               [14] Vidhya, K., Sumathi, N., and Ramya, D. Cross
     programming languages. In Reverse Engineering                        language higher level clone detection-between two
     (WCRE), 2012 19th Conference on Working (2012),                      different object oriented programming language source
     IEEE, pp. 405–414.                                                   codes. In Proceedings of International Conference on
 [2] Baeza-Yates, R. A., and Ribeiro-Neto, B.                             Inter Disciplinary Research in Engineering and
     Modern Information Retrieval. Addison-Wesley                         Technology 2014 (2014), ASDF, pp. 21–27.
     Longman Publishing Co., Inc., Boston, MA, USA,                  [15] Wettel, R., and Marinescu, R. Archeology of code
     1999.                                                                duplication: Recovering duplication chains from small
 [3] Brixtel, R., Fontaine, M., Lesner, B., Bazin, C.,                    duplication fragments. In Symbolic and Numeric
     and Robbes, R. Language-independent clone                            Algorithms for Scientific Computing, 2005. SYNASC
     detection applied to plagiarism detection. In Source                 2005. Seventh International Symposium on (2005),
     Code Analysis and Manipulation (SCAM), 2010 10th                     IEEE, pp. 63–70.
     IEEE Working Conference on (2010), IEEE,
     pp. 77–86.
 [4] Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., and
     Im, E. G. Software plagiarism detection: a
     graph-based approach. In Proceedings of the 22nd
     ACM international conference on Conference on
     information & knowledge management (2013), ACM,
     pp. 1577–1580.
 [5] Chuda, D., Navrat, P., Kovacova, B., and
     Humay, P. The issue of (software) plagiarism: A
     student view. IEEE Transactions on Education 55, 1
     (February 2012), 22–28.
 [6] Flores, E. Reutilización de código fuente entre
     lenguajes de programación. Master’s thesis,
     Universidad Politécnica de Valencia, Valencia, España,
     February 2012.
 [7] Flores, E., Barrón-Cedeno, A., Rosso, P., and
     Moreno, L. Towards the detection of cross-language
     source code reuse. In Natural Language Processing and
     Information Systems. Springer Berlin Heidelberg,
     2011, pp. 250–253.
 [8] Flores, E., Rosso, P., Moreno, L., and
     Villatoro-Tello, E. PAN@FIRE: Overview of
     SOCO track on the detection of SOurce COde Re-use.
     In Proceedings of the Sixth Forum for Information
     Retrieval Evaluation (FIRE 2014) (December 2014).
 [9] Flores, E., Rosso, P., Moreno, L., and
     Villatoro-Tello, E. PAN@FIRE 2015: Overview of
     CL-SOCO track on the detection of cross-language
     SOurce COde Re-use. In Proceedings of the Seventh
     Forum for Information Retrieval Evaluation (FIRE
     2015) (December 2015).
[10] Gondaliya, T. P., Joshi, H., and Joshi, H. Source
     code plagiarism detection ,SCPDet: A review.
     International Journal of Computer Applications 105,
     17 (November 2014), 27–31.
[11] Hall, M., Frank, E., Holmes, G., Pfahringer,
     B., Reutemann, P., and Witten, I. H. The WEKA
     data mining software: An update. SIGKDD
     Explorations Newsletter 11, 1 (2009), 10–18.
[12] Krinke, J. Identifying similar code with program
     dependence graphs. In Reverse Engineering, 2001.
     Proceedings. Eighth Working Conference on (2001),
     IEEE, pp. 301–309.
[13] Ramı́rez-de-la Cruz, A., Ramı́rez-de-la Rosa,
     G., Sánchez-Sánchez, C., Luna-Ramı́rez, W.,
     Jiménez-Salazar, H., and Rodrı́guez-Lucatero,
     C. UAM@SOCO 2014: Detection of source code


                                                                14