=Paper=
{{Paper
|id=Vol-1587/T1-3
|storemode=property
|title=High Level Features for Detecting Source Code Plagiarism across Programming Languages
|pdfUrl=https://ceur-ws.org/Vol-1587/T1-3.pdf
|volume=Vol-1587
|authors=A. Ramírez-de-la-Cruz,G. Ramírez-de-la-Rosa,C. Sánchez-Sánchez,H. Jiménez-Salazar,C. Rodríguez-Lucatero,W. A. Luna-Ramírez
|dblpUrl=https://dblp.org/rec/conf/fire/Ramirez-de-la-Cruz15
}}
==High Level Features for Detecting Source Code Plagiarism across Programming Languages==
High Level Features for Detecting Source Code Plagiarism across Programming Languages ∗ A. Ramírez-de-la-Cruz G. Ramírez-de-la-Rosa C. Sánchez-Sánchez Universidad Autónoma Universidad Autónoma Universidad Autónoma Metropolitana Metropolitana Metropolitana Unidad Cuajimalpa Unidad Cuajimalpa Unidad Cuajimalpa México D.F. México D.F. México D.F. H. Jiménez-Salazar C. Rodríguez-Lucatero W. A. Luna-Ramírez Universidad Autónoma Universidad Autónoma Universidad Autónoma Metropolitana Metropolitana Metropolitana Unidad Cuajimalpa Unidad Cuajimalpa Unidad Cuajimalpa México D.F. México DF. México D.F. ABSTRACT identify re-use source code cases in a monolingual scenario. In this paper we describe the participation of the Language Later, in this year 2015, the task considers a cross-language and Reasoning group from UAM-C in the context of the scenario, that is, when a programmer tries to re-use source Cross Language SOurce COde re-use competition (CL-SOCO code from one language, say Java, to another, say C. 2015). We proposed a representation of source code pairs by In this paper, we present our methodology to solve the using five high level features; namely: i) lexical feature, ii) problem of finding source code re-use cases across Java and stylistic feature, iii) comments feature, iv ) programmer’s C programming languages. Consequently, we use a set of five text feature, and v ) structure feature. We combine these high level features within a classification problem, namely: different representations in three ways, each of which was lexical, stylistic, comments, programmer’s text, and struc- a run submission for the CL-SOCO competition. Obtained tural features. results indicate that proposed representations provide some The rest of the paper is organized as follows, in Section information that allows to detect particular cases of source 2 we describe the research work more closely related to our code re-use. proposed methodology. In Section 3, we briefly describe the shared task. Then, in Section 4, we describe the computa- tion of each of our five proposed features. Our Section 5 CCS Concepts presents the experiment evaluation carried out on the train- •Information systems → Content analysis and fea- ing set. Section 6 shows the details of submitted runs, and ture selection; Near-duplicate and plagiarism detec- finally in Section 7, we present our conclusions and some tion; •Applied computing → Document analysis; future perspectives. Keywords 2. RELATED WORK High level feature; Document representation; Plagiarism de- There are several approaches and tools for finding plagia- tection; Source code plagiarism rism in source code [10]. Some of the most representatives’ approaches are those that try to find syntactic similarities 1. INTRODUCTION through the codes, in the same source language. Some of those are based on searching similar n-grams or small char- Source code re-use identification has been an interesting acter sequences (strings) between two source codes. Some topic in two fronts: in the software industry and in the examples are: the proposed by Flores et al. [7] and, the academia. On one hand, software companies are very in- proposed by Wettel et al. [15] respectively. Likewise, other terested on protect their own software developments; thus, approaches have tried to detect lexical similarities. For ex- they invest lots of effort and money in trying to do so. On ample the proposed approaches by Krinke et al. [12] or Chae the other hand, the academia, mainly on computing related et al. [4] that look for graph dependencies (of how methods areas, worries that their students do not plagiarize source are called) inside the abstract syntax tree. code neither from other students nor from the forums on Focusing beyond watching a certain characteristic, the ap- the Internet [5]. proach proposed by Ramirez et al. [13] evaluates a set of The competition of Source Code Re-Use emerge in this three different types of features in order to determine the context, when in 2014 [8] they invited to the scientific com- similarity between code sources. The features are: 1) lexi- munity to a shared task in order to evaluate systems that cal (character n-grams), 2) structural (function names and ∗Corresponding author. E-mail address: parameter names and types), and 3) stylistic (the number gramirez@correo.cua.uam.mx of lines of code, the number of white spaces, the number of 10 tabulations, the number of empty lines, the number of de- 4. PROPOSED HIGH LEVEL FEATURES fined functions, average word length, the number of upper In this section we describe our proposed representation for case letters, the number of lower case letters, the number of source code documents as a set of five high level features, under scores, vocabulary size, and the lexical richness). This in order to identify source code re-use. To compute each combination has shown important aspects, with acceptable of these features, we represent a document in five different results, for determining plagiarism between pairs of Java ways. Particularly, for the lexical, comments and program- source codes. mer’s text features, we represent each document as a set of On the other hand, some researchers have focused on try- characters n-grams; for the stylistic feature, we use eleven ing to identify similarities or code clones in software writ- attributes; and for the structural feature we use another ten ten in different languages. Basically, their methods use an attributes. intermediate language to change the codes into it, but at The idea behind these set of high level features is to cap- the end, they search for similarities in the same language. ture aspect of source code that are inherent to the program- An example of such methods is the proposal of Al-Omari mer more than a particular programming language. Thus, et al. [1] in which clones of software, specified in different the stylistic feature capture information about the writing languages belonging to .NET framework, can be recognized. style of the programmer; the comments’ feature attends for In this proposal the software is analyzed when it is trans- only the information in natural language that the program- formed into the Common Intermediate Language, which is mer uses to explain the code; programmer’s text feature takes the result of compiling source code in .NET, to finally look into account the strings that the program produces, that is, for similarities. text that, again, little has to do with the programming lan- Another example of the use of an intermediate language guage per se. Additionally, the lexical and structural fea- is the work of Brixel et al. [3]. They focus on identifying tures take advantage of the fact that both language at hand language-independent code clones, for doing that they pre- (that is, Java and C) share, at some extent, some syntax. processed the source codes, to take them to an intermediate Next, we describe each of the five high level proposed fea- language, and then they use an alignment method based on tures. the parallel principle at local resolution (character level) to Lexical feature. The idea behind this representation is compute similarities between documents. to find a global similarity along the entire document using There are other proposals that do not take the source code the representation proposed by Flores [6]. We compute this as main point, instead they focus on the source code file con- feature in the same way as is described in [13]. That is, we tent. That is the case of the approach proposed by Vidhya et use a bag of character trigrams where all the white spaces al. [14] where two types of metrics are calculated. The first and line-breaks are deleted and the letters are changed into kind of metrics work at method level: Number of lines of lowercase. Additionally, as we know the language of each code, arguments, function calls, local variables, conditional source code file a priori, we eliminate the reserved words statements, looping statements, return statements, assign- within the document. Consequently, given two source code ment statements. The second type of metrics work at file documents D1 and D2 each one is represented as a vector level: Number of lines of code, variables declared, methods according to the vector space model [2], where the dimen- defined, function calls, and sequences of function calls. The sion of these vector is given by the vocabulary of character author compare two documents by differences in each of the trigrams in both documents. Finally, the lexical feature is metrics described before; then, they detect a case of clone compute as the cosine similarity of these two vectors (see using a manual threshold of the average of the differences in Equation 1). the metrics computed. As can be observed, some of the methods described be- −→ − → fore are expensive to apply in large collections (for instance D1 · D2 sim(D1 , D2 ) = − → − → (1) those based on compute the syntax tree), others depend on kD1 kkD2 k translators for each programming language to being used, Stylistic feature. As in [13] we took into account a set and others need a manually thresholds to identify re-use of eleven stylistic characteristics of the programmer code (plagiarized) source code pairs. Contrary to these previous written style. The characteristics are: the number of lines methods, we proposed to use five high level features that are of code, the number of white spaces, the number of tabu- both, easy to compute and considered more than one type lations, the number of empty lines, the number of defined of aspect in the source code files. functions, average word length, the number of upper case letters, the number of lower case letters, the number of un- 3. SHARED TASK DESCRIPTION der scores, vocabulary size, and the lexical richness (i.e., CL-SOCO, Cross Language Detection of SOurce COde the total number of tokens over the vocabulary size). To de- Re-use, is a shared task that focuses on cross-language source termine the stylistic feature we use a vector representation code re-use detection. Participant systems were provided for all these attributes and we applied the cosine similarity with a set of cross-lingual training and test sets of source (Equation 1) between the two vectors. code files. The task consists on identifying the source code Comments feature. As we mentioned before, we think pairs that have been re-use at a document level across pro- that the explanations and details of the procedures that are gramming languages, particularly, Java and C. The details given in the form of comments have particularities that are about the tasks are described in [9]. Note that the rele- inherent to the programmer; thus, these texts allow to cap- vance judgments represent cases of re-use in both directions, ture information that while re-using the code source can be i.e., the direction of the re-use is not being detected. The left unmodified. Accordingly, we use everything between provided training set has 599 pairs (Java-C) of source code blocks / ∗ ... ∗ / and everything after // as comments. documents and the test set has 79 source code documents. First we concatenated all the text written as comments in 11 the source code document, then we compute the character trigrams. The next procedure was similar to the described previously for the lexical feature; that is, to determine the Stylistic Feature comments feature we use a vector representation and applied 1 the cosine similarity as in Equation 1. Programmer’s text feature. To compute this feature 0.8 we considered all the text that were passed as function’s ar- guments (in prints sentences, for instance), or string that 0.6 Score are assign to some variable (e.g. x="Hello World!"). All these texts were concatenated together as we did with the 0.4 previous features. Finally we use the same vector representa- tion using character trigrams; then, by computing the cosine 0.2 similarity, as in Equation 1, we determine the programmer’s F mesure Precision text feature for two given source code documents. 0 Recall Structural feature. For this feature we took into ac- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold count ten attributes to represent a source code document. Figure 2: Results of identification of source code re-use when These attributes are: the number of relational operations, using the stylistic feature only with manually set thresholds. number of arithmetic operations, number of assignment state- ments, number of function calls, number of looping state- ments, number of write access, number of comments, num- ber of functions or procedures defined, number of control flow statements, and number of return statements. These ten attributes form a vector for each document; then, the similarity of two given documents is computed by the cosine Comments Feature 1 similarity given by the Equation 1. 0.8 5. EXPERIMENTAL EVALUATION The evaluation was performed with the training set pro- 0.6 vided in the shared task (see Section 3). We carried out a Score series of experiments using single features in order to find 0.4 the amount of relevant information given by each one of our used high level features. 0.2 F mesure Precision Lexical Feature Recall 0 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold 0.8 Figure 3: Results of identification of source code re-use when using the comments feature only with manually set thresh- olds. 0.6 Score 0.4 0.2 F mesure Precision Recall Programmer’s text Feature 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 F mesure Similarity threshold Precision Figure 1: Results of classification when using the lexical Recall 0.8 feature only with manually set thresholds. 0.6 For each experiment we computed the similarities values Score of each source code file given in the training set. Then, we 0.4 measured the performance of each proposed representation by means of establishing a manual threshold for considering 0.2 when two codes are plagiarized (re-used). That threshold was set from 10 to 90 percent of similarity in increments of 0 10%. For each threshold we evaluated the Precision, Recall 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 and F-measure1 . Similarity threshold The results of our evaluation are given in Figures 1 to Figure 4: Results of identification of source code re-use when 5. From these results we can observe that there are three using the programmer’s text feature only with manually set thresholds. 1 We compute the F-measure as describe in the evaluation script provided by CL-SOCO 2015 at http://users.dsic.upv. es/grupos/nle/clsoco/ 12 Structural Feature 3. Lexical feature only (run 3). For this third run, 1 we compute the lexical similarity of every single pair in the test set. As we see in the training set (Figure 1) 0.8 we used a threshold of similarity of 30%. The results are shown in Table 1. 0.6 Score Table 1: Results of our submitted runs. 0.4 Precision Recall F-measure Monolingual 0.988 0.634 0.772 0.2 model (run 1) F mesure Precision Cross-language 0.620 0.771 0.687 Recall 0 model (run 2) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Similarity threshold Lexical feature 0.496 0.962 0.655 Figure 5: Results of identification of source code re-use when only (run 3) using the structure feature only with manually set thresh- olds. In Table 1 we can see that our best model is the monolin- gual one. This result validates, to some extent, that re-use features, lexical, comments and programmer’s text features cases can be identified by aspects that has to do more with that perform very well in Precision; this means that by using the text in natural language than information of particular all the written text in natural language allows to find almost programming language. every pair that has been re-use from each other. However, To get a better idea of the performance in this task, in with two features, namely: stylistic and structural, we ob- Table 2 we show our best system (run 1) against the average tained very good Recall. Consequently, we hypothesize that performance results of all participant systems and the second by using the five high level features as a representation of best system. It is worth to mention that our system has the each pair of code, we may have a compromise of Precision best performance out of a total of 12 systems. and Recall, hence, a better global performance. It is worth to mention that we did not consider the entire Table 2: Comparison among all participants systems. training set (599 pairs). The reason to do this is that we Precision Recall F-measure realized that some pairs labeled as re-use cases were very Our Run 1 0.988 0.634 0.772 different (i.e., they solved complete different problems). In Second best 1.000 0.603 0.752 order to eliminate noise to our classification models, we com- Average all systems 0.916 0.611 0.706 puted the similarity (lexical similarity) among the 599 pairs; then, we computed the average (x) and standard deviation Table 2 shows that our monolingual model globally out- (σ) of those similarity values and we removed all pairs with performs the others systems, that is, among all the actual similarity values below a standard deviation. This way we re-use pairs we effectively identify most of them, however we only conserved pairs with similarity value grater than (x−σ), are only 98.8% sure that they are in fact re-use cases. that give us a total of 477 pairs. 7. CONCLUSIONS 6. SUBMITTED RUNS In this paper, we have described the experiments per- We submitted three runs for the posed task based on our formed by the Language and Reasoning group from UAM-C proposed representation. For our first two submissions (run in the context of the CL-SOCO 2015 evaluation exercise. 1 and run 2) we tackled the problems as a binary classifica- Our proposed system was designed for addressing the prob- tion problem, where the classes were re-use and not re-use. lem of cross-language source code re-use detection by means We trained our model using a Random Forest algorithm with of employing five high level features within a classification default parameters in the Weka [11] platform. problem. Particularly, we proposed the following features: i) lexi- 1. Monolingual model (Run 1). As our five proposed cal feature, ii) stylistic feature, iii) comments feature, iv ) features focus on aspect that little has to do with a spe- programmer’s text feature, and v ) structure feature. These cific programming language, we trained a model using features are more oriented to detect aspects that the pro- source code re-use pairs for C language only, this set grammers leave in natural language more than in a particu- was the same provided by SOCO 2014 [8]. This train- lar programming language. ing set contains 79 source code files with 26 pairs of Obtained results indicate that our proposed features can, re-use source codes. Once the model was trained we to some extent, identify cases of source code re-use across classified the test data (from CL-SOCO 2015) and the Java and C programming language. A deeper analysis need results are shown in Table 1. to be perform in order to determine which feature are the most useful in this task and if they are o not correlated. 2. Cross-language model (Run 2). Here we decided to train our model with the training set of cross lan- guage examples (the subset of 477 pairs), then we clas- 8. ACKNOWLEDGMENTS sified the test data and results of this evaluation is Authors would like to thank UAM Cuajimalpa for its sup- shown in the third row in Table 1. port. 13 9. REFERENCES re-use by mean of combining different types of [1] Al-Omari, F., Keivanloo, I., Roy, C. K., and representacions. Rilling, J. Detecting clones across Microsoft.NET [14] Vidhya, K., Sumathi, N., and Ramya, D. Cross programming languages. In Reverse Engineering language higher level clone detection-between two (WCRE), 2012 19th Conference on Working (2012), different object oriented programming language source IEEE, pp. 405–414. codes. In Proceedings of International Conference on [2] Baeza-Yates, R. A., and Ribeiro-Neto, B. Inter Disciplinary Research in Engineering and Modern Information Retrieval. Addison-Wesley Technology 2014 (2014), ASDF, pp. 21–27. Longman Publishing Co., Inc., Boston, MA, USA, [15] Wettel, R., and Marinescu, R. Archeology of code 1999. duplication: Recovering duplication chains from small [3] Brixtel, R., Fontaine, M., Lesner, B., Bazin, C., duplication fragments. In Symbolic and Numeric and Robbes, R. Language-independent clone Algorithms for Scientific Computing, 2005. SYNASC detection applied to plagiarism detection. In Source 2005. Seventh International Symposium on (2005), Code Analysis and Manipulation (SCAM), 2010 10th IEEE, pp. 63–70. IEEE Working Conference on (2010), IEEE, pp. 77–86. [4] Chae, D.-K., Ha, J., Kim, S.-W., Kang, B., and Im, E. G. Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management (2013), ACM, pp. 1577–1580. [5] Chuda, D., Navrat, P., Kovacova, B., and Humay, P. The issue of (software) plagiarism: A student view. IEEE Transactions on Education 55, 1 (February 2012), 22–28. [6] Flores, E. Reutilización de código fuente entre lenguajes de programación. Master’s thesis, Universidad Politécnica de Valencia, Valencia, España, February 2012. [7] Flores, E., Barrón-Cedeno, A., Rosso, P., and Moreno, L. Towards the detection of cross-language source code reuse. In Natural Language Processing and Information Systems. Springer Berlin Heidelberg, 2011, pp. 250–253. [8] Flores, E., Rosso, P., Moreno, L., and Villatoro-Tello, E. PAN@FIRE: Overview of SOCO track on the detection of SOurce COde Re-use. In Proceedings of the Sixth Forum for Information Retrieval Evaluation (FIRE 2014) (December 2014). [9] Flores, E., Rosso, P., Moreno, L., and Villatoro-Tello, E. PAN@FIRE 2015: Overview of CL-SOCO track on the detection of cross-language SOurce COde Re-use. In Proceedings of the Seventh Forum for Information Retrieval Evaluation (FIRE 2015) (December 2015). [10] Gondaliya, T. P., Joshi, H., and Joshi, H. Source code plagiarism detection ,SCPDet: A review. International Journal of Computer Applications 105, 17 (November 2014), 27–31. [11] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. The WEKA data mining software: An update. SIGKDD Explorations Newsletter 11, 1 (2009), 10–18. [12] Krinke, J. Identifying similar code with program dependence graphs. In Reverse Engineering, 2001. Proceedings. Eighth Working Conference on (2001), IEEE, pp. 301–309. [13] Ramı́rez-de-la Cruz, A., Ramı́rez-de-la Rosa, G., Sánchez-Sánchez, C., Luna-Ramı́rez, W., Jiménez-Salazar, H., and Rodrı́guez-Lucatero, C. UAM@SOCO 2014: Detection of source code 14