(CLSCR) Cross Language Source Code Reuse Detection using Intermediate Language Dimpal Shah Heena Jethani Hardik Joshi Gujarat University, Gujarat University, Gujarat University, Ahmedabad, Ahmedabad, Ahmedabad, India. India. India. Dimpalshah38@gmail.com heenahjethani@gmail.com joshee@acm.org ABSTRACT In today's digital era information access is just a click away. so computer science students also have easy access to all the source 2. DEFINITION codes from different websites thus it has become difficult for Cross language source code reuse: academicians to detect source code reuse in students programming Cross language plagiarism is also known as translation plagiarism. assignments. The new trend in the area of source code reuse is Let A1 and A2 be two programming languages and A1!=A2. using the source code by translating it in another programming Cross language source code reuse is stated as the translation of a language popularly known as cross language plagiarism. source code P1 € A1 into P2 € A2. Our CLSCR addresses this problem. CLSCR mainly has two . components: A compiler that compiles and translates the language specific source code into a tool specific internal format and The 3. RELATED WORK Similarity calculator that computes similarity between internal 3.1 Tokenization formats of different programs. It is the preprocessing technique that CLSCR performs before its actual implementation. It is the process of converting the source Keywords code in to tokens. Token is the smallest unit that holds meaning in Cross Language; CLSCR; Tokenization; Learning Management a program. Tokens include: System. (1) Identifiers (Variable types, Functions and Labels). 1. INTRODUCTION (2) Literals. Identifying if students programming assignments are their original (3) Operators (For example +, -, / etc). work or have been plagiarized from internet is of sole importance to the academicians. To address this problem many tools have (4) Keywords (For example for, While, If etc). been developed till date. Some of the tools are Sherlock, MOSS, JPLAG etc. All of these tools detect mono language plagiarism Mono language plagiarism: It is the act of producing source code file from another source code file of same language just by doing text edit operation and not understanding the granularities of the program. With advance developments and research in the field of information retrieval new techniques of plagiarism have also emerged. One such technique is cross language plagiarism it is a modern and smart way of plagiarism. Cross language plagiarism comes into picture when students want a source code for particular functionality in language A and while surfing the internet students come across the exact source code for the functionality but in language B so student decides to plagiarize by translating syntax of commands on A to syntax of B without understanding the working of the code. Our tool CLSCR detects this type of plagiarism CLSCR basically works in 3 phases that are language detection, internal format conversion, similarity computation. All are explained in Section 4. Figure 1. Example of Java File . 15 Import Files. These translations produce common internal format for Java files. Io Internal format is the compiler specific language file. Class HelloWorld { Public Static ......... Figure 2. Tokenization of java file 4. DESIGN CLSCR mainly works in 3 phases. Language Detection Intermediate language generation Similarity computation Figure 3. Design of CLSCR PHASE 1: Language detection The tokenized source code file is given as input to phase1. It detects the programming language of the file by comparing it with the predefined database consisting of keywords of different programming languages. import System.out.println new extends ........ Figure 4. Intermediate language generation As shown in figure 4 the internal format files are in monolingual. In short CLSCR performs translation of different programming import Java io ............. System.out.println language source codes to an intermediate language. Figure 3. Comparison between tokens and database PHASE 3: Similarity Computation It is the last phase of CLSCR. Phase2 generated internal format files is then compared to compute similarity. After detecting the programming language in the phase1 automatically moves the input files to specific predefined folders. This phase uses open source plagiarism detector SHERLOCK for For example, it will move C++ language program file to the C++ calculating similarity percentage between internal format files. folder and java files to java folder. 4.1 Sherlock PHASE 2: Intermediate language (Internal Format) Generation. SHERLOCK tool allows an instructor to examine a collection of Phase2 gets its input files from different folders for example C++ submitted programs for similarities. Each program is stored as a folder and java folder. Input files are then processed by compiler. single file, and is written using a specific predefined language [1] For example, we have C++ conversion file a part of compiler for Here our predefined language is our internal format. It uses the translating C++ folder files and java conversion file for translating concept of runs and anomalies to detect similarity. java folder Runs and Anomalies: The Tool defines run as a sequence of 16 common lines in two files, where the sequence might not be quite contiguous. There may be a number of extra or deleted lines interrupting the sequence. The allowable size of interruptions is called anomalies. Similarity Percentage is calculated on the basis of length of run and anomalies. Table 1. Runs and anomalies Sequence1 Sequence2 Sequence3 Begin Begin Begin Line2 Lin2 Extra line Line3 Extra line Line3 Line4 Line3 Extra line Line5 Line4 Line4 Line6 Line5 Extra line Line 7 Line7 Another line Figure 5. Similarity Computation Line8 Line8 Line7 Sequence 1 and 2 form run with 2 anomalies. We have assigned weight to all general properties as per their importance in plagiarism detection. Weight of class = cl Sherlock Usage: To use Sherlock we downloaded sherlock.C file which is available online. Then we compiled sherlock.C to Weight of constructor = co generate exe file. All files that need to be compared for Weight of function = f plagiarism and the exe file are placed in same folder. Then we Weight of variable = v run Sherlock a command-line program to generate result file containing similarity percentage of the files. Weight of object = o sherlock *.java > results.txt Then we calculate weight total of properties 5. EXPERIMENTS AND RESULTS Weight_total = cl(no of class) + co(no of As our initial effort we have just focused on two object oriented constructor) +o(no of object) + v (no of programming languages C++ and java. but it can be implemented variable) + f (no of function). to detect many other languages. Evaluation of our tool is done through a data set that is checked for originality and degree of plagiarism is computed. For testing, the dataset used was collected All those files having similar weight total are only compared. As from third party organization. Dataset consisted of 1000+ java and files with large difference in weight total have different properties C++ program for now we have tested this tool only on two object thus the degree of similarity is very less. Thus they are ignored. oriented languages java and C++. But with slight modification this tool can be implemented to detect plagiarism between many other languages. 7. CONCLUSION A software system that automatically detects cross language After passing all source code files to 3 phases of CLSCR, the plagiarism between C++ and java files has been proposed and results obtains shows the similarity percentage between various presented in this paper. This is basically a desktop application to files. detect plagiarism between different language source code files. Academicians can install the application and by just uploading the collection of assignments of the students can detect the degree of 6. IMPROVED EFFECIENCY plagiarism between the programs. CLSCR by default compares all files of C++ folder with java folder files. These folders may contain 500+ files resp. The system accepts the .txt, .java, .cpp all extensions of the source code so the overhead of converting the programs to a specific Comparing this large number of files is a tedious task and may extension is also removed. take some amount of time. To improve efficiency addition processing phase can be introduced. The proposed system has potential for becoming the comprehensive plagiarism detection system for universities. As Preprocessing phase: This phase is implemented before phase 2. CLSCR being able to detect cross language plagiarism Before converting the source code into intermediate language, additionally can even be used to detect mono language plagiarism. Attribute comparison among different source codes is performed. Although some of the processing of CLSCR would be worthless Attributes are general properties of source code files. They when in attempt of detecting mono language plagiarism but the include number of classes, number of functions, number of result of detection would be accurate. objects, number of constructers, number of variables etc. 17 This software has been tested for large number of programming [5] Gabel, Mark, and Zhendong Su. "A study of the uniqueness assignments of all categories with accurate results. This system of source code." Proceedings of the eighteenth ACM can efficiently handle huge data set and can be seamlessly SIGSOFT international symposium on Foundations of integrated with any learning management system. software engineering. ACM, 2010 This system can overall improve the quality of education imported [6] Chae, Dong-Kyu, et al. "Software plagiarism detection: a in different computer science institution. graph-based approach." Proceedings of the 22nd ACM international conference on Conference on information & 8. ACKNOWLEDGEMENTS knowledge management. ACM, 2013 We are grateful to Ms Sangeeta Premani for her guidance [7] Flores, Enrique, et al. "DeSoCoRe: detecting source code re- throughout the design and execution of this task. Finally, we thank use across programming languages." Proceedings of the 2012 FIRE2015 for providing the dataset to test our tool. Conference of the North American Chapter of the 9. REFERENCES Association for Computational Linguistics: Human [1] Joy, Mike, and Michael Luck. "Plagiarism in programming Language Technologies: Demonstration Session. Association for Computational Linguistics, 2012. assignments." Education, IEEE Transactions on 42.2 (1999): 129-133. [8] Flores, Enrique, et al. "Towards the detection of cross- language source code reuse." Natural Language Processing [2] Đurić, Zoran, and Dragan Gašević. "A source code similarity and Information Systems. Springer Berlin Heidelberg, 2011. system for plagiarism detection." The Computer Journal 250-253 (2012): bxs018. [9] Flores, Enrique, Rosso, Paolo, Moreno, Lidia and Villatoro- [3] Jadalla, Ameera, and Ashraf Elnagar. "PDE4Java: Plagiarism Tello, Esaú: PAN@FIRE 2015:” Overview of CL-SOCO Detection Engine for Java source code: a clustering Track on the Detection of Cross-Language SOurce COde Re- approach." International Journal of Business Intelligence use”. In Proceedings of the Seventh Forum for Information and Data Mining 3.2 (2008): 121-135 Retrieval Evaluation (FIRE 2015), Gandhinagar, India, 4-6 [4] Juričić, Vedran, Tereza Jurić, and Marija Tkalec. December (2015) "Performance evaluation of plagiarism detection method based on the intermediate language." (2011). 18