(CLSCR) Cross Language Source Code Reuse Detection
                using Intermediate Language
          Dimpal Shah                                   Heena Jethani                                         Hardik Joshi
       Gujarat University,                             Gujarat University,                                 Gujarat University,
          Ahmedabad,                                     Ahmedabad,                                          Ahmedabad,
             India.                                          India.                                              India.
    Dimpalshah38@gmail.com                         heenahjethani@gmail.com                                 joshee@acm.org


ABSTRACT
In today's digital era information access is just a click away. so
computer science students also have easy access to all the source
                                                                            2. DEFINITION
codes from different websites thus it has become difficult for              Cross language source code reuse:
academicians to detect source code reuse in students programming            Cross language plagiarism is also known as translation plagiarism.
assignments. The new trend in the area of source code reuse is              Let A1 and A2 be two programming languages and A1!=A2.
using the source code by translating it in another programming              Cross language source code reuse is stated as the translation of a
language popularly known as cross language plagiarism.                      source code P1 € A1 into P2 € A2.
Our CLSCR addresses this problem. CLSCR mainly has two                      .
components: A compiler that compiles and translates the language
specific source code into a tool specific internal format and The           3. RELATED WORK
Similarity calculator that computes similarity between internal             3.1 Tokenization
formats of different programs.                                              It is the preprocessing technique that CLSCR performs before its
                                                                            actual implementation. It is the process of converting the source
Keywords                                                                    code in to tokens. Token is the smallest unit that holds meaning in
Cross Language; CLSCR; Tokenization; Learning Management                    a program. Tokens include:
System.
                                                                            (1) Identifiers (Variable types, Functions and Labels).
1. INTRODUCTION                                                             (2) Literals.
Identifying if students programming assignments are their original
                                                                            (3) Operators (For example +, -, / etc).
work or have been plagiarized from internet is of sole importance
to the academicians. To address this problem many tools have                (4) Keywords (For example for, While, If etc).
been developed till date. Some of the tools are Sherlock, MOSS,
JPLAG etc. All of these tools detect mono language plagiarism
Mono language plagiarism:
It is the act of producing source code file from another source
code file of same language just by doing text edit operation and
not understanding the granularities of the program.
With advance developments and research in the field of
information retrieval new techniques of plagiarism have also
emerged. One such technique is cross language plagiarism it is a
modern and smart way of plagiarism.
Cross language plagiarism comes into picture when students want
a source code for particular functionality in language A and while
surfing the internet students come across the exact source code for
the functionality but in language B so student decides to plagiarize
by translating syntax of commands on A to syntax of B without
understanding the working of the code.
Our tool CLSCR detects this type of plagiarism CLSCR basically
works in 3 phases that are language detection, internal format
conversion, similarity computation. All are explained in Section 4.                 Figure 1. Example of Java File
.


                                                                       15
                         Import
                                                                                  Files. These translations produce common internal format for
                             Java
                                                                                  files.
                              Io
                                                                                  Internal format is the compiler specific language file.
                          Class
                      HelloWorld
                              {
                         Public
                          Static
                          .........
            Figure 2. Tokenization of java file


4. DESIGN
CLSCR mainly works in 3 phases.


             Language Detection


      Intermediate language generation


           Similarity computation

         Figure 3. Design of CLSCR


PHASE 1: Language detection
The tokenized source code file is given as input to phase1. It
detects the programming language of the file by comparing it
with the predefined database consisting of keywords of different
programming languages.


import      System.out.println           new          extends     ........             Figure 4. Intermediate language generation


                                                                                  As shown in figure 4 the internal format files are in monolingual.
                                                                                  In short CLSCR performs translation of different programming
import       Java       io            .............    System.out.println         language source codes to an intermediate language.
Figure 3. Comparison between tokens and database                                  PHASE 3: Similarity Computation
                                                                                  It is the last phase of CLSCR. Phase2 generated internal format
                                                                                  files is then compared to compute similarity.
After detecting the programming language in the phase1
automatically moves the input files to specific predefined folders.               This phase uses open source plagiarism detector SHERLOCK for
For example, it will move C++ language program file to the C++                    calculating similarity percentage between internal format files.
folder and java files to java folder.                                             4.1 Sherlock
PHASE 2: Intermediate language (Internal Format) Generation.                      SHERLOCK tool allows an instructor to examine a collection of
Phase2 gets its input files from different folders for example C++                submitted programs for similarities. Each program is stored as a
folder and java folder. Input files are then processed by compiler.               single file, and is written using a specific predefined language [1]
For example, we have C++ conversion file a part of compiler for                   Here our predefined language is our internal format. It uses the
translating C++ folder files and java conversion file for translating             concept of runs and anomalies to detect similarity.
java folder                                                                       Runs and Anomalies: The Tool defines run as a sequence of

                                                                             16
common lines in two files, where the sequence might not be
quite contiguous. There may be a number of extra or deleted
lines interrupting the sequence. The allowable size of
interruptions is called anomalies. Similarity Percentage is
calculated on the basis of length of run and anomalies.
                  Table 1. Runs and anomalies
Sequence1              Sequence2              Sequence3
Begin                  Begin                  Begin
Line2                  Lin2                   Extra line
Line3                  Extra line             Line3
Line4                  Line3                  Extra line
Line5                  Line4                  Line4
Line6                  Line5                  Extra line
Line 7                 Line7                  Another line
                                                                                           Figure 5. Similarity Computation
Line8                  Line8                  Line7
         Sequence 1 and 2 form run with 2 anomalies.                       We have assigned weight to all general properties as per their
                                                                           importance in plagiarism detection.
                                                                           Weight of class = cl
Sherlock Usage: To use Sherlock we downloaded sherlock.C file
which is available online. Then we compiled sherlock.C to                  Weight of constructor = co
generate exe file. All files that need to be compared for                  Weight of function = f
plagiarism and the exe file are placed in same folder. Then we
                                                                           Weight of variable = v
run Sherlock a command-line program to generate result file
containing similarity percentage of the files.                             Weight of object = o
sherlock *.java > results.txt                                              Then we calculate weight total of properties


5. EXPERIMENTS AND RESULTS                                                 Weight_total = cl(no of class) + co(no of
As our initial effort we have just focused on two object oriented          constructor) +o(no of object) + v (no of
programming languages C++ and java. but it can be implemented              variable) + f (no of function).
to detect many other languages. Evaluation of our tool is done
through a data set that is checked for originality and degree of
plagiarism is computed. For testing, the dataset used was collected        All those files having similar weight total are only compared. As
from third party organization. Dataset consisted of 1000+ java and         files with large difference in weight total have different properties
C++ program for now we have tested this tool only on two object            thus the degree of similarity is very less. Thus they are ignored.
oriented languages java and C++.
But with slight modification this tool can be implemented to
detect plagiarism between many other languages.                            7. CONCLUSION
                                                                           A software system that automatically detects cross language
After passing all source code files to 3 phases of CLSCR, the              plagiarism between C++ and java files has been proposed and
results obtains shows the similarity percentage between various            presented in this paper. This is basically a desktop application to
files.                                                                     detect plagiarism between different language source code files.
                                                                           Academicians can install the application and by just uploading the
                                                                           collection of assignments of the students can detect the degree of
6. IMPROVED EFFECIENCY                                                     plagiarism between the programs.
CLSCR by default compares all files of C++ folder with java
folder files. These folders may contain 500+ files resp.                   The system accepts the .txt, .java, .cpp all extensions of the source
                                                                           code so the overhead of converting the programs to a specific
Comparing this large number of files is a tedious task and may
                                                                           extension is also removed.
take some amount of time. To improve efficiency addition
processing phase can be introduced.                                        The proposed system has potential for becoming the
                                                                           comprehensive plagiarism detection system for universities. As
Preprocessing phase: This phase is implemented before phase 2.
                                                                           CLSCR being able to detect cross language plagiarism
Before converting the source code into intermediate language,
                                                                           additionally can even be used to detect mono language plagiarism.
Attribute comparison among different source codes is performed.            Although some of the processing of CLSCR would be worthless
Attributes are general properties of source code files. They               when in attempt of detecting mono language plagiarism but the
include number of classes, number of functions, number of                  result of detection would be accurate.
objects, number of constructers, number of variables etc.

                                                                      17
This software has been tested for large number of programming              [5] Gabel, Mark, and Zhendong Su. "A study of the uniqueness
assignments of all categories with accurate results. This system               of source code." Proceedings of the eighteenth ACM
can efficiently handle huge data set and can be seamlessly                     SIGSOFT international symposium on Foundations of
integrated with any learning management system.                                software engineering. ACM, 2010
This system can overall improve the quality of education imported          [6] Chae, Dong-Kyu, et al. "Software plagiarism detection: a
in different computer science institution.                                     graph-based approach." Proceedings of the 22nd ACM
                                                                               international conference on Conference on information &
8. ACKNOWLEDGEMENTS                                                            knowledge management. ACM, 2013
We are grateful to Ms Sangeeta Premani for her guidance
                                                                           [7] Flores, Enrique, et al. "DeSoCoRe: detecting source code re-
throughout the design and execution of this task. Finally, we thank
                                                                               use across programming languages." Proceedings of the 2012
FIRE2015 for providing the dataset to test our tool.
                                                                               Conference of the North American Chapter of the
9. REFERENCES                                                                  Association for Computational Linguistics: Human
[1] Joy, Mike, and Michael Luck. "Plagiarism in programming                    Language Technologies: Demonstration Session. Association
                                                                               for Computational Linguistics, 2012.
    assignments." Education, IEEE Transactions on 42.2 (1999):
    129-133.                                                               [8] Flores, Enrique, et al. "Towards the detection of cross-
                                                                               language source code reuse." Natural Language Processing
[2] Đurić, Zoran, and Dragan Gašević. "A source code similarity
                                                                               and Information Systems. Springer Berlin Heidelberg, 2011.
    system for plagiarism detection." The Computer Journal
                                                                               250-253
    (2012): bxs018.
                                                                           [9] Flores, Enrique, Rosso, Paolo, Moreno, Lidia and Villatoro-
[3] Jadalla, Ameera, and Ashraf Elnagar. "PDE4Java: Plagiarism
                                                                               Tello, Esaú: PAN@FIRE 2015:” Overview of CL-SOCO
    Detection Engine for Java source code: a clustering
                                                                               Track on the Detection of Cross-Language SOurce COde Re-
    approach." International Journal of Business Intelligence
                                                                               use”. In Proceedings of the Seventh Forum for Information
    and Data Mining 3.2 (2008): 121-135
                                                                               Retrieval Evaluation (FIRE 2015), Gandhinagar, India, 4-6
[4] Juričić, Vedran, Tereza Jurić, and Marija Tkalec.                          December (2015)
    "Performance evaluation of plagiarism detection method
    based on the intermediate language." (2011).


                                                                      18