<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Source Code Authorship Attribution using Stacked classifier</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chanchal Suman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayush Raj</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sriparna Saha</string-name>
          <email>sriparna@iitp.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pushpak Bhattacharyya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IIIT Bhubaneswar</string-name>
          <email>b518015@iiit-bh.ac.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology Patna</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Source code authorship attribution is a process of identifying the author of a given code. With increasing number of software submissions on open source repositories like Github, Codalab, Kaggle, Codeforces online judge, etc. the authors can copy other's code for their products. The application area of this method includes the detection of plagiarized code and prevent legal issues. In this work, we have applied the tf-idf based mechanism for representing the given source code. Word and character n-grams are used for the generation of code vectors. Finally, the generated vectors are fed to diferent available machine learning classifiers for the prediction task. We have applied this methodology to the dataset released by organizers of the AI-SOCO, a shared task of FIRE-2020. The problem statement for the task is "Given the predefined set of source code and their authors, the task is to build a system to determine which one of these authors wrote a given unseen before source code." An accuracy of 82.95% is achieved on the test data, which attained 10th position in the competition.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Authorship attribution</kwd>
        <kwd>abstract syntax tree</kwd>
        <kwd>tf-idf</kwd>
        <kwd>stacking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Nowadays, the identification of the author of a piece of code has become very crucial in many
cases. Some of the cases are authorship dispute, malware attacks, logic bomb, proof of authorship,
frauds, etc. [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Source code is very formal and restrictive in comparison to the natural
languages used in our daily life. But still, there is a large degree of flexibility while writing a
program [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The traditional way of dealing with this task is to divide the problem into two
parts: i) extracting the representation vector for the author’s style from the given code, and
ii) building a classifier for the prediction task using the representation vector. In this work,
an abstract syntax tree (AST), has been used for the tokenization of the code. tf-idf is applied
to word and character n-grams for generating the code representation. Diferent available
machine learning classifiers are used for classification purposes. Finally, the best performance
is achieved using word bigrams on a stacked model. The stacked model is an ensemble of extra
tree classifier, random forest classifier, and XG-Boost classifier. The Authorship Identification of
SOurce COde (AI-SOCO) is the shared task organized by FIRE-2020 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We participated in this
task, and the accuracy achieved for the test set is 82.95%.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Concepts Used</title>
      <p>In this section, we have discussed the basic concepts of abstract syntax tree, and the stacked
classifier.</p>
      <sec id="sec-2-1">
        <title>2.1. Abstract Syntax Tree</title>
        <p>
          An abstract syntax tree (AST) is a representation of a program’s code in the form of a rooted
tree. Nodes of an AST correspond to diferent code constructs (e.g., math operations and
variable declarations). Children of a node correspond to smaller constructs that comprise its
corresponding code. Diferent constructs are represented with diferent node types [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Stacked Classifier</title>
        <p>Stacked Generalization is an ensemble machine learning algorithm, where diferent machine
learning algorithms are ensembled together to get the results. Stacked Model uses a
metalearning algorithm to learn how to combine the predictions from two or more machine learning
algorithms. The benefit of stacking is that better results are obtained by combining the outputs
of multiple classifiers.</p>
        <p>In stacking, diferent base machine learning classifiers are combined together with the help
of a meta-classifier. The individual classification models are trained based on the complete
training set. Then the outputs for these classifiers are fed as inputs to the meta-classifier where
it predicts the final result. The meta-classifier can either be trained on the predicted class labels
or probabilities from the ensemble.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology Used</title>
      <p>In this section, we have discussed the data, its pre-processing, vectorization, and the
classification.
3.1. Data
The dataset is composed of source code collected from the open submissions in the Codeforces
online judge. There are 100 source code of 1,000 users, a total of 100,000 codes. The data is
divided into three parts, training, development, and test set having 50000, 25000, and 25000 codes
of C++ programming language, respectively. All codes are correct, bug-free, compile-ready, and
for a unique problem.</p>
      <sec id="sec-3-1">
        <title>3.2. Data Preprocessing</title>
        <p>The data is preprocessed using the clang library for the generation of Abstract syntax tree for
the source code. The writing style and syntax of code are diferent from the natural texts, thus
the normal tokenization process should not be usual for splitting codes. We have used AST for
the above purpose. Firstly, the language of the source code is found through the GuessLang
library of python. All the codes are written in C or C++, thus we have used the clang library to
generate the abstract syntax tree. AST traversal provides the list of tokens for the codes.
3.3. Vectorization
tf-idf Vectorizer is used for generating the vector representation of the given source code. The
list of tokens are extracted from the AST. Word and character n-grams are generated from
the token list. While building the vocabulary, we have ignored the terms having document
frequency strictly higher than 0.6 and strictly lower than 0.0. In this way, the code vectors are
calculated using the traditional tf-idf mechanism. We have used word uni-grams, bi-grams,
tri-grams, and character bi-grams, tri-grams, and quad-grams for the experimental purpose.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.4. Classifier</title>
        <p>The representation vectors for the given codes are fed to diferent available machine learning
classifiers. We have also used a stacked model for classification purpose. A stacking of three
diferent classifiers i) extra tree classifier, ii) random forest, and iii) XG-Boost Classifier is
carried out. The input is first classified using the three diferent classifiers, and then the outputs
produced by the classifiers are fed to SVM for final prediction. Thus, SVM is used as the meta
classifier over these classifiers. This architecture is shown in Figure 1. 150 estimators are used
with the extra tree classifier and the random forest classifier. SVM classifier takes the output
of all the three classifiers and then predicts the author for the input given. Stacking Classifier
from the mlxtend library is used to stack the models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we have discussed the results achieved on the development data, using diferent
approaches. An accuracy of 75.1% is achieved by random forest, using the word uni-gram
based vectors. The performance for the unigrams is shown in Table 1. The accuracy value
has increased with the use of uni-gram and bi-grams, shown in Table 2. An accuracy of
83.20% is achieved using the stacked model for the word n-grams (n=1, 2). A decrease in
performance is recorded with the vectors created from the combinations of word uni-grams,
bi-grams, and tri-grams, as shown in Table 3. Character n-grams are also used for generating
the feature representations. Word-based features outperform the character-based features. The
experimental results are shown in Table 4. Thus, it is concluded that the word bi-gram-based
representation with stacked model performs the best on the development dataset.</p>
      <p>Accuracy values of 81.58% and 82.95% are achieved using the bi-grams-based representations
by random forest and the stacked model, respectively. The results for the test data are shown in
table 5.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Source code authorship identification is a process of identifying, who wrote a piece of code,
given some set of probable authors. In the current work, we have used a tf-idf mechanism for
generating the feature vector for the codes. The abstract syntax tree is used for tokenizing the
codes and splitting it into tokens. Experimental results illustrate that word bi-grams perform
better than other combinations of word and character n-grams.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Online Resources</title>
      <p>The sources for the implementation are available via
• Clang
• Tf-idf.
• Source code link.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Layton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Watters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dazeley</surname>
          </string-name>
          ,
          <article-title>Automatically determining phishing campaigns using the uscap methodology, in: 2010 eCrime Researchers Summit</article-title>
          , IEEE,
          <year>2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caliskan-Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yamaguchi</surname>
          </string-name>
          , R. Greenstadt, De-anonymizing programmers via code stylometry,
          <source>in: 24th {USENIX} Security Symposium ({USENIX} Security 15)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Frantzeskou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gritzalis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Katsikas</surname>
          </string-name>
          ,
          <article-title>Source code author identification based on n-gram author profiles</article-title>
          ,
          <source>in: IFIP International Conference on Artificial Intelligence Applications and Innovations</source>
          , Springer,
          <year>2006</year>
          , pp.
          <fpage>508</fpage>
          -
          <lpage>515</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fadel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Musleh</surname>
          </string-name>
          , I. Tufaha,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Ayyoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jararweh</surname>
          </string-name>
          , E. Benkhelifa,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the PAN@FIRE 2020 task on Authorship Identification of SOurce COde (AI-SOCO), in: Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE</article-title>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bogomolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacchelli</surname>
          </string-name>
          , T. Bryksin,
          <article-title>Authorship attribution of source code: A language-agnostic approach and applicability in software engineering</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>11593</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>