<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Team Alexa at Authorship Identification of SOurce COde (AI-SOCO)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mutaz Bni Younes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nour Al-Khdour</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Jordan University of Science and Technology Irbid</institution>
          ,
          <country country="JO">Jordan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>In this paper, we discuss our team's efort on the Authorship Identification of Source Code task. The task is about finding the author of a given code. This task provides 100,000 source codes for 1,000 diferent users; the organizers collected 100 source code for each user from the open submissions in the Codeforces online judge. Our team tested diferent approaches; the best one was using an ensemble model consisting of MultinomialNB, BernoulliNB, and CodeBERTa with two diferent runs. Our team achieved a 93.36% accuracy score, and we ranked 3rd on the leaderboard out of 16 teams.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The previous works to identify the author of a source code improved over the years using
diferent techniques such as the quality of the features extracted from the source code, machine
learning models, deep learning models, and the state of the art models.</p>
      <p>
        Pellin [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] applied a Support Vector Machine (SVM) classifier to determine the author of source
code from two authors. The model trained on AST representation that extracted from the code.
The classification accuracy was between 67% and 88%.
      </p>
      <p>
        For source code authorship attribution, Alsulami et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] implement Long Short-Term
Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models that learn the
structural syntactic features of Abstract Syntax Tree (AST). These models were evaluated on
two datasets of sources codes, Python dataset that was collected from Google Code Jam, and
C++ dataset that was collected from Github. The results for the dataset consists of 25 authors
and 10 authors as follows: using LSTM 92% and 80%, while using BiLSTM achieves 96% and 85%.
      </p>
      <p>
        Mahbub et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a system using the stacking ensemble method to identify the
authors of a source code that was written by a group of people. The system comprises five
classifiers (DNN, random forest with CART decision trees, random forest with C4.5 decision
trees, C-SVM, and v-SVM). The model converted source codes to metrics, extracted the features
vectors to be the input of the classifiers, and then each classifier predicts the probability of each
author.
      </p>
      <p>
        Abuhamad et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] applied a convolutional neural network (CNN) to identify the author
of codes and programs written in C++, Java, and Python. The CNN model is fed with term
frequency-inverse document frequency (Tf-idf) features as well as word embedding
representations. Their model was evaluated on a dataset collected from Google Code Jam and show decent
results as follow: 96.2% accuracy for C++ programmers, 95.8% accuracy for Java programmers,
and 94.6% accuracy for python programmers.
      </p>
      <p>
        Bogomolov et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed two language-agnostic models to deal with datasets of three
popular programming languages C++, Python, and Java: Path-based Random Forest (PbRF),
and Path-based Neural Network (PbNN). Both models recorded competitive results and
outperformed the state of the art models, PbRF works better for the dataset consisting of a small
number of codes for each author, while PbNN model deals with the dataset consisting of a large
number of codes for each author.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>
        Authorship Identification of SOurce COde (AI-SOCO) Task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is to discover the author of a piece
of code by knowing the programming style. The dataset was collected from Codeforces online
judge. It is composed of 100,000 source codes written in C++ programming language for 1,000
diferent users; each user has 100 source codes. The dataset is split into training, development,
and testing sets. The dataset consists of 2 parts, the first part is CSV files, and the second part
is a directory that contains all source codes. The CSV file illustrates the user id and all the
related source code ids in the source codes directory. Table 1 shows the distribution of the dataset.
      </p>
      <p>Evaluation Metric the evaluation process is the accuracy metric. Accuracy [11] is the most
popular used performance measure, it is known as the ratio of correctly predicted observation
to the total observations as given in Equation 1, which is the same as Equation 2.
   =
   =</p>
      <p>Number of Correct Predictions</p>
      <p>Total Number of Predictions</p>
      <p>TP + TN
TP + TN + FP + FN
(1)
(2)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section will describe our proposed model (Alexa model), starting from the feature extraction
phase to the final ensemble model. Our ensemble model consists of three separate models;
the first two models are traditional machine learning models that require feature extraction
before training them. The extracted features were Tf-idf on character level ranged between 1-5
grams extracted from the source codes. These features were used to train a MultinominalNB
model and a BernoulliNB model. And for the third model we used CodeBERTa model. After we
trained all the models, we took the probabilities from each model and multiplied them with
weights to get the highest accuracy score on the development dataset. The final weights were
determined for all classifiers based on multiple experiments: 0.5 for MNB probabilities and 0.1
for BernoulliNB probabilities, 0.4 for the probabilities from the first run of CodeBERTa, and
1.4 for the probabilities from the second run of CodeBERTa. This model ranked third on the
leaderboard among 16 teams and achieved a 93.36% accuracy score.</p>
      <p>The parameters used to train the classifiers are illustrated below and summarized in the Table
2:</p>
      <p>MultinominalNB parameters: alpha equal 0.001 and the default values for the rest of the
parameters, we imported the model from sklearn library [12]. The features were weighted as</p>
      <p>BernoulliNB parameters: alpha equal 0.1 and the default values for the rest of the
parameters, we imported the model from sklearn library [12]. The features were weighted as follows:
character-level Tf-idf Vectorizer “char”, max features=30000, ngram range=(1,5)</p>
      <p>CodeBERTa first run parameters: num_train_epochs equal 20, learning_rate equal 3e-5,
and the default values for the rest of the parameters.</p>
      <p>CodeBERTa second run parameters: num_train_epochs equal 20, learning_rate equal
2e-5, and the default values for the rest of the parameters.</p>
      <p>We tested out diferent models that were trained on source codes such as
huggingface/CodeBERTasmall-v1 [13], microsoft/codebert-base [ 14], huggingface/CodeBERTa-language-id [13],
codistai/codeBERT-small-v2 [13]. The best model was CodeBERTa-small-v1, it is a model based on
RoBERTa that is trained on a corpus of source codes from GitHub called CodeSearchNet dataset.
CodeBERTa supports multiple programming languages and it consists of 6-layers and 84M
parameters [13].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this research, we present a novel approach for AI-SOCO Task using an ensemble model
that consists of MultinominalNB, BernoulliNB, and CodeBERTa. AI-SOCO Task focused on
identifying the author of the source code. Alexa model extracts Tf-idf on character-level
as features from the source codes and feeds them into MultinominalNB and BernoulliNB
models, while CodeBERTa model uses the source codes as input. Our model ranks third in the
competition, with 93.36% accuracy.
code: A language-agnostic approach and applicability in software engineering, arXiv
preprint arXiv:2001.11593 (2020).
[11] F. X. Diebold, R. S. Mariano, Comparing predictive accuracy, Journal of Business &amp;
economic statistics 20 (2002) 134–144.
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.
[13] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, M. Brockschmidt, CodeSearchNet Challenge:
Evaluating the State of Semantic Code Search, arXiv:1909.09436 [cs, stat] (2019). URL:
http://arxiv.org/abs/1909.09436, arXiv: 1909.09436.
[14] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang,
M. Zhou, Codebert: A pre-trained model for programming and natural languages, 2020.
a r X i v : 2 0 0 2 . 0 8 1 5 5 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gray</surname>
          </string-name>
          , S. MacDonell, P. Sallis,
          <article-title>Software forensics: Extending authorship analysis techniques to computer programs (</article-title>
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alrabaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Debbabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>On the feasibility of malware authorship attribution</article-title>
          ,
          <source>in: International Symposium on Foundations and Practice of Security</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joy</surname>
          </string-name>
          ,
          <article-title>Style analysis for source code plagiarism detection</article-title>
          .,
          <source>Ph.D. thesis</source>
          , University of Warwick, Coventry, UK,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <article-title>Plagiarism and authorship analysis: introduction to the special issue</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>45</volume>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fadel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Musleh</surname>
          </string-name>
          , I. Tufaha,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Ayyoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jararweh</surname>
          </string-name>
          , E. Benkhelifa,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the PAN@FIRE 2020 task on Authorship Identification of SOurce COde (AI-SOCO), in: Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE</article-title>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Pellin</surname>
          </string-name>
          ,
          <article-title>Using classification techniques to determine source code authorship</article-title>
          , White Paper: Department of Computer Science, University of Wisconsin (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Alsulami</surname>
          </string-name>
          , E. Dauber,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mancoridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenstadt</surname>
          </string-name>
          ,
          <article-title>Source code authorship attribution using long short-term memory based networks</article-title>
          ,
          <source>in: European Symposium on Research in Computer Security</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mahbub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Z.</given-names>
            <surname>Oishie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <article-title>Authorship identification of source code segments written by multiple authors using stacking ensemble method</article-title>
          ,
          <source>in: 2019 22nd International Conference on Computer and Information Technology (ICCIT)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abuhamad</surname>
          </string-name>
          , J.-s. Rhim, T. AbuHmed, S. Ullah,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nyang</surname>
          </string-name>
          ,
          <article-title>Code authorship identification using convolutional neural networks</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>95</volume>
          (
          <year>2019</year>
          )
          <fpage>104</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bogomolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacchelli</surname>
          </string-name>
          , T. Bryksin, Authorship attribution of source
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>