<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>N-gram-based Authorship Identification of Source Code</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yunpeng Yang</string-name>
          <email>yunpengy@outlook.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong</string-name>
          <email>kongleilei@fosu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyua Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoliang Qi</string-name>
          <email>qihaoliang@fosu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heilongjiang Institute of Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper focuses on the task of source code author identification published on PAN@FIRE2020 (Information Retrieval Evaluation Forum) which is to identify the most likely author of the code given a set of C++ source code without defined authors. This research is useful in some cases, such as detecting malware authors, solving academic cheating and online coding competition cheating problems. In the evaluation, we regard the source code author recognition task as a multi-classification task, and use word n-gram and character n-gram to extract features to train a logistic regression classifier. In the final results, the accuracy of our method reached 0.9428, ranking second. The experiments show that using the character n-gram as features is the best way to improve the prediction accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>1 N-gram</kwd>
        <kwd>Authorship Identification</kwd>
        <kwd>Source Code</kwd>
        <kwd>Multi-classification</kwd>
        <kwd>Logistic Regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>detailed data of data set in Table 1 and Table 2. By comparing the two kinds of word segmentation,
we find that the features extracted by character 2-gram to 7-gram can obtain better performance. Then,
we use TF-IDF method to calculate TF, IDF, DF, TF * IDF values of these features. Especially, the
IDF is computed using Eq.1:</p>
      <p>IDF ( t )  log</p>
      <p>1  n d
1  df(d,t)</p>
      <p>For the purpose of feature filtering, we delete some features whose DF values are too high or too
small, and regularize the TF * IDF values of the remaining features as new features, denoted as
Features with TF-IDF weights in Fig.1. Lastly, the logistic regression is used as the multi-classifier to
train the model of AI-SOCO. The parameter of the logistic regression model we use is proposed by
sklearn1, the parameters are set as C=1.0, max_iter=100, multi_class='ovr', penalty='l2',
solver='liblinear' and tol=0.0001.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Dataset</title>
      <p>The dataset is composed of source codes collected from the open submissions in the Codeforces
online judge. The total number of source codes in the dataset is 100,000, which are from 1,000
authors respectively. The source codes of each author are 100 and all of them are C + + codes.
Detailed information for the dataset is given in Table 1 and Table 2.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Experimental Results</title>
      <p>The performance of source code author identification task is evaluated by accuracy. Table 3 shows
the final evaluation results of top 5.</p>
      <p>We have tried many feature extraction methods for observing their effects. The experimental
results are shown in Table 4 and Table 5.</p>
      <p>According to Table 4, we can see that the logistic regression model and the random forest model
are better in the word 1-gram. Therefore, we train the two models at the same time, and through the
analysis of the results, we choose the logistic regression model.</p>
      <p>In conclusion, firstly, using TF-IDF filtered features to train the model can get higher accuracy.
Secondly, the effect of character n-gram is better than word n-gram. So in the final evaluation, we use
TF-IDF filtered character 2+3+4+5+6+7gram feature to learn the logic regression model, and the
accuracy rate is 0.93556 in development set.
1https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
We used the logistic regression model in this website to do multi-classifier tasks</p>
    </sec>
    <sec id="sec-6">
      <title>4. Conclusions</title>
      <p>This paper introduces an n-gram-based authorship identification of source code method. This
method uses the logistic regression as a multi-classifier. Through the analysis of the experimental
results, it can be concluded that the model of character 2-7gram after filtering by TF-IDF method has
the best result, and the final accuracy rate is 0.9428. Most source codes can identify the correct author
through this method.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Acknowledgements</title>
      <p>This work is supported by the National Natural Science Foundation of China (No. 61806075 and
No. 61772177), and the Natural Science Foundation of Heilongjiang Province (No. F2018029).</p>
    </sec>
    <sec id="sec-8">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Fadel</surname>
          </string-name>
          ,
          <article-title>Ali and Musleh, Husam and Tuffaha, Ibraheem and AlAyyoub, Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo. Overview of the PAN@FIRE 2020 Task on Authorship Identification of SOurce COde (AISOCO)</article-title>
          .
          <article-title>Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FI RE</article-title>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>