<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Hyderabad, India
EMAIL: hanyong@fosu.edu.cn (*corresponding author)
ORCID:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Ranking-based and Classification-based Approaches for Code Author Identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tang Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangyu Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujie Xu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Menghan Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiran Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhengyu Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Foshan University</institution>
          ,
          <addr-line>Foshan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Heilongjiang Institute of Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Heilongjiang University</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In this paper, we propose two approaches, the ranking-based approach and the classificationbased approach, for the source code author identification (AI-SOCO) task of FIRE2020 (Forum for Information Retrieval Evaluation). The ranking-based approach ranks the source codes according to the number of occurrences of the 15-grams, while the classification-based approach exploits the TF-IDF of terms as features to learn a classifier. Although the rankingbased approach is very simple, it gets an accuracy of 0.9157 in the evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. The proposed approaches</title>
      <p>
        Similar to the existing methods, our the first approach is based on the classification-based
approach. We extracts the TF-IDF feature of each source code as the input and the different authors as
labels to train a multiclass classification model. We choose the Random Forest[
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ] as the learning
algorithm to learn the classifier. In prediction sessions, inputting the TF-IDF feature of a source, we
predict its author according to the classification result.
      </p>
      <p>
        Another approach is a ranking-based approach. The difference between ranking-based approach
and the classification-based approach is that our ranking-based approach regards the AI-SOCO task as
a ranking problem. The motivation using ranking-based approach stems from our research on
plagiarism detection and microblog filtering. In these tasks, we found that ranking-based model was
more effective than classification-based model[
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3-6</xref>
        ]. We merge the codes written by the same author,
denoted as di, where i represents the codes of i-th author. Firstly, we compute the similarity of a given
source code (denoted as q) with the codes di of different author i. Then we rank each di according to
the similarity score. Lastly, the author of the top document is chosen as the result. For the similarity
computation, we tried the number of character-based n-gram occurrences.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments 3.1.</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>AI-SOCO provides the dataset consisted of 100,000 codeforces source codes (from 1,000 different
authors, 100 sources per author). These codes are correct, bugless and coded in C++. 50,000 source
codes(train dataset) are allowed to be used to train models. 25,000 source codes(validation dataset)
can be only used to select models. And the rest 25,000 source codes(test dataset) are used to test.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation metrics</title>
      <sec id="sec-5-1">
        <title>The task is evaluated by using Accuracy.</title>
        <p>3.3.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Model selection</title>
      <sec id="sec-6-1">
        <title>3.3.1 Model selection of classification-based approach</title>
        <p>We use the TfidfVectorize tool provided by scikit-learn to convert text data into TF-IDF feature
vectors, and use the classifiers such as OneVsRestClassifier(LinearSVC), DecisionTreeClassifier,
LogisticsRegression, KNeighborsClassifier, RandomForestClassifier (oob_score=False, random_state
=None) to train our approaches with default parameters. The results are in Table 1. According to the
tabel 1, we choose the Random Forest as the classifier.</p>
        <p>In the ranking-based approach, we ranked each d according to the similarity score. For the
similarity measure, we tried two methods. One was to use the traditional vector space retrieval model.
In this approach, the vector space model was applied to compute the similarity. In our experiments,
the vector space model was calculated by using Lucene (an information retrieval toolkit) with the
default parameters. Another apporach was to rank the authors according to the number of occurrences
of character-based n-gram in the codes of authors and the given forecasted code.</p>
        <p>Table 2 and Table 3 show the performances of the ranking-based approaches with different
number of n-gram occurrences and vector space model respectively. From table 2 and table 3, we can
see that the performance of the approach using vector space model is significantly lower than that of
the approach using the number of n-gram occurrences. We also note that the approach based on
vector space model achieves the poor performance when using terms as features. But when
introducing n-gram as features, the performance gets the greater improvement. As showed in table 3,
the performance of 4-grams, 5-grams and 6-grams is lower than n-grams(n&gt;10). It maybe that the
ngrams has no ability to catch the long distance character features when n is set as a smaller value. To
some extent, it shows that the n-grams(n is set as small value) can not express the writer's coding
style well. Finally, we submitted the results based on 15-grams occurrence and 20-grams occurrence.
Accuracy
0.8274
3.4.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>The performance of our submitted results</title>
      <sec id="sec-7-1">
        <title>We submit three groups of results. The experiment results of our submitted results on test data</title>
        <p>
          are shown in the following Table 4. And Table 5 shows the best result of top 5 team[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Accuracy
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>RandomForest</title>
      </sec>
      <sec id="sec-7-3">
        <title>Accuracy 0.9157 0.9105 0.8025</title>
        <p>yang1094</p>
      </sec>
      <sec id="sec-7-4">
        <title>Alexa</title>
      </sec>
      <sec id="sec-7-5">
        <title>AI-SOCO RoBERTa Code Baseline (6L, 12H)</title>
      </sec>
      <sec id="sec-7-6">
        <title>LAST</title>
      </sec>
      <sec id="sec-7-7">
        <title>FSU_HLJIT</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4. Discussion and Conclusions</title>
      <p>The ranking-based approach and the classification-based approach are proposed for AI-SOCO
task of FIRE2020. The ranking-based approach ranks the source codes according to the number of
occurrences of the character-based n-gram, while the classification-based approach exploits the
TFIDF of terms as features to learn a classifier. Although the ranking-based approach is very simple, but
it gets an acceptable performance. It shows that longer n-grams can catch the author’s coding profile
effectively. However, the problem of using only n-gram as features is that the occurrence of n-gram
is too single to express the code profile. For example, not only the the length of code but also some
global factors are not considered in n-gram based approach. Which granularities are more suitable for
code author identification remains further research.</p>
      <p>In addition, for the ranking-based approach proposed in this paper, so many things have been left
unfinished, because of the lack of time. The experiments are inefficient and the performance do not
meet our expectation. Some approaches based generative model or discriminative model has not been
attempted in the evaluation. In future, we will plan to further develop the ranking-based model to
improve the performance of AI-SOCO. For instance, using the language model model the authors’
profile or using learning to rank algorithm to learn a ranking model to rank the authors.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Acknowledgements</title>
    </sec>
    <sec id="sec-10">
      <title>6. References</title>
      <p>This work is supported by National Social Science Fund of China (No.18BYY125).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          ,
          <article-title>"Random forests"</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          ,
          <year>2001</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Geurts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ., and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wehenkel</surname>
          </string-name>
          ,
          <article-title>"Extremely randomized trees"</article-title>
          ,
          <source>Machine Learning</source>
          ,
          <volume>63</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>42</lpage>
          ,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lei-lei</surname>
            <given-names>KONG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong-yuan</surname>
            <given-names>HAN</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hao-liang</surname>
            <given-names>QI</given-names>
          </string-name>
          ,
          <article-title>Mu-yun YANG</article-title>
          .
          <article-title>Source Retrieval Model Focused on Aggregation for Plagiarism Detection</article-title>
          .
          <source>Information Science</source>
          .
          <year>2019</year>
          ,
          <volume>503</volume>
          :
          <fpage>336</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Leilei</given-names>
            <surname>Kong</surname>
          </string-name>
          , Zhongyuan Han,
          <string-name>
            <given-names>Haoliang</given-names>
            <surname>Qi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhimao</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <article-title>A Ranking-based Text Matching Approach for Plagiarism Detection</article-title>
          .
          <source>IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences</source>
          .
          <year>2018</year>
          ,
          <volume>101</volume>
          :
          <fpage>799</fpage>
          -
          <lpage>810</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Leilei</given-names>
            <surname>Kong</surname>
          </string-name>
          , Zhimao Lu, Zhongyuan Han,
          <string-name>
            <given-names>Haoliang</given-names>
            <surname>Qi</surname>
          </string-name>
          .
          <article-title>A ranking approach to source retrieval of plagiarism detection</article-title>
          .
          <source>IEICE Trans. Information and Systems</source>
          .
          <year>2017</year>
          ,
          <article-title>E100-D(1</article-title>
          ):
          <fpage>203</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Han</given-names>
            <surname>Zhongyuan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Muyun</given-names>
          </string-name>
          , Kong Leilei, Qi Haoliang,
          <string-name>
            <given-names>Li</given-names>
            <surname>Sheng</surname>
          </string-name>
          .
          <article-title>A Hybrid Model to Realtime Microblog Filtering</article-title>
          .
          <source>Chinese Journal of Electronics</source>
          .
          <year>2016</year>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>432</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Fadel</surname>
          </string-name>
          , Ali and Musleh, Husam and Tuffaha, Ibraheem and
          <string-name>
            <surname>Al-Ayyoub</surname>
          </string-name>
          ,
          <article-title>Mahmoud and Jararweh, Yaser and Benkhelifa, Elhadj and Rosso, Paolo. "Overview of the PAN@FIRE 2020 Task on Authorship Identification of SOurce COde (AI-SOCO)"</article-title>
          . In:
          <article-title>Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>