=Paper= {{Paper |id=Vol-2826/T5-2 |storemode=property |title=UoB at AI-SOCO 2020: Approaches to Source Code Classification and the Surprising Power of n-grams |pdfUrl=https://ceur-ws.org/Vol-2826/T5-2.pdf |volume=Vol-2826 |authors=Alexander Crosby,Harish Tayyar Madabushi |dblpUrl=https://dblp.org/rec/conf/fire/CrosbyM20 }} ==UoB at AI-SOCO 2020: Approaches to Source Code Classification and the Surprising Power of n-grams== https://ceur-ws.org/Vol-2826/T5-2.pdf

UoB at AI-SOCO 2020: Approaches to Source Code
Classification and the Surprising Power of n-grams
Alexander Crosby, Harish Tayyar Madabushi
University of Birmingham, Edgbaston, Birmingham, B15 2TT, United Kingdom

Abstract
Authorship identification of source code is the process of identifying the composer of given source
code. Code authorship identification plays an important role in many real-world scenarios, such as the
detection of plagiarism and ghost writing in both education and workplace settings. Additionally, it can
allow the identification of individuals or organisations that produce and distribute malware programs.
In this paper we describe the experimentation and submission by team UoB to the AI-SOCO track at
FIRE 2020, which achieved first place. We first perform extensive testing on a variety of techniques used
in source code authorship identification including n-gram, stylometric, and abstract syntax tree derived
features. We also investigate the application of CodeBERT, a new pre-trained model that demonstrates
state-of-the-art performance in natural and programming language tasks. Finally, we explore the po-
tential of ensembling multiple models together to create a single superior model. Our winning model
utilises byte-level n-grams extracted from source codes to build feature vectors that represent an au-
thor’s programming style. These feature vectors are then used to train a densely connected neural
network model to carry out authorship classification on previously unseen source codes, achieving an
accuracy of 95.11%.

Keywords
authorship identification, source code, machine learning, n-grams

1. Introduction
Source code authorship attribution is the task of identifying the author of a given piece of
code [1]. The main concept behind authorship attribution is that each author uses a number of
stylistic traits when writing code that can be used as a fingerprint to distinguish one author
from another [2]. Authorship identification, therefore, is accomplished by identifying these
stylistic fingerprints and using statistical and machine learning models to attribute source code
to an author [3].
There are a number of real world applications of source code attribution, such as the detection
of plagiarism [4], or the use of a ghost-writer in academic, workplace and other environments.
Additionally, code authorship attribution can be used to identify authors of malware [5], who
may attempt to conceal their identity or may obfuscate their code to hide its function and
origin [6].

Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India
email: AlexCrosby@live.co.uk (A. Crosby); Harish@HarishTayyarMadabushi.com (H. T. Madabushi)
url: https://HarishTayyarMadabushi.com/ (H. T. Madabushi)
orcid: 0000-0001-5260-3653 (H. T. Madabushi)
© 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
Authorship Identification of Source Code (AI-SOCO) 2020 is a track organised for the Forum
for Information Retrieval Evaluation (FIRE) 2020 [7]. The track asks for the identification of
effective strategies that solve the problem of source code authorship attribution in issues related
to “cheating, in academic, work and open source environments” and that help in the detection
of authors of malware software [7].
The track involves a pre-defined set of 100,000 source codes made up from 1,000 unique
authors who had submitted code as part of Codeforces online programming competitions [8].
This data set was broken down to create a training, development, and test set of source codes.
The test set contained 25,000 source codes which did not have an attributed author, leaving
75,000 pieces of source code for training and development. Of this, 50,000 source codes were
used for training and 25,000 for development. Using these data sets, participants were required
to build a system that determined the author of each unlabelled source code. In the development
stage, participants are ranked based on their model’s performance on the development set.
In the evaluation stage, participants are ranked on their model’s performance on the 25,000
unlabelled source codes.
This paper describes our submissions to the AI-SOCO 2020 track1 . Our multi-faceted approach
evaluates a variety of different techniques previously used in authorship identification tasks
and additionally creates an ensemble model which pulls on the strengths of each individual
model. We find that our n-gram-based approach described in Section 3.3.2 outperformed all
other models we investigated, including modern state-of-the-art approaches, and achieved first
place on both the development and test datasets.

2. Related Work
This section provides an overview of different approaches related to authorship identification
attribution to identify promising techniques that inform the construction of our models.

2.1. Stylometric Approaches
Stylometry is perhaps the oldest technique in code authorship attribution and is based on the
identification of features in written text that could be linked to stylistic choices in a person’s
writing. One such example is the work of Krsul and Spafford [9], who proposed a total of 49
different features in 3 main areas: (1) layout specific metrics such as indentation and comment
style; (2) programming style such as variable, function and comment naming style; and (3)
program structure such as usage of specific data structures, presence of debugging identifiers or
assertions, and error handling.

2.2. n-gram Approaches
n-grams have been used successfully in a variety of Natural Language Processing (NLP) tasks,
including spell checking, language modelling and authorship attribution of text and have
successfully been ported to the field of authorship attribution.

1
Source code and data are published at https://github.com/AlexCrosby/AI-SOCO
Frantzeskou et al. [10] introduces the profile-based Source Code Author Profile (SCAP)
method. This method is a profile-based approach that creates a profile of frequently used
byte-level n-grams for each author. Classification can then be carried out by finding the profile
that best matches an unlabelled source code by counting common n-grams between source
code and profile.
Kothari et al. [11] demonstrated that character-level n-grams could be combined with stylo-
metric features to boost prediction accuracy, suggesting that combining different authorship
identification techniques could be a promising approach to use in this track.

2.3. Abstract Syntax Tree Approaches
Abstract Syntax Trees (ASTs) are a representation of the abstract syntactic structure of a source
code, represented in a hierarchical tree structure, that are a common source of information in
program analysis tasks.
Caliskan-Islam et al. [12] proposed a stylometric feature set extracted from a source code’s
AST and combining that with lexical and layout features elicited directly from the source code,
in line with more classical stylometry methodologies and running them through a random
forest classifier. Caliskan-Islam et al. [12] then found that the feature that best discriminated
between authors was AST node bigrams. Each bigram was made up from a node and the parent
node it was directly connected to to. The term frequencies of these bigrams, when used for
classification, demonstrated an accuracy comparable to using these bigrams in combination
with the extracted lexical and layout features.

2.4. CodeBERT
CodeBERT [13] is a pre-trained multi-layer bidirectional Transformer model based on the
same architecture used in RoBERTa. Unlike RoBERTa, CodeBERT is designed to be used in
both natural language and programming language applications, such as code documentation
generation and natural language code search. CodeBERT produces word-level vectors for a
source code using contextual information taken from surrounding word contexts, as opposed to
information from the code AST as is done by models such as code2vec [14] and code2seq [15].

3. Methodology
This section outlines the techniques used to generate each model used for authorship attribution
in the AI-SOCO 2020 track. Each model detailed was devised to identify meaningful vector
embeddings for source codes and assess their overall effectiveness, with the end goal being the
ensembling of a variety of different techniques to provide an overall superior model.

3.1. Preprocessing
Depending on the features required by each model, the source codes were preprocessed to make
extraction of said features possible. Two different preprocessed datasets were generated for
this purpose. The first of these involved the removal of all comment lines and the unfolding of
“#define” preprocessor directives which may otherwise obfuscate features extracted from the
program structure and is a requirement for AST extraction.
The second preprocessed dataset was the ASTs for each source code extracted directly from
the first preprocessed dataset.
Both of these preprocessing stages were carried out using the tool astminer [16] using Google
Cloud Platform’s Compute Engine.

3.2. Initial Experimentation
Our initial experimentations were expansions on the character count and bag-of-words term
frequency–inverse document frequency (TF-IDF) vectorisation techniques used in the baseline
models; plus a variant on the character count where only A-Z characters were considered,
ignoring case. These vectors were all used to train four different machine learning models:
Logistic Regression; k-nearest neighbours; Naïve Bayes classifier; and a support vector classifier.
While these models did not perform particularly well, achieving a best accuracy of 74.996%
using the TF-IDF vectors in a Logistic Regression model shown in Section 3, they did identify
that the Naïve Bayes classifier was both quick to train and had moderate accuracy, meaning it
was a good candidate model to evaluate vectorisation techniques used in further models.

3.3. n-gram Models
3.3.1. Source Code Author Profile Method
The SCAP method discussed in Section 2.2 was implemented based on its success in cases where
only very short pieces of code were available, such as in our dataset [17].
In this method, character or byte-level n-grams are extracted from the raw source codes and
used to create vector representations. This method involves the generation of a profile for each
author by creating a set of the L most commonly occurring n-grams for each author throughout
their source codes. Then, to calculate an unseen source code’s similarity with each profile, the
Simplified Profile Intersection (SPI) is measured.
Letting Pa be the author profile n-gram set and Ps being the n-gram set of the previously
unseen source code, the SPI value between Pa and Ps is given by the magnitude of the intersection
of the two profiles (see Equation 1).

𝑆𝑃 𝐼 = |𝑃𝑎 ∩ 𝑃𝑠 | (1)

The unseen source code is hence attributed to the author profile achieving the highest SPI
value.
To select n-gram size 𝑛, profile length 𝐿 and whether to use character or byte n-grams, an
exhaustive grid search was carried out to identify the best performing settings. In addition to
a set 𝐿 value, we also investigated using an unlimited profile length and excluding n-grams
that only have a frequency of 1 as proposed by Tennyson and Mitropoulos [18]. Overall, this
technique managed to achieve an accuracy of 92.212% on the development dataset as shown
in Section 4.1.1.
3.3.2. Instance-based n-gram Models
A second n-gram based approach was proposed based on the success of the SCAP method
(Section 3.3.1). In this model, the raw source codes were decomposed into their constituent
character or byte-level n-grams. Unlike SCAP however, these n-grams were represented as a
bag of n-grams which would be used to train machine learning models to predict authorship
through the co-occurrence of n-grams in any given source code.
A Naïve Bayes classifier was used to identify the best candidate representations that would
later be used to train the neural network.
The best candidate n-gram representation identified at this stage was then used downstream
as the input in a neural network classifier model.
In addition to character and byte-level n-grams, a bag-of-Words (BoW) model was also
conceived. In this model, vectors represented word-level n-grams extracted from the training
dataset. Unlike the other n-gram models, only word n-grams of size 1 were used. However, the
entire vocabulary of the training dataset, a total of 60,770 words, is used rather than limiting
them to a specified max size as discussed in Section 5.5.
This model achieved an accuracy of 95.416% on the development dataset as detailed in 4.1.2.
Due to the increased accuracy of this instance-based model over the profile-based SCAP method,
going forwards only this instance-based n-gram model was used. The final accuracy of this
model on the test dataset was 95.11% as mentioned in Section 4.4.

3.4. Abstract Syntax Tree Model
In this approach, features were derived from the preprocessed dataset containing the AST
structures derived from the source codes. This model was developed based on its successful
implementation by Caliskan-Islam et al. [12] and Wisse and Veenman [19].
These ASTs contain information relating to the node type, the code that relates to said node,
and the relationships between each node. These nodes are then tokenised by substituting each
unique node with a numerical representation. Likewise n-grams of nodes are tokenised, in
which a node bi-gram would be a node and its parent node and so on.
These tokens are then used to create a vector representation that describe the occurrence of
the top most commonly occurring tokens in the training dataset. Like the n-gram model, multiple
AST vectors variants were evaluated in order to identify best performing vector parameters.
These parameters were then to generate the vectors used downstream to train a neural network
classifier model. The final accuracy achieved by this model on the development dataset was
80.052% as shown in Section 4.2.

3.5. Stylometry Model
In this model, 136 different stylistic and layout features were extracted from both the raw and
preprocessed source codes without comments and “#define” directives. 100 of these 136 features
were counts of the printable characters as used in the baseline model.
The remaining 36 features are documented in the repository released alongside this paper
and were collated due to their common use in multiple papers on this topic [12, 20, 21].
136-dimensional vectors representing these features from each source code were then used
to train a densely-connected neural network classifier model. This model achieved an accuracy
of 75.376% on the development dataset as shown in Section 3.

3.6. CodeBERT Model
This model introduces CodeBERT to the task of authorship attribution, a domain that, to this
author’s knowledge, the model has not been applied to previously.
We fine-tuned the CodeBERT model (provided at https://github.com/microsoft/CodeBERT)
for authorship attribution using an NVIDIA K80 GPU on an Amazon Web Services p2.xlarge
instance for 10 epochs using the Adam optimiser at a learning rate of 2 × 10−5 . As shown
in Section 3, this model achieved an accuracy of 86.724% on the development dataset.

3.7. Weighted Average Ensemble
The idea behind ensembling is that each model, when independently trained, is likely to
have different strengths while classifying source codes. By combining individual models their
strengths can be pulled together to get a more accurate classification than by any one model
alone [22].
Following the creation of the previously mentioned models, five candidates were selected
based on performance and difference in feature representations. These models were: The
n-gram, BoW, AST, stylometry and CodeBERT models.
One ensembling technique experimented with was a weighted averaging procedure. In an
average ensemble this is achieved by simply averaging the SoftMax outputs from each model
to get the average prediction of all models. A common problem with this method is that if
one model performs significantly worse, it can drag the overall prediction accuracy down. To
combat this, each model is given a different weighting, allowing some models to contribute
more to the pooled classification, and others less [22]
Two different optimisation techniques were tested in identifying the best weight values:
Powell’s conjugate direction method [23], and Differential Evolution [24]. These two methods
were used as an attempt to avoid getting stuck in a local minimum due to the reliance on any
single optimisation method. The best performing weights on the development dataset were
selected to be used in the final ensemble. This model outperformed all other models on the
development dataset, achieving an accuracy of 95.715% as discussed in Section 4.1.2. However,
it did not manage to outperform the instance-based n-gram model on the test dataset, achieving
an accuracy of 95.11% as shown in Section 4.4.

3.8. Discarded Methods
3.8.1. Convolutional Neural Network Model
Another deep-learning model evaluated was a Convolutional Neural Network (CNN) model. In
this model, each word in the source codes are given a numerical token representation. This
ordered list of tokens is then fed into the CNN model.
This model contains five main layers. The first is the embedding layer, that learns a multidi-
mensional vector representation for each word during training specific to our task.
This sequence of embeddings is then fed into a 1-dimensional convolutional layer which
extracts features useful for classification decisions over the sequence of word embeddings. This
is then fed into a MaxPooling layer to down sample the convolutional layer’s output feature
map to reduce model size and training time.
It is this pooled layer that is then fed into the LSTM layer of the CNN to generate a final
source code vector than is then classified with a final SoftMax layer. Unfortunately, in the initial
testing of this model the results were poor, failing to surpass any of our previous models with a
accuracy of 73.880% as shown in Section 3. As we did not have enough time to fully explore the
capability of this model, we decided to discontinue further research on this model in favour of
our better performing models.

3.8.2. Stacked Neural Network Ensemble
In addition to the weighted average ensemble discussed in Section 3.7, another stacked neural
network ensembling technique was trialled.
In this model, a single multi-headed neural network model was constructed. This model uses
the same architectures as the individual models but concatenates them all at their final hidden
layer before the final SoftMax classification layer. This also allows all models to be trained
simultaneously, allowing for differences in the models from the original versions that allow for
improved ensembling.
Due to the size of this model, a few concessions were made to reduce its size: first, the
CodeBERT model was excluded from the ensemble since it was the largest individual model by
a considerable margin. Instead, the final hidden layer values from the CodeBERT model were
pre-calculated and input directly into this model at the concatenation layer. Additionally, only
the single best n-gram model from those discussed in Section 3.3 was used.
Due to the size of this model, it experienced significant overfitting issues during development
which would have required significant changes to overcome. This model was hence discarded
in favour of the more promising weighted average ensemble strategy.

4. Results
This chapter outlines our findings and analysis of results of the application of our models defined
in Section 3.
The highest accuracies achieved at each stage of experimentation on the development dataset
are presented in Table 1. The n-gram-based neural network model achieved the best individual
model accuracy at 95.416%. The top ensemble accuracy achieved was 95.716%, achieved through
the weighted averaging ensemble of the five models, discussed in Section 3.7.
All models were evaluated based on their accuracy on the development dataset as that was
the only metric considered in evaluating and ranking submitted systems in the AI-SOCO 2020
track.
Table 1
Final Model Results on the Development Dataset.
Model Accuracy (%)
Weighted Average Ensemble 95.715
n-gram Model 95.416
SCAP Model 92.212
Stacking Neural Network Ensemble 89.160
CodeBERT Model 86.724
BoW Model 82.960
AST Model 80.052
Stylometry Model 75.376
Word TF-IDF Logistic Regression Initial Model 74.996
Convolutional Neural Network Model 73.880

4.1. n-gram Results
4.1.1. Source Code Author Profile Method
The highest accuracy achieved by the SCAP method on the development dataset was 92.212%
was the highest accuracy achieved using 𝑛 = 9 and 𝐿 = 8000 on character-level n-grams.

4.1.2. n-gram Model Results
As outlined in Section 3.3.2, initial vector parameter exploration was carried out using a Naïve
Bayes Model to find the best combination to feed to the neural network classifier model.
The best overall accuracy at this stage was achieved using character-level n-gram vectors
made up of the normalised feature count of the top 10,000 occurring 8-grams in each of the
source code and achieved an accuracy of 85.648% on the development dataset.
During the neural network hyperparameter optimisation stage, we found that different vector
parameters were more effective. Experimentations on these parameters in the neural network
model are shown in Table 2.
The final vector parameters selected used byte-level 6-grams, using binary representation of
the top 20,000 most commonly occurring 6-grams. This achieved the final accuracy of 95.416%
on the development dataset as shown in Table 1. The architecture of the neural net used two
hidden layers containing 3,000 and 2,000 neurons respectively. The model also used dropout
layers with a dropout rate of 0.5 between the two hidden layers and between the second hidden
layer and final output layer. The model used an initial learning rate of 1 × 10−4 using the
RMSProp algorithm.

4.1.3. Bag-of-Words Model Results
For the BoW model, we found that the highest accuracy achieved was 82.96% on the development
dataset, using a binary enumeration representation.
Table 2
n-gram Vector Exploration Results Using Neural Network Classifier on the Development Dataset.
n-gram Size Enumeration Type n-gram Level Accuracy (%)
6 Binary Byte 95.416
7 Binary Byte 95.292
5 Binary Byte 95.240
4 Binary Byte 95.204
6 Binary Character 95.135
8 Binary Byte 95.072
9 Binary Byte 95.000
10 Binary Byte 94.860
8 Count Byte 93.336
8 Count Character 93.191
6 Count Byte 93.180
8 TF-IDF Byte 92.080
6 TF-IDF Byte 91.264

4.2. Abstract Syntax Tree Model Results
The highest accuracy achieved using ASTs used node unigrams. These nodes were enumerated
with a binary count representation, in which only the top 20,000 most commonly occurring
nodes were represented. Using a Naïve Bayes classifier model, this representation achieved an
overall accuracy of 64.308% on the development dataset. This vector representation was then
used to train a neural network classifier, achieving an accuracy of 80.052% on the development
dataset as shown in Table 1.

4.3. Weighted Average Ensemble Results
Both Powell’s conjugate direction method and Differential Evolution were used as optimisation
techniques to find the best weights for the models of the ensemble as discussed in Section 3.7.
Both algorithms concurred on the weights for each model, that when used in ensembling, gives
an accuracy of 95.716% on the development dataset, as shown in Table 1. The weights derived
by Powell’s method and Differential Evolution for each model were: n-gram - 0.3079; CodeBERT
- 0.19504; BoW - 0.09437; AST - 0.31464; and stylometry - 0.08805.

4.4. Submitted Results
For the development phase of the AI-SOCO 2020 track, we submitted the results from our
n-gram model which achieved an accuracy of 95.416%. The weighted average ensemble result
was not submitted as this phase ended prior to its completion. In this phase we achieved first
place.
For the evaluation phase, we again submitted the n-gram model predictions on the test set
alongside the predictions from the weighted average ensemble. The results of our models in
this phase are shown in Table 3.
With these results, we again managed to take first place in this phase of the track.
Table 3
Model performance in the evaluation phase.
Model Accuracy (%)
n-gram Model 95.11
Weighted Average Ensemble 93.82

5. Analysis and Discussion
In this section, we discuss some observations made in two of our most powerful models that
subverted the author’s expectations, as well as analysing the models strengths and weaknesses.

5.1. Weighted Average Ensemble
While the weighted average ensemble was the best model overall, its results do highlight some
issues. Firstly, there is only a 0.3% accuracy increase over the n-gram model alone. From our
ensemble analysis, it is clear that whatever other models can classify correctly, the n-gram
model can typically make the same correct predictions, in addition to making predictions that
other models cannot, resulting in the ensemble’s component models providing little benefit. It
is also evident that when models share some degree of overlap, they are less effective in the
final ensemble. This indicates that the final ensemble would likely benefit from having a more
diverse set of constituent models rather than models that extract features from the same raw
textual data.

5.2. n-gram Model
The n-gram model was ultimately our most powerful author classifier, with only the similar
SCAP method coming close to the same accuracy. This demonstrates that character and
byte-level n-grams are likely the best individual representation for the source codes in our
dataset. Frantzeskou et al. [17] discuss how their SCAP method has strong performance, even
when there is very limited training data per author available. Our results suggest that perhaps
the good performance has less to do with the SCAP method itself, and more to do with n-gram
features being more powerful than other vector representations in these limited feature datasets,
hence why our n-gram neural network model outperformed other models as well.
One perplexing observation is how the optimised vector representation used in the Naïve
Bayes model differed significantly to the optimal vector used in the neural network mode. In
addition to a smaller 𝑛 value being used, a binary count performed better than the frequency
count values. This is an odd observation, as the initial assumption was that more information
was contained in a normalised count, as it detailed how often an author used a specific n-gram
rather than if they just used it at least once or not. Similarly, TF-IDF of n-grams performed worse,
even more so than just normalised counts. Combined with our findings in Section 4.1.2, this
suggests that commonly occurring words, that would typically be weighted down in TF-IDF, are
indeed important in making classification decisions, as opposed to more unique features which
would not be represented in the final vectors. This poses a possible new avenue of approach for
Table 4
Ensemble combination accuracies (%) on the Development Dataset. Accuracies along the diagonal
reflect the individual model accuracy prior to ensembling.
× n-gram CodeBERT BoW AST Stylometry
n-gram 95.416
CodeBERT 95.528 86.724
BoW 95.472 90.564 82.96
AST 95.524 90.52 87.864 80.052
Stylometry 95.472 88.944 86.5 85.872 75.376

future work for this model in which a better selection of n-grams could be identified to make
up our vector representation that could potentially improve accuracy.

5.3. CodeBERT Model
Despite having demonstrated state-of-the-art performance in a number of NL-PL tasks [13],
CodeBERT failed to outperform n-gram-based models. This is perhaps unsurprising as Code-
BERT’s focus is on NL-PL understanding tasks. It may be the case that classifications are being
made on the basis of vocabulary used or that the classification token does not encapsulate
enough information to distinguish between 1000 authors.

5.4. Ensemble Analysis
An ensemble analysis was carried out to investigate how each individual model contributed to
the overall ensemble.
This study was carried out by analysing all ensemble combinations of any two given models
used in the final weighted average ensemble. Table 4 shows the accuracies of all these combi-
nations. This table shows that any combination involving n-grams will typically be the best
performing and that overall, other models do not make a significant impact above the base
n-gram accuracy. The highest accuracy achieved by an n-gram combination was n-gram ×
CodeBERT models, achieving 95.528% accuracy, a 0.112% increase over n-grams alone.
Other combinations, however, can significantly increase the accuracy over the individual
components. For example, the AST × Stylometry ensemble accuracy is improved by 5.82% over
the AST model alone and 10.496% better over the stylometry model alone.
An ablation study was also carried out, investigating the effect of removing individual models
from the final weighted average ensemble. Table 5 shows the effects of these ablations on the
final ensemble accuracy.
Whilst it is unsurprising that the BoW and stylometry models had little impact on the final
accuracies given their small weights deduced in Section 4.3, the results displayed by the AST
model are curious. Much like the BoW and stylometry models, it had a relatively small impact
on final ensemble accuracy, however in the ensemble this model is given a higher weight than
any of the other constituent models despite being second worst in terms of raw accuracy.
It is not entirely clear why a model with such a significant weight has so little contribution,
but it could be attributed to the diversity of the information captured by the AST vectors. Unlike
Table 5
Ablation of Models from Weighted Average Ensemble on the Development Dataset.
Model Removed Accuracy (%) Difference (%)
None 95.715 0
n-gram 92.648 -3.067
CodeBERT 95.568 -0.147
BoW 95.672 -0.043
AST 95.632 -0.083
Stylometry 95.672 -0.043

the other models that extract their features from the raw source code, the AST model features
are discovered as the result of syntax analysis carried out by a compiler and represents the
actual working structure of the code. In other words, these other models have a higher degree
of overlap in the information they contain, since their features all come from the same source,
while the AST model has features that no other model has access to, and it is this diversity that
could confer the higher weighting.

5.5. Error Analysis
To investigate why our n-gram model significantly outperformed all other models and to explore
potential avenues of work that could lead to increased accuracy, an error analysis was carried
out.
The first step in this process was to identify mistakes consistently made by all models. By
intersecting the errors made by each individual model, a set of 822 source codes were identified
that were never predicted correctly by any model belonging to 430 different authors. Over half
of these authors were one-off misclassifications, leaving only 189 authors that had multiple
misclassifications. By analysing the source codes from some of these authors, some consistent
characteristics were identified that were shared by these misclassifications. A case study of the
author labelled 376 provides a clear example of some of these features.
Source codes from author 376 which are correctly identified in the ensemble model contain
the same “#define” pre-processor directives and comment signature at the top of the source
codes. Figure 1 however, exhibits three source codes from the same author that were not
classified correctly. Notable in these source codes, is that the comment and “#define” pre-
processor directives were absent, or in the case of Figure 1(a), altered from the authors typical
usage. These features were also missing in all other misclassified source codes from this author.
Additionally, the actual code contained is both short, typically being less than 30 lines, and
lacking complex variable names, preferring to use single letter identifiers. This is also consistent
with a significant proportion of misidentified source codes from other authors and it may be
this lack of extractable information that leads to the models consistently getting them wrong.
Next, an n-gram specific error analysis was carried out where 𝑛 = 6. To analyse the
weaknesses of this model, a number of author source codes were again analysed. Figure 2
contains three such source codes from the author labelled 672.
All of these source codes, along with the remaining correctly identified source codes from
this author, share 52 common n-grams and, often, a unique comment at the start of the code.
(a) (b)
1 #include 1 #include
2 2
3 #define ff first 3 using namespace std;
4 #define ss second 4
5 #define maxn 2000006 5 int main() {
6 #define pb push_back 6 int n;
7 #define ll long long 7 cin >> n;
8 #define lll __int128 8 for (int i = 0; i < n; i++) {
9 #define vll vector 9 long long k, x;
10 #define mll map 10 cin >> k >> x;
11 #define MOD 1000000007 11 cout << (k - 1) * 9 + x << endl;
12 #define pll pair 12 }
13 #define ull unsigned long long 13 }
14 #define f(i,x,n) for(int i=x;i<=n;i++)
15 (c)
16 int dx[] = { -1, 0, 1, 0, -1, -1, 1, 1}; 1 #include
17 int dy[] = {0, 1, 0, -1, -1, 1, 1, -1}; 2 using namespace std;
18 3
19 using namespace std; 4 int t, a, b, c, d;
20 5
21 int main() { 6 int main() {
22 ios_base::sync_with_stdio(false); 7 for (cin >> t; t; t--) {
23 cin.tie(NULL); 8 cin >> a >> b >> c >> d;
24 cout.tie(NULL); 9 cout << b << " " << c << " " << c << endl;
25 10 }
26 int t; t = 1; 11 return 0;
27 while (t--) { 12 }
28
29 ll c, x; cin >> c;
30 x = 2;
31 ll ans = 0;
32 if (c < x or x == 1) ans = c;
33 else if (c == x ) ans = 1;
34 else {
35 while (c > 0) {
36 if (c % x == 0) c /= x;
37 else {
38 ans += (c % x);
39 c -= (c % x);
40 }
41 }
42 }
43 cout << ans << endl;
44 }
45
46 return 0;
47 }

Figure 1: Misidentified Source Codes from User 376.

Curiously however, the source code Figure 2(c) is misidentified by the n-gram model despite
the shared n-grams and comment. Upon investigating the shared features, a notable pattern
emerged where the 52 common n-grams exclusively corresponded to five repeating features
found in the authors source codes:
#include
using namespace std;
int main() {
scanf("%
return 0;

A significant note is that none of the 52 common n-grams were derived from the unique comment
at the start of the source codes. Several conclusions can be made from this: firstly, features
(a) (c)
1 /*jai mata di 1 /*jai mata di
2 let’s rock*/ 2 let’s rock*/
3 #include 3 #include
4 using namespace std; 4
5 const int N=200004; 5 using namespace std;
6 long long sum[N]; 6 const int N=5e5+10;
7 int a[N]; 7
8 int main() 8 char s[N], t[N];
9 { 9 int nxt[N];
10 int n,m; 10 int a[7];
11 scanf("%d %d",&n,&m); 11 int n;
12 int i; 12
13 for(i=1;i<=n;i++) 13 void build() {
14 { 14 nxt[0]=-1;
15 scanf("%d",&a[i]); 15 for(int i=1; i<=n; i++) {
16 } 16 int j=nxt[i-1];
17 sort(a+1,a+n+1); 17 while(j!=-1&&t[j+1]!=t[i]) j=nxt[j];
18 long long ans=0; 18 nxt[i]=j+1;
19 for(i=1;i<=n;i++) 19 }
20 { 20 }
21 sum[i%m]+=a[i]; 21
22 ans=ans+sum[i%m]; 22 int main() {
23 printf("%lld ",ans); 23 scanf("%s", s+1);
24 } 24 int len=strlen(s+1);
25 printf("\n"); 25 for(int i=1; i<=len; i++) {
26 return 0; 26 a[s[i]-’0’]++;
27 } 27 }
28 scanf("%s", t+1);
(b) 29 n=strlen(t+1);
1 /*jai mata di 30 build();
2 let’s rock */ 31 int cur=0;
3 #include 32 string ans;
4 using namespace std; 33 for(int i=1; i<=len; i++) {
5 int main() 34 int need=t[cur+1]-’0’;
6 { 35 if (a[need]) {
7 int t; 36 a[need]--;
8 scanf("%d",&t); 37 ans+=need+’0’;
9 while(t--) 38 cur++;
10 { 39 if (cur==n) cur=nxt[n];
11 int n; 40 }
12 scanf("%d",&n); 41 else {
13 int cnt=0; 42 break;
14 int i,j; 43 }
15 for(i=1;i<10;i++) 44 }
16 { 45 for(int i=1; i<=a[0]; i++) ans+=’0’;
17 long long val=0; 46 for(int i=1; i<=a[1]; i++) ans+=’1’;
18 for(j=1;j<=10;j++) 47 cout<