<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Boosting a kNN classifier by improving feature extraction for authorship identification of source code</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yves Bestgen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université catholique de Louvain</institution>
          ,
          <addr-line>10 place Cardinal Mercier, Louvain-la-Neuve, 1348</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <fpage>6</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>This paper presents the system developed by the LAST to identify the author of a source code. It combines the classical 1-nearest neighbor algorithm with a feature extraction step whose main characteristics are the use of indentation-aware tokenization to gather n-grams and skip-grams, which are weighted by Relevance Frequency. Its performance is over 92% correct identification of the author among 1,000 potential programmers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;1-nearest neighbor</kwd>
        <kwd>indentation-aware tokenization</kwd>
        <kwd>skipgrams</kwd>
        <kwd>relevance frequency supervised weighting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This paper presents the system developed by the Laboratoire d’analyse statistique des textes (LAST)
for the PAN@FIRE 2020 Task on Authorship Identification of SOurce COde (AI-SOCO). This task aim
is to identify the programmer who wrote a source code. As stressed by the challenge organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
being able to perform this task efectively should make it possible to combat cheating in academic
and scientific communities. It could also help to fight against copyright infringement and intervenes
in code integrity investigations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Authorship identification is a question that has held attention for many years, especially in the
literary [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ] and history fields [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. More recently, due to the development of computers and the
Internet, works have focused on plagiarism and the re-use of sources without citing them adequately.
Identifying the author of a literary text or a source code are two relatively diferent tasks, not only
because of the diferences between these genres of documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but also because the potential
authors of a literary work are usually well known and in very small numbers while the number of
potential programmers is very large. This is for example the case in the present shared task since
no less than 1,000 programmers must be discriminated. This task, therefore, appears a priori to be
particularly complex.
      </p>
      <p>
        As in other areas of machine learning, the state-of-the-arts approach is based on Deep Learning
[
        <xref ref-type="bibr" rid="ref7">7, 8, 9</xref>
        ]. However, given the very large number of potential authors in this challenge, the classical
k-nearest neighbors (kNN) algorithm could prove to be very efective. In line with my previous
research ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for example), the main objective of the study is thus to try developing the most eficient
no deep learning system without using any additional resources such as pre-trained language models
or knowledge bases. To do this, special attention was put on tokenization, by trying to take into
account the source code format style [10], on the feature set used, which includes skipgram [11], and
on the weighting of these features, by adapting relevance frequency, the supervised weighting scheme
proposed by [12].
      </p>
      <p>The rest of this paper describes the materials collected by the challenge organizers, the system
developed, and the level of performance it has achieved. It also presents an analysis of the usefulness
of each of its components.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Challenge</title>
      <p>
        The organizers of the AI-SOCO task collected from the Codeforces site (http://codeforces.com), a web
platform on which programming contests are organized, 100,000 source codes in C++, 100 codes for
each of the 1,000 selected participants. These source codes have all been considered to be correct by
Codeforces. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] provide useful statistics about this dataset.
      </p>
      <p>This dataset has been divided into three sets: the training set composed of 50 source codes for each
of the 1,000 authors, the development set consisting of 25 source codes for each author, and the test
set also made up of 25 source codes for each author.</p>
      <p>Author IDs were provided for the training and development sets, but not for the test set. Strangely
enough, the rules of the challenge forbade adding the development set to the training set to predict
the test set while the answers were provided. During the testing phase, each team could submit three
solutions that were evaluated using the accuracy metric, a logical choice given that the number of
instances in all categories is the same.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Developed System</title>
      <p>The specificity of the developed system is clearly not in the supervised classification procedure used
since it is the classical kNN algorithm, but in the features and their weighting. The choices made
to try to arrive at the best possible classifier are described below. Each option was evaluated based
on a stratified sample of 5,000 instances of the development set (instead of the full set to speed up
the evaluation). From time to time, certain choices were also evaluated on the full development set
and on the training set using a Leave-one-out cross-validation technique (LOOCV) [13, 14]. Document
preprocessing and feature extraction were performed using the Statistical Analysis System (SAS). The
kNN algorithm was performed using a custom C program [15].</p>
      <sec id="sec-3-1">
        <title>3.1. Indentation-Aware Tokenization</title>
        <p>As underlined in the first works in source code authorship identification [16] (see [17] for a synthesis),
programmers can be discriminated by source code components such as preferred variable names and
ways of doing typical operations, but also by the code layout such as indentations and space between
and within instructions. For this reason, the tokenization step kept track of indentations. Concretely,
the contiguous spaces and tabs present at the beginning of a line are considered as tokens in the same
way as any other sequence of characters separated by space or tab. For example, the following two
lines (&gt; and &lt; marking the start and end of a line):
&gt;
&gt;
for(auto &amp;p : elem.second) {&lt;</p>
        <p>i = p.first;&lt;
produce the following tokens (&gt; and &lt; marking the start and end of a token):
In this tokenization, the space and the tab are distinguished from each other.</p>
        <p>The extracted features are composed of token n-grams with  ranging from 1 to 4. To reduce the size
of the data, n-grams longer than 40 characters, an arbitrary threshold, were discarded. As indentations
are considered as "normal" tokens, they are taken into account in the n-grams.</p>
        <p>Skipgrams [11], n-grams of which one or several tokens located in the middle are replaced by a
placeholder, were also extracted, one from each 3-gram (a_b_c producing a_?_c) and three from each
4-gram (a_b_c_d producing a_?_b_c, a_b_?_d and a_?_?_d). Minimum frequency thresholds were
applied to these diferent types of features: 30 for 1-grams, 50 for 2-grams and 40 for 3-grams,
4&gt;elem.second)&lt;</p>
        <p>&lt;
&gt;for(auto&lt;
&gt;
&gt;&amp;p&lt;
&gt;:&lt;
&gt;{&lt;
&gt;
&gt;i&lt;
&gt;=&lt;
&gt;p.first;&lt;</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Features</title>
        <p>&lt;
grams and skipgrams.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Weighting</title>
        <p>
          Instead of the classic tf.idf frequently used in the field (for example [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]), the system proposed here
uses relevance frequency (rf ), the supervised weighting factor of [12], which favors features that are
positively associated with the target category. Its formula is
  =  2(2 +


(1,  )
),
where  is the number of instances in the target category in which this feature occurs and  is the
number of instances in all other categories in which this feature occurs. As this is a supervised weighting
scheme, it is computed on the training set only. This rf weighting is combined with a binary coding
(presence or absence) of the feature.
        </p>
        <p>Since the category of an instance is ignored for the test set, [12] proposed to use for weighting
the unknown instance the weights of the category with which it is compared. If this approach is
technically feasible, it poses practical problems since it is necessary to reweight the features of an
unknown instance as many times as there are potential categories. The approach proposed by [18]
for other supervised weighting procedures was used instead. It consists of using the maximum weight
that the feature in question has received on all the categories considered in the learning set. The rf
weights used for the test set are therefore calculated by the same formula and on the same data as the
weights used for the training set.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Adding Centroids</title>
        <p>While it was not obvious that this factor could be of much use, the centroid of each category was
added to its 50 instances in the training set. It is obtained by calculating the mean weight of all the
features present in at least one of these 50 instances.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. L2 Normalization</title>
        <p>The analyzes showed that the use of L2 normalization markedly improves the system performance.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Parameters for the kNN Algorithm</title>
        <p>The analyzes carried out on the development set led to fix the kNN parameter  at 1 and to use the
classical Euclidean distance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Analyzes and Results</title>
      <sec id="sec-4-1">
        <title>4.1. Performance on the Test Set</title>
        <p>As allowed by the rules of the challenge, three submissions were made on the test set. The first was
based on the system described above and obtained an accuracy of 0.9217 while its accuracy on the
development set was 0.9269. The other two systems submitted tried to account for the information
provided by the challenge organizers that the test material contained exactly 25 instances of each of
the 1,000 categories. More precisely, the classification was carried out by employing the best matching
ifrst approach (see [19] for the use of this approach in a case of a pair-matching) and by stopping
assigning instances to a category when a suficiently large number of instances had been assigned
to it. As this kind of information is obviously not available in real life and as these two submissions
scored only very slightly higher than to first one (0.9219 instead of 0.9217), they are not discussed in
more detail.</p>
        <p>The system ranked fourth in the challenge out of 16 teams with 0.0294 less than the first (0.9511),
0.0119 less than the 3rd (0.9336) and 0.006 more than the 5th (0.9157). It is also interesting to compare
its performance with the TF-IDF kNN Baseline proposed by the organizers, which is based on the
standard sklearn tokenizer, a TF-IDF weighting applied to the 10,000 most frequent features, and a
kNN classifier with  = 25. It scored 0.6278 on the test material, 0.2939 lower than the system proposed
here.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Factors Afecting the System Performance</title>
        <p>If the supervised learning procedure is very classic, the feature extraction is somewhat diferent from
what is usually done in the field. It is therefore useful to discuss the benefits provided by its diferent
components. To do this, Table 1 compares the full system described in section 3 (the system used
for the first submission) to a series of other systems obtained by ablation, modifying one component
at a time. The full system is therefore based on the indentation-aware tokenization, the n-grams (
ranging from 1 to 4) and the skipgrams, the rf weighting, adding the centroids of the categories and
using  = 1 while for building the other systems only one of these components is modified like the
tokenization procedure, the weighting or the value of  . The accuracy scores reported in Table 1 were
obtained on a stratified sample of 5,000 instances taken from the development set.</p>
        <p>Table 1 indicates that three factors have an important impact on performance:
• Tokenization which, when done in the classical way, by simply separating tokens according to
the presence of space or tab, results in a performance decrease of 0.0464.
• The extracted features since the system based only on 1-grams obtains an accuracy lower by
0.0354 compared to the full system. Adding each additional n-gram length improves
performance, but the gain provided by the 4-grams is small and less than that provided by the
skipgrams. It should be remembered, however, that three of the four types of skipgrams are based
on the 4-grams.
• The value of  has also a very important efect. The smaller the  , the better the performance,
a  of 25, as in the baseline proposed by the organizers, producing a loss of 0.0492.</p>
        <p>The other two factors have a much smaller impact. This is particularly the case with the addition
of centroids (+0.0006). The rf weighting brings a gain of 0.005, which is a bit disappointing.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Qualitative Analysis</title>
        <p>At first glance, it seemed dificult to imagine that it would be possible to achieve such a high level of
performance (let alone the level reached by the challenge best system) in a task that requires
identifying 1,000 diferent programmers based on 50 instances for each of them. So, I tried to understand why
this task was "so" simple. The main reason is that the vast majority of programmers in the datasets do
not try to hide their identity. Many of them systematically insert at the beginning or at the end of all
or almost all of their source codes metadata stored in block comments that distinguish them from all
the others. Many programmers also systematically insert preambles (idiosyncratic declarations and
insertion of standard declaration files) at the start of the code, which they systematically copy from
one program to another and therefore whose content and order are always identical.</p>
        <p>There is, however, one special case that deserves mention here. Analysis of the errors made by
the system shows that it was unable to distinguish two programmers from each other, User IDs 91
and 147. The system makes more than 50% of errors on them, the errors made on one consisting of
assigning the source code to the other programmer and vice versa. The comparison of the source
codes of these two programmers shows that they both use very frequently the same metadata and
preambles around the C++ instructions needed to solve the problem as shown in Figure 1. Without
further information, it is dificult to understand the origin of this similarity.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusion</title>
      <p>The system developed by the LAST for the PAN@FIRE 2020 Task on Authorship Identification of
SOurce COde (AI-SOCO) was aimed at obtaining the best possible performance by combining a
supervised learning algorithm among the most classical with a feature extraction step as eficient as
possible. The classifier used is the 1-nearest neighbor algorithm. The features are extracted by a fully
automatic procedure exclusively from the training set. The performance achieved is over 92%
correct identification of the author of a source code among 1,000 potential programmers, ranking the
system 4th out of 16 teams. Analyzes carried out on a sample of the development set indicate that
the indentation-aware tokenization, the extraction of n-grams from 1 to 4 tokens and skipgrams, and
setting the parameter k to 1 are at the source of the eficiency of the system, as well as, but to a
lesser extent, the use of relevance frequency, the supervised weighting scheme proposed by [12]. The
performance obtained on the test set seems rather high for a system that is based on an approach
as simple as kNN and which only uses training data without the addition of external knowledge or
precomputed embeddings.</p>
      <p>An important weakness of the kNN algorithm is that it is biased in favor of the most frequent
categories. In the present task, this is not a problem since the categories are balanced, but this weakness
should be kept in mind if the system is used with other datasets in which this condition is not met.</p>
      <p>
        Another limitation of the system is that it is far from obvious that it could prove useful for the
purpose for which the AI-SOCO task was designed: to aid in the detection of cheating in the academic
community (https://sites.google.com/view/ai-soco-2020/task-description). The proposed system
benefits too much from the fact that the vast majority of programmers do not try to hide their identity,
systematically inserting at the beginning or at the end of all or almost all of their source codes,
metadata which distinguishes them from all the others. When two diferent programmers use the same
metadata in their source code the system performance is very low. It would be interesting to
determine what level of performance the proposed system could achieve when using only the parts of the
source codes that contain these metadata or, even more interesting, to determine the eficiency of
this system, but also of the other systems participating in this task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], on source codes without these
metadata.
      </p>
      <p>Regarding possible developments, it could be very fruitful to try to make tokenization more eficient
since the analyzes have shown that even the partial use of code formatting is very beneficial. Taking
inspiration from [9] could be particularly relevant.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The author wishes to thank the organizers of the AI-SOCO task for putting together this valuable
event and the two anonymous reviewers for their very helpful comments. He is a Research Associate
of the Fonds de la Recherche Scientifique - FNRS (Fédération Wallonie Bruxelles de Belgique).
Computational resources have been provided by the supercomputing facilities of the Université Catholique de
Louvain (CISM/UCL) and the Consortium des Equipements de Calcul Intensif en Fédération Wallonie
Bruxelles (CECI).
2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto,
ON, Canada, October 15-19, 2018, ACM, 2018, pp. 101–114. doi:10.1145/3243734.3243738.
[8] E. Bogomolov, V. Kovalenko, A. Bacchelli, T. Bryksin, Authorship attribution of source code:
A language-agnostic approach and applicability in software engineering, CoRR abs/2001.11593
(2020).
[9] S. Gorshkov, M. Nered, E. Ilyushin, D. Namiot, V. Sukhomlin, Source code authorship
identification using tokenization and boosting algorithms, in: V. Sukhomlin, E. Zubareva (Eds.), Modern
Information Technology and IT Education, Springer International Publishing, 2020, pp. 295–308.
[10] I. Krsul, E. H. Spaford, Authorship analysis: identifying the author of a program, Comput.</p>
      <p>Secur. 16 (1997) 233–257. doi:10.1016/S0167-4048(97)00005-9.
[11] Y. Wilks, Reveal: the notion of anomalous texts in a very large corpus, Tuscan Word Centre</p>
      <p>International Workshop, Certosa di Pontignano, Italy, 2005.
[12] M. Lan, C. L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic
text categorization, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009) 721–735. doi:10.1109/
TPAMI.2008.110.
[13] P. A. Lachenbruch, R. M. Mickey, Estimation of error rates in discriminant analysis,
Technometrics 10 (1968) 1–11.
[14] Y. Bestgen, S. Granger, J. Thewissen, Error patterns and automatic l1 identification, in:
S. Jarvis, S. Crossley (Eds.), Approaching Transfer through Text Classification: Explorations in
the Detection-based Approach, volume 3, Multilingual Matters, Bristol, UK, 2012, pp. 127–153.
[15] Y. Bestgen, Détermination de la valence afective de termes dans de grands corpus de textes, in:</p>
      <p>Actes du Colloque International sur la Fouille de Texte CIFT’02, INRIA, 2002, pp. 81–94.
[16] I. Krsul, Authorship analysis: identifying the author of a program, Department of Computer</p>
      <p>Science Technical Reports TR-94-030, Purdue University, 1994.
[17] S. D. Burrows, Source code authorship attribution, Ph.D. thesis, RMIT University, Melbourne,
2010.
[18] I. Batal, M. Hauskrecht, Boosting KNN text classification accuracy by using supervised term
weighting schemes, in: D. W. Cheung, I. Song, W. W. Chu, X. Hu, J. J. Lin (Eds.), Proceedings of
the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong,
China, November 2-6, 2009, ACM, 2009, pp. 2041–2044. doi:10.1145/1645953.1646296.
[19] Y. Bestgen, LSVMA : au plus deux composants pour apparier des résumé à des articles, in: Actes
de DEFT2011 : septième Défi Fouille de Textes, ATALA, 2011, pp. 105–114.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fadel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Musleh</surname>
          </string-name>
          , I. Tufaha,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Ayyoub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jararweh</surname>
          </string-name>
          , E. Benkhelifa,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <article-title>Overview of the PAN@FIRE 2020 task on Authorship Identification of SOurce COde (AI-SOCO), in: Proceedings of The 12th meeting of the Forum for Information Retrieval Evaluation (FIRE</article-title>
          <year>2020</year>
          ), CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Niu</surname>
          </string-name>
          ,
          <article-title>Authorship identification of source codes</article-title>
          , in: L.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Shahabi</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Lian (Eds.),
          <source>Web and Big</source>
          Data - First International Joint Conference,
          <source>APWeb-WAIM</source>
          <year>2017</year>
          , Beijing, China,
          <source>July 7-9</source>
          ,
          <year>2017</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>10366</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2017</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>296</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -63579-8\_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bestgen</surname>
          </string-name>
          ,
          <article-title>Recherche en paternité littéraire : deux romans en quête d'auteur</article-title>
          , in: V. MartinSchmets (Ed.), Henri Vandeputte :
          <article-title>Oeuvres complètes</article-title>
          , volume
          <volume>3</volume>
          ,
          <string-name>
            <surname>Tropismes</surname>
          </string-name>
          , Bruxelles,
          <year>1992</year>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W. M. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>The revenger's tragedy: The derivation and interpretation of statistical results for resolving disputed authorship</article-title>
          ,
          <source>Computer and the Humanities</source>
          <volume>21</volume>
          (
          <year>1987</year>
          )
          <fpage>21</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Somers</surname>
          </string-name>
          ,
          <article-title>Analyse statistique du style</article-title>
          , Nauwelaerts, Louvain,
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mosteller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>Inference in an authorship problem</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          <volume>58</volume>
          (
          <year>1963</year>
          )
          <fpage>275</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abuhamad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abuhmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohaisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nyang</surname>
          </string-name>
          ,
          <article-title>Large-scale and language-oblivious code authorship identification</article-title>
          , in: D.
          <string-name>
            <surname>Lie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mannan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Backes</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Wang (Eds.),
          <source>Proceedings of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>