<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Personality Recognition Applying Machine Learning Techniques on Source Code Metrics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hugo A. Castellanos</string-name>
          <email>hacastellanosm@unal.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Nacional de Colombia Bogotá</institution>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Source code has become a data source of interest in the recent years. In the software industry is common the extraction of source code metrics, mainly for quality assurance purposes. In this paper source code metrics are used to consolidate programmers pro les with the purpose to identify different personality traits using machine learning algorithms. This work was done as part of the Personality Recognition in SOurce COde (PR-SOCO) shared task in the Forum for Information Retrieval Evaluation 2016 (FIRE 2016).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Pieces of text have been always of interest in information
retrieval as text based documents contain valuable
information about the author. During recent decades source code
has become a source of valuable information as well. Many
e orts in this eld have been addressed to improve both
processes and products in the software development
industry [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The main e orts in source code analysis have been focused
in forensics applications like author recognition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and
plagiarism detection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Several techniques have been used
successfully in the mentioned tasks like n-grams, source code
metrics, coding styles and abstract syntax trees [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Other
applications of source code analysis include feature location
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], topics identi cation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], among others.
      </p>
      <p>The PR-SOCO shared task consisted in predict the
personality traits of a programmer given a set of his/her source
codes. These source codes as any other production of a
human being may be in uenced by personality.</p>
      <p>In this work, the use of source code metrics is proposed
to nd information about the program author. Speci cally,
the author personality traits based on the Big-5 personality
test. In addition, machine learning methods are used to
predict the personality traits based on the extracted source
code metrics.</p>
      <p>The rest of this paper is organized as follows. Section 2
presents a general background on source code metrics.
Section 3 describes the proposed approach. Section 4 presents
the machine learning strategies. Section 5 presents the
obtained results. Finally, Section 6 concludes the paper.
2.</p>
      <p>BACKGROUND ON SOURCE CODE
METRICS</p>
      <p>
        According to Malhotra [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], software metrics are used to
assess the quality of the product or process used to build it.
Such metrics have the following characteristics:
      </p>
      <p>Quantitative: metrics have a value.</p>
      <p>Understandable: the way the metric is calculated must
be easy to understand.</p>
      <p>Validatable: metrics must capture the attributes which
they were designed to.</p>
      <p>Economical: it must be economical to capture the
metric.</p>
      <p>Repeatable: if measured several times the results should
be the same.</p>
      <p>Language independent: the metrics should not depend
to a speci c language.</p>
      <p>Applicability: the metric should be applicable in any
phase of the software development.</p>
      <p>Comparable: the metric should correlate with another
metric capturing the same concept.</p>
      <p>Source code metrics must have a scale which can be:
Interval: it is given by a de ned range of values.</p>
      <p>Ratio: it is a value which has an absolute minimum or
zero point.</p>
      <p>Absolute: it is a simple count of the elements of
interest.</p>
      <p>Nominal: it is a value which mainly de nes a discrete
scale of values, like 1-present or 0-not present.</p>
      <p>Ordinal: it is a categorization which is intended to
order or rank, for instance levels of severity: critical,
high, medium, etc.</p>
      <p>The Halstead volume (V ), described in Equation 3, is a
measure of size but it is also interpreted as the number of
mental comparisons that were needed to write a program
with length N . Moreover, the di culty (D), shown in
Equation 4, describes the di culty to write a program. It is
highly related to volume because as it increases the di
culty also does.</p>
      <p>Size: usually intended to estimate cost and e ort. The
most popular metric in this category are the source
lines of code (SLOC). But in object oriented languages
the size can be measured by the number of classes,
methods and attributes.</p>
      <p>Software quality: intended to measure the quality of
the software, this metric can be divided in the following
categories:
{ Based on defects: they consist in measure the
level of defects. The main metrics of this
category are: the defect density de ned as the number
of defects by SLOC; defect removal e ectiveness
which is de ned as the number of defects removed
in a phase divided by latent defects. If the latent
defects are unknown then can be estimated based
on previous phases.
{ Usability: this kind of metrics are intended to
measure the user satisfaction using the software.</p>
      <p>
        The satisfaction can be given be the ease to use
and learn.
{ Complexity metrics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: they are oriented to
produce a measure on the di culty to test or
maintain a piece of source code. This metric also
give information about the amount of instructions
during execution.
{ Testing: intended to measure the progress of
testing over a software
Object oriented metrics: intended to measure object
oriented paradigm features. They can be divided in:
{ Coupling: measure of the level of interdependence
between classes, it is calculated counting the
number of classes called by another class.
{ Cohesion: measures how many elements of a class
are functionally related to each other.
{ Inheritance: it measures the depth of the class
hierarchy.
{ Reuse: measure of the amount of times that a
class is reused.
{ Size: intended to measure the size but not only
in lines of code but also in the particularities of
object oriented paradigm, like method count,
attribute count, class count, etc.
      </p>
      <p>Evolutionary metrics: try to measure the evolution of
a software based on di erent elements like revisions,
refactorings, bug- xes. The measure how much lines
of code are new, modi ed or deleted.</p>
      <p>
        Additionally the empirical Halstead metrics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] should also
be considered. The base to calculate these metrics are the
operands (identi ers) and operators (keywords, ++, +).
Equation 1 consist in the sum of the unique operators (n1)
and operands (n2). Length, described in Equation 2, is the
sum of the total number of operands (N1) and operators
(N2).
(1)
(2)
(3)
(4)
(5)
(6)
(7)
      </p>
      <p>The e ort (E) described in Equation 5, indicates the e ort
required to write a program of high di culty.</p>
      <p>
        Finally, the e ort is the base to calculate the time to
understand/implement (T ) and bugs delivered (B), as can be
seen in Equations 6 and 7, respectively. The time metric
is related to the Stroud number [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is the "number
of elementary discrimination per second". Stroud claimed
that this number ranges from 5 to 20, but the Halstead's
experiments indicated empirically that the best number in
this case was 18.
      </p>
      <p>T =
B =</p>
      <p>E
18
3. SOURCE CODE ANALYSIS FOR
PERSON</p>
      <p>ALITY RECOGNITION</p>
      <p>
        Text documents, contains information about the author.
In the work described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the authors were able to show
that certain personality traits could be predicted based on
a text, in this case, an essay.
      </p>
      <p>The present work starts from the hypothesis that source
code, as a form of text, leaves traces of the author's
personality traits. To the scope of this work source code is a text
document written by a single author. It is worth
mentioning that a single problem solution could be implemented in
several ways by a programmer which give a certain guaranty
of uniqueness.</p>
      <p>To develop this hypothesis, a method is proposed to
extract metrics from source code to be able to predict the
personality traits. In Figure 1 the general method is
summarized. As rst step the source examples provided are
separated into individual les. Later a set of metrics is
extracted from the source codes using a source code analyzer.
With the extracted metrics as an input, machine learning
methods are applied in order to predict the personality traits
of the authors. Finally, the results are presented.</p>
      <p>The provided corpus consisted in a source code le per
person, and another le which indicates author and his/her
personality traits (ground truth). Each source code le
contained several source code pieces divided by a mark. The
le was split into several individual les keeping track of the
author- le relationship.</p>
      <p>
        An analyzer was written using ANTLR 4 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with the
java grammar. From each individual le the source code
metrics described in Table 1 were extracted.
      </p>
      <p>As can be seen most of the metrics are based in counting
and obtaining the average. All the metrics were normalized,
such normalized data were the input of the machine learning
algorithms.</p>
      <p>
        As the extracted metrics are from similar categories, a
hierarchical clustering using the Ward's method [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] was
applied. It was found that certain related metrics were too
close to each other. Therefore, they were consolidated as
follows:
      </p>
      <p>Length metrics: contain the metrics related to some
length/size measure and it is calculated as the average
among: amount of les, average source lines of code,
average class number per le, average source code lines
per class, average attributes per class, average methods
per class, average class name length, and the average
number of parameters.</p>
      <p>Complexity metrics: contain the metrics related with
algorithm complexity and it is calculated as the
average of: average amount of for loops, average amount
of while loops, average amount of if clauses, average
amount of if-else clauses, and the average identi er
length.</p>
      <p>Halstead : contains all the Halstead metrics extracted,
it was calculated as the average of: Halstead bugs
delivered, Halstead di culty, Halstead e ort, Halstead
time to understand or implement, Halstead volume.</p>
      <p>MACHINE LEARNING METHODS</p>
      <p>In this section the used machine learning methods are
described. Each one corresponds to a submission sent to the
shared task: submission 1 corresponds to support vector
regression (SVR) over source code metrics, submission 2
corresponds to extra trees regressor (ETR), and submission 3
corresponds to support vector regression over averages.
4.1</p>
      <p>Support vector regression (SVR) on
metrics</p>
      <p>A SVR algorithm was used jointly with the extracted
metrics as input. For each personality trait an independent SVR
was used and a 6-fold cross validation was executed over the
corpus. The best parameters according with this validation
can be seen in the Table 2. The Figure 2 shows the
resulting mean squared error (y axis) versus the gamma variation
(x axis) with the best C and values in logarithmic scale
in cross validation. This behavior was similar for all the
personality traits.
4.2</p>
      <p>Extra trees regressor (ETR) on metrics</p>
      <p>Another method applied was the Extra trees regressor,
for each personality trait a 6 fold cross validation was
performed. For the parameter number of estimators for all
traits the best value was 77.
4.3</p>
      <p>Support vector regression (SVR) on
averages</p>
      <p>Based on the clustering results a SVR was used with the
metrics averages as input, i.e., length metrics, complexity
metrics, and Halstead metrics. The rst step was to
calculate the variance. As the complexity metrics variance was
too low, it was removed and only the length and Halstead
average metrics were used as input.</p>
      <p>The best parameters according with cross validation can
be seen in Table 3. The graphics of versus error for the
best C and values have a similar behavior of the one shown
in Figure 2.
with other participant results, and showing better results
than the baseline in submissions 2 and 3. Conscientiousness
followed with the best error for the SVR and Extra Tree
Regressor.</p>
      <p>
        The worst predicted trait with RMSE was Emotional
Stability/Neuroticism in all methods, based in the results of
other participants1, this was a general result [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A deep
study in this particular trait is required to improve the
results.
      </p>
      <p>When measured with Pearson Product-Moment
Correlation (PC), the results are very di erent among runs. But
submissions 2 and 3 showed much better results compared
with baseline because indicates a stronger correlation that
the one showed in the baseline. The SVR with averages
has an important correlation in openness with value of 0.3
and conscientiousness with value of -0.25. In the ETR run,
openness was the highest value with 0.29. SVR over metrics
in openness also had the highest value with 0.28. This trait
was the most consistent among all the used methods.</p>
      <p>It is interesting that PC shows correlations with openness
and conscientiousness. This is a good result because
indicates that the used metrics have certain relationship whit the
mentioned personality traits. Compared with the baseline
RMSE, the proposed method performed slightly better, but
still it is not signi cant, which shows that more work is
required to obtain a good predictor of personality. Therefore,
it is necessary to include more source code metrics within
this study. This could lead to nd that certain metrics are
related to speci c personality traits.</p>
      <p>CONCLUSIONS AND FUTURE WORK
The source code metrics extracted and used as input to
the machine learning methods were enough to get a close
prediction of several personality traits. Other approaches
can be consulted in [?] which shows other results and
approximations for the PR-SOCO task.</p>
      <p>As the PC denotes certain correlation, in this case
particularly with openness, this could mean that the metrics
considered in this work are likely related to the mentioned
trait. However, as there are several other metrics with di
erent purposes, like quality, readability, etc., the use of more of
those metrics could improve the prediction. Other metrics
not considered in this study may have better relationships
with the personality traits. This work could be extended by
exploring other metrics an its relationship with each
personality trait.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dhawle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          .
          <article-title>Lexical predictors of personality type</article-title>
          .
          <source>Proceedings of joint annual meeting of the interface and The Classi cation Society of North America</source>
          , pages
          <volume>1</volume>
          {
          <fpage>16</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caliskan-Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yamaguchi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Greenstadt.</surname>
          </string-name>
          De-anonymizing Programmers via Code Stylometry.
          <source>USENIX sec</source>
          , pages
          <volume>255</volume>
          {
          <fpage>270</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Revelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gethers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Poshyvanyk</surname>
          </string-name>
          .
          <article-title>Feature location in source code: A taxonomy and survey</article-title>
          .
          <source>Journal of software: Evolution and Process</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <volume>53</volume>
          {
          <fpage>95</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Halstead</surname>
          </string-name>
          .
          <source>Elements of Software Science (Operating and Programming Systems Series)</source>
          . Elsevier Science Inc., New York, NY, USA,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Holmes</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Tweedie</surname>
          </string-name>
          .
          <article-title>Forensic Stylometry: A Review of the fCUSUMg Controversy</article-title>
          .
          <source>Revue Informatique et Statistique dans les Science Humaines</source>
          , pages
          <volume>19</volume>
          {
          <fpage>47</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Argiddi</surname>
          </string-name>
          .
          <source>Author Identi cation : An Approach Based on Style Feature Metrics of Software Source Codes</source>
          .
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <volume>564</volume>
          {
          <fpage>568</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ducasse</surname>
          </string-name>
          , and
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>G^rba. Semantic clustering: Identifying topics in source code</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <volume>49</volume>
          (
          <issue>3</issue>
          ):
          <volume>230</volume>
          {
          <fpage>243</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          .
          <source>Empirical Research in Software Engineering: Concepts</source>
          ,
          <article-title>Analysis, and Applications</article-title>
          . CRC Press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>McCabe</surname>
          </string-name>
          .
          <article-title>A complexity measure</article-title>
          .
          <source>IEEE Transactions on software Engineering</source>
          , (
          <volume>4</volume>
          ):
          <volume>308</volume>
          {
          <fpage>320</fpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Parr</surname>
          </string-name>
          .
          <source>The De nitive ANTLR 4 Reference. Pragmatic Bookshelf, 2nd edition</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Restrepo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          . Pan at re:
          <article-title>Overview of the pr-soco track on personality recognition in source code</article-title>
          .
          <source>In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Conte</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Dunsmore</surname>
          </string-name>
          .
          <article-title>Software science revisited: A critical analysis of the theory and its empirical support</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          , (
          <volume>2</volume>
          ):
          <volume>155</volume>
          {
          <fpage>165</fpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. H. Ward</given-names>
            <surname>Jr</surname>
          </string-name>
          .
          <article-title>Hierarchical grouping to optimize an objective function</article-title>
          .
          <source>Journal of the American statistical association</source>
          ,
          <volume>58</volume>
          (
          <issue>301</issue>
          ):
          <volume>236</volume>
          {
          <fpage>244</fpage>
          ,
          <year>1963</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>