=Paper= {{Paper |id=Vol-1737/T1-4 |storemode=property |title=Personality Recognition Applying Machine Learning Techniques on Source Code Metrics |pdfUrl=https://ceur-ws.org/Vol-1737/T1-4.pdf |volume=Vol-1737 |authors=Hugo A. Castellanos |dblpUrl=https://dblp.org/rec/conf/fire/Castellanos16 }} ==Personality Recognition Applying Machine Learning Techniques on Source Code Metrics == https://ceur-ws.org/Vol-1737/T1-4.pdf
       Personality Recognition Applying Machine Learning
              Techniques on Source Code Metrics

                                                   Hugo A. Castellanos
                                              Universidad Nacional de Colombia
                                                      Bogotá, Colombia
                                             hacastellanosm@unal.edu.co


ABSTRACT                                                             The rest of this paper is organized as follows. Section 2
Source code has become a data source of interest in the re-       presents a general background on source code metrics. Sec-
cent years. In the software industry is common the extrac-        tion 3 describes the proposed approach. Section 4 presents
tion of source code metrics, mainly for quality assurance pur-    the machine learning strategies. Section 5 presents the ob-
poses. In this paper source code metrics are used to consol-      tained results. Finally, Section 6 concludes the paper.
idate programmers profiles with the purpose to identify dif-
ferent personality traits using machine learning algorithms.      2.    BACKGROUND ON SOURCE CODE MET-
This work was done as part of the Personality Recognition               RICS
in SOurce COde (PR-SOCO) shared task in the Forum for
Information Retrieval Evaluation 2016 (FIRE 2016).                  According to Malhotra [8], software metrics are used to
                                                                  assess the quality of the product or process used to build it.
                                                                  Such metrics have the following characteristics:
CCS Concepts
•Information systems → Content analysis and fea-                       • Quantitative: metrics have a value.
ture selection; •Computing methodologies → Super-
                                                                       • Understandable: the way the metric is calculated must
vised learning by regression; Cluster analysis; •General and
                                                                         be easy to understand.
reference → Metrics; •Software and its engineering
→ Parsers;                                                             • Validatable: metrics must capture the attributes which
                                                                         they were designed to.
Keywords
Personality recognition; Source code metrics; Support Vec-             • Economical: it must be economical to capture the met-
tor Regression                                                           ric.

                                                                       • Repeatable: if measured several times the results should
1.   INTRODUCTION                                                        be the same.
   Pieces of text have been always of interest in information
retrieval as text based documents contain valuable informa-            • Language independent: the metrics should not depend
tion about the author. During recent decades source code                 to a specific language.
has become a source of valuable information as well. Many
                                                                       • Applicability: the metric should be applicable in any
efforts in this field have been addressed to improve both
                                                                         phase of the software development.
processes and products in the software development indus-
try [8].                                                               • Comparable: the metric should correlate with another
   The main efforts in source code analysis have been focused            metric capturing the same concept.
in forensics applications like author recognition [5], and pla-
giarism detection [2]. Several techniques have been used          Source code metrics must have a scale which can be:
successfully in the mentioned tasks like n-grams, source code
metrics, coding styles and abstract syntax trees [6]. Other            • Interval: it is given by a defined range of values.
applications of source code analysis include feature location
[3], topics identification [7], among others.                          • Ratio: it is a value which has an absolute minimum or
   The PR-SOCO shared task consisted in predict the per-                 zero point.
sonality traits of a programmer given a set of his/her source          • Absolute: it is a simple count of the elements of inter-
codes. These source codes as any other production of a hu-               est.
man being may be influenced by personality.
   In this work, the use of source code metrics is proposed            • Nominal: it is a value which mainly defines a discrete
to find information about the program author. Specifically,              scale of values, like 1-present or 0-not present.
the author personality traits based on the Big-5 personality
test. In addition, machine learning methods are used to                • Ordinal: it is a categorization which is intended to
predict the personality traits based on the extracted source             order or rank, for instance levels of severity: critical,
code metrics.                                                            high, medium, etc.
  The metrics can be classified according the intended mea-
sure:
                                                                                          n = n1 + n2                       (1)
   • Size: usually intended to estimate cost and effort. The
     most popular metric in this category are the source
     lines of code (SLOC). But in object oriented languages                              N = N1 + N2                        (2)
     the size can be measured by the number of classes,             The Halstead volume (V ), described in Equation 3, is a
     methods and attributes.                                     measure of size but it is also interpreted as the number of
   • Software quality: intended to measure the quality of        mental comparisons that were needed to write a program
     the software, this metric can be divided in the following   with length N . Moreover, the difficulty (D), shown in Equa-
     categories:                                                 tion 4, describes the difficulty to write a program. It is
                                                                 highly related to volume because as it increases the diffi-
        – Based on defects: they consist in measure the          culty also does.
          level of defects. The main metrics of this cate-
          gory are: the defect density defined as the number                             V = N log2 n                       (3)
          of defects by SLOC; defect removal effectiveness
          which is defined as the number of defects removed
          in a phase divided by latent defects. If the latent                                  n1 N2
          defects are unknown then can be estimated based                                D=      ·                          (4)
                                                                                               2 n2
          on previous phases.
                                                                   The effort (E) described in Equation 5, indicates the effort
        – Usability: this kind of metrics are intended to        required to write a program of high difficulty.
          measure the user satisfaction using the software.
          The satisfaction can be given be the ease to use
          and learn.                                                                       E =D·V                           (5)
        – Complexity metrics [9]: they are oriented to pro-         Finally, the effort is the base to calculate the time to un-
          duce a measure on the difficulty to test or main-      derstand/implement (T ) and bugs delivered (B), as can be
          tain a piece of source code. This metric also          seen in Equations 6 and 7, respectively. The time metric
          give information about the amount of instructions      is related to the Stroud number [12], which is the ”number
          during execution.                                      of elementary discrimination per second”. Stroud claimed
                                                                 that this number ranges from 5 to 20, but the Halstead’s
        – Testing: intended to measure the progress of test-
                                                                 experiments indicated empirically that the best number in
          ing over a software
                                                                 this case was 18.
   • Object oriented metrics: intended to measure object
     oriented paradigm features. They can be divided in:                                          E
                                                                                            T =                             (6)
                                                                                                  18
        – Coupling: measure of the level of interdependence
          between classes, it is calculated counting the num-                                      2
          ber of classes called by another class.                                                E3
                                                                                           B=                               (7)
                                                                                                3000
        – Cohesion: measures how many elements of a class
          are functionally related to each other.                3.   SOURCE CODE ANALYSIS FOR PERSON-
        – Inheritance: it measures the depth of the class             ALITY RECOGNITION
          hierarchy.                                                Text documents, contains information about the author.
        – Reuse: measure of the amount of times that a           In the work described in [1], the authors were able to show
          class is reused.                                       that certain personality traits could be predicted based on
        – Size: intended to measure the size but not only        a text, in this case, an essay.
          in lines of code but also in the particularities of       The present work starts from the hypothesis that source
          object oriented paradigm, like method count, at-       code, as a form of text, leaves traces of the author’s person-
          tribute count, class count, etc.                       ality traits. To the scope of this work source code is a text
                                                                 document written by a single author. It is worth mention-
   • Evolutionary metrics: try to measure the evolution of       ing that a single problem solution could be implemented in
     a software based on different elements like revisions,      several ways by a programmer which give a certain guaranty
     refactorings, bug-fixes. The measure how much lines         of uniqueness.
     of code are new, modified or deleted.                          To develop this hypothesis, a method is proposed to ex-
                                                                 tract metrics from source code to be able to predict the
  Additionally the empirical Halstead metrics [4] should also    personality traits. In Figure 1 the general method is sum-
be considered. The base to calculate these metrics are the       marized. As first step the source examples provided are
operands (identifiers) and operators (keywords, ++, +).          separated into individual files. Later a set of metrics is ex-
Equation 1 consist in the sum of the unique operators (n1 )      tracted from the source codes using a source code analyzer.
and operands (n2 ). Length, described in Equation 2, is the      With the extracted metrics as an input, machine learning
sum of the total number of operands (N1 ) and operators          methods are applied in order to predict the personality traits
(N2 ).                                                           of the authors. Finally, the results are presented.
                                                                        Table 1: Metrics extracted from source code
                                                                    Metric                       Basic description
                                                                    Amount of files              The total amount of files.
                                                                    Average source               The average of source
                                                                    lines of code                lines of code.
                                                                    Average class number         The average of classes
                                                                    per file                     per source code file.
                                                                    Average source code          The average of source
                                                                    lines per class              code lines per class.
                                                                                                 The average of attributes
                                                                    Average attributes per class
                                                                                                 contained in a class.
                                                                                                 The average number
                                                                    Average methods per class    of methods contained
                                                                                                 in a class.
                                                                                                 The average length
                                                                    Average class name length
                                                                                                 of a class name.
                                                                                                 The average amount of
                                                                    Average amount
                                                                                                 for loops contained
              Figure 1: Process summary.                            of for loops
                                                                                                 in a method.
                                                                                                 The average amount of
                                                                    Average amount
                                                                                                 while loops contained
   The provided corpus consisted in a source code file per          of while loops
                                                                                                 in a method.
person, and another file which indicates author and his/her                                      The average amount
personality traits (ground truth). Each source code file con-       Average amount
                                                                                                 of if clauses contained
tained several source code pieces divided by a mark. The            of if clauses
                                                                                                 in a method.
file was split into several individual files keeping track of the                                The average amount
author-file relationship.                                           Average amount of
                                                                                                 of if-else clauses contained
   An analyzer was written using ANTLR 4 [10] with the              if-else clauses
                                                                                                 in a method.
java grammar. From each individual file the source code
                                                                                                 The average identifier
metrics described in Table 1 were extracted.                        Average identifier length
                                                                                                 length per files.
   As can be seen most of the metrics are based in counting
                                                                                                 Average number of
and obtaining the average. All the metrics were normalized,         Average parameters
                                                                                                 parameters in methods.
such normalized data were the input of the machine learning
                                                                    Average ciclomatic           Indicates the cyclomatic
algorithms.
                                                                    complexity                   complexity average.
   As the extracted metrics are from similar categories, a hi-
erarchical clustering using the Ward’s method [13] was ap-                                       The average number
                                                                    Average of
plied. It was found that certain related metrics were too                                        of static attributes
                                                                    static attributes
close to each other. Therefore, they were consolidated as                                        contained in a class.
follows:                                                                                         The average of
                                                                    Average of static methods    static methods
   • Length metrics: contain the metrics related to some                                         contained in a class.
     length/size measure and it is calculated as the average                                     Indicates the number of
     among: amount of files, average source lines of code,                                       possible bugs generated
                                                                    Halstead bugs delivered
     average class number per file, average source code lines                                    based on the halstead
     per class, average attributes per class, average methods                                    metrics.
     per class, average class name length, and the average                                       An index which
     number of parameters.                                          Halstead Difficulty          measures the difficulty
                                                                                                 to write the program.
   • Complexity metrics: contain the metrics related with                                        An index which measures
     algorithm complexity and it is calculated as the aver-         Halstead Effort              the necessary effort to
     age of: average amount of for loops, average amount                                         write the code.
     of while loops, average amount of if clauses, average                                       An index which indicates
                                                                    Halstead Time to
     amount of if-else clauses, and the average identifier                                       the time taken to write
                                                                    understand or implement
     length.                                                                                     a source code.
                                                                                                 Indicates how much
   • Halstead : contains all the Halstead metrics extracted,        Halstead volume              information the reader needs
     it was calculated as the average of: Halstead bugs de-                                      to understand the code.
     livered, Halstead difficulty, Halstead effort, Halstead
     time to understand or implement, Halstead volume.
4.     MACHINE LEARNING METHODS                                                                                metrics, and Halstead metrics. The first step was to calcu-
  In this section the used machine learning methods are de-                                                    late the variance. As the complexity metrics variance was
scribed. Each one corresponds to a submission sent to the                                                      too low, it was removed and only the length and Halstead
shared task: submission 1 corresponds to support vector re-                                                    average metrics were used as input.
gression (SVR) over source code metrics, submission 2 cor-                                                        The best parameters according with cross validation can
responds to extra trees regressor (ETR), and submission 3                                                      be seen in Table 3. The graphics of γ versus error for the
corresponds to support vector regression over averages.                                                        best C and  values have a similar behavior of the one shown
                                                                                                               in Figure 2.
4.1      Support vector regression (SVR) on met-
         rics
                                                                                                               Table 3: Best parameters for SVR with metric av-
   A SVR algorithm was used jointly with the extracted met-                                                    erages according with cross validation
rics as input. For each personality trait an independent SVR                                                            Personality trait    C     γ    
was used and a 6-fold cross validation was executed over the                                                                                       1    −11
                                                                                                                        Emotional stability  32   49
                                                                                                                                                      2
corpus. The best parameters according with this validation
                                                                                                                           Extroversion      32 2−3 2−10
can be seen in the Table 2. The Figure 2 shows the result-
ing mean squared error (y axis) versus the gamma variation                                                            Openness to experience 32    1
                                                                                                                                                  49
                                                                                                                                                      2−10
(x axis) with the best C and  values in logarithmic scale                                                                Agreeableness      32    2  2−11
in cross validation. This behavior was similar for all the                                                               Conscientiousness   32 0.5 2−37
personality traits.

4.2      Extra trees regressor (ETR) on metrics                                                                5.   RESULTS
  Another method applied was the Extra trees regressor,
                                                                                                                  Using the mentioned algorithms with the previously de-
for each personality trait a 6 fold cross validation was per-
                                                                                                               scribed inputs and parameters, the prediction was done on
formed. For the parameter number of estimators for all
                                                                                                               the test dataset. Results can be seen in the Tables 4, 5, and
traits the best value was 77.
                                                                                                               6.
4.3      Support vector regression (SVR) on aver-                                                                 The three proposed methods obtained a similar perfor-
         ages                                                                                                  mance with Root Mean Squared Error (RMSE). The SVR
                                                                                                               with metrics have slightly better results. This could be
 Based on the clustering results a SVR was used with the
                                                                                                               caused by the removal of the complexity metrics.
metrics averages as input, i.e., length metrics, complexity
                                                                                                                  When evaluated with RMSE the Openness trait was the
                                                                                                               best result in all the three applied methods, being consistent

Table 2: Best parameters with SVR with metrics
according with cross validation
                                                                                                               Table 4: Results over test data with SVR using met-
         Personality trait     C  γ   
                                                                                                               rics as input
         Emotional stability   32 8 2−14                                                                                  Personality trait     MSE PC
            Extroversion       32 16 2−12                                                                                 Emotional stability   11.83 0.05
        Openness to experience 32 16 2−19                                                                                    Extroversion        9.54  0.11
           Agreeableness       32 8 2−10                                                                                 Openness to experience  8.14  0.28
          Conscientiousness    32 16 2−54                                                                                    Agreeableness      10.48 -0.08
                                                                                                                           Conscientiousness     8.39 -0.09

     −0.01

                                                                                                               Table 5: Results over test data with Extra Tree Re-
     −0.02
                                                                                                               gressor using metrics as input
     −0.03                                                                                                                Personality trait    MSE PC
                                                                                                                         Emotional stability   10.31 0.02
     −0.04
                                                                                                                            Extroversion        9.06  0.0
     −0.05
                                                                                                                       Openness to experience   7.27  0.29
                                                                                                                            Agreeableness       9.61 -0.11
     −0.06
                                                                                                                          Conscientiousness     8.47  0.16
     −0.07



     −0.08                                                                                                     Table 6: Results over test data with SVR using met-
                                                                                                               ric averages as input
     −0.09
         2
             -6
                  2
                      -5
                           2
                               -4
                                    2
                                        -3
                                             2
                                                 -2
                                                      2
                                                          -1
                                                               2
                                                                   0
                                                                       2
                                                                           1
                                                                               2
                                                                                   2
                                                                                       2
                                                                                           3
                                                                                               2
                                                                                                   4
                                                                                                       2
                                                                                                           5
                                                                                                                          Personality trait    MSE PC
                                                                                                                         Emotional stability   10.24 0.03
                                                                                                                             Extroversion       9.01   0.01
Figure 2: Variation of γ parameter in SVR versus                                                                        Openness to experience  7.34   0.3
the resulting error in cross validation with the best                                                                       Agreeableness       9.36   0.01
C and  parameters.                                                                                                       Conscientiousness     9.99  -0.25
with other participant results, and showing better results           [4] M. H. Halstead. Elements of Software Science
than the baseline in submissions 2 and 3. Conscientiousness              (Operating and Programming Systems Series). Elsevier
followed with the best error for the SVR and Extra Tree                  Science Inc., New York, NY, USA, 1977.
Regressor.                                                           [5] D. I. Holmes and F. J. Tweedie. Forensic Stylometry:
   The worst predicted trait with RMSE was Emotional Sta-                A Review of the {CUSUM} Controversy. Revue
bility/Neuroticism in all methods, based in the results of               Informatique et Statistique dans les Science Humaines,
other participants1 , this was a general result [11]. A deep             pages 19–47, 1995.
study in this particular trait is required to improve the re-        [6] R. R. Joshi and R. V. Argiddi. Author Identification :
sults.                                                                   An Approach Based on Style Feature Metrics of
   When measured with Pearson Product-Moment Correla-                    Software Source Codes. 4(4):564–568, 2013.
tion (PC), the results are very different among runs. But            [7] A. Kuhn, S. Ducasse, and T. Gı̂rba. Semantic
submissions 2 and 3 showed much better results compared                  clustering: Identifying topics in source code.
with baseline because indicates a stronger correlation that              Information and Software Technology, 49(3):230–243,
the one showed in the baseline. The SVR with averages                    2007.
has an important correlation in openness with value of 0.3           [8] R. Malhotra. Empirical Research in Software
and conscientiousness with value of -0.25. In the ETR run,               Engineering: Concepts, Analysis, and Applications.
openness was the highest value with 0.29. SVR over metrics               CRC Press, 2015.
in openness also had the highest value with 0.28. This trait
                                                                     [9] T. J. McCabe. A complexity measure. IEEE
was the most consistent among all the used methods.
                                                                         Transactions on software Engineering, (4):308–320,
   It is interesting that PC shows correlations with openness
                                                                         1976.
and conscientiousness. This is a good result because indi-
cates that the used metrics have certain relationship whit the      [10] T. Parr. The Definitive ANTLR 4 Reference.
mentioned personality traits. Compared with the baseline                 Pragmatic Bookshelf, 2nd edition, 2013.
RMSE, the proposed method performed slightly better, but            [11] F. Rangel, F. González, F. Restrepo, M. Montes, and
still it is not significant, which shows that more work is re-           P. Rosso. Pan at fire: Overview of the pr-soco track on
quired to obtain a good predictor of personality. Therefore,             personality recognition in source code. In Working
it is necessary to include more source code metrics within               notes of FIRE 2016 - Forum for Information Retrieval
this study. This could lead to find that certain metrics are             Evaluation, Kolkata, India, December 7-10, 2016,
related to specific personality traits.                                  CEUR Workshop Proceedings. CEUR-WS.org, 2016.
                                                                    [12] V. Y. Shen, S. D. Conte, and H. E. Dunsmore.
                                                                         Software science revisited: A critical analysis of the
6.     CONCLUSIONS AND FUTURE WORK                                       theory and its empirical support. IEEE Transactions
   The source code metrics extracted and used as input to                on Software Engineering, (2):155–165, 1983.
the machine learning methods were enough to get a close             [13] J. H. Ward Jr. Hierarchical grouping to optimize an
prediction of several personality traits. Other approaches               objective function. Journal of the American statistical
can be consulted in [?] which shows other results and ap-                association, 58(301):236–244, 1963.
proximations for the PR-SOCO task.
   As the PC denotes certain correlation, in this case par-
ticularly with openness, this could mean that the metrics
considered in this work are likely related to the mentioned
trait. However, as there are several other metrics with differ-
ent purposes, like quality, readability, etc., the use of more of
those metrics could improve the prediction. Other metrics
not considered in this study may have better relationships
with the personality traits. This work could be extended by
exploring other metrics an its relationship with each person-
ality trait.

7.     REFERENCES
    [1] S. Argamon, S. Dhawle, M. Koppel, and J. W.
        Pennebaker. Lexical predictors of personality type.
        Proceedings of joint annual meeting of the interface
        and The Classification Society of North America,
        pages 1–16, 2005.
    [2] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan,
        C. Voss, F. Yamaguchi, and R. Greenstadt.
        De-anonymizing Programmers via Code Stylometry.
        USENIX sec, pages 255–270, 2015.
    [3] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk.
        Feature location in source code: A taxonomy and
        survey. Journal of software: Evolution and Process,
        25(1):53–95, 2013.
1
    http://www.autoritas.es/prsoco/evaluation/