A Supervised Approach for Personality Recognition in
      Source Code using Code Analysis Tool at FIRE 2016
                        Rehana Delair                                                      Rutal Mahajan
                          SNPIT&RC                                                            SNPIT&RC
                         Bardoli, Gujarat                                                    Bardoli, Gujarat
                        +91 9904039419                                                      +91 9426393096
                  rehanad10@gmail.com                                             rutal.mahajan@gmail.com

ABSTRACT                                                              In order to collect different features from the given source code,
Personality Recognition from the author’s source code is a task       checkstyle [2] is used. It is a code analysis tool which performs
organized by PR-SOCO team in conjunction with the FIRE 2016           different checks on the source code. We have used weka [4] tool
Forum for Information Retrieval Evaluation. The aim is to             to train the regression model.
identify author’s personality traits from source code collection of   The rest of this paper is structured as follows: Section 2 outlines
a programmer. We have used various supervised learning                our approach on the Personality Recognition in Source Code.
approaches to train the regression model with different set of        Section 3 presents tools used. Section 4 describes training and test
features extracted using static code analysis tool checkstyle.        data. Section 5 describes experiments and Section 6 describes
Based on these features, the trained regression model is used to      official results of this task. Finally, we conclude in Section 7.
predict the score for different personality traits. All the systems
are evaluated using two evaluation metrics: Root Mean Squared         2. Approach
Error (RMSE) and Pearson Product-Moment Correlation (PC).
Our system has scored 0.62 and 0.33 PC in two personality traits,     2.1 Overview
Openness and Conscientiousness respectively using M5Rules             Main Process of Personality Recognition includes the following
algorithm as regression model, which is the best score among all      steps, which is shown in Figure 1:
the submitted runs of our system as well as among all the                  1.   Collect individual corpora
participated systems.                                                           In this step, we need to collect training data. In this case,
                                                                                we need source codes of different programmers which is
Keywords                                                                        training data provided by PR-SOCO committee [8].
Personality Recognition, Machine Learning, Regression, Big-Five            2.   Collect associated personality ratings for each
Personality traits                                                              participant
                                                                                This is the step where we collect personality ratings for
1. INTRODUCTION                                                                 each programmer. We have used Big-Five personality
There is a lot of work going on in the area of Personality                      traits [3] to describe the personality of an individual.
Recognition [1] [5] [6] [7]. Personality traits influence most of               This data is also provided by PR-SOCO committee [8].
the human activities such as the way people write [1], interact            3.   Pre-processing
with each other, and the way they make a decision. The                          In this step, given file/data is converted into the efficient
programmer’s personality will affect the type of software project               format for checkstyle [2]. It removes any separating
they chose to participate [6] or the way they write or structures               lines from the source code and converts data into an
their code.                                                                     actual JAVA file. We have also implemented a function
There are many projects that use written text to identify author’s              to isolate one single program from the given training
personality. In “whose thumb is it anyway?” [5] personal weblogs                files of source code.
are analyzed to predict personality traits. They have used the             4.   Extract relevant features from the texts
Support Vector Machine algorithm to predict personality traits.                 In this step main features are identified from the given
Main features are word based bi- and tri- grams. In “Finding                    source code. We need to find different features of good
relationships between socio-technical aspects and personality                   source code which reflects authors’ personality. For this
traits by mining developer e-mails.” [6] they have used                         purpose we have used a code analysis tool checkstyle
developer’s emails to identify their personality.                               [2]. It performs different checks on the source code such
                                                                                as how well the code is commented, how it is indented,
Personality Recognition from the source code is different than                  naming conventions, etc. From this we have collected
other projects because the source code has limited scope. The                   measures for different features.
Programmer doesn’t have the choice to select their own word.               5.   Build statistical models of the personality ratings
They have to follow some of the pre-defined rules. Identifying                  based on the features
Personality from the source code is a difficult task.                           We have used different regression models to predict the
                                                                                personality traits like Support Vector Machine
Personality can be defined along five traits using the Big Five
                                                                                Regression, Gaussian Processes,M5 algorithm, M5’
Theory [3], which is the most widely accepted in psychology. The
                                                                                Rules and Random Tree. We have used JAVA API for
five traits are extroversion (E), emotional stability / neuroticism
                                                                                Weka [4] to train different regression models.
(S), agreeableness (A), conscientiousness (C), and openness to
                                                                           6.   Test the learned models on unseen individuals
experience (O).
                                                                                Using different features and trained regression model
                                                                                we predicted the score for different personality traits.
                                                                              Table 1. Different Features Extracted by Checkstyle tool

                           Training Data                                        Category                                     Number of
                                                                                                     Feature Name
                                                                                                                              features
                                                                                                        Headers                   2
                                                                                                   Javadoc comments               12
                           Preprocessing
                                                                           Style based feature        White spaces                16
                                                                                                      Block Checks                 6
                             Individual
                             Programs                                                                  class design
                                                                                                                                   9
                                                                                                        problems
                                                                                                      Annotations                  7
                                                                                                         Coding                   43
                             checkstyle
                                                                                                         Imports                   8
          Feature                                                                                        Metrics                   6
                             Errors and
          Extraction         Warnings                                        Category based
                                                                                                        Modifiers                  2
                                                                                 feature
                                                                                                  Naming conventions              15
                          Collect Features                                                           Size violations               8
                                                                                                     Miscellaneous                15
                              Features                                                             Regular expression              5
                                                                                  Total                                          154
                       Build Regression Model
                                                                           Single program is separated from the collection of source code
                                                                           and it is checked using checkstyle [2]. Errors and Warnings are
                          Regression Model                                 counted and converted in per line of code format.

                                                                           3. Data set
                                                                           The training data set was provided by PR-SOCO committee itself
                                                                           that consists of source codes written in Java. The data consist of
          Test Data                                    Output              49 documents that consist of a collection of source code of
                                                                           different authors. These source codes are labeled with personality
                                                                           traits of the programmer in a continuous range from 20 to 80.
              Figure 1. Flow Diagram of the Process                        Test data were also provided by PR-SOCO committee [8]. It is
                                                                           consists of 21 documents of a source code collection. We have
                                                                           used this data to evaluate our system.
2.2 Features                                                               4. Experiments
We have used total 154 features of source code, which is extracted         We have a collection of source code written by 49 different
using static code analysis tool Checkstyle [2]. These features are         programmers along with their personality traits. We have used this
categories into two categories to train the regression model: Style        data to train our model and then tested it on 21 unseen source
based features and Content based features. These are shown in the          codes. Two metrics were used to evaluate the system: the average
Table 1.                                                                   Root Mean Squared Error (RMSE) as well as the Pearson
     1.    Style based Features                                            Product-Moment Correlation (PC) between our software scores
           It is the category of different features related to the style   and the ground-truth scores. We have tested our system on a given
           of the code. Such features are used to perform checks on        test data. Results are discussed in the next section.
           code layout and formatting problems. It contains                RMSE is the square root of the mean/average of the square of all
           Indentation, Headers, Javadoc comments, white spaces,           of the error and PC is defined as a measure of the strength of a
           Block checks, etc.                                              linear association between two variables.
     2.    Content based Features
           It is the category of different features related to the         We have used different Supervised Regression model to predict
           content of the source code. It performs checks on class         personality traits of different authors. These are Support Vector
           design problems, method design problem, Annotations,            Machine, Gaussian Processes, M5P algorithm, M5Rule and
           Coding, Imports, Metrics, Modifiers, Naming                     Random tree algorithm. Support Vector Machine plots all the data
           conventions, Size violations and other miscellaneous            items as a point in n-dimensional space. We have used default
           features.                                                       kernel settings in Support Vector Machine. M5P algorithm is
                                                                           decision tree based algorithm and M5Rule is rule based algorithm.
5. Results                                                               6. Conclusion
We have submitted total five runs. This all runs use different           Various supervised learning algorithms proved to be very capable
regression algorithms. We have used Support Vector Machine,              of predicting personality traits scores for different authors from
Gaussian Processes, M5P algorithm, M5Rule and Random tree                their given source code. Currently in our system we have not
algorithm for regression.                                                refined the effect of individual extracted features on different
Results obtained for different runs of our system are shown in the       personality trait. Such refinement may yield better prediction
Table 2. Two metrics are shown for each personality trait:               results than the current submitted runs.
RMSE/PC. It shows Root Mean Squared Error / Pearson Product-
Moment Correlation values. At the bottom of the Table 2,
measures for baselines: (a) a bag of character 3-grams with              7. REFERENCES
frequency weight; (2) an approach that always predicts the mean          [1]    Celli F., Lepri B., Biel J. I., Gatica-Perez D., Riccardi G.,
value observed in the training data are shown [8]. Our system has              Pianesi F. (2014). The workshop on computational
scored 0.62 and 0.33 PC in two personality traits, Openness and                personalty recognition 2014. Proc. Of the ACM Int. Conf. on
Conscientiousness respectively using M5Rules algorithm as                      Multimedia. Pp. 1245-1246.
regression model, which is the best score among all the submitted
                                                                         [2] CheckStyle project , http://checkstyle.sourceforge.net/
runs of our system as well as among all the participated systems.
                                                                         [3] Costa P.T., McCrae R.R. (2008). The revised neo personality
In Neuroticism personality trait, our predicted scores are                   inventory (neo-pi-r). The SAGE handbook of personality
positively correlated with the ground truth scores. It gives nearly          theory and assessment 2, 179-198
worst RMSE in Gaussian processes and SMO. In Extroversion
personality traits, all regression models give different scores and it   [4] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
is weakly correlated with ground truth scores. Openness                      Pfahringer, Peter Reutemann, Ian H. Witten (2009); The
personality trait is strongly correlated with the ground truth score         WEKA Data Mining Software: An Update; SIGKDD
and gives good results. Agreeableness is negatively related with             Explorations, Volume 11, Issue 1
ground truth scores and it also gives worst RMSE. In                     [5] Oberlander J., Nowson S. (2006). Whose thumb is it
Conscientiousness, predicted scores are positively correlated with           anyway? Classifying author personality from weblog text.
ground truth scores.                                                         Proceedings of the COLING/ACL 2006 Main Conference
                                                                             Poster Sessions, pages 627–634, Sydney, July 2006. ©2006
    Table 2. Official results of different runs of our system
                                                                             Association for Computational Linguistics
  Run           N            E           O           A          C        [6] Paruma-Pab ́on O.H., Gonz ́alez F.A., Aponte J., Camargo
             19.07/       25.22/      23.62/      21.47/     22.05/          J.E., Restrepo-Calle F. (2016). Finding relationships between
M5Rules
               0.2         0.08        0.62       -0.15       0.33           socio-technical aspects and personality traits by mining
             26.36/       16.67/      15.97/      23.11/     21.72/          developer e-mails. Workshop on Cooperative and Human
   GP                                                                        Aspects of Software Engineering (CHASE), ICSE.
              0.19        -0.02        0.19       -0.13        0.1
             18.75/       25.22/      20.28/      21.47/     22.05/      [7] Rangel F., Celli F., Rosso M., Potthast M., Stein B.,
  M5P
               0.2         0.08        0.54       -0.15       0.33           Daelemans W. (2015). Overview of the 3rd Author Profiling
Random       17.55/       20.34/      16.74/       21.1/      20.9/          Task at PAN 2015. CLEF 2015 Labs and Workshops,
 Tree         0.29        -0.26        0.27       -0.06       0.14           Notebook Papers. CEUR Workshop Proceedings. CEUR-
             26.72/       23.41/      16.25/      27.78/     15.53/          WS.org, vol. 1391.
 SMO
              0.18        -0.11        0.13       -0.19       0.27
Baseline     10.29/        9.06/       7.74/       9.00/      8.47/      [8] Francisco Rangel, Fabio González, Felipe Restrepo, Manuel
 bow          0.06         0.12       -0.17         0.20      0.17           Montes and Paolo Rosso. PAN at FIRE: Overview of the PR-
Baseline     10.26/        9.06/       7.57/       9.04/      8.54/          SOCO Track on Personality Recognition in SOurce Code.
 mean          0.00        0.00         0.00        0.00       0.00          Working notes of FIRE 2016 - Forum for Information
                                                                             Retrieval Evaluation, Kolkata, India, December 7-10, 2016.
 Best         9.78/        8.60/       6.95/       8.79/      8.38/
                                                                             CEUR Workshop Proceedings. CEUR-WS.org. 2016
Results       0.36         0.47        0.62        0.38       0.33
 Worst       29.44/       28.80/      33.53/      28.63/     22.36/
Results      -0.29        -0.37       -0.36       -0.32      -0.31