     PRHLT at PR-SOCO: A Regression Model for Predicting
              Personality Traits from Source Code
                                   Notebook for PR-SOCO at FIRE 2016
                          Maite Giménez                                          Roberto Paredes
           Pattern Recognition and Human Language                   Pattern Recognition and Human Language
             Technology (PRHLT) Research Center                       Technology (PRHLT) Research Center
               Universitat Politècnica de València                      Universitat Politècnica de València
           Camino de Vera s/n, 46022 Valencia, Spain                Camino de Vera s/n, 46022 Valencia, Spain
                    mgimenez@dsic.upv.es                                     rparedes@dsic.upv.es

ABSTRACT                                                           different sources: social media, essays, blog posts, etc. [1,
This paper describes our participation in the PAN@FIRE             2, 14]. Finally, it is noteworthy that previous studies [12]
Personality Recognition in Source Code (PR-SOCO) 2016              have already proven the impact of the personality traits in
shared task. We have proposed two different approaches to          the behavior of developers in the FLOSS community 1 .
tackle this task, on the one hand, each code sample from           Previously, there were some efforts to evaluate Personal-
each author was taken as an independent sample and it was          ity Recognition systems in several shared tasks, using texts
vectorized using word n-grams; on the other hand, all the          gathered from Twitter [17], YouTube Vlogs, and Mobile
code from an author was taken as a unique sample, and              Phone interactions [4]. However, the Personality Recogni-
it was vectorized using word n-grams together with hand-           tion in Source Code (PR-SOCO) shared task was the first
crafted features that may determine the personality traits of      competition where the objective was to determine the per-
an author. Regardless of the approach, a regression model          sonality of developers from the source code they wrote, lay-
was trained to classify the personality traits of the author       ing groundwork for a fair comparison between different ap-
of a sample of source code. All the systems we have sub-           proaches and future work.
mitted to be evaluated have achieved a root mean square            In this paper we describe our participation for addressing
error (RMSE) below the mean RMSE of the participants of            the PR-SOCO task. The rest of the paper is organized as
the shared task. Moreover, one of our runs, the one that in-       follows. Next section is devoted to define the Personality
cluded the hand-crafted features, held the best result in the      Recognition task. In Section 4 the model proposed is de-
personality trait Agreeableness. This suggests that in the         scribed. Following, in Section 5, the results achieved are
absence of enough independent samples to train a machine           presented. Finally, in Section 6 our results are discussed,
learning system, hand-crafted features are able to obtain          and future work is proposed.
better results.

Keywords                                                           2.   TASK DEFINITION
                                                                     The main objective proposed by the organizers of the PR-
PR-SOCO; Author profiling; Personality Recognition; Source
                                                                   SOCO shared task was to predict the personality traits of
Code; Natural Language Processing; Machine Learning; Re-
                                                                   developers given a collection of their source code. The per-
                                                                   sonality of a developer was determined following the Five
                                                                   Factor Theory or Big Five [5, 11, 3] which is the most widely
1.   INTRODUCTION                                                  accepted in psychology. Therefore, five traits define the per-
   One of the new emerging research areas in Natural Lan-          sonality of an author. Those traits are: agreeableness (A),
guage Processing (NLP) is Personality Recognition (PR),            conscientiousness (C), extroversion (E), openness to experi-
which seeks to classify the personality traits of the author       ence (O), and emotional stability / neuroticism (N). Each
of a text. In psychology, Norman et al. (1963) [11] pro-           trait was labeled within a range between 20 and 80. The
posed a taxonomy for describing the personality along five         models were evaluated by the organizers using two metrics:
dimensions known as “Big Five”, which are: agreeableness,          the average Root Mean Squared Error (RMSE) as well as
conscientiousness, extroversion, openness to experience, and       the Pearson Product-Moment Correlation (PC). For further
emotional stability. Besides, this work determined that our        information about the task, please review the overview pa-
personality traits have a strong influence on our individ-         per of the task [16].
ual behavior. The work carried out by Gill (2003) [8] out-
line that the personality is projected through the language.       3.   DATA
Therefore, by exploiting different kinds of NLP techniques,
                                                                      The organizers have gathered 60 samples of source code
it is possible to infer the personality of the author of a text.
                                                                   from 60 different programmers. In order to train the partic-
In addition, Personality Recognition can be useful in vari-
ous applications such as marketing, sociology, etc. [6, 7, 15,     1
                                                                     Free/Libre Open Source Software https://www.gnu.org/
18]. Also, PR can be inferred using texts extracted from           philosophy/floss-and-foss.en.html
ipants’ models, 49 samples were provided, and 21 were held
to validate the results. Each sample consists of a collection
of source code written in Java. In Table 1 the total number
of training and test samples is shown.

               Table 1: Dataset distribution

          Dataset     Source Code       Authors
          Train              1,741           49
          Test                 751           21

  We have studied the distribution of the number of samples
available for each value of each trait to classify depending
on whether we considered the number of code samples as in-
dependent (number of pieces of source code) or not (number
of authors). Figures 1 and 2 show the number of samples
available for the trait Agreeableness. Similarly, the rest of
the traits presented an equivalent distribution of the num-
ber of training samples available. It should be noted that
the number of authors, and therefore the number of training     Figure 2: Num. of code samples for each value of Agree-
samples available might be insufficient to adjust the parame-   ableness to classify (Code-Based approach).
ters of a machine learning system adequately. If we consider
each sample of code as an independent training sample, we
will have more training samples available, which might be       Author-Based approach uses all the samples of code from
useful for fighting the curse of dimensionality[9]. This has        an author including hand-crafted features in addition
led us to two different approaches that will be described in        to the words n-grams. The features considered were:
Section 4.                                                          the number of samples of code that implemented the
                                                                    same class (hf1 ), the number of allocations (hf2 ), the
                                                                    number of loops (hf3 ), the appearance of pieces of code
                                                                    suspicious of plagiarism (hf4 )2 , the number of imports
                                                                    (hf5 ), the number of functions (hf6 ), the number of
                                                                    exceptions handled (hf7 ), the number classes devel-
                                                                    oped (hf8 ), the number of different classes developed
                                                                    (hf9 ), the number of comment lines (hf10 ), and the
                                                                    number of prints (hf11 ).

                                                                Code-Based approach assumed independence between the
                                                                    samples. This naı̈ve assumption allowed us to train
                                                                    with 1,741 samples. The CB approach relies solely
                                                                    on the n-grams found in each piece of code, without
                                                                    considering any kind of aggregated information from
                                                                    each author. It generates a prediction for each sample
                                                                    of source code. Therefore, the final prediction for an
                                                                    author is the mean of all the predictions obtained for
                                                                    each piece of code that this author wrote.

                                                                   As text representation, several vectorizer methods were
                                                                evaluated for each approach. The vectorizers considered
Figure 1: Num. of authors for each value of Agreeableness       were: the term frequency-inverse document frequency (tf-
to classify (Author-Based approach).                            idf) from one to four words (tfidf-words), the tf-idf from
                                                                one to four n-grams of words ignoring the terms that have
                                                                a frequency strictly higher than the threshold 0.5 and ap-
  Noteworthy, we are not exploiting any external dataset or     plying sub-linear scaling (sublinear-1:4), idem but explor-
resource to either train or fine-tune our models.               ing n-grams from one to six words (sublinear-1:6), the tf-idf
                                                                from one to six characters (tfidf-chars), and a bag of words
                                                                (BOW). We carried out a preprocessing phase where code
4.   SYSTEM DESCRIPTION                                         snippets (e.g. a sequence of words that define a loop) were
  Provided that the number of data samples available for        replaced by tokens. However, the systems that included this
training machine learning models is crucial, two approaches     2
                                                                  We supposed that those samples of code that instanti-
were evaluated. We have proposed an Author Based (AB)           ate classes that do not belong to the standard library are
approach and a Code Based (CB) approach.                        suspicious of plagiarism, e.g. the class SeparateChaining-
Table 2: RMSE achieved using a 5-fold validation over the train dataset following the Code Based approach. The mean RMSE
and the standard deviation for the 5-fold validation for each trait is reported.

          Model                    Agreeableness    Conscientiousness    Extroversion   Neuroticism        Openness
          sublinear-1:6 & ridge    6.10 (±0.67)       4.81 (±0.41)       5.55 (±0.89)   8.30 (±0.95)    4.93 (±0.52)
          sublinear-1:4 & ridge    6.08 (±0.65)       4.82 (±0.44)       5.53 (±0.87)   8.26 (±0.94)    4.95 (±0.55)
          sublinear-1:6 & LR.      6.11 (±0.85)       4.79 (±0.47)       5.94 (±0.89)   8.54 (±1.02)    4.85 (±0.43)
          sublinear-1:4 & LR.      6.07 (±0.81)       4.83 (±0.47)       5.89 (±0.84)   8.49 (±1.01)    4.91 (±0.44)
          sublinear-1:6 & RFR.     6.10 (±0.67)       5.00 (±0.72)       5.55 (±0.89)   8.30 (±0.95)    4.93 (±0.52)

phase obtained worse results that those systems without pre-     of code that implemented the same class hf1 , the appearance
processing. This phenomenon was previously reported in the       of pieces of code suspicious of plagiarism hf4 , the number
author profiling literature [1, 10]. Our results confirm that    of classes developed hf8 , and the number of different classes
the preprocessing phase also has a negative impact on the        developed hf9 .
personality recognition task from source code.
                                                                   We submitted five different models. Those that performed
   Moreover, both approaches used a regression model to          better during the development phase, which were:
classify the authors automatically. The machine learning
algorithms considered were: an Epsilon-Support Vector Re-             1. run 1: a Code-Based approach using sublinear-1:4 and
gression (SVR) model, a Linear Regression (LR) model,                    Ridge.
a Linear Least Squares model with l2 regularization and               2. run 2: a Code-Based approach using sublinear-1:6 and
α = 0.5 (Ridge), Linear model trained with L1 prior as reg-              Ridge.
ularizer and α = 0.5 (Lasso), a Multi-layer Perceptron clas-
sifier (MLP), a Decision Tree Regressor (DTR), and a Ran-             3. run 3: an Author-Based approach using sublinear-1:4,
dom Forest Regressor (RFR). The task was also evaluated as               the following hand-crafted features: hf1 ⊕ hf4 ⊕ hf8 ⊕
a classification problem using Support Vector Machines, and              hf9 ⊕ hf10 and Ridge.
Random Forest. Nevertheless, the classification approach
behaved worse than the regression approach. Therefore, this           4. run 4: a Code-Based approach using sublinear-1:4 and
classification approach was discarded.                                   Logistic Regression.
                                                                      5. run 5: a Code-Based approach using sublinear-1:6 and
   We have developed a pipeline using scikit-learn [13]. In
                                                                         Logistic Regression.
the CB approach, we have selected the best combination
of n-grams and the regression model using a 5-cross valida-         Two baselines were provided by the organizers: a bag
tion. The selection of the models was a compromise solution.     of words 3-grams with frequency weight (bow), and an ap-
We selected those models that achieved better global RMSE        proach that always predicts the mean value observed in the
computed as the mean of the RMSE for each trait and for          training data (mean). The evaluation results for each per-
each fold:                                                       sonality trait over the test set can be found in Table 3.
                        P5                                          Eleven teams have presented their respective systems. In
                           f old=1 (RM SEtrait&f old )/5
                                                                 total, 48 systems were submitted for evaluation. All the
                                     5                           systems we have submitted have performed better than the
                                                                 mean of the systems proposed using the RMSE.
This has allowed us to obtain models with a competitive          Despite the results achieved during the development phase,
performance for all traits measured with the RMSE. Our           our best performing system was the one that followed the
systems were only optimized for the RMSE, which might            Author-Based approach. This system was able to achieve
affect the performance using the Pearson Correlation since       the best RMSE result in the personality trait Agreeableness.
there is no reciprocity between the RMSE and the Pearson         Nevertheless, our systems’ predictions did not find a correla-
Correlation. Conversely, in the AB approach, the best hand-      tion with the gold standard following the Pearson coefficient
crafted combination was selected applying an ablation test,      metric. Besides, neither the baselines proposed nor the best
and these features were concatenated to the word n-grams         performing participants were able to find a significative cor-
of the best model obtained for the CB approach.                  relation. The best correlation found by the participants was
                                                                 0.62 for the trait Openness, which can not be considered a
                                                                 strong positive correlation.
   Hereafter, we will describe the results achieved by our       6.     DISCUSSION AND FUTURE WORK
best models. Table 2 shows the RMSE of our best models             In this paper we have presented our participation in the
at development time. Due to the computational complexity         PAN@FIRE Personality Recognition in Source Code 2016
of performing the grid search over two metrics, we have only     shared task. Two approaches were proposed an Author-
used the RMSE to adjust our models.                              Based approach and a Code-Based approach. The AB ap-
   After selecting the best model for the Code-Based ap-         proach performed better for all the traits. This could be
proach, we have selected the hand-crafted features that im-      explained because the samples we used to train the systems
proved the classification in the Author-Based approach. The      that followed the Code-Based approach were not indepen-
hand-crafted features selected were: the number of samples       dent. Therefore, the results we obtained in the development
Table 3: Evaluation of our participation in the PR-SOCO                  Assessment: Personality Measurement and Testing,
shared task. The first five rows, run 1 up to run 5, show                volume 2. Sage, 2008.
the results achieved by our systems. The traits are: agree-          [4] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez,
ableness (A), conscientiousness (C), extroversion (E), neu-              G. Riccardi, and F. Pianesi. The workshop on
roticism (N), and openness to experience (O). Moreover, the              computational personality recognition 2014. In
performance of the baseline systems are included, as well as             Proceedings of the 22nd ACM international conference
the minimum, maximum and mean performance obtained                       on Multimedia, pages 1245–1246. ACM, 2014.
by the participants at the shared task.
                                                                     [5] P. T. Costa and R. R. McCrae. The revised neo
              (a) RMSE achieved in the test dataset
                                                                         personality inventory (neo-pi-r). The SAGE handbook
                                                                         of personality theory and assessment, 2:179–198, 2008.
     Model               A       C       E       N         O         [6] S. Cruz, F. Q. da Silva, and L. F. Capretz. Forty years
     (CB) run 1        9.29     9.02   8.75    10.67     7.85            of research on personality in software engineering: A
     (CB) run 2        9.36     8.99   8.79    10.46     7.67            mapping study. Computers in Human Behavior,
     (AB) run 3       8.79      8.69    9.0    10.22     7.57            46:94–113, 2015.
     (CB) run 4        9.62     8.86   8.69    10.73     7.81        [7] R. Fuchs. Personality traits and their impact on
     (CB) run 5        9.71     8.89   8.65    10.65     7.79            graphical user interface design. In 2nd Workshop on
     baseline bow       9.0    8.47     9.06   10.29     7.74            Attitude, Personality and Emotions in User Adapted
     baseline mean     9.04     8.54    9.06   10.26     7.57            Interaction, 2001.
     min              8.79     8.38     8.60    9.78     6.95        [8] A. J. Gill. Personality and language: The projection
     max              28.63    22.36   28.80   29.44    33.53            and perception of personality in computer-mediated
     mean              9.72    10.74   12.27   12.75    10.49            communication. PhD thesis, University of Edinburgh,
       (b) Pearson Correlation achieved in the test dataset.         [9] E. Keogh and A. Mueen. Curse of dimensionality. In
     Model                A       C      E       N        O              Encyclopedia of Machine Learning, pages 257–258.
     (CB) run 1        0.03    -0.23   0.31    -0.22   -0.12             Springer, 2011.
     (CB) run 2          0.0   -0.19   0.28    -0.07    0.05        [10] A. McEnery and M. Oakes. Authorship
     (AB) run 3         0.33   -0.12   0.18     0.09    0.03             studies/textual statistics. 2000.
     (CB) run 4        -0.03   -0.09    0.28   -0.15   -0.05        [11] W. T. Norman. Toward an adequate taxonomy of
     (CB) run 5        -0.06   -0.12    0.3    -0.16   -0.02             personality attributes: Replicated factor structure in
     baseline bow       0.20    0.17   0.12    0.06    -0.17             peer nomination personality ratings. The Journal of
     baseline mean       0.0     0.0    0.0     0.0      0.0             Abnormal and Social Psychology, 66(6):574, 1963.
     min               -0.32   -0.31   -0.37   -0.29   -0.36        [12] O. H. Paruma-Pabón, F. A. González, J. Aponte, J. E.
     max                0.38    0.33    0.47   0.36     0.62             Camargo, and F. Restrepo-Calle. Finding
     mean              -0.01   -0.01    0.06   0.04     0.09             relationships between socio-technical aspects and
                                                                         personality traits by mining developer e-mails. In
                                                                         Proceedings of the 9th International Workshop on
phase correspond to over-fitted systems.                                 Cooperative and Human Aspects of Software
However, provided that we did not have enough samples we                 Engineering, pages 8–14. ACM, 2016.
still need to include proper techniques for data augmenta-          [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
tion. If we would be able to get more labeled data, new                  B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
approaches could be studied such as deep learning methods                R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
and word embeddings for text representation.                             D. Cournapeau, M. Brucher, M. Perrot, and
Noteworthy, the minimum error achieved by the partici-                   E. Duchesnay. Scikit-learn: Machine learning in
pants’ proposals in the RMSE is close to the baseline models             Python. Journal of Machine Learning Research,
for all the personality traits, and only for some traits a corre-        12:2825–2830, 2011.
lation with the gold standard was found. This highlights the        [14] B. Plank and D. Hovy. Personality traits on twitter -
complexity of the task. Therefore, personality recognition in            or - how to get 1,500 personality tests in a week. In
source codes is an open problem and new NLP approaches                   Proceedings of the 6th Workshop on Computational
could improve the performance of the systems.                            Approaches to Subjectivity, Sentiment and Social
                                                                         Media Analysis, pages 92–98, 2015.
                                                                    [15] D. Preotiuc-Pietro, J. Eichstaedt, G. Park, M. Sap,
