=Paper= {{Paper |id=Vol-1737/T1-3 |storemode=property |title=CAPS-PRC: A System for Personality Recognition in Programming Code |pdfUrl=https://ceur-ws.org/Vol-1737/T1-3.pdf |volume=Vol-1737 |authors=Ivan Bilan,Eduard Saller,Benjamin Roth,Mariia Krytchak |dblpUrl=https://dblp.org/rec/conf/fire/BilanSRK16 }} ==CAPS-PRC: A System for Personality Recognition in Programming Code== https://ceur-ws.org/Vol-1737/T1-3.pdf
       CAPS-PRC: A System for Personality Recognition in
                     Programming Code
                                        Notebook for PAN at FIRE16
                   Ivan Bilan                        Eduard Saller                      Benjamin Roth
           Center for Information and          Center for Information and          Center for Information and
             Language Processing                 Language Processing                 Language Processing
          Ludwig Maximilian University        Ludwig Maximilian University        Ludwig Maximilian University
                   of Munich                           of Munich                           of Munich
               Oettingenstr. 67                    Oettingenstr. 67                    Oettingenstr. 67
               Munich, Germany                     Munich, Germany                     Munich, Germany
             ivan.bilan@gmx.de                     eduard@saller.io                     beroth@cis.uni-
                                                                                         muenchen.de
                                                    Mariia Krytchak
                                               Department of Psychology
                                              Ludwig Maximilian University
                                                      of Munich
                                                    Leopoldstr. 13
                                                  Munich, Germany
                                              mariia.krytchak@gmx.de

ABSTRACT                                                       NEO-PI-R Inventory [5] to form the training set [8]: ex-
This paper describes the participation of the CAPS-PRC         troversion, emotional stability/neuroticism, agreeableness,
system developed at the LMU Munich in the personality          conscientiousness, and openness to experience. The Big Five
recognition shared task (PR-SOCO) organized by PAN at          Model, i.e. five broad fairly independent dimensions, encom-
the FIRE16 Conference. The machine learning system uses        passes all personality traits and is considered to describe the
the output of a Java code analyzer to investigate the struc-   personality in a comprehensive way. The NEO-PI-R Inven-
ture of a given program, its length, its average variable      tory is a statistically reliable and valid tool that operational-
length and also it takes into account the comments a given     izes the Big Five Model through self/other-assessment and is
programmer wrote. The comments are analyzed by language        set in various cross professional and cross cultural contexts
independent stylometric features, including TF-IDF distri-     to describe the personality.
bution, average word length, type/token ration and more.
The system was evaluated using Root Mean Squared Error         2.    EXPERIMENTAL SETUP
(RMSE) and Pearson Product-Moment Correlation (PC).
The best run exhibited the following results: Neuroticism
(RMSE - 10.42, PC - 0.04), Extroversion (RMSE - 8.96, PC
                                                               2.1    Approaching the problem
- 0.16), Openness (RMSE - 7.54, PC - 0.1), Agreeableness          Based on the available research on the Big Five psycholog-
(RMSE - 9.16, PC - 0.04), Conscientiousness (RMSE - 8.61,      ical traits [5] [3], we can see that the traits are considered to
PC - 0.07).                                                    be independent of each other. For this reason, each psycho-
                                                               logical trait was viewed and analyzed individually. Figures
                                                               1 to 5 show the distribution of the training set for each psy-
Keywords                                                       chological trait by author. Table 1 shows the mean trait
machine learning; Big Five personality traits; source code     distribution.
analysis; abstract syntax tree                                    Since each programmer/author has submitted more than
                                                               one program, we approach the problem from two different
                                                               angles:
1.   INTRODUCTION                                                 1) the feature vectors are extracted for each programmer,
   The main purpose of the task is to investigate whether      by first extracting them for each program and then averag-
it is possible to predict personality traits of programmers    ing all the underlying feature vectors into one single feature
based on the source code written by them [8]. Previous         vector for the author. The classifier learns based on a single
research has identified the relationship between personality   feature vector for each author, where the author represents
factors and computer programming styles having used dif-       one sample in the dataset.
ferent measures of personality [2] [4]. The task considers        2) the classifier is trained at the level of programs. Each
the Big Five personality traits which were assessed by the     program inherits the trait value of its author. The feature
                                                               vectors are extracted for each program and then the classifier
                                                               regards each program as a training instance. To get back to
   Figure 1: Author Distribution, Agreeableness                     Figure 4: Author Distribution, Neuroticism




 Figure 2: Author Distribution, Conscientiousness                    Figure 5: Author Distribution, Openness


                                                              program belonging to a certain author. The final result is
                                                              a single prediction for each author based on the predictions
                                                              produced for each underlying program.

                                                              2.2     Feature Extraction
                                                              2.2.1    Abstract syntax tree
                                                                 We use a grammar γ specifically designed for the analy-
                                                              sis of a programming language β, which in the context of
                                                              the task was the Java programming language. The gram-
                                                              mar γ combined with a parser ρ provides a semantic repre-
                                                              sentation of the source code called an abstract syntax tree
                                                              (AST). Compared to normal parse trees there are some po-
                                                              tential advantages. First, the generation of an AST can
                                                              be interpreted as a normalization step of our feature gen-
                                                              eration. In contrast to the original source code, which has
                                                              inconsistencies like whitespace characters or other unneeded
    Figure 3: Author Distribution, Extroversion               characters, the AST represents a concise version of a given
                                                              program. This also makes the generation of meta-features
                                                              (compositions of different base features) more simple, due to
the level of authors (while the final prediction should be    the strict representation of all, to the compiler important,
done for the author), the predictions are averaged for each   parts of the program. Additionally, the representing syn-
        Trait          Mean Value              Standard                                           Author-based
                                               Deviation             Personality               Evaluation Metric
                                                                       Traits               RMSE               PC
  Agreeableness            47.02                  8.95
 Conscientiousness         46.37                  6.46            Agreeableness              9.17             -0.12
  Extroversion             45.22                  8.19           Conscientiousness           8.83             -0.31
   Neuroticism             49.92                 11.15            Extroversion               9.55              -0.1
    Openness               49.51                  6.68             Neuroticism               10.28             0.14
                                                                    Openness                 7.25              -0.1

  Table 1: Mean Trait Distribution, Training Set
                                                                Table 3: Results of the Multinomial Logistic Regres-
 Distribution / Dataset            Train Set     Test Set       sion Approach
 Min. Programs per Author               6           14
 Mean Programs per Author              37           37
 Max. Programs per Author             122          109          was implemented on the level of authors.
 Total Number of Programs            1790          772             The first classification approach is based on Gradient Boosted
                                                                Regression with least squares regression as its loss function,
  Total Number of Authors              49           22
                                                                1100 estimators, 5 as the maximum depth of the individ-
                                                                ual regression estimators, and the learning rate of 0.1. This
            Table 2: Programs per Author                        approach also utilized χ2 test for the feature selection to
                                                                choose only the best 200 features from the AST feature ex-
                                                                traction pipeline. This approach was implemented using the
tax tree is not necessarily bound by the original syntactic     scikit-learn Python library [7].
rules of the original programming language β which allows          The second approach is based on the Multinomial Logistic
for generalizations of the source code to occur.                Regression model with the l2-regularized squared loss as its
   In our approach, we use the frequency distribution of all    objective function. That is, each feature was multiplied with
known entities in the grammar to build a feature list for a     a trait-specific weight, and the result of this linear combina-
given program. This shallow use of the AST provides 237         tion was the input to a sigmoid activation. As the output
features for a given source code analysis. Some examples        of this prediction is in the range [0,1], we re-scaled the trait
would be the T ype of variables or the nature of a state-       values in the training data to the same range for computing
ment(do, for, while, etc.) The implementation of the AST        the squared loss.
is made possible with the help of ANTLR parser [6].                Training was done using stochastic gradient descent with
                                                                constant learning rate, and parameters were tuned on the
2.2.2    Custom Features                                        held-out development set using random search. The search
                                                                space of the parameters was: learning rate ∈ 0.01, 0.1, 1,
   In addition to the AST, we used additional features for
                                                                number of training epochs ∈ {10, 20, 50, 100, 200, 500}, regu-
the source code and also the comments. The following is an
                                                                larization ∈ {0, 0.001, 0.01, 0.1, 1}, (mini-)batch size ∈ {1, all}.
exhaustive list of all additional features used.
                                                                The best configuration was: learning rate: 1, training-epochs:
   Code-based features: length of the whole program (in
                                                                2, regularization: 0.6, batch-size: all. This approach was de-
lines of code, in characters), the average length of variable
                                                                veloped with theano Python library [1].
names, what indentation the programmer is using (tabs or
spaces).
   Comment-based features: type/token ratio, usage of           3.    EXPERIMENTAL RESULTS
punctuation marks, TF-IDF, the frequency of comments              The dataset included 49 programmers in the training set
(block comments and inline comments separately), average        (with 1790 programs in total) and 22 programmers in the
word length.                                                    test set (772 programs). Final evaluation was done with
   Author-level based features: number of programs sub-         two different evaluation metrics: Root Mean Squared Error
mitted (see Table 2), average length of programs in lines of    (RMSE) and Pearson Product-Moment Correlation (PC).
code.                                                           In the Gradient Boosted Regression approach (GBR Ap-
                                                                proach), the system was tuned to maximize both of these
2.3     Classification                                          metrics at the same time, while the Multinomial Logistic Re-
   We experimented with a number of Regression classifiers      gression one (MLR Approach) concentrated on RMSE. Ta-
like Linear Regression, Ridge Regression, Logistic Regres-      ble 3 gives a detailed overview of the results achieved using
sion and Gradient Boosted Regression. In addition, we have      Multinomial Logistic Regression at the level of authors. Ta-
tried to detect the outliers with the RANdom SAmple Con-        ble 4 shows the results achieved using the Gradient Boosted
sensus (RANSAC). The final system implementation did not        Regression approach at the level of authors and the level of
use RANSAC, since it delivered worse results. Although,         programs.
this technique should be further investigated with a bigger       In general, the results are low using both RMSE and PC
dataset.                                                        and only slightly outperform the performance of the baseline
   We have submitted our final runs based on two machine        approaches (see Table 5). Two baselines have been provided
learning algorithms: Gradient Boosted Regression and Multi-     by the task organizers [8]:
nomial Logistic Regression. Furthermore, Gradient Boosted         1) 3-gram character representation.
Regression was evaluated on the level of authors and the          2) always predict the mean trait value of the training
programs level, while the Multinomial Logistic Regression       dataset.
                       Author-based Program-based                      Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia,
      Personality             Evaluation Metric
        Traits                                                         M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin,
                       RMSE      PC    RMSE     PC                     E. Larsen, C. Laurent, S. Lee, S. Lefrancois,
  Agreeableness         10.89   -0.05    9.16   0.04                   S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey,
 Conscientiousness       8.9     0.16    8.61   0.07                   C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol,
   Extroversion         11.18   -0.35    8.96   0.16                   O. Mastropietro, R. T. McGibbon, R. Memisevic,
   Neuroticism          12.06   -0.04   10.42   0.04                   B. van Merriënboer, V. Michalski, M. Mirza,
    Openness             7.5     0.35    7.54    0.1                   A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel,
                                                                       D. Renshaw, M. Rocklin, A. Romero, M. Roth,
                                                                       P. Sadowski, J. Salvatier, F. Savard, J. Schlüter,
Table 4: Results of the Gradient Boosted Regression                    J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk,
Approach                                                               S. Shabanian, E. Simon, S. Spieckermann, S. R.
                                                                       Subramanyam, J. Sygnowski, J. Tanguay, G. van
                  3-gram characters  Mean value                        Tulder, J. Turian, S. Urban, P. Vincent, F. Visin,
     Personality         Evaluation Metric                             H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson,
       Traits     RMSE       PC      RMSE PC                           K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang.
  Agreeableness     9.00     0.20      9.04 0.00                       Theano: A Python framework for fast computation of
 Conscientiousness 8.47      0.17      8.54 0.00                       mathematical expressions. arXiv e-prints,
  Extroversion      9.06     0.12      9.06 0.00                       abs/1605.02688, May 2016.
   Neuroticism     10.29     0.06     10.26 0.00                   [2] C. Bishop-Clark. Cognitive style, personality, and
    Openness        7.74    -0.17      7.57 0.00                       computer programming. Computers in Human
                                                                       Behavior, 11(2):241–260, 1995.
                                                                   [3] O. P. John and S. Srivastava. The big five trait
             Table 5: Baseline Approaches                              taxonomy: History, measurement, and theoretical
                                                                       perspectives. Handbook of personality: Theory and
                                                                       research, 2(1999):102–138, 1999.
4.     CONCLUSIONS                                                 [4] Z. Karimi, A. Baraani-Dastjerdi, N. Ghasem-Aghaee,
  This paper describes the system that given a source code             and S. Wagner. Links between the personalities, styles
collection of a programmer, identifies their personality traits.       and performance in computer programming. Journal of
While the RMSE and PC scores proved promising during de-               Systems and Software, 111:228–241, 2016.
velopment, further investigation suggested the dataset may         [5] F. Ostendorf and A. Angleitner. Neo-PI-R:
be too small to create an effective machine learning system.           Neo-Persönlichkeitsinventar nach Costa und McCrae.
The compiler style feature generation process using ASTs               Hogrefe, 2004.
combined with several custom features could serve as future        [6] T. Parr. The definitive ANTLR 4 reference. Pragmatic
baselines for similar tasks.                                           Bookshelf, 2013.
4.1     Future Work                                                [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
                                                                       B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
   The task would benefit greatly from an expanded training
                                                                       R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
corpus (more samples per programmer, more programmers).
                                                                       D. Cournapeau, M. Brucher, M. Perrot, and
The value distribution of the training set is also an important
                                                                       E. Duchesnay. Scikit-learn: Machine learning in python.
point. The current training set exhibits normal distributed
                                                                       J. Mach. Learn. Res., 12:2825–2830, Nov. 2011.
scores for each Big Five trait. A more robust system could
be created when using an equal number of samples within            [8] F. Rangel, F. González, F. Restrepo, M. Montes, and
low, mid and high value range.                                         P. Rosso. Pan at fire: Overview of the pr-soco track on
   Additionally, further feature engineering, additional sta-          personality recognition in source code. In Working
tistical analysis of the AST output, and transferring strate-          notes of FIRE 2016 - Forum for Information Retrieval
gies of other NLP tasks involving syntax trees onto the cur-           Evaluation, Kolkata, India, December 7-10, 2016,
rent task could improve the system.                                    CEUR Workshop Proceedings. CEUR-WS.org, 2016.


5.     REFERENCES
[1] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller,
    D. Bahdanau, N. Ballas, F. Bastien, J. Bayer,
    A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron,
    J. Bergstra, V. Bisson, J. Bleecher Snyder,
    N. Bouchard, N. Boulanger-Lewandowski,
    X. Bouthillier, A. de Brébisson, O. Breuleux, P.-L.
    Carrier, K. Cho, J. Chorowski, P. Christiano,
    T. Cooijmans, M.-A. Côté, M. Côté, A. Courville,
    Y. N. Dauphin, O. Delalleau, J. Demouth,
    G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe,
    V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan,
    O. Firat, M. Germain, X. Glorot, I. Goodfellow,
    M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P.