Pisco: A Computational Approach to Predict Personality
                 Types from Java Source Code

                Matthias Liebeck                     Pashutan Modaresi                   Alexander Askinadze
         Institute of Computer Science           Institute of Computer Science        Institute of Computer Science
           Heinrich Heine University               Heinrich Heine University            Heinrich Heine University
                    Düsseldorf                              Düsseldorf                           Düsseldorf
         D-40225 Düsseldorf, Germany             D-40225 Düsseldorf, Germany          D-40225 Düsseldorf, Germany
                liebeck@cs.uni-                       modaresi@cs.uni-                    askinadze@cs.uni-
                 duesseldorf.de                        duesseldorf.de                       duesseldorf.de
                                                       Stefan Conrad
                                                 Institute of Computer Science
                                                   Heinrich Heine University
                                                            Düsseldorf
                                                 D-40225 Düsseldorf, Germany
                                                        conrad@cs.uni-
                                                        duesseldorf.de

ABSTRACT                                                           tasy, aesthetics, values), agreeableness (trust, straightfor-
We developed an approach to automatically predict the per-         wardness, compliance), and conscientiousness (competence,
sonality traits of Java developers based on their source code      order, dutifulness).
for the PR-SOCO challenge 2016. The challenge provides a              Computational personality recognition has been applied
data set consisting of source code with their associated de-       to various domains, such as essays [8], tweets [7], and blogs
velopers’ personality traits (neuroticism, extraversion, open-     [11]. An interesting but less studied application is the per-
ness, agreeableness, and conscientiousness). Our approach          sonality prediction of software developers based on their
adapts features from the authorship identification domain          written source code. Unlike blogs and tweets, which are
and utilizes features that were specifically engineered for        written (mostly) in natural languages, source code is writ-
the PR-SOCO challenge. We experiment with two learn-               ten in a programming language that might not explicitly
ing methods: linear regression and k-nearest neighbors re-         reveal the author’s personality.
gressor. The results are reported in terms of the Pearson             The study of software developers’ source code has many
product-moment correlation and root mean square error.             practical applications. For instance, in the education sector
                                                                   for detecting plagiarism [1], in the law sector for cybercrime
                                                                   investigation [5], and in the technology sector to identify
CCS Concepts                                                       the expertise level of programmers [6]. To the best of our
•Computing methodologies Ñ Artificial intelligence;                knowledge, there have been no studies on the automatic pre-
Natural language processing;                                       diction of software developers’ personalities based on their
                                                                   source code. Having a tool capable of predicting the person-
Keywords                                                           ality of a software developer based on his or her open source
                                                                   projects (GitHub1 , Bitbucket2 , etc.) could dramatically im-
Computational personality recognition; five factor model;
                                                                   prove the recruitment process of companies since software
Java source code
                                                                   development requires teamwork and deciding if a program-
                                                                   mer’s personality fits the team is crucial for companies.
1.    INTRODUCTION                                                    In this paper, we introduce a machine learning approach
   Author profiling is a research field that deals with the pre-   developed in the scope of the PR-SOCO [12] shared task
diction of user properties (e.g., age and gender prediction        to automatically identify the personality type of a Java de-
of an author [10]). The subfield computational personality         veloper based on his or her source code. Participants were
recognition refers to an interdisciplinary field that incorpo-     provided with a training set consisting of Java sources codes
rates computer science and psychology to automatically in-         of programmers annotated with the five previously discussed
fer an author’s personality based on his or her generated          personality traits and with a test set. The aim of the PR-
contents [4]. Although the generated contents can be of any        SOCO task is the development of approaches that predict
form, we focus on textual contents in this work.                   the personality traits of programmers on the test set.
   A popular personality model used in computational per-             We investigated two classes of features: structure features
sonality recognition is the five factor model [2]. According       dependent on the programming experience of the program-
to this model, five fundamental traits exist that make up          mer (architecture design, code complexity, etc.) and style
the human personality and each consists of several facets:
                                                                   1
neuroticism (anxiety, depression, angry hostility), extraver-          https://github.com/
                                                                   2
sion (warmth, positive emotions, activity), openness (fan-             https://bitbucket.org/
                                                                                                                             The data was not cleaned by the organizers and, therefore,
                                                                                                                          its quality varied. It sometimes contained debug output,
                              70
                                                                                                                          empty classes, syntax errors or even Python code. Another
                                                                                                                          influencing factor is that students occasionally used external
                              60                                                                                          code that was copied into the project, e.g., code from pro-
    Personality Trait Value


                                                                                                                          gramming lectures at other universities. Since the focus of
                              50
                                                                                                                          this challenge is the prediction of the students’ personality
                                                                                                                          types, a proper filtering step for external code seems reason-
                                                                                                                          able. Otherwise, the prediction of the students’ personality
                              40                                                                                          types can be influenced by other coder’s personality types.
                                                                                                                          Unfortunately, we were not able to perform a plagiarism
                              30                                                                                          check via web search.

                                                                                                                          2.2     Implemented Features
                              20
                                                                                                                             With the parsed source code from knife, we are able to im-
                                    Neuroticism   Extroversion        Openness        Agreeableness   Conscientiousness
                                                                 Personality Traits                                       plement several style and structure features for our machine
                                                                                                                          learning approach.
Figure 1: Distribution of personality traits in the                                                                       2.2.1    Style Features
training set                                                                                                                 While naming conventions are certainly a controversial
                                                                                                                          topic of debate for software developers (who each have their
                                                                                                                          own programming style), we believe that the naming of
features related to the code layout that cannot be easily
                                                                                                                          classes, methods, fields and local variables is important for
changed by IDEs (comment length, variable length, etc.).
                                                                                                                          the understanding of the code. For instance, overly short or
We intentionally ignored the layout features (line length,
                                                                                                                          overly long variable names can be difficult to understand.
formatting style, etc.) as these features can be easily mod-
                                                                                                                          Therefore, the length of such names might correlate with
ified by IDEs using available formatting and code cleaning
                                                                                                                          how thoughtful a developer was while writing source code.
functionalities [3].
                                                                                                                          We decided to use the following style features:
   The remainder of the paper is structured as follows: Sec-
tion 2 describes the PR-SOCO challenge and our contribu-
                                                                                                                           F1: Length of method names
tion to solving it. The results of our approach are described
in Section 3. We conclude and outline future work in Section                                                               F2: Length of method parameter names
4.
                                                                                                                           F3: Length of field names
2.                            APPROACH                                                                                     F4: Length of local variables names in methods
  In order to process the students’ Java source code, we first                                                                 An interesting observation is that the training data
created knife 3 which is an open-source wrapper for the two                                                                    contains a solution from one student who used a lo-
Java parsers QDOX 4 and JavaParser 5 . Knife parses source                                                                     cal variable name that is 75 characters long while the
code into classes, methods, parameters, and variables and                                                                      mean length of local variable names for all students is
uses the Spark micro framework to provide the parsed code                                                                      4.02 pσ “ 3.89q. Such an outlier can be problematic
as JSON. Afterwards, pisco 6 consumes the parsed source                                                                        for linear regression.
code, extracts features, and uses machine learning to predict
personality traits with linear regression and the k-nearest                                                               2.2.2    Structure Features
neighbors regressor.
                                                                                                                            We investigated ten structure features that we consider
2.1                                Data                                                                                   to be related to the developer’s programming experience.
                                                                                                                          A more experienced developer might tend to write shorter
  The data for the PR-SOCO challenge comprises solutions
                                                                                                                          methods with fewer lines of code or less code in general.
for different Java programming tasks that were uploaded by
students and the results of their personality tests. Each of                                                               F5: Number of classes
the five personality traits is represented by a value between
20 and 80. The students were allowed to upload more than                                                                   F6: Cyclomatic complexity
one solution per programming task and to reuse code from                                                                       The cyclomatic complexity [9] is a software metric that
previous exercises or from external resources. The training                                                                    calculates the number of linear independent paths in
set comprises 49 data points and the test set contains 21                                                                      a program’s control flow. We calculate the cyclomatic
data points. It might be difficult to train classifiers and                                                                    complexity per method by starting with an initial value
avoid outliers with such a low amount of data.                                                                                 of 1, which is increased for each occurence of control
  Figure 1 shows a boxplot for the personality traits in the                                                                   flow modifying keywords, such as if or for.
training set. It can be observed that the median personality
scores are between 46 and 50.                                                                                              F7: Number of methods
3
  https://github.com/pasmod/knife                                                                                          F8: Number of method parameters
4
  https://github.com/paul-hammant/qdox
5
  https://github.com/javaparser/javaparser                                                                                 F9: Length of methods
6
  https://github.com/Liebeck/pisco                                                                                             We included the length of methods in our feature set
     since long methods can be an indicator that the sin-             It reflects how careful the students were in following
     gle responsiblity principle is violated and the methods          instructions or in testing if their code meets the spec-
     could be refactored into multiple smaller methods. In            ified requirements.
     our experiments, we tested the length of methods in
     terms of the number of lines and in terms of characters       Although it might be useful to analyze code comments
     (without indentation).                                     (e.g., the average comment length), we decided not to use
                                                                features based on code comments since line and block com-
F10: Number of fields per class                                 ments may be polluted by code that was commented out.
F11: Number of local variables in methods                       2.3    Cross-Validation
F12: Duplicate code measure                                        Since most of our features are on a class or method basis,
     We noticed that some students uploaded multiple solu-      we need to aggregate their values to a vector representation
     tions with very similar looking code. They copy pasted     of a fixed length in order to deal with different numbers of
     methods from one class to another while performing         solutions, classes, fields, methods, and parameters. In order
     small changes to the code. This motivated us to check      to make our features more robust against outliers, we first
     whether a student uploaded two methods that have a         aggregate the values per solution with a summary statistic
     high overlap.7                                             (e.g., mean, variance, range) and then calculate their mean.
                                                                Given that the choice of a summary statistic is not apparent,
     The duplicate code measure was implemented as a bi-
                                                                we decided to choose it via cross-validation on the training
     nary feature. The code lines from all methods were
                                                                set.
     tokenized and converted into bag-of-words models. Af-
                                                                   Additionally, we noticed different behaviors of the features
     terwards, we calculated the pairwise cosine similar-
                                                                depending on the personality trait. This encouraged us to
     ity between all methods and considered two methods
                                                                estimate an optimal feature set for each personality trait
     mi ‰ mj to be a duplicate of each other by comparing
                                                                individually. Since we have 16 features and the power set
     their similarity with a threshold τ :
                                                                of all of these features contains too many combinations, it
                                                                is not computationally feasible to search the entire feature
                          #                                     space. First, we performed a cross-validation on the training
                           1      if cospmi , mj q ą τ
         DCMpmi , mj q :“                                (1)    set with all 16 features. Additionally, we experimented with
                           0      otherwise                     subsets of our features and chose the subset that performed
     We empirically estimated τ “ 0.9. A student uploaded       best during the 10-fold cross-validation on the training set.
     duplicate code if DCMpmi , mj q “ 1 for two of his or
     her methods mi ‰ mj .                                      3.    EVALUATION
                                                                  In total, 11 teams participated at the PR-SOCO shared
F13: Usage of IDE default template text
                                                                task and submitted 48 runs.
     We noticed that some students did not remove or
     change default IDE text content and implemented this       3.1    Evaluation Metrics
     behavior as a binary feature.
                                                                   Two evaluation metrics were proposed for the evaluation
F14: Ratio of external library usage                            of the submissions. To measure the correlation between the
     Developers are nowadays able to share libraries via de-    predicted values and the gold standard values, the Pear-
     pendency managers, which allow developers to use im-       son product-moment correlation coefficient (PC) was used.
     plementations of other developers without the need to      Moreover, the root mean square error (RMSE) was used to
     write all the code from scratch. In the case of Java,      measure the average amount of prediction errors. For a vec-
     code can be grouped into packages which can be im-         tor y P Rn of truth values and its corresponding prediction
     ported. This feature calculates the ratio of imports       vector y P Rn , the equations of the Pearson product-moment
     from standard Java packages to all imports.                correlation and RMSE are shown in Equations 2 and 3 re-
                                                                spectively:
2.2.3    Miscellaneous Features
                                                                                      řn
                                                                                        i“1 pyi ´ ȳqpyi ´ ȳq
F15: Number of empty classes                                               r “ bř                     bř                   (2)
                                                                                    n             2    n             2
     We noticed that the submitted solutions sometimes                              i“1 pyi ´ ȳq      i“1 pyi ´ ȳq
     contain empty classes. This might be an indicator of
                                                                  where ȳ and ȳ denote the average values of the vectors
     how thoroughly a programmer works or how important
                                                                y and y respectively and n represents the number of data
     cleaning up source code is for him/her.
                                                                points.
F16: Ratio of unparsable solutions
                                                                                            c řn
     This feature captures that students uploaded code that                                                       2
                                                                                                 i“1 pyi ´ yi q
     is not valid Java code. A student’s solution might con-                    RM SE “                                    (3)
                                                                                                       n
     tain syntax errors that made it unparsable for QDOX.
     This is especially the case where students uploaded        3.2    Results
     debug output or Python code. This feature is imple-          To optimize the hyperparameters (meaning parameters
     mented as the ratio of parsable to unparsable solutions.   that do not need to be learned as part of the model, e.g.,
7                                                               summary statistics for features and parameters that have
  This is not to be confused with a plagiarism check between
the solutions of different students.                            to be set manually for learning algorithms), we performed
                                                                                                                                                            1.0
an exhaustive 10-fold cross-validated grid search over all hy-
perparameters for each personality trait individually. We
used k-nearest neighbors regressor (runs 3 and 4) and linear


                                                                                                                       Pearson Product-Moment Correlation
regression (runs 5 and 6), and optimized once to minimize                                                                                                   0.5
RMSE (runs 4 and 5) and once to maximize the Pearson
correlation (runs 3 and 6). After observing the results of
the cross-validation, we noticed that none of the two learn-
ing algorithms could outperform the other one. As a result,                                                                                                 0.0
we decided to choose the learning algorithm for each per-
sonality trait individually and chose the one with the higher
cross-validation score on the training data. This resulted
in two more runs since we once optimized for the Pearson                                                                                                    0.5

correlation (run 1) and once for RMSE (run 2).
   The task organizers also provided two baseline ap-
proaches: a bag of character 3-grams with frequency weight
                                                                                                                                                            1.0
and an approach that always predicts the mean value ob-                                                                                                             Neuroticism   Extroversion        Openness        Agreeableness   Conscientiousness
                                                                                                                                                                                                 Personality Traits
served in the training data [12].
   The settings of the best runs, including the selected fea-
tures and the applied learning algorithm, together with their                                                                                                     Figure 3: Pearson’s Correlation Results
corresponding RMSE values, are summarized in Table 1.
Note that the numbers listed under selected features corre-
spond to the feature indexes introduced in Section 2.2. It is                                                           For comparison, we also provide the settings of the best
observable that the k-nearest neighbors regressor has supe-                                                           runs regarding the Pearson correlation in Table 2. Similar to
rior results over the linear regression method for all personal-                                                      the case of RMSE, the features F3, F12, F13, and F15 were
ity traits. As we discussed previously, several extracted fea-                                                        identified to result in higher Pearson correlations. For the
tures include outliers, which can cause large residual values                                                         personality traits extroversion and agreeableness, based on
by linear regression. By contrast, the k-nearest neighbors                                                            the grid search results, linear regression resulted in higher
regressor is capable of coping with outliers and is preferred                                                         Pearson correlations in comparison to the k-nearest neigh-
by the grid search.                                                                                                   bors regressor. Nevertheless, linear regression results in neg-
   It is also observable that the features length of field names                                                      ative correlation coefficients for both traits. The Pearson
(F3), duplicate code measure (F12), usage of IDE default                                                              correlations of our best runs for the individual traits can be
template text (F13), and number of empty classes (F15) are                                                            compared to the other submissions in Figure 3.
among the most powerful predictors of personality traits.
                                                                                                                      4.                                    CONCLUSIONS
                           30                                                                                            We presented our approach to automatically predict per-
                                                                                                                      sonality types in the five factor model from Java source code
                                                                                                                      for the PR-SOCO challenge 2016. Our architecture consists
                           25                                                                                         of the two components knife and pisco which we made pub-
                                                                                                                      licly available on GitHub. We used knife to parse the source
  Root Mean Square Error


                                                                                                                      code and pisco to extract features and to predict personality
                           20
                                                                                                                      traits.
                                                                                                                         We achieved the best root mean squared error for the per-
                                                                                                                      sonality trait conscientiousness among all 11 participating
                           15
                                                                                                                      teams. For the personality traits neuroticism and openness,
                                                                                                                      our best runs ranked 3rd and 9th, respectively, based on 48
                                                                                                                      runs. Our RMSE result for the trait extroversion was better
                           10
                                                                                                                      than the median. Unfortunately, the results in the dimen-
                                                                                                                      sion openness were not satisfactory. The results in terms of
                                                                                                                      the Pearson correlation were mixed since we achieved posi-
                            5
                                Neuroticism   Extroversion        Openness        Agreeableness   Conscientiousness   tive and negative correlations.
                                                             Personality Traits                                          In our future work, we want to crawl external resources
                                                                                                                      in order to determine if pieces of the source code are plagia-
                           Figure 2: Root Mean Square Error Results                                                   rized. We also want to evaluate non-linear machine learning
                                                                                                                      approaches. During our data analysis, we identified that the
    In Figure 2, we compare our results regarding the RMSE                                                            developers sometimes used more than one natural language,
measure to the other participants. The results not included                                                           for instance in comments or in variable names. We would
between the whiskers are considered as outliers and are rep-                                                          like to investigate this behavior for possible correlations with
resented by empty circles. For each personality trait, the                                                            personality types. In our work, we ignored layout features
filled circle indicates the RMSE values of our best runs. For                                                         since they can easily be modified by an IDE. However, we
all personality traits except agreeableness, our proposed ap-                                                         could investigate if the developer is consistent in using the
proach had RMSE values lower than the median. In partic-                                                              auto formatter of his or her IDE.
ular, we achieved the lowest RMSE among all participating
teams for the personality trait conscientiousness.
                             Table 1: Selected features for the best runs according to RMSE
                                                       Selected Features
        Personality Trait                                                                             Method   RMSE
                              1    2   3   4   5   6   7    8   9  10 11     12   13   14   15   16
          Neuroticism         X    X   X       X   X   X   X   X    X   X    X    X         X    X    k-NN     9.97
          Extroversion                                                       X              X         k-NN     9.22
           Openness                    X                                     X    X         X         k-NN     7.42
         Agreeableness        X    X   X   X   X   X   X   X   X    X   X    X    X    X    X    X    k-NN     11.5
        Conscientiousness              X                                     X    X         X         k-NN     8.38


               Table 2: Selected features for the best runs according to the Pearson correlation
                                                       Selected Features
         Personality Trait                                                                            Method    PC
                               1   2   3   4   5   6   7    8   9  10 11     12   13   14   15   16
           Neuroticism         X   X   X       X   X   X   X   X    X    X    X   X         X    X     k-NN    0.23
           Extroversion        X   X           X           X        X    X    X   X         X    X      LR     -0.05
            Openness                   X                                      X   X         X          k-NN    0.29
          Agreeableness        X   X   X   X   X   X   X   X   X    X    X    X   X    X    X    X      LR     -0.28
         Conscientiousness             X                                      X   X         X          k-NN    0.19


5.   ACKNOWLEDGMENTS                                                    International Conference on Software Engineering -
   This work was partially funded by the PhD program                    Volume 1, ICSE ’10, pages 385–394. ACM, 2010.
Online Participation, supported by the North Rhine-                 [7] J. Golbeck, C. Robles, M. Edmondson, and K. Turner.
Westphalian funding scheme Fortschrittskollegs by the Ger-              Predicting Personality from Twitter. In
                                              ”                         SocialCom/PASSAT, pages 149–156. IEEE, 2011.
man Federal Ministry of Economics and Technology under
the ZIM program (Grant No. KF2846504), and by the IST-              [8] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K.
Hochschule University of Applied Sciences. Computational                Moore. Using Linguistic Cues for the Automatic
support and infrastructure were provided by the “Centre for             Recognition of Personality in Conversation and Text.
Information and Media Technology” (ZIM) at the University               J. Artif. Int. Res., 30(1):457–500, Nov. 2007.
of Düsseldorf (Germany).                                           [9] T. J. McCabe. A Complexity Measure. IEEE Trans.
                                                                        Software Eng., 2(4):308–320, 1976.
6.   REFERENCES                                                    [10] P. Modaresi, M. Liebeck, and S. Conrad. Exploring
                                                                        the Effects of Cross-Genre Machine Learning for
 [1] A. Ahtiainen, S. Surakka, and M. Rahikainen. Plaggie:              Author Profiling in PAN 2016. In Working Notes of
     GNU-licensed Source Code Plagiarism Detection                      CLEF 2016 - Conference and Labs of the Evaluation
     Engine for Java Exercises. In Proceedings of the 6th               forum, pages 970–977, 2016.
     Baltic Sea conference on Computing education                  [11] J. Oberlander and S. Nowson. Whose thumb is it
     research: Koli Calling 2006, pages 141–142. ACM,                   anyway? Classifying author personality from weblog
     2006.                                                              text. In Proceedings of the COLING/ACL on Main
 [2] P. T. Costa and R. R. McCrae. The NEO personality                  Conference Poster Sessions, COLING-ACL ’06, pages
     inventory manual. Psychological Assessment                         627–634. Association for Computational Linguistics,
     Ressources, 1985.                                                  2006.
 [3] H. Ding. Extraction of Java Program Fingerprints for          [12] F. Rangel, F. González, F. Restrepo, M. Montes, and
     Software Authorship Identification. Master’s thesis,               P. Rosso. PAN at FIRE: Overview of the PR-SOCO
     Faculty of the Graduate College of the Oklahoma                    Track on Personality Recognition in SOurce COde. In
     State University, 2002.                                            Working notes of FIRE 2016 - Forum for Information
 [4] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli,                   Retrieval Evaluation, Kolkata, India, December 7-10,
     M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens,                2016, CEUR Workshop Proceedings. CEUR-WS.org,
     and M. De Cock. Computational personality                          2016.
     recognition in social media. User Modeling and
     User-Adapted Interaction, 26(2):109–142, 2016.
 [5] G. Frantzeskou and S. Gritzalis. Source Code
     Authorship Analysis for Supporting the Cybercrime
     Investigation Process. In ICETE 2004, 1st
     International Conference on E-Business and
     Telecommunication Networks, pages 85–92, 2004.
 [6] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill. A
     Degree-of-Knowledge Model to Capture Source Code
     Familiarity. In Proceedings of the 32nd ACM/IEEE