Pisco: A Computational Approach to Predict Personality Types from Java Source Code Matthias Liebeck Pashutan Modaresi Alexander Askinadze Institute of Computer Science Institute of Computer Science Institute of Computer Science Heinrich Heine University Heinrich Heine University Heinrich Heine University Düsseldorf Düsseldorf Düsseldorf D-40225 Düsseldorf, Germany D-40225 Düsseldorf, Germany D-40225 Düsseldorf, Germany liebeck@cs.uni- modaresi@cs.uni- askinadze@cs.uni- duesseldorf.de duesseldorf.de duesseldorf.de Stefan Conrad Institute of Computer Science Heinrich Heine University Düsseldorf D-40225 Düsseldorf, Germany conrad@cs.uni- duesseldorf.de ABSTRACT tasy, aesthetics, values), agreeableness (trust, straightfor- We developed an approach to automatically predict the per- wardness, compliance), and conscientiousness (competence, sonality traits of Java developers based on their source code order, dutifulness). for the PR-SOCO challenge 2016. The challenge provides a Computational personality recognition has been applied data set consisting of source code with their associated de- to various domains, such as essays [8], tweets [7], and blogs velopers’ personality traits (neuroticism, extraversion, open- [11]. An interesting but less studied application is the per- ness, agreeableness, and conscientiousness). Our approach sonality prediction of software developers based on their adapts features from the authorship identification domain written source code. Unlike blogs and tweets, which are and utilizes features that were specifically engineered for written (mostly) in natural languages, source code is writ- the PR-SOCO challenge. We experiment with two learn- ten in a programming language that might not explicitly ing methods: linear regression and k-nearest neighbors re- reveal the author’s personality. gressor. The results are reported in terms of the Pearson The study of software developers’ source code has many product-moment correlation and root mean square error. practical applications. For instance, in the education sector for detecting plagiarism [1], in the law sector for cybercrime investigation [5], and in the technology sector to identify CCS Concepts the expertise level of programmers [6]. To the best of our •Computing methodologies Ñ Artificial intelligence; knowledge, there have been no studies on the automatic pre- Natural language processing; diction of software developers’ personalities based on their source code. Having a tool capable of predicting the person- Keywords ality of a software developer based on his or her open source projects (GitHub1 , Bitbucket2 , etc.) could dramatically im- Computational personality recognition; five factor model; prove the recruitment process of companies since software Java source code development requires teamwork and deciding if a program- mer’s personality fits the team is crucial for companies. 1. INTRODUCTION In this paper, we introduce a machine learning approach Author profiling is a research field that deals with the pre- developed in the scope of the PR-SOCO [12] shared task diction of user properties (e.g., age and gender prediction to automatically identify the personality type of a Java de- of an author [10]). The subfield computational personality veloper based on his or her source code. Participants were recognition refers to an interdisciplinary field that incorpo- provided with a training set consisting of Java sources codes rates computer science and psychology to automatically in- of programmers annotated with the five previously discussed fer an author’s personality based on his or her generated personality traits and with a test set. The aim of the PR- contents [4]. Although the generated contents can be of any SOCO task is the development of approaches that predict form, we focus on textual contents in this work. the personality traits of programmers on the test set. A popular personality model used in computational per- We investigated two classes of features: structure features sonality recognition is the five factor model [2]. According dependent on the programming experience of the program- to this model, five fundamental traits exist that make up mer (architecture design, code complexity, etc.) and style the human personality and each consists of several facets: 1 neuroticism (anxiety, depression, angry hostility), extraver- https://github.com/ 2 sion (warmth, positive emotions, activity), openness (fan- https://bitbucket.org/ The data was not cleaned by the organizers and, therefore, its quality varied. It sometimes contained debug output, 70 empty classes, syntax errors or even Python code. Another influencing factor is that students occasionally used external 60 code that was copied into the project, e.g., code from pro- Personality Trait Value gramming lectures at other universities. Since the focus of 50 this challenge is the prediction of the students’ personality types, a proper filtering step for external code seems reason- able. Otherwise, the prediction of the students’ personality 40 types can be influenced by other coder’s personality types. Unfortunately, we were not able to perform a plagiarism 30 check via web search. 2.2 Implemented Features 20 With the parsed source code from knife, we are able to im- Neuroticism Extroversion Openness Agreeableness Conscientiousness Personality Traits plement several style and structure features for our machine learning approach. Figure 1: Distribution of personality traits in the 2.2.1 Style Features training set While naming conventions are certainly a controversial topic of debate for software developers (who each have their own programming style), we believe that the naming of features related to the code layout that cannot be easily classes, methods, fields and local variables is important for changed by IDEs (comment length, variable length, etc.). the understanding of the code. For instance, overly short or We intentionally ignored the layout features (line length, overly long variable names can be difficult to understand. formatting style, etc.) as these features can be easily mod- Therefore, the length of such names might correlate with ified by IDEs using available formatting and code cleaning how thoughtful a developer was while writing source code. functionalities [3]. We decided to use the following style features: The remainder of the paper is structured as follows: Sec- tion 2 describes the PR-SOCO challenge and our contribu- F1: Length of method names tion to solving it. The results of our approach are described in Section 3. We conclude and outline future work in Section F2: Length of method parameter names 4. F3: Length of field names 2. APPROACH F4: Length of local variables names in methods In order to process the students’ Java source code, we first An interesting observation is that the training data created knife 3 which is an open-source wrapper for the two contains a solution from one student who used a lo- Java parsers QDOX 4 and JavaParser 5 . Knife parses source cal variable name that is 75 characters long while the code into classes, methods, parameters, and variables and mean length of local variable names for all students is uses the Spark micro framework to provide the parsed code 4.02 pσ “ 3.89q. Such an outlier can be problematic as JSON. Afterwards, pisco 6 consumes the parsed source for linear regression. code, extracts features, and uses machine learning to predict personality traits with linear regression and the k-nearest 2.2.2 Structure Features neighbors regressor. We investigated ten structure features that we consider 2.1 Data to be related to the developer’s programming experience. A more experienced developer might tend to write shorter The data for the PR-SOCO challenge comprises solutions methods with fewer lines of code or less code in general. for different Java programming tasks that were uploaded by students and the results of their personality tests. Each of F5: Number of classes the five personality traits is represented by a value between 20 and 80. The students were allowed to upload more than F6: Cyclomatic complexity one solution per programming task and to reuse code from The cyclomatic complexity [9] is a software metric that previous exercises or from external resources. The training calculates the number of linear independent paths in set comprises 49 data points and the test set contains 21 a program’s control flow. We calculate the cyclomatic data points. It might be difficult to train classifiers and complexity per method by starting with an initial value avoid outliers with such a low amount of data. of 1, which is increased for each occurence of control Figure 1 shows a boxplot for the personality traits in the flow modifying keywords, such as if or for. training set. It can be observed that the median personality scores are between 46 and 50. F7: Number of methods 3 https://github.com/pasmod/knife F8: Number of method parameters 4 https://github.com/paul-hammant/qdox 5 https://github.com/javaparser/javaparser F9: Length of methods 6 https://github.com/Liebeck/pisco We included the length of methods in our feature set since long methods can be an indicator that the sin- It reflects how careful the students were in following gle responsiblity principle is violated and the methods instructions or in testing if their code meets the spec- could be refactored into multiple smaller methods. In ified requirements. our experiments, we tested the length of methods in terms of the number of lines and in terms of characters Although it might be useful to analyze code comments (without indentation). (e.g., the average comment length), we decided not to use features based on code comments since line and block com- F10: Number of fields per class ments may be polluted by code that was commented out. F11: Number of local variables in methods 2.3 Cross-Validation F12: Duplicate code measure Since most of our features are on a class or method basis, We noticed that some students uploaded multiple solu- we need to aggregate their values to a vector representation tions with very similar looking code. They copy pasted of a fixed length in order to deal with different numbers of methods from one class to another while performing solutions, classes, fields, methods, and parameters. In order small changes to the code. This motivated us to check to make our features more robust against outliers, we first whether a student uploaded two methods that have a aggregate the values per solution with a summary statistic high overlap.7 (e.g., mean, variance, range) and then calculate their mean. Given that the choice of a summary statistic is not apparent, The duplicate code measure was implemented as a bi- we decided to choose it via cross-validation on the training nary feature. The code lines from all methods were set. tokenized and converted into bag-of-words models. Af- Additionally, we noticed different behaviors of the features terwards, we calculated the pairwise cosine similar- depending on the personality trait. This encouraged us to ity between all methods and considered two methods estimate an optimal feature set for each personality trait mi ‰ mj to be a duplicate of each other by comparing individually. Since we have 16 features and the power set their similarity with a threshold τ : of all of these features contains too many combinations, it is not computationally feasible to search the entire feature # space. First, we performed a cross-validation on the training 1 if cospmi , mj q ą τ DCMpmi , mj q :“ (1) set with all 16 features. Additionally, we experimented with 0 otherwise subsets of our features and chose the subset that performed We empirically estimated τ “ 0.9. A student uploaded best during the 10-fold cross-validation on the training set. duplicate code if DCMpmi , mj q “ 1 for two of his or her methods mi ‰ mj . 3. EVALUATION In total, 11 teams participated at the PR-SOCO shared F13: Usage of IDE default template text task and submitted 48 runs. We noticed that some students did not remove or change default IDE text content and implemented this 3.1 Evaluation Metrics behavior as a binary feature. Two evaluation metrics were proposed for the evaluation F14: Ratio of external library usage of the submissions. To measure the correlation between the Developers are nowadays able to share libraries via de- predicted values and the gold standard values, the Pear- pendency managers, which allow developers to use im- son product-moment correlation coefficient (PC) was used. plementations of other developers without the need to Moreover, the root mean square error (RMSE) was used to write all the code from scratch. In the case of Java, measure the average amount of prediction errors. For a vec- code can be grouped into packages which can be im- tor y P Rn of truth values and its corresponding prediction ported. This feature calculates the ratio of imports vector y P Rn , the equations of the Pearson product-moment from standard Java packages to all imports. correlation and RMSE are shown in Equations 2 and 3 re- spectively: 2.2.3 Miscellaneous Features řn i“1 pyi ´ ȳqpyi ´ ȳq F15: Number of empty classes r “ bř bř (2) n 2 n 2 We noticed that the submitted solutions sometimes i“1 pyi ´ ȳq i“1 pyi ´ ȳq contain empty classes. This might be an indicator of where ȳ and ȳ denote the average values of the vectors how thoroughly a programmer works or how important y and y respectively and n represents the number of data cleaning up source code is for him/her. points. F16: Ratio of unparsable solutions c řn This feature captures that students uploaded code that 2 i“1 pyi ´ yi q is not valid Java code. A student’s solution might con- RM SE “ (3) n tain syntax errors that made it unparsable for QDOX. This is especially the case where students uploaded 3.2 Results debug output or Python code. This feature is imple- To optimize the hyperparameters (meaning parameters mented as the ratio of parsable to unparsable solutions. that do not need to be learned as part of the model, e.g., 7 summary statistics for features and parameters that have This is not to be confused with a plagiarism check between the solutions of different students. to be set manually for learning algorithms), we performed 1.0 an exhaustive 10-fold cross-validated grid search over all hy- perparameters for each personality trait individually. We used k-nearest neighbors regressor (runs 3 and 4) and linear Pearson Product-Moment Correlation regression (runs 5 and 6), and optimized once to minimize 0.5 RMSE (runs 4 and 5) and once to maximize the Pearson correlation (runs 3 and 6). After observing the results of the cross-validation, we noticed that none of the two learn- ing algorithms could outperform the other one. As a result, 0.0 we decided to choose the learning algorithm for each per- sonality trait individually and chose the one with the higher cross-validation score on the training data. This resulted in two more runs since we once optimized for the Pearson 0.5 correlation (run 1) and once for RMSE (run 2). The task organizers also provided two baseline ap- proaches: a bag of character 3-grams with frequency weight 1.0 and an approach that always predicts the mean value ob- Neuroticism Extroversion Openness Agreeableness Conscientiousness Personality Traits served in the training data [12]. The settings of the best runs, including the selected fea- tures and the applied learning algorithm, together with their Figure 3: Pearson’s Correlation Results corresponding RMSE values, are summarized in Table 1. Note that the numbers listed under selected features corre- spond to the feature indexes introduced in Section 2.2. It is For comparison, we also provide the settings of the best observable that the k-nearest neighbors regressor has supe- runs regarding the Pearson correlation in Table 2. Similar to rior results over the linear regression method for all personal- the case of RMSE, the features F3, F12, F13, and F15 were ity traits. As we discussed previously, several extracted fea- identified to result in higher Pearson correlations. For the tures include outliers, which can cause large residual values personality traits extroversion and agreeableness, based on by linear regression. By contrast, the k-nearest neighbors the grid search results, linear regression resulted in higher regressor is capable of coping with outliers and is preferred Pearson correlations in comparison to the k-nearest neigh- by the grid search. bors regressor. Nevertheless, linear regression results in neg- It is also observable that the features length of field names ative correlation coefficients for both traits. The Pearson (F3), duplicate code measure (F12), usage of IDE default correlations of our best runs for the individual traits can be template text (F13), and number of empty classes (F15) are compared to the other submissions in Figure 3. among the most powerful predictors of personality traits. 4. CONCLUSIONS 30 We presented our approach to automatically predict per- sonality types in the five factor model from Java source code for the PR-SOCO challenge 2016. Our architecture consists 25 of the two components knife and pisco which we made pub- licly available on GitHub. We used knife to parse the source Root Mean Square Error code and pisco to extract features and to predict personality 20 traits. We achieved the best root mean squared error for the per- sonality trait conscientiousness among all 11 participating 15 teams. For the personality traits neuroticism and openness, our best runs ranked 3rd and 9th, respectively, based on 48 runs. Our RMSE result for the trait extroversion was better 10 than the median. Unfortunately, the results in the dimen- sion openness were not satisfactory. The results in terms of the Pearson correlation were mixed since we achieved posi- 5 Neuroticism Extroversion Openness Agreeableness Conscientiousness tive and negative correlations. Personality Traits In our future work, we want to crawl external resources in order to determine if pieces of the source code are plagia- Figure 2: Root Mean Square Error Results rized. We also want to evaluate non-linear machine learning approaches. During our data analysis, we identified that the In Figure 2, we compare our results regarding the RMSE developers sometimes used more than one natural language, measure to the other participants. The results not included for instance in comments or in variable names. We would between the whiskers are considered as outliers and are rep- like to investigate this behavior for possible correlations with resented by empty circles. For each personality trait, the personality types. In our work, we ignored layout features filled circle indicates the RMSE values of our best runs. For since they can easily be modified by an IDE. However, we all personality traits except agreeableness, our proposed ap- could investigate if the developer is consistent in using the proach had RMSE values lower than the median. In partic- auto formatter of his or her IDE. ular, we achieved the lowest RMSE among all participating teams for the personality trait conscientiousness. Table 1: Selected features for the best runs according to RMSE Selected Features Personality Trait Method RMSE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Neuroticism X X X X X X X X X X X X X X k-NN 9.97 Extroversion X X k-NN 9.22 Openness X X X X k-NN 7.42 Agreeableness X X X X X X X X X X X X X X X X k-NN 11.5 Conscientiousness X X X X k-NN 8.38 Table 2: Selected features for the best runs according to the Pearson correlation Selected Features Personality Trait Method PC 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Neuroticism X X X X X X X X X X X X X X k-NN 0.23 Extroversion X X X X X X X X X X LR -0.05 Openness X X X X k-NN 0.29 Agreeableness X X X X X X X X X X X X X X X X LR -0.28 Conscientiousness X X X X k-NN 0.19 5. ACKNOWLEDGMENTS International Conference on Software Engineering - This work was partially funded by the PhD program Volume 1, ICSE ’10, pages 385–394. ACM, 2010. Online Participation, supported by the North Rhine- [7] J. Golbeck, C. Robles, M. Edmondson, and K. Turner. Westphalian funding scheme Fortschrittskollegs by the Ger- Predicting Personality from Twitter. In ” SocialCom/PASSAT, pages 149–156. IEEE, 2011. man Federal Ministry of Economics and Technology under the ZIM program (Grant No. KF2846504), and by the IST- [8] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Hochschule University of Applied Sciences. Computational Moore. Using Linguistic Cues for the Automatic support and infrastructure were provided by the “Centre for Recognition of Personality in Conversation and Text. Information and Media Technology” (ZIM) at the University J. Artif. Int. Res., 30(1):457–500, Nov. 2007. of Düsseldorf (Germany). [9] T. J. McCabe. A Complexity Measure. IEEE Trans. Software Eng., 2(4):308–320, 1976. 6. REFERENCES [10] P. Modaresi, M. Liebeck, and S. Conrad. Exploring the Effects of Cross-Genre Machine Learning for [1] A. Ahtiainen, S. Surakka, and M. Rahikainen. Plaggie: Author Profiling in PAN 2016. In Working Notes of GNU-licensed Source Code Plagiarism Detection CLEF 2016 - Conference and Labs of the Evaluation Engine for Java Exercises. In Proceedings of the 6th forum, pages 970–977, 2016. Baltic Sea conference on Computing education [11] J. Oberlander and S. Nowson. Whose thumb is it research: Koli Calling 2006, pages 141–142. ACM, anyway? Classifying author personality from weblog 2006. text. In Proceedings of the COLING/ACL on Main [2] P. T. Costa and R. R. McCrae. The NEO personality Conference Poster Sessions, COLING-ACL ’06, pages inventory manual. Psychological Assessment 627–634. Association for Computational Linguistics, Ressources, 1985. 2006. [3] H. Ding. Extraction of Java Program Fingerprints for [12] F. Rangel, F. González, F. Restrepo, M. Montes, and Software Authorship Identification. Master’s thesis, P. Rosso. PAN at FIRE: Overview of the PR-SOCO Faculty of the Graduate College of the Oklahoma Track on Personality Recognition in SOurce COde. In State University, 2002. Working notes of FIRE 2016 - Forum for Information [4] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, Retrieval Evaluation, Kolkata, India, December 7-10, M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens, 2016, CEUR Workshop Proceedings. CEUR-WS.org, and M. De Cock. Computational personality 2016. recognition in social media. User Modeling and User-Adapted Interaction, 26(2):109–142, 2016. [5] G. Frantzeskou and S. Gritzalis. Source Code Authorship Analysis for Supporting the Cybercrime Investigation Process. In ICETE 2004, 1st International Conference on E-Business and Telecommunication Networks, pages 85–92, 2004. [6] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill. A Degree-of-Knowledge Model to Capture Source Code Familiarity. In Proceedings of the 32nd ACM/IEEE