PRHLT at PR-SOCO: A Regression Model for Predicting Personality Traits from Source Code Notebook for PR-SOCO at FIRE 2016 Maite Giménez Roberto Paredes Pattern Recognition and Human Language Pattern Recognition and Human Language Technology (PRHLT) Research Center Technology (PRHLT) Research Center Universitat Politècnica de València Universitat Politècnica de València Camino de Vera s/n, 46022 Valencia, Spain Camino de Vera s/n, 46022 Valencia, Spain mgimenez@dsic.upv.es rparedes@dsic.upv.es ABSTRACT different sources: social media, essays, blog posts, etc. [1, This paper describes our participation in the PAN@FIRE 2, 14]. Finally, it is noteworthy that previous studies [12] Personality Recognition in Source Code (PR-SOCO) 2016 have already proven the impact of the personality traits in shared task. We have proposed two different approaches to the behavior of developers in the FLOSS community 1 . tackle this task, on the one hand, each code sample from Previously, there were some efforts to evaluate Personal- each author was taken as an independent sample and it was ity Recognition systems in several shared tasks, using texts vectorized using word n-grams; on the other hand, all the gathered from Twitter [17], YouTube Vlogs, and Mobile code from an author was taken as a unique sample, and Phone interactions [4]. However, the Personality Recogni- it was vectorized using word n-grams together with hand- tion in Source Code (PR-SOCO) shared task was the first crafted features that may determine the personality traits of competition where the objective was to determine the per- an author. Regardless of the approach, a regression model sonality of developers from the source code they wrote, lay- was trained to classify the personality traits of the author ing groundwork for a fair comparison between different ap- of a sample of source code. All the systems we have sub- proaches and future work. mitted to be evaluated have achieved a root mean square In this paper we describe our participation for addressing error (RMSE) below the mean RMSE of the participants of the PR-SOCO task. The rest of the paper is organized as the shared task. Moreover, one of our runs, the one that in- follows. Next section is devoted to define the Personality cluded the hand-crafted features, held the best result in the Recognition task. In Section 4 the model proposed is de- personality trait Agreeableness. This suggests that in the scribed. Following, in Section 5, the results achieved are absence of enough independent samples to train a machine presented. Finally, in Section 6 our results are discussed, learning system, hand-crafted features are able to obtain and future work is proposed. better results. Keywords 2. TASK DEFINITION The main objective proposed by the organizers of the PR- PR-SOCO; Author profiling; Personality Recognition; Source SOCO shared task was to predict the personality traits of Code; Natural Language Processing; Machine Learning; Re- developers given a collection of their source code. The per- gression sonality of a developer was determined following the Five Factor Theory or Big Five [5, 11, 3] which is the most widely 1. INTRODUCTION accepted in psychology. Therefore, five traits define the per- One of the new emerging research areas in Natural Lan- sonality of an author. Those traits are: agreeableness (A), guage Processing (NLP) is Personality Recognition (PR), conscientiousness (C), extroversion (E), openness to experi- which seeks to classify the personality traits of the author ence (O), and emotional stability / neuroticism (N). Each of a text. In psychology, Norman et al. (1963) [11] pro- trait was labeled within a range between 20 and 80. The posed a taxonomy for describing the personality along five models were evaluated by the organizers using two metrics: dimensions known as “Big Five”, which are: agreeableness, the average Root Mean Squared Error (RMSE) as well as conscientiousness, extroversion, openness to experience, and the Pearson Product-Moment Correlation (PC). For further emotional stability. Besides, this work determined that our information about the task, please review the overview pa- personality traits have a strong influence on our individ- per of the task [16]. ual behavior. The work carried out by Gill (2003) [8] out- line that the personality is projected through the language. 3. DATA Therefore, by exploiting different kinds of NLP techniques, The organizers have gathered 60 samples of source code it is possible to infer the personality of the author of a text. from 60 different programmers. In order to train the partic- In addition, Personality Recognition can be useful in vari- ous applications such as marketing, sociology, etc. [6, 7, 15, 1 Free/Libre Open Source Software https://www.gnu.org/ 18]. Also, PR can be inferred using texts extracted from philosophy/floss-and-foss.en.html ipants’ models, 49 samples were provided, and 21 were held to validate the results. Each sample consists of a collection of source code written in Java. In Table 1 the total number of training and test samples is shown. Table 1: Dataset distribution Dataset Source Code Authors Train 1,741 49 Test 751 21 We have studied the distribution of the number of samples available for each value of each trait to classify depending on whether we considered the number of code samples as in- dependent (number of pieces of source code) or not (number of authors). Figures 1 and 2 show the number of samples available for the trait Agreeableness. Similarly, the rest of the traits presented an equivalent distribution of the num- ber of training samples available. It should be noted that the number of authors, and therefore the number of training Figure 2: Num. of code samples for each value of Agree- samples available might be insufficient to adjust the parame- ableness to classify (Code-Based approach). ters of a machine learning system adequately. If we consider each sample of code as an independent training sample, we will have more training samples available, which might be Author-Based approach uses all the samples of code from useful for fighting the curse of dimensionality[9]. This has an author including hand-crafted features in addition led us to two different approaches that will be described in to the words n-grams. The features considered were: Section 4. the number of samples of code that implemented the same class (hf1 ), the number of allocations (hf2 ), the number of loops (hf3 ), the appearance of pieces of code suspicious of plagiarism (hf4 )2 , the number of imports (hf5 ), the number of functions (hf6 ), the number of exceptions handled (hf7 ), the number classes devel- oped (hf8 ), the number of different classes developed (hf9 ), the number of comment lines (hf10 ), and the number of prints (hf11 ). Code-Based approach assumed independence between the samples. This naı̈ve assumption allowed us to train with 1,741 samples. The CB approach relies solely on the n-grams found in each piece of code, without considering any kind of aggregated information from each author. It generates a prediction for each sample of source code. Therefore, the final prediction for an author is the mean of all the predictions obtained for each piece of code that this author wrote. As text representation, several vectorizer methods were evaluated for each approach. The vectorizers considered Figure 1: Num. of authors for each value of Agreeableness were: the term frequency-inverse document frequency (tf- to classify (Author-Based approach). idf) from one to four words (tfidf-words), the tf-idf from one to four n-grams of words ignoring the terms that have a frequency strictly higher than the threshold 0.5 and ap- Noteworthy, we are not exploiting any external dataset or plying sub-linear scaling (sublinear-1:4), idem but explor- resource to either train or fine-tune our models. ing n-grams from one to six words (sublinear-1:6), the tf-idf from one to six characters (tfidf-chars), and a bag of words (BOW). We carried out a preprocessing phase where code 4. SYSTEM DESCRIPTION snippets (e.g. a sequence of words that define a loop) were Provided that the number of data samples available for replaced by tokens. However, the systems that included this training machine learning models is crucial, two approaches 2 We supposed that those samples of code that instanti- were evaluated. We have proposed an Author Based (AB) ate classes that do not belong to the standard library are approach and a Code Based (CB) approach. suspicious of plagiarism, e.g. the class SeparateChaining- HashTable. Table 2: RMSE achieved using a 5-fold validation over the train dataset following the Code Based approach. The mean RMSE and the standard deviation for the 5-fold validation for each trait is reported. Model Agreeableness Conscientiousness Extroversion Neuroticism Openness sublinear-1:6 & ridge 6.10 (±0.67) 4.81 (±0.41) 5.55 (±0.89) 8.30 (±0.95) 4.93 (±0.52) sublinear-1:4 & ridge 6.08 (±0.65) 4.82 (±0.44) 5.53 (±0.87) 8.26 (±0.94) 4.95 (±0.55) sublinear-1:6 & LR. 6.11 (±0.85) 4.79 (±0.47) 5.94 (±0.89) 8.54 (±1.02) 4.85 (±0.43) sublinear-1:4 & LR. 6.07 (±0.81) 4.83 (±0.47) 5.89 (±0.84) 8.49 (±1.01) 4.91 (±0.44) sublinear-1:6 & RFR. 6.10 (±0.67) 5.00 (±0.72) 5.55 (±0.89) 8.30 (±0.95) 4.93 (±0.52) phase obtained worse results that those systems without pre- of code that implemented the same class hf1 , the appearance processing. This phenomenon was previously reported in the of pieces of code suspicious of plagiarism hf4 , the number author profiling literature [1, 10]. Our results confirm that of classes developed hf8 , and the number of different classes the preprocessing phase also has a negative impact on the developed hf9 . personality recognition task from source code. We submitted five different models. Those that performed Moreover, both approaches used a regression model to better during the development phase, which were: classify the authors automatically. The machine learning algorithms considered were: an Epsilon-Support Vector Re- 1. run 1: a Code-Based approach using sublinear-1:4 and gression (SVR) model, a Linear Regression (LR) model, Ridge. a Linear Least Squares model with l2 regularization and 2. run 2: a Code-Based approach using sublinear-1:6 and α = 0.5 (Ridge), Linear model trained with L1 prior as reg- Ridge. ularizer and α = 0.5 (Lasso), a Multi-layer Perceptron clas- sifier (MLP), a Decision Tree Regressor (DTR), and a Ran- 3. run 3: an Author-Based approach using sublinear-1:4, dom Forest Regressor (RFR). The task was also evaluated as the following hand-crafted features: hf1 ⊕ hf4 ⊕ hf8 ⊕ a classification problem using Support Vector Machines, and hf9 ⊕ hf10 and Ridge. Random Forest. Nevertheless, the classification approach behaved worse than the regression approach. Therefore, this 4. run 4: a Code-Based approach using sublinear-1:4 and classification approach was discarded. Logistic Regression. 5. run 5: a Code-Based approach using sublinear-1:6 and We have developed a pipeline using scikit-learn [13]. In Logistic Regression. the CB approach, we have selected the best combination of n-grams and the regression model using a 5-cross valida- Two baselines were provided by the organizers: a bag tion. The selection of the models was a compromise solution. of words 3-grams with frequency weight (bow), and an ap- We selected those models that achieved better global RMSE proach that always predicts the mean value observed in the computed as the mean of the RMSE for each trait and for training data (mean). The evaluation results for each per- each fold: sonality trait over the test set can be found in Table 3. P5 Eleven teams have presented their respective systems. In f old=1 (RM SEtrait&f old )/5 X total, 48 systems were submitted for evaluation. All the trait∈A,C,E,N,O 5 systems we have submitted have performed better than the mean of the systems proposed using the RMSE. This has allowed us to obtain models with a competitive Despite the results achieved during the development phase, performance for all traits measured with the RMSE. Our our best performing system was the one that followed the systems were only optimized for the RMSE, which might Author-Based approach. This system was able to achieve affect the performance using the Pearson Correlation since the best RMSE result in the personality trait Agreeableness. there is no reciprocity between the RMSE and the Pearson Nevertheless, our systems’ predictions did not find a correla- Correlation. Conversely, in the AB approach, the best hand- tion with the gold standard following the Pearson coefficient crafted combination was selected applying an ablation test, metric. Besides, neither the baselines proposed nor the best and these features were concatenated to the word n-grams performing participants were able to find a significative cor- of the best model obtained for the CB approach. relation. The best correlation found by the participants was 0.62 for the trait Openness, which can not be considered a strong positive correlation. 5. RESULTS Hereafter, we will describe the results achieved by our 6. DISCUSSION AND FUTURE WORK best models. Table 2 shows the RMSE of our best models In this paper we have presented our participation in the at development time. Due to the computational complexity PAN@FIRE Personality Recognition in Source Code 2016 of performing the grid search over two metrics, we have only shared task. Two approaches were proposed an Author- used the RMSE to adjust our models. Based approach and a Code-Based approach. The AB ap- After selecting the best model for the Code-Based ap- proach performed better for all the traits. This could be proach, we have selected the hand-crafted features that im- explained because the samples we used to train the systems proved the classification in the Author-Based approach. The that followed the Code-Based approach were not indepen- hand-crafted features selected were: the number of samples dent. Therefore, the results we obtained in the development Table 3: Evaluation of our participation in the PR-SOCO Assessment: Personality Measurement and Testing, shared task. The first five rows, run 1 up to run 5, show volume 2. Sage, 2008. the results achieved by our systems. The traits are: agree- [4] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez, ableness (A), conscientiousness (C), extroversion (E), neu- G. Riccardi, and F. Pianesi. The workshop on roticism (N), and openness to experience (O). Moreover, the computational personality recognition 2014. In performance of the baseline systems are included, as well as Proceedings of the 22nd ACM international conference the minimum, maximum and mean performance obtained on Multimedia, pages 1245–1246. ACM, 2014. by the participants at the shared task. [5] P. T. Costa and R. R. McCrae. The revised neo (a) RMSE achieved in the test dataset personality inventory (neo-pi-r). The SAGE handbook of personality theory and assessment, 2:179–198, 2008. Model A C E N O [6] S. Cruz, F. Q. da Silva, and L. F. Capretz. Forty years (CB) run 1 9.29 9.02 8.75 10.67 7.85 of research on personality in software engineering: A (CB) run 2 9.36 8.99 8.79 10.46 7.67 mapping study. Computers in Human Behavior, (AB) run 3 8.79 8.69 9.0 10.22 7.57 46:94–113, 2015. (CB) run 4 9.62 8.86 8.69 10.73 7.81 [7] R. Fuchs. Personality traits and their impact on (CB) run 5 9.71 8.89 8.65 10.65 7.79 graphical user interface design. In 2nd Workshop on baseline bow 9.0 8.47 9.06 10.29 7.74 Attitude, Personality and Emotions in User Adapted baseline mean 9.04 8.54 9.06 10.26 7.57 Interaction, 2001. min 8.79 8.38 8.60 9.78 6.95 [8] A. J. Gill. Personality and language: The projection max 28.63 22.36 28.80 29.44 33.53 and perception of personality in computer-mediated mean 9.72 10.74 12.27 12.75 10.49 communication. PhD thesis, University of Edinburgh, 2003. (b) Pearson Correlation achieved in the test dataset. [9] E. Keogh and A. Mueen. Curse of dimensionality. In Model A C E N O Encyclopedia of Machine Learning, pages 257–258. (CB) run 1 0.03 -0.23 0.31 -0.22 -0.12 Springer, 2011. (CB) run 2 0.0 -0.19 0.28 -0.07 0.05 [10] A. McEnery and M. Oakes. Authorship (AB) run 3 0.33 -0.12 0.18 0.09 0.03 studies/textual statistics. 2000. (CB) run 4 -0.03 -0.09 0.28 -0.15 -0.05 [11] W. T. Norman. Toward an adequate taxonomy of (CB) run 5 -0.06 -0.12 0.3 -0.16 -0.02 personality attributes: Replicated factor structure in baseline bow 0.20 0.17 0.12 0.06 -0.17 peer nomination personality ratings. The Journal of baseline mean 0.0 0.0 0.0 0.0 0.0 Abnormal and Social Psychology, 66(6):574, 1963. min -0.32 -0.31 -0.37 -0.29 -0.36 [12] O. H. Paruma-Pabón, F. A. González, J. Aponte, J. E. max 0.38 0.33 0.47 0.36 0.62 Camargo, and F. Restrepo-Calle. Finding mean -0.01 -0.01 0.06 0.04 0.09 relationships between socio-technical aspects and personality traits by mining developer e-mails. In Proceedings of the 9th International Workshop on phase correspond to over-fitted systems. Cooperative and Human Aspects of Software However, provided that we did not have enough samples we Engineering, pages 8–14. ACM, 2016. still need to include proper techniques for data augmenta- [13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, tion. If we would be able to get more labeled data, new B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, approaches could be studied such as deep learning methods R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, and word embeddings for text representation. D. Cournapeau, M. Brucher, M. Perrot, and Noteworthy, the minimum error achieved by the partici- E. Duchesnay. Scikit-learn: Machine learning in pants’ proposals in the RMSE is close to the baseline models Python. Journal of Machine Learning Research, for all the personality traits, and only for some traits a corre- 12:2825–2830, 2011. lation with the gold standard was found. This highlights the [14] B. Plank and D. Hovy. Personality traits on twitter - complexity of the task. Therefore, personality recognition in or - how to get 1,500 personality tests in a week. In source codes is an open problem and new NLP approaches Proceedings of the 6th Workshop on Computational could improve the performance of the systems. Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 92–98, 2015. [15] D. Preotiuc-Pietro, J. Eichstaedt, G. Park, M. Sap, 7. REFERENCES L. Smith, V. Tobolsky, H. A. Schwartz, and L. Ungar. [1] S. Argamon, S. Dhawle, M. Koppel, and J. W. The role of personality, age and gender in tweeting Pennebaker. Lexical predictors of personality type. In about mental illnesses. NAACL HLT 2015, page 21, Proceedings of the 2005 Joint Annual Meeting of the 2015. Interface and the Classification Society of North [16] F. Rangel, F. González, F. Restrepo, M. Montes, and America, 2005. P. Rosso. Pan at fire: Overview of the pr-soco track on [2] S. Argamon, M. Koppel, J. W. Pennebaker, and personality recognition in source code. In Working J. Schler. Mining the blogosphere: Age, gender and the notes of FIRE 2016 - Forum for Information Retrieval varieties of self-expression. First Monday, 12(9), 2007. Evaluation, Kolkata, India, December 7-10, 2016, [3] G. J. Boyle, G. Matthews, and D. H. Saklofske. The CEUR Workshop Proceedings. CEUR-WS.org, 2016. SAGE Handbook of Personality Theory and [17] F. Rangel, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. Overview of the 3rd author profiling task at pan 2015. In CLEF, 2015. [18] R. S. Rubin, D. C. Munz, and W. H. Bommer. Leading from within: The effects of emotion recognition and personality on transformational leadership behavior. Academy of Management Journal, 48(5):845–858, 2005.