=Paper=
{{Paper
|id=Vol-1737/T1-3
|storemode=property
|title=CAPS-PRC: A System for Personality Recognition in Programming Code
|pdfUrl=https://ceur-ws.org/Vol-1737/T1-3.pdf
|volume=Vol-1737
|authors=Ivan Bilan,Eduard Saller,Benjamin Roth,Mariia Krytchak
|dblpUrl=https://dblp.org/rec/conf/fire/BilanSRK16
}}
==CAPS-PRC: A System for Personality Recognition in Programming Code==
CAPS-PRC: A System for Personality Recognition in Programming Code Notebook for PAN at FIRE16 Ivan Bilan Eduard Saller Benjamin Roth Center for Information and Center for Information and Center for Information and Language Processing Language Processing Language Processing Ludwig Maximilian University Ludwig Maximilian University Ludwig Maximilian University of Munich of Munich of Munich Oettingenstr. 67 Oettingenstr. 67 Oettingenstr. 67 Munich, Germany Munich, Germany Munich, Germany ivan.bilan@gmx.de eduard@saller.io beroth@cis.uni- muenchen.de Mariia Krytchak Department of Psychology Ludwig Maximilian University of Munich Leopoldstr. 13 Munich, Germany mariia.krytchak@gmx.de ABSTRACT NEO-PI-R Inventory [5] to form the training set [8]: ex- This paper describes the participation of the CAPS-PRC troversion, emotional stability/neuroticism, agreeableness, system developed at the LMU Munich in the personality conscientiousness, and openness to experience. The Big Five recognition shared task (PR-SOCO) organized by PAN at Model, i.e. five broad fairly independent dimensions, encom- the FIRE16 Conference. The machine learning system uses passes all personality traits and is considered to describe the the output of a Java code analyzer to investigate the struc- personality in a comprehensive way. The NEO-PI-R Inven- ture of a given program, its length, its average variable tory is a statistically reliable and valid tool that operational- length and also it takes into account the comments a given izes the Big Five Model through self/other-assessment and is programmer wrote. The comments are analyzed by language set in various cross professional and cross cultural contexts independent stylometric features, including TF-IDF distri- to describe the personality. bution, average word length, type/token ration and more. The system was evaluated using Root Mean Squared Error 2. EXPERIMENTAL SETUP (RMSE) and Pearson Product-Moment Correlation (PC). The best run exhibited the following results: Neuroticism (RMSE - 10.42, PC - 0.04), Extroversion (RMSE - 8.96, PC 2.1 Approaching the problem - 0.16), Openness (RMSE - 7.54, PC - 0.1), Agreeableness Based on the available research on the Big Five psycholog- (RMSE - 9.16, PC - 0.04), Conscientiousness (RMSE - 8.61, ical traits [5] [3], we can see that the traits are considered to PC - 0.07). be independent of each other. For this reason, each psycho- logical trait was viewed and analyzed individually. Figures 1 to 5 show the distribution of the training set for each psy- Keywords chological trait by author. Table 1 shows the mean trait machine learning; Big Five personality traits; source code distribution. analysis; abstract syntax tree Since each programmer/author has submitted more than one program, we approach the problem from two different angles: 1. INTRODUCTION 1) the feature vectors are extracted for each programmer, The main purpose of the task is to investigate whether by first extracting them for each program and then averag- it is possible to predict personality traits of programmers ing all the underlying feature vectors into one single feature based on the source code written by them [8]. Previous vector for the author. The classifier learns based on a single research has identified the relationship between personality feature vector for each author, where the author represents factors and computer programming styles having used dif- one sample in the dataset. ferent measures of personality [2] [4]. The task considers 2) the classifier is trained at the level of programs. Each the Big Five personality traits which were assessed by the program inherits the trait value of its author. The feature vectors are extracted for each program and then the classifier regards each program as a training instance. To get back to Figure 1: Author Distribution, Agreeableness Figure 4: Author Distribution, Neuroticism Figure 2: Author Distribution, Conscientiousness Figure 5: Author Distribution, Openness program belonging to a certain author. The final result is a single prediction for each author based on the predictions produced for each underlying program. 2.2 Feature Extraction 2.2.1 Abstract syntax tree We use a grammar γ specifically designed for the analy- sis of a programming language β, which in the context of the task was the Java programming language. The gram- mar γ combined with a parser ρ provides a semantic repre- sentation of the source code called an abstract syntax tree (AST). Compared to normal parse trees there are some po- tential advantages. First, the generation of an AST can be interpreted as a normalization step of our feature gen- eration. In contrast to the original source code, which has inconsistencies like whitespace characters or other unneeded Figure 3: Author Distribution, Extroversion characters, the AST represents a concise version of a given program. This also makes the generation of meta-features (compositions of different base features) more simple, due to the level of authors (while the final prediction should be the strict representation of all, to the compiler important, done for the author), the predictions are averaged for each parts of the program. Additionally, the representing syn- Trait Mean Value Standard Author-based Deviation Personality Evaluation Metric Traits RMSE PC Agreeableness 47.02 8.95 Conscientiousness 46.37 6.46 Agreeableness 9.17 -0.12 Extroversion 45.22 8.19 Conscientiousness 8.83 -0.31 Neuroticism 49.92 11.15 Extroversion 9.55 -0.1 Openness 49.51 6.68 Neuroticism 10.28 0.14 Openness 7.25 -0.1 Table 1: Mean Trait Distribution, Training Set Table 3: Results of the Multinomial Logistic Regres- Distribution / Dataset Train Set Test Set sion Approach Min. Programs per Author 6 14 Mean Programs per Author 37 37 Max. Programs per Author 122 109 was implemented on the level of authors. Total Number of Programs 1790 772 The first classification approach is based on Gradient Boosted Regression with least squares regression as its loss function, Total Number of Authors 49 22 1100 estimators, 5 as the maximum depth of the individ- ual regression estimators, and the learning rate of 0.1. This Table 2: Programs per Author approach also utilized χ2 test for the feature selection to choose only the best 200 features from the AST feature ex- traction pipeline. This approach was implemented using the tax tree is not necessarily bound by the original syntactic scikit-learn Python library [7]. rules of the original programming language β which allows The second approach is based on the Multinomial Logistic for generalizations of the source code to occur. Regression model with the l2-regularized squared loss as its In our approach, we use the frequency distribution of all objective function. That is, each feature was multiplied with known entities in the grammar to build a feature list for a a trait-specific weight, and the result of this linear combina- given program. This shallow use of the AST provides 237 tion was the input to a sigmoid activation. As the output features for a given source code analysis. Some examples of this prediction is in the range [0,1], we re-scaled the trait would be the T ype of variables or the nature of a state- values in the training data to the same range for computing ment(do, for, while, etc.) The implementation of the AST the squared loss. is made possible with the help of ANTLR parser [6]. Training was done using stochastic gradient descent with constant learning rate, and parameters were tuned on the 2.2.2 Custom Features held-out development set using random search. The search space of the parameters was: learning rate ∈ 0.01, 0.1, 1, In addition to the AST, we used additional features for number of training epochs ∈ {10, 20, 50, 100, 200, 500}, regu- the source code and also the comments. The following is an larization ∈ {0, 0.001, 0.01, 0.1, 1}, (mini-)batch size ∈ {1, all}. exhaustive list of all additional features used. The best configuration was: learning rate: 1, training-epochs: Code-based features: length of the whole program (in 2, regularization: 0.6, batch-size: all. This approach was de- lines of code, in characters), the average length of variable veloped with theano Python library [1]. names, what indentation the programmer is using (tabs or spaces). Comment-based features: type/token ratio, usage of 3. EXPERIMENTAL RESULTS punctuation marks, TF-IDF, the frequency of comments The dataset included 49 programmers in the training set (block comments and inline comments separately), average (with 1790 programs in total) and 22 programmers in the word length. test set (772 programs). Final evaluation was done with Author-level based features: number of programs sub- two different evaluation metrics: Root Mean Squared Error mitted (see Table 2), average length of programs in lines of (RMSE) and Pearson Product-Moment Correlation (PC). code. In the Gradient Boosted Regression approach (GBR Ap- proach), the system was tuned to maximize both of these 2.3 Classification metrics at the same time, while the Multinomial Logistic Re- We experimented with a number of Regression classifiers gression one (MLR Approach) concentrated on RMSE. Ta- like Linear Regression, Ridge Regression, Logistic Regres- ble 3 gives a detailed overview of the results achieved using sion and Gradient Boosted Regression. In addition, we have Multinomial Logistic Regression at the level of authors. Ta- tried to detect the outliers with the RANdom SAmple Con- ble 4 shows the results achieved using the Gradient Boosted sensus (RANSAC). The final system implementation did not Regression approach at the level of authors and the level of use RANSAC, since it delivered worse results. Although, programs. this technique should be further investigated with a bigger In general, the results are low using both RMSE and PC dataset. and only slightly outperform the performance of the baseline We have submitted our final runs based on two machine approaches (see Table 5). Two baselines have been provided learning algorithms: Gradient Boosted Regression and Multi- by the task organizers [8]: nomial Logistic Regression. Furthermore, Gradient Boosted 1) 3-gram character representation. Regression was evaluated on the level of authors and the 2) always predict the mean trait value of the training programs level, while the Multinomial Logistic Regression dataset. Author-based Program-based Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, K. Jia, Personality Evaluation Metric Traits M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin, RMSE PC RMSE PC E. Larsen, C. Laurent, S. Lee, S. Lefrancois, Agreeableness 10.89 -0.05 9.16 0.04 S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey, Conscientiousness 8.9 0.16 8.61 0.07 C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, Extroversion 11.18 -0.35 8.96 0.16 O. Mastropietro, R. T. McGibbon, R. Memisevic, Neuroticism 12.06 -0.04 10.42 0.04 B. van Merriënboer, V. Michalski, M. Mirza, Openness 7.5 0.35 7.54 0.1 A. Orlandi, C. Pal, R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, Table 4: Results of the Gradient Boosted Regression J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, Approach S. Shabanian, E. Simon, S. Spieckermann, S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van 3-gram characters Mean value Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, Personality Evaluation Metric H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, Traits RMSE PC RMSE PC K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang. Agreeableness 9.00 0.20 9.04 0.00 Theano: A Python framework for fast computation of Conscientiousness 8.47 0.17 8.54 0.00 mathematical expressions. arXiv e-prints, Extroversion 9.06 0.12 9.06 0.00 abs/1605.02688, May 2016. Neuroticism 10.29 0.06 10.26 0.00 [2] C. Bishop-Clark. Cognitive style, personality, and Openness 7.74 -0.17 7.57 0.00 computer programming. Computers in Human Behavior, 11(2):241–260, 1995. [3] O. P. John and S. Srivastava. The big five trait Table 5: Baseline Approaches taxonomy: History, measurement, and theoretical perspectives. Handbook of personality: Theory and research, 2(1999):102–138, 1999. 4. CONCLUSIONS [4] Z. Karimi, A. Baraani-Dastjerdi, N. Ghasem-Aghaee, This paper describes the system that given a source code and S. Wagner. Links between the personalities, styles collection of a programmer, identifies their personality traits. and performance in computer programming. Journal of While the RMSE and PC scores proved promising during de- Systems and Software, 111:228–241, 2016. velopment, further investigation suggested the dataset may [5] F. Ostendorf and A. Angleitner. Neo-PI-R: be too small to create an effective machine learning system. Neo-Persönlichkeitsinventar nach Costa und McCrae. The compiler style feature generation process using ASTs Hogrefe, 2004. combined with several custom features could serve as future [6] T. Parr. The definitive ANTLR 4 reference. Pragmatic baselines for similar tasks. Bookshelf, 2013. 4.1 Future Work [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, The task would benefit greatly from an expanded training R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, corpus (more samples per programmer, more programmers). D. Cournapeau, M. Brucher, M. Perrot, and The value distribution of the training set is also an important E. Duchesnay. Scikit-learn: Machine learning in python. point. The current training set exhibits normal distributed J. Mach. Learn. Res., 12:2825–2830, Nov. 2011. scores for each Big Five trait. A more robust system could be created when using an equal number of samples within [8] F. Rangel, F. González, F. Restrepo, M. Montes, and low, mid and high value range. P. Rosso. Pan at fire: Overview of the pr-soco track on Additionally, further feature engineering, additional sta- personality recognition in source code. In Working tistical analysis of the AST output, and transferring strate- notes of FIRE 2016 - Forum for Information Retrieval gies of other NLP tasks involving syntax trees onto the cur- Evaluation, Kolkata, India, December 7-10, 2016, rent task could improve the system. CEUR Workshop Proceedings. CEUR-WS.org, 2016. 5. REFERENCES [1] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger-Lewandowski, X. Bouthillier, A. de Brébisson, O. Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Côté, M. Côté, A. Courville, Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P.