A Supervised Approach for Personality Recognition in Source Code using Code Analysis Tool at FIRE 2016 Rehana Delair Rutal Mahajan SNPIT&RC SNPIT&RC Bardoli, Gujarat Bardoli, Gujarat +91 9904039419 +91 9426393096 rehanad10@gmail.com rutal.mahajan@gmail.com ABSTRACT In order to collect different features from the given source code, Personality Recognition from the author’s source code is a task checkstyle [2] is used. It is a code analysis tool which performs organized by PR-SOCO team in conjunction with the FIRE 2016 different checks on the source code. We have used weka [4] tool Forum for Information Retrieval Evaluation. The aim is to to train the regression model. identify author’s personality traits from source code collection of The rest of this paper is structured as follows: Section 2 outlines a programmer. We have used various supervised learning our approach on the Personality Recognition in Source Code. approaches to train the regression model with different set of Section 3 presents tools used. Section 4 describes training and test features extracted using static code analysis tool checkstyle. data. Section 5 describes experiments and Section 6 describes Based on these features, the trained regression model is used to official results of this task. Finally, we conclude in Section 7. predict the score for different personality traits. All the systems are evaluated using two evaluation metrics: Root Mean Squared 2. Approach Error (RMSE) and Pearson Product-Moment Correlation (PC). Our system has scored 0.62 and 0.33 PC in two personality traits, 2.1 Overview Openness and Conscientiousness respectively using M5Rules Main Process of Personality Recognition includes the following algorithm as regression model, which is the best score among all steps, which is shown in Figure 1: the submitted runs of our system as well as among all the 1. Collect individual corpora participated systems. In this step, we need to collect training data. In this case, we need source codes of different programmers which is Keywords training data provided by PR-SOCO committee [8]. Personality Recognition, Machine Learning, Regression, Big-Five 2. Collect associated personality ratings for each Personality traits participant This is the step where we collect personality ratings for 1. INTRODUCTION each programmer. We have used Big-Five personality There is a lot of work going on in the area of Personality traits [3] to describe the personality of an individual. Recognition [1] [5] [6] [7]. Personality traits influence most of This data is also provided by PR-SOCO committee [8]. the human activities such as the way people write [1], interact 3. Pre-processing with each other, and the way they make a decision. The In this step, given file/data is converted into the efficient programmer’s personality will affect the type of software project format for checkstyle [2]. It removes any separating they chose to participate [6] or the way they write or structures lines from the source code and converts data into an their code. actual JAVA file. We have also implemented a function There are many projects that use written text to identify author’s to isolate one single program from the given training personality. In “whose thumb is it anyway?” [5] personal weblogs files of source code. are analyzed to predict personality traits. They have used the 4. Extract relevant features from the texts Support Vector Machine algorithm to predict personality traits. In this step main features are identified from the given Main features are word based bi- and tri- grams. In “Finding source code. We need to find different features of good relationships between socio-technical aspects and personality source code which reflects authors’ personality. For this traits by mining developer e-mails.” [6] they have used purpose we have used a code analysis tool checkstyle developer’s emails to identify their personality. [2]. It performs different checks on the source code such as how well the code is commented, how it is indented, Personality Recognition from the source code is different than naming conventions, etc. From this we have collected other projects because the source code has limited scope. The measures for different features. Programmer doesn’t have the choice to select their own word. 5. Build statistical models of the personality ratings They have to follow some of the pre-defined rules. Identifying based on the features Personality from the source code is a difficult task. We have used different regression models to predict the personality traits like Support Vector Machine Personality can be defined along five traits using the Big Five Regression, Gaussian Processes,M5 algorithm, M5’ Theory [3], which is the most widely accepted in psychology. The Rules and Random Tree. We have used JAVA API for five traits are extroversion (E), emotional stability / neuroticism Weka [4] to train different regression models. (S), agreeableness (A), conscientiousness (C), and openness to 6. Test the learned models on unseen individuals experience (O). Using different features and trained regression model we predicted the score for different personality traits. Table 1. Different Features Extracted by Checkstyle tool Training Data Category Number of Feature Name features Headers 2 Javadoc comments 12 Preprocessing Style based feature White spaces 16 Block Checks 6 Individual Programs class design 9 problems Annotations 7 Coding 43 checkstyle Imports 8 Feature Metrics 6 Errors and Extraction Warnings Category based Modifiers 2 feature Naming conventions 15 Collect Features Size violations 8 Miscellaneous 15 Features Regular expression 5 Total 154 Build Regression Model Single program is separated from the collection of source code and it is checked using checkstyle [2]. Errors and Warnings are Regression Model counted and converted in per line of code format. 3. Data set The training data set was provided by PR-SOCO committee itself that consists of source codes written in Java. The data consist of Test Data Output 49 documents that consist of a collection of source code of different authors. These source codes are labeled with personality traits of the programmer in a continuous range from 20 to 80. Figure 1. Flow Diagram of the Process Test data were also provided by PR-SOCO committee [8]. It is consists of 21 documents of a source code collection. We have used this data to evaluate our system. 2.2 Features 4. Experiments We have used total 154 features of source code, which is extracted We have a collection of source code written by 49 different using static code analysis tool Checkstyle [2]. These features are programmers along with their personality traits. We have used this categories into two categories to train the regression model: Style data to train our model and then tested it on 21 unseen source based features and Content based features. These are shown in the codes. Two metrics were used to evaluate the system: the average Table 1. Root Mean Squared Error (RMSE) as well as the Pearson 1. Style based Features Product-Moment Correlation (PC) between our software scores It is the category of different features related to the style and the ground-truth scores. We have tested our system on a given of the code. Such features are used to perform checks on test data. Results are discussed in the next section. code layout and formatting problems. It contains RMSE is the square root of the mean/average of the square of all Indentation, Headers, Javadoc comments, white spaces, of the error and PC is defined as a measure of the strength of a Block checks, etc. linear association between two variables. 2. Content based Features It is the category of different features related to the We have used different Supervised Regression model to predict content of the source code. It performs checks on class personality traits of different authors. These are Support Vector design problems, method design problem, Annotations, Machine, Gaussian Processes, M5P algorithm, M5Rule and Coding, Imports, Metrics, Modifiers, Naming Random tree algorithm. Support Vector Machine plots all the data conventions, Size violations and other miscellaneous items as a point in n-dimensional space. We have used default features. kernel settings in Support Vector Machine. M5P algorithm is decision tree based algorithm and M5Rule is rule based algorithm. 5. Results 6. Conclusion We have submitted total five runs. This all runs use different Various supervised learning algorithms proved to be very capable regression algorithms. We have used Support Vector Machine, of predicting personality traits scores for different authors from Gaussian Processes, M5P algorithm, M5Rule and Random tree their given source code. Currently in our system we have not algorithm for regression. refined the effect of individual extracted features on different Results obtained for different runs of our system are shown in the personality trait. Such refinement may yield better prediction Table 2. Two metrics are shown for each personality trait: results than the current submitted runs. RMSE/PC. It shows Root Mean Squared Error / Pearson Product- Moment Correlation values. At the bottom of the Table 2, measures for baselines: (a) a bag of character 3-grams with 7. REFERENCES frequency weight; (2) an approach that always predicts the mean [1] Celli F., Lepri B., Biel J. I., Gatica-Perez D., Riccardi G., value observed in the training data are shown [8]. Our system has Pianesi F. (2014). The workshop on computational scored 0.62 and 0.33 PC in two personality traits, Openness and personalty recognition 2014. Proc. Of the ACM Int. Conf. on Conscientiousness respectively using M5Rules algorithm as Multimedia. Pp. 1245-1246. regression model, which is the best score among all the submitted [2] CheckStyle project , http://checkstyle.sourceforge.net/ runs of our system as well as among all the participated systems. [3] Costa P.T., McCrae R.R. (2008). The revised neo personality In Neuroticism personality trait, our predicted scores are inventory (neo-pi-r). The SAGE handbook of personality positively correlated with the ground truth scores. It gives nearly theory and assessment 2, 179-198 worst RMSE in Gaussian processes and SMO. In Extroversion personality traits, all regression models give different scores and it [4] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard is weakly correlated with ground truth scores. Openness Pfahringer, Peter Reutemann, Ian H. Witten (2009); The personality trait is strongly correlated with the ground truth score WEKA Data Mining Software: An Update; SIGKDD and gives good results. Agreeableness is negatively related with Explorations, Volume 11, Issue 1 ground truth scores and it also gives worst RMSE. In [5] Oberlander J., Nowson S. (2006). Whose thumb is it Conscientiousness, predicted scores are positively correlated with anyway? Classifying author personality from weblog text. ground truth scores. Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 627–634, Sydney, July 2006. ©2006 Table 2. Official results of different runs of our system Association for Computational Linguistics Run N E O A C [6] Paruma-Pab ́on O.H., Gonz ́alez F.A., Aponte J., Camargo 19.07/ 25.22/ 23.62/ 21.47/ 22.05/ J.E., Restrepo-Calle F. (2016). Finding relationships between M5Rules 0.2 0.08 0.62 -0.15 0.33 socio-technical aspects and personality traits by mining 26.36/ 16.67/ 15.97/ 23.11/ 21.72/ developer e-mails. Workshop on Cooperative and Human GP Aspects of Software Engineering (CHASE), ICSE. 0.19 -0.02 0.19 -0.13 0.1 18.75/ 25.22/ 20.28/ 21.47/ 22.05/ [7] Rangel F., Celli F., Rosso M., Potthast M., Stein B., M5P 0.2 0.08 0.54 -0.15 0.33 Daelemans W. (2015). Overview of the 3rd Author Profiling Random 17.55/ 20.34/ 16.74/ 21.1/ 20.9/ Task at PAN 2015. CLEF 2015 Labs and Workshops, Tree 0.29 -0.26 0.27 -0.06 0.14 Notebook Papers. CEUR Workshop Proceedings. CEUR- 26.72/ 23.41/ 16.25/ 27.78/ 15.53/ WS.org, vol. 1391. SMO 0.18 -0.11 0.13 -0.19 0.27 Baseline 10.29/ 9.06/ 7.74/ 9.00/ 8.47/ [8] Francisco Rangel, Fabio González, Felipe Restrepo, Manuel bow 0.06 0.12 -0.17 0.20 0.17 Montes and Paolo Rosso. PAN at FIRE: Overview of the PR- Baseline 10.26/ 9.06/ 7.57/ 9.04/ 8.54/ SOCO Track on Personality Recognition in SOurce Code. mean 0.00 0.00 0.00 0.00 0.00 Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016. Best 9.78/ 8.60/ 6.95/ 8.79/ 8.38/ CEUR Workshop Proceedings. CEUR-WS.org. 2016 Results 0.36 0.47 0.62 0.38 0.33 Worst 29.44/ 28.80/ 33.53/ 28.63/ 22.36/ Results -0.29 -0.37 -0.36 -0.32 -0.31