UAEMex System for Identifying Personality Traits from Source Code Eder Vázquez Vázquez1, Omar González Brito2, Jovani A. García3, Miguel García Calderón4, Gabriela Villada Ramírez5, Alan J. Serrano León6, René A. García-Hernández7, Yulia Ledeneva8 Universidad Autónoma del Estado de México, UAPT Tianguistenco. Instituto Literario, 100, Toluca, Edo. Méx. 50000, México. eder2v@hotmail.com1, gonzalezbritoomar@gmail.com2, jovani_2807@hotmail.com3, tonsquemike@outlook.com4, inggaby.vr@gmail.com5, alan.serrano.leon@outlook.com6, renearnulfo@hotmail.com7, yledeneva@yahoo.com8 3rd E-mail ABSTRACT metrics proposed by PR-SOCO, we rank the results with others This paper describes the UAEMex participation on Personality systems by personality traits. In section 5, the conclusions are Recognition Source Code (PR-SOCO 2016) task, where the presented. principal challenge is to identify the five personality traits using the source code of a developer. In the first phase of the task, a training 2. METHODOLOGY dataset with 50 programs and the degree values of the personality The proposed methodology is divided in four steps: Corpus incidence for each trait were provided. In the second phase, a test Analysis, Feature Extraction, Feature Representation and dataset with 21 programs must be classified. Our method consists Classification. in extracting only 41 features from the source code including the comments in order to classify it (we test 4 models). Using the 2.1 Corpus Analysis The training dataset is composed of 1741 java source codes of 50 evaluation metrics proposed by PR-SOCO, our system is ranked developers that where evaluated with the Big-Five Theory between the best systems for both evaluation metrics. Finally, using personality traits where each trait ranges between value of 20 and the RMSE and the PC metric we propose a ranking measure. 80. However, the number of different values by personality trait in Keywords the samples is small, we decided to manage each program by PR-SOCO; Support Vector Machine; Symbolic Regression; KNN; separated, to get a good representation (See table 1). There are Neural Networks; Personality Trait; Genetic Algorithms. different numbers of values per class on every personality trait, the distribution is shown in table 1. 1. INTRODUCTION Personality is an inherent aspect of human nature that has an Table 1. Source code distribution for every personality trait. influence on its activities. It means, personality is a set of characteristics that describes one person, and makes it different Number of different Personality Trait from others [1]. Nowadays, identifying the degree of personality values traits for determining if a candidate fits with a job is such important Neuroticism 13 as skills and experience [2]. After decades of research, the Big-Five Theory is the most accepted model for assessing the personality [2]. Extroversion 14 This model has a hierarchical organization of personality traits with five classes: Extroversion (E), Agreeableness (A), Openness 11 Conscientiousness (C), Neuroticism (N), and Openness to Agreeableness 14 experience (O) [3]. Given a few set of java source codes in PR-SOCO task, the main Conscientiousness 12 objective is to identify the degree of presence of five classes of personality [4]. In order to get an approximation of what aspects 2.2 Feature Extraction determine the personality, the NEO-PI R test may be answered (this Using few source codes of our team members, we identify some test is based on the Big-Five theory) to measure the personality personal features in order to identify some similar elements. As traits [3]. There are many structured surveys based on NEO-PI R in result, we detected the indentation, identifier and comment features several Web pages, available on-line for everybody to predict the are important to determine the author of such codes. These features personality of the user. Using these aspects, we propose to extract can be extracted independently of the content or objective of the 41 features as the main information for training four classifiers. source code. The first 25 features were calculated using average and the last 16 were calculated using frequency. Extracted features In this paper, we present the working notes of the UAMEX can be classified as: participation on the PR-SOCO 2016 task. Indentation Features: space in code, space in the comments, This paper is organized as follows. In section 2, the space between classes, space between source code blocks, space methodology is described. In section 3, the results for the test between methods, space between control sentences and space in dataset experiments are presented. In section 4, using the evaluation clustering characters “(), [], {}”. These features are measured with The training phase consists on store feature vectors and class labels the average. of training dataset. In classification phase is necessarily to define constant 𝑘 and send an unlabeled vector to KNN algorithm for Identifier Features: The presence of underscore, uppercase and calculate the minimal distance between stored classes and input lowercase in the name of an identifier is measured in binary way. vector [14]. We use Weka implementation for KNN algorithm [15]. Also, we extract the average number of characters and the average length in the name of an identifier as features. These features are 2.4.4 Back Propagation Neural Network (BP-NN) extracted for class, methods and variable names. Also, the Neural networks are an elemental processor that recipe a vector as percentage of number of initialized variables is extracted. input data. The feature vector is send at input layer and then every Comment Features: The presence of line and block comments are neuron processes a 𝑘 − 𝑖𝑛𝑝𝑢𝑡 with 𝑘 − 𝑤𝑒𝑖𝑔ℎ𝑡 and returns a 𝑘 − extracted as binary features. Also, the presence of comments with 𝑜𝑢𝑡𝑝𝑢𝑡. Neural networks are used to approximate functions all letters in uppercase is extracted as binary feature. Finally, the according to the input data [16]. average of size of the comments is extracted as feature. When neural network implements back-propagation error, the output of neural network is compared with desired output to calculate neural network error and then correct weights of every 2.3 Features Representation neuron in hidden layer [17]. For every source code, 41 features are extracted for representing in a vector space model, where the Source Code 𝑆𝑖 is represented by 3. RUN RESULTS In this section, the results submitted for the PR-SOCO test dataset one of the 41 features 𝑓𝑗 [5]. are described. 2.4 Classification Run 1: This run was generated using symbolic regression (SR) over Once the source codes are represented in a vector space model, we the vector space model but we eliminate the source codes of five train the system with the next classifiers. The objective of test developers according to the next criterion: the person who has a different classifiers is that if the extracted features are good features high presence in all the personality traits, the person who has a then we would get, in general, good results with these classifiers. It lower presence in all the personality traits, the person who has an is worth to say that these classifiers have been widely used in other average presence in all the personality traits, the person who has language processing tasks, especially we trust in the Symbolic more source codes and the person who has few source codes. Regression model since the training dataset only has some few values per trait. Run 2: Similar to run 1, this run was generated using (SR) but for each personality trait the developers (between 12 and 20) with 2.4.1 Symbolic Regression (SR) average presence of such trait were eliminated. Finding the structure, coefficients and appropriate elements of a Run 3: For this run, the whole training dataset was used with Back model at same time that try to solve problem, is a challenge for Propagation Neural Network. which no efficient mathematical method exists, therefore traditional mathematical techniques are not the best in empirical Run 4: The whole training dataset with KNN with constant 𝑘 = 3 modeling problems due to their nonlinearity. Because, there is a was used. need with an artificial expert which can create or define a model from available data of specific task without appeal problem Run 5: We use a genetic algorithm, but it is not described because understand [6]. Symbolic Regression is an artificial expert type that we find a mistake. evolve models from available data observations [7] [8], whose main Run 6: The whole training dataset was used for classify with a objective is to find a model which describes the relationship SVM. between dependent variable and independent variables as accurately as possible [9]. Root Mean Square Error (RMSE) and Pearson Correlation (PC) metrics were used by PR-SOCO task as evaluation of the ranking Because Symbolic Regression works directly with Genetic results. A minimum RMSE is desired for a system. In change, in Programming is possible to evolve equations or mathematical PC metrics a closer value to 1 or -1 is desired. In table 2, the RMSE functions in order to estimate the behavior of a dataset. The scores of our runs are presented, with the best scores highlighted in symbolic regression technique standout as a viable solution to the bold. As is possible to see, the first and six runs get the best scores, problem of this work because it does not assume an answer where the SR and SVM classifiers were used, respectably. problem, but also discover it [10]. 2.4.2 Support Vector Machine (SVM) Table 2. RMSE results of submitted runs for test dataset. SVM maps a set of examples as a set of points in the same space trying to get optimal hyper-plane. Optimal hyper-plane is defined Run N E O A C as hyperplane with maximal separation between two classes [11]. 1 11.54 11.08 6.95 8.98 8.53 SVM make predictions based on which side of the gap they fall on [12]. In this work, we used SVM implementation LIB-SVM [13]. 2 11.10 12.23 9.72 9.94 9.86 3 9.84 12.69 7.34 9.56 11.36 2.4.3 K Nearest Neighbor (KNN) Is one of the simplest machine learning algorithms known as lazy 4 10.67 9.49 8.14 8.97 8.82 classifier where classification function is only approximated 6 10.86 9.85 7.57 9.42 8.53 locally. KNN is trained using vectors on feature space; each vector must have a class label. In table 3, the results with Pearson Correlation metric is showed, RMSE metric correspond with the rank of our results for the PC with the best score highlighted in bold. metric. Table 3. PC results of submitted runs for test dataset. In PR-SOCO 2016, two evaluation metrics were used given two ways of ranking the results, the RMSE for measuring the average Run N E O A C error between the observed and predicted values and the PC for 1 -0.29 -0.14 0.45 0.22 0.11 measuring the correlation between variables. In this paper, we 2 -0.14 -0.15 0.04 0.19 -0.30 propose ranking the results using both RMSE and PC measures as: 3 0.35 -0.10 0.28 0.33 -0.01 𝑅𝑎𝑛𝑘𝑖𝑛𝑔 = ((1 − 𝑃𝐶) ∗ 𝑅𝑀𝑆𝐸) 4 0.04 -0.04 0.10 0.29 -0.07 This measure only is applied for positive correlation results in PC metric. Since RMSE is not normalized we propose to multiply both 6 0.13 0 0 0 0 results. This ranking is a metric where best values are those closer to cero. Table 6 shows the best results evaluating with our proposing measure. 4. RANKING RESULTS In PR-SOCO 2016, eleven teams participated in this task with two baseline: the baseline bow (bl bow) based on trigram of chars and Table 5. Best runs with PC metric. the baseline mean (bl mean) based on a method that predicts the Rank N E O A C mean value of the observed values. In table 4, the best RMSE results of those teams for every personality trait are showed 1 0.36 0.47 0.62 0.38 0.33 according to the rank. In general, our results (uaemex) were ranked 0.35 0.45 0.33 0.32 2 0.38 in good positions outperforming the baseline, except for uaemex uaemex uaemex uaemex Extroversion, in the case of Neuroticism and Agreeableness we 3 0.31 0.35 0.37 0.29 were ranked in second position, in the case of Openness we get the 0.31 first rank and for Conscientiousness we get the fourth position 4 0.29 0.31 0.33 0.21 between two baselines. 5 0.27 0.31 0.3 0.21 0.21 6 0.23 0.16 0.29 0.19 0.19 Table 4. Best runs with RMSE metric. 0.12 7 0.14 0.27 0.06 0.16 Rank N E O A C bl bow 6.95 8 0.1 0.11 0.12 0 0.13 1 9.78 8.79 8.38 uaemex 9 0.1 0.1 0.05 -0.05 0.07 8.6 9.84 8.97 0 2 7.16 8.39 10 0.09 0.08 -0.07 uaemex uaemex bl mean -0.12 9 8.47 0.06 0.08 bl mean 3 9.97 8.69 7.19 11 0.11 -0.15 bl bow bl bow bl bow bl mean 9.04 8.53 0 -0.17 -0.19 -0.2 4 10.04 8.8 7.27 12 0.05 bl mean uaemex bl bow bl bow bl bow uaemex 8.54 0 0 5 10.24 8.96 7.42 9.16 13 -0.31 -0.28 -0.23 bl mean bl mean bl mean 10.26 6 9.01 7.57 9.32 8.59 bl mean bl mean Table 6. Results with our proposal evaluation metric. 7 10.27 9.06 9.36 8.61 bl bow 7.74 Ranking N E O A C 8 10.28 bl mean 9.39 8.69 bl bow 6.39 3.82 10.29 1 5.32 5.88 6.24 9 9.22 8.19 9.55 8.77 uaemex uaemex bl bow 6.36 9.49 2 6.54 5.59 4.60 6.78 10 10.37 8.21 10.31 8.85 uaemex uaemex 7.03 3 6.74 6.03 4.79 6.71 11 10.53 11.18 8.43 11.5 9.99 bl bow 12 17.55 16.67 15.97 21.1 15.53 4 7.67 6.07 5.13 6.98 7.47 13 24.16 27.39 22.57 28.63 22.36 7.2 5 8.84 7.52 5.26 7.55 bl bow 7.97 7.59 6 8.91 7.28 8.24 In table 5, the best PC results of those teams for every personality bl bow uaemex trait are showed according to the positive correlation results. In 7.57 8.54 general, our results (uaemex) were ranked in good positions 7 9.3 8.49 8.49 bl mean bl mean outperforming the baseline configurations. In the case of 9.67 9.06 9.04 Neuroticism, Openness, Agreeableness and Conscientiousness we 8 8.23 11.33 bl bow bl mean bl mean were ranked in second position except for the Extroversion trait. In 9 9.74 9.32 8.43 9.26 - general, it is possible to observe that the rank of our results for the 9.85 [4] Rangel, F., González, F., Restrepo, F., Montes, M. and 10 9.93 8.97 22.61 - uaemex Rosso, P. 2016. Pan at fire: Overview of the pr-soco track on 10.26 personality recognition in source code. In Working notes of 11 23.20 16.47 - - FIRE 2016 – Forum for Information Retrieval Evaluation, bl mean 12 12.46 24.65 - - - Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. 13 21.74 - - - - [5] Salton, G., Wong, A., Yang, C.S. 1975. A vector space model for automatic indexing. Commun. ACM 18, 613-620. As we can see in table 6, our results get a better balance between [6] Dabhi, V.K., Vij, S.K. 2011. Empirical modeling using RMSE and PC. In table 6, uaemex team is ranking in first position symbolic regression via postfix Genetic Programming. for Neuroticism and Openness trait, in second place for Image Information Processing (ICIIP), 2011 International Agreeableness and sixth place for Conscientiousness. However, in Conference on, 1-6. this new ranking the Extroversion do not outperform both baselines. [7] Koza, J.R. 1992. Genetic programming: on the programming of computers by means of natural selection. MIT Press. [8] Murari, A., Peluso, E., Gelfusa, M., Lupelli, I., Lungaroni, 5. CONCLUSIONS M., Gaudio, P. 2015. Symbolic regression via genetic This paper presents the results in personality trait prediction. We programming for data driven derivation of confinement describe the participation of the UAEMex at PR-SOCO 2016. scaling laws without any assumption on their mathematical form. Plasma Physics and Controlled Fusion 57. We know that submitted runs overcome the baseline despite that [9] Kommenda, M., Affenzeller, M., Burlacu, B., Kronberger, corpus has noise like repeated source code, obfuscated source code G., Winkler, S. M. 2014. Genetic programming with data and it have little samples. migration for symbolic regression. In: Proceedings of the The training set has different classes of personality. There are 2014 conference companion on Genetic and evolutionary unbalanced classes and there has not enough examples for class computation companion, 1361-1366. values. In this approach, we do not make preprocessing because it [10] Can, B., Heavey, C. 2011. Comparison of experimental was considered that all information in corpus are relevant by the designs for simulation-based symbolic regression of task. Personality Trait Prediction in source code is a new task and manufacturing systems. Computers and Industrial there are not reference approaches about this. It was difficult to Engineering 61, 447-462. identify what features would be extracted. [11] Hearst, M.A. 1998. Support Vector Machines. IEEE The best results in our runs obtained with the symbolic regression Intelligent Systems 13, 18-28. model because the training phase try to approximate the output of [12] Cortes, C., Vapnik, V. 1995. Support-Vector Networks. input vector. Machine Learning 20, 273-297. Also, we propose a new ranking measure for combine a RMSE and [13] Chang, C.-C., Lin, C.-J. 1977. LIBSVM: A library for PC measure in order to get an approximation for evaluation results. support vector machines. ACM Trans. Intell. Syst. Technol. According to our experiments in train dataset, we note that it is 2, 1-27. better than RMSE or PC evaluating alone. RMSE is a minimization metric and PC is a maximization metric. [14] Stone, C.J. 1977. Consistent Nonparametric Regression. 595- 620. 6. ACKNOWLEDGMENTS [15] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, Thanks to Autonomous University of the State of Mexico P., Witten, I. H. 2009. The WEKA data mining software: an (UAEMex), Consejo Nacional de Ciencia y Tecnología update. SIGKDD Explor. Newsl. 11, 10-18. (CONACyT) and Consejo Mexiquense de Ciencia y Tecnología (COMECyT) for support granted for this work. [16] McCulloch, W.S., Pitts, W. 1988. A logical calculus of the ideas immanent in nervous activity. In: James, A.A., Edward, 7. REFERENCES R. (eds.) Neurocomputing: foundations of research, 15-27. [1] Montaño, M., Palacios, J., Gantiva, C. 2009. Teorías de la [17] Rumelhart, D.E., Hinton, G.E., Williams, R. J. 1986. personalidad. Un análisis histórico del concepto y su Learning internal representations by error propagation. In: medición. Psychologia Avances de la disciplina, 81-107. David, E.R., James, L.M., Group, C.P.R. (eds.) Parallel [2] Paul, C., R., M.R. 2008. NEO PI-R Revised Neo Personality distributed processing: explorations in the microstructure of Inventory. TEA Ediciones S.A. cognition, vol. 1, 318-362. [3] Hussain, S., Abbas, M., Shahzad, K., Syeda, A. 2012. Personality and career choices. African Journal of Business Management (AJBM) 6, 2255-2260.