Personality Recognition Applying Machine Learning Techniques on Source Code Metrics Hugo A. Castellanos Universidad Nacional de Colombia Bogotá, Colombia hacastellanosm@unal.edu.co ABSTRACT The rest of this paper is organized as follows. Section 2 Source code has become a data source of interest in the re- presents a general background on source code metrics. Sec- cent years. In the software industry is common the extrac- tion 3 describes the proposed approach. Section 4 presents tion of source code metrics, mainly for quality assurance pur- the machine learning strategies. Section 5 presents the ob- poses. In this paper source code metrics are used to consol- tained results. Finally, Section 6 concludes the paper. idate programmers profiles with the purpose to identify dif- ferent personality traits using machine learning algorithms. 2. BACKGROUND ON SOURCE CODE MET- This work was done as part of the Personality Recognition RICS in SOurce COde (PR-SOCO) shared task in the Forum for Information Retrieval Evaluation 2016 (FIRE 2016). According to Malhotra [8], software metrics are used to assess the quality of the product or process used to build it. Such metrics have the following characteristics: CCS Concepts •Information systems → Content analysis and fea- • Quantitative: metrics have a value. ture selection; •Computing methodologies → Super- • Understandable: the way the metric is calculated must vised learning by regression; Cluster analysis; •General and be easy to understand. reference → Metrics; •Software and its engineering → Parsers; • Validatable: metrics must capture the attributes which they were designed to. Keywords Personality recognition; Source code metrics; Support Vec- • Economical: it must be economical to capture the met- tor Regression ric. • Repeatable: if measured several times the results should 1. INTRODUCTION be the same. Pieces of text have been always of interest in information retrieval as text based documents contain valuable informa- • Language independent: the metrics should not depend tion about the author. During recent decades source code to a specific language. has become a source of valuable information as well. Many • Applicability: the metric should be applicable in any efforts in this field have been addressed to improve both phase of the software development. processes and products in the software development indus- try [8]. • Comparable: the metric should correlate with another The main efforts in source code analysis have been focused metric capturing the same concept. in forensics applications like author recognition [5], and pla- giarism detection [2]. Several techniques have been used Source code metrics must have a scale which can be: successfully in the mentioned tasks like n-grams, source code metrics, coding styles and abstract syntax trees [6]. Other • Interval: it is given by a defined range of values. applications of source code analysis include feature location [3], topics identification [7], among others. • Ratio: it is a value which has an absolute minimum or The PR-SOCO shared task consisted in predict the per- zero point. sonality traits of a programmer given a set of his/her source • Absolute: it is a simple count of the elements of inter- codes. These source codes as any other production of a hu- est. man being may be influenced by personality. In this work, the use of source code metrics is proposed • Nominal: it is a value which mainly defines a discrete to find information about the program author. Specifically, scale of values, like 1-present or 0-not present. the author personality traits based on the Big-5 personality test. In addition, machine learning methods are used to • Ordinal: it is a categorization which is intended to predict the personality traits based on the extracted source order or rank, for instance levels of severity: critical, code metrics. high, medium, etc. The metrics can be classified according the intended mea- sure: n = n1 + n2 (1) • Size: usually intended to estimate cost and effort. The most popular metric in this category are the source lines of code (SLOC). But in object oriented languages N = N1 + N2 (2) the size can be measured by the number of classes, The Halstead volume (V ), described in Equation 3, is a methods and attributes. measure of size but it is also interpreted as the number of • Software quality: intended to measure the quality of mental comparisons that were needed to write a program the software, this metric can be divided in the following with length N . Moreover, the difficulty (D), shown in Equa- categories: tion 4, describes the difficulty to write a program. It is highly related to volume because as it increases the diffi- – Based on defects: they consist in measure the culty also does. level of defects. The main metrics of this cate- gory are: the defect density defined as the number V = N log2 n (3) of defects by SLOC; defect removal effectiveness which is defined as the number of defects removed in a phase divided by latent defects. If the latent n1 N2 defects are unknown then can be estimated based D= · (4) 2 n2 on previous phases. The effort (E) described in Equation 5, indicates the effort – Usability: this kind of metrics are intended to required to write a program of high difficulty. measure the user satisfaction using the software. The satisfaction can be given be the ease to use and learn. E =D·V (5) – Complexity metrics [9]: they are oriented to pro- Finally, the effort is the base to calculate the time to un- duce a measure on the difficulty to test or main- derstand/implement (T ) and bugs delivered (B), as can be tain a piece of source code. This metric also seen in Equations 6 and 7, respectively. The time metric give information about the amount of instructions is related to the Stroud number [12], which is the ”number during execution. of elementary discrimination per second”. Stroud claimed that this number ranges from 5 to 20, but the Halstead’s – Testing: intended to measure the progress of test- experiments indicated empirically that the best number in ing over a software this case was 18. • Object oriented metrics: intended to measure object oriented paradigm features. They can be divided in: E T = (6) 18 – Coupling: measure of the level of interdependence between classes, it is calculated counting the num- 2 ber of classes called by another class. E3 B= (7) 3000 – Cohesion: measures how many elements of a class are functionally related to each other. 3. SOURCE CODE ANALYSIS FOR PERSON- – Inheritance: it measures the depth of the class ALITY RECOGNITION hierarchy. Text documents, contains information about the author. – Reuse: measure of the amount of times that a In the work described in [1], the authors were able to show class is reused. that certain personality traits could be predicted based on – Size: intended to measure the size but not only a text, in this case, an essay. in lines of code but also in the particularities of The present work starts from the hypothesis that source object oriented paradigm, like method count, at- code, as a form of text, leaves traces of the author’s person- tribute count, class count, etc. ality traits. To the scope of this work source code is a text document written by a single author. It is worth mention- • Evolutionary metrics: try to measure the evolution of ing that a single problem solution could be implemented in a software based on different elements like revisions, several ways by a programmer which give a certain guaranty refactorings, bug-fixes. The measure how much lines of uniqueness. of code are new, modified or deleted. To develop this hypothesis, a method is proposed to ex- tract metrics from source code to be able to predict the Additionally the empirical Halstead metrics [4] should also personality traits. In Figure 1 the general method is sum- be considered. The base to calculate these metrics are the marized. As first step the source examples provided are operands (identifiers) and operators (keywords, ++, +). separated into individual files. Later a set of metrics is ex- Equation 1 consist in the sum of the unique operators (n1 ) tracted from the source codes using a source code analyzer. and operands (n2 ). Length, described in Equation 2, is the With the extracted metrics as an input, machine learning sum of the total number of operands (N1 ) and operators methods are applied in order to predict the personality traits (N2 ). of the authors. Finally, the results are presented. Table 1: Metrics extracted from source code Metric Basic description Amount of files The total amount of files. Average source The average of source lines of code lines of code. Average class number The average of classes per file per source code file. Average source code The average of source lines per class code lines per class. The average of attributes Average attributes per class contained in a class. The average number Average methods per class of methods contained in a class. The average length Average class name length of a class name. The average amount of Average amount for loops contained Figure 1: Process summary. of for loops in a method. The average amount of Average amount while loops contained The provided corpus consisted in a source code file per of while loops in a method. person, and another file which indicates author and his/her The average amount personality traits (ground truth). Each source code file con- Average amount of if clauses contained tained several source code pieces divided by a mark. The of if clauses in a method. file was split into several individual files keeping track of the The average amount author-file relationship. Average amount of of if-else clauses contained An analyzer was written using ANTLR 4 [10] with the if-else clauses in a method. java grammar. From each individual file the source code The average identifier metrics described in Table 1 were extracted. Average identifier length length per files. As can be seen most of the metrics are based in counting Average number of and obtaining the average. All the metrics were normalized, Average parameters parameters in methods. such normalized data were the input of the machine learning Average ciclomatic Indicates the cyclomatic algorithms. complexity complexity average. As the extracted metrics are from similar categories, a hi- erarchical clustering using the Ward’s method [13] was ap- The average number Average of plied. It was found that certain related metrics were too of static attributes static attributes close to each other. Therefore, they were consolidated as contained in a class. follows: The average of Average of static methods static methods • Length metrics: contain the metrics related to some contained in a class. length/size measure and it is calculated as the average Indicates the number of among: amount of files, average source lines of code, possible bugs generated Halstead bugs delivered average class number per file, average source code lines based on the halstead per class, average attributes per class, average methods metrics. per class, average class name length, and the average An index which number of parameters. Halstead Difficulty measures the difficulty to write the program. • Complexity metrics: contain the metrics related with An index which measures algorithm complexity and it is calculated as the aver- Halstead Effort the necessary effort to age of: average amount of for loops, average amount write the code. of while loops, average amount of if clauses, average An index which indicates Halstead Time to amount of if-else clauses, and the average identifier the time taken to write understand or implement length. a source code. Indicates how much • Halstead : contains all the Halstead metrics extracted, Halstead volume information the reader needs it was calculated as the average of: Halstead bugs de- to understand the code. livered, Halstead difficulty, Halstead effort, Halstead time to understand or implement, Halstead volume. 4. MACHINE LEARNING METHODS metrics, and Halstead metrics. The first step was to calcu- In this section the used machine learning methods are de- late the variance. As the complexity metrics variance was scribed. Each one corresponds to a submission sent to the too low, it was removed and only the length and Halstead shared task: submission 1 corresponds to support vector re- average metrics were used as input. gression (SVR) over source code metrics, submission 2 cor- The best parameters according with cross validation can responds to extra trees regressor (ETR), and submission 3 be seen in Table 3. The graphics of γ versus error for the corresponds to support vector regression over averages. best C and  values have a similar behavior of the one shown in Figure 2. 4.1 Support vector regression (SVR) on met- rics Table 3: Best parameters for SVR with metric av- A SVR algorithm was used jointly with the extracted met- erages according with cross validation rics as input. For each personality trait an independent SVR Personality trait C γ  was used and a 6-fold cross validation was executed over the 1 −11 Emotional stability 32 49 2 corpus. The best parameters according with this validation Extroversion 32 2−3 2−10 can be seen in the Table 2. The Figure 2 shows the result- ing mean squared error (y axis) versus the gamma variation Openness to experience 32 1 49 2−10 (x axis) with the best C and  values in logarithmic scale Agreeableness 32 2 2−11 in cross validation. This behavior was similar for all the Conscientiousness 32 0.5 2−37 personality traits. 4.2 Extra trees regressor (ETR) on metrics 5. RESULTS Another method applied was the Extra trees regressor, Using the mentioned algorithms with the previously de- for each personality trait a 6 fold cross validation was per- scribed inputs and parameters, the prediction was done on formed. For the parameter number of estimators for all the test dataset. Results can be seen in the Tables 4, 5, and traits the best value was 77. 6. 4.3 Support vector regression (SVR) on aver- The three proposed methods obtained a similar perfor- ages mance with Root Mean Squared Error (RMSE). The SVR with metrics have slightly better results. This could be Based on the clustering results a SVR was used with the caused by the removal of the complexity metrics. metrics averages as input, i.e., length metrics, complexity When evaluated with RMSE the Openness trait was the best result in all the three applied methods, being consistent Table 2: Best parameters with SVR with metrics according with cross validation Table 4: Results over test data with SVR using met- Personality trait C γ  rics as input Emotional stability 32 8 2−14 Personality trait MSE PC Extroversion 32 16 2−12 Emotional stability 11.83 0.05 Openness to experience 32 16 2−19 Extroversion 9.54 0.11 Agreeableness 32 8 2−10 Openness to experience 8.14 0.28 Conscientiousness 32 16 2−54 Agreeableness 10.48 -0.08 Conscientiousness 8.39 -0.09 −0.01 Table 5: Results over test data with Extra Tree Re- −0.02 gressor using metrics as input −0.03 Personality trait MSE PC Emotional stability 10.31 0.02 −0.04 Extroversion 9.06 0.0 −0.05 Openness to experience 7.27 0.29 Agreeableness 9.61 -0.11 −0.06 Conscientiousness 8.47 0.16 −0.07 −0.08 Table 6: Results over test data with SVR using met- ric averages as input −0.09 2 -6 2 -5 2 -4 2 -3 2 -2 2 -1 2 0 2 1 2 2 2 3 2 4 2 5 Personality trait MSE PC Emotional stability 10.24 0.03 Extroversion 9.01 0.01 Figure 2: Variation of γ parameter in SVR versus Openness to experience 7.34 0.3 the resulting error in cross validation with the best Agreeableness 9.36 0.01 C and  parameters. Conscientiousness 9.99 -0.25 with other participant results, and showing better results [4] M. H. Halstead. Elements of Software Science than the baseline in submissions 2 and 3. Conscientiousness (Operating and Programming Systems Series). Elsevier followed with the best error for the SVR and Extra Tree Science Inc., New York, NY, USA, 1977. Regressor. [5] D. I. Holmes and F. J. Tweedie. Forensic Stylometry: The worst predicted trait with RMSE was Emotional Sta- A Review of the {CUSUM} Controversy. Revue bility/Neuroticism in all methods, based in the results of Informatique et Statistique dans les Science Humaines, other participants1 , this was a general result [11]. A deep pages 19–47, 1995. study in this particular trait is required to improve the re- [6] R. R. Joshi and R. V. Argiddi. Author Identification : sults. An Approach Based on Style Feature Metrics of When measured with Pearson Product-Moment Correla- Software Source Codes. 4(4):564–568, 2013. tion (PC), the results are very different among runs. But [7] A. Kuhn, S. Ducasse, and T. Gı̂rba. Semantic submissions 2 and 3 showed much better results compared clustering: Identifying topics in source code. with baseline because indicates a stronger correlation that Information and Software Technology, 49(3):230–243, the one showed in the baseline. The SVR with averages 2007. has an important correlation in openness with value of 0.3 [8] R. Malhotra. Empirical Research in Software and conscientiousness with value of -0.25. In the ETR run, Engineering: Concepts, Analysis, and Applications. openness was the highest value with 0.29. SVR over metrics CRC Press, 2015. in openness also had the highest value with 0.28. This trait [9] T. J. McCabe. A complexity measure. IEEE was the most consistent among all the used methods. Transactions on software Engineering, (4):308–320, It is interesting that PC shows correlations with openness 1976. and conscientiousness. This is a good result because indi- cates that the used metrics have certain relationship whit the [10] T. Parr. The Definitive ANTLR 4 Reference. mentioned personality traits. Compared with the baseline Pragmatic Bookshelf, 2nd edition, 2013. RMSE, the proposed method performed slightly better, but [11] F. Rangel, F. González, F. Restrepo, M. Montes, and still it is not significant, which shows that more work is re- P. Rosso. Pan at fire: Overview of the pr-soco track on quired to obtain a good predictor of personality. Therefore, personality recognition in source code. In Working it is necessary to include more source code metrics within notes of FIRE 2016 - Forum for Information Retrieval this study. This could lead to find that certain metrics are Evaluation, Kolkata, India, December 7-10, 2016, related to specific personality traits. CEUR Workshop Proceedings. CEUR-WS.org, 2016. [12] V. Y. Shen, S. D. Conte, and H. E. Dunsmore. Software science revisited: A critical analysis of the 6. CONCLUSIONS AND FUTURE WORK theory and its empirical support. IEEE Transactions The source code metrics extracted and used as input to on Software Engineering, (2):155–165, 1983. the machine learning methods were enough to get a close [13] J. H. Ward Jr. Hierarchical grouping to optimize an prediction of several personality traits. Other approaches objective function. Journal of the American statistical can be consulted in [?] which shows other results and ap- association, 58(301):236–244, 1963. proximations for the PR-SOCO task. As the PC denotes certain correlation, in this case par- ticularly with openness, this could mean that the metrics considered in this work are likely related to the mentioned trait. However, as there are several other metrics with differ- ent purposes, like quality, readability, etc., the use of more of those metrics could improve the prediction. Other metrics not considered in this study may have better relationships with the personality traits. This work could be extended by exploring other metrics an its relationship with each person- ality trait. 7. REFERENCES [1] S. Argamon, S. Dhawle, M. Koppel, and J. W. Pennebaker. Lexical predictors of personality type. Proceedings of joint annual meeting of the interface and The Classification Society of North America, pages 1–16, 2005. [2] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt. De-anonymizing Programmers via Code Stylometry. USENIX sec, pages 255–270, 2015. [3] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk. Feature location in source code: A taxonomy and survey. Journal of software: Evolution and Process, 25(1):53–95, 2013. 1 http://www.autoritas.es/prsoco/evaluation/