-

Personality Recognition Applying Machine Learning Techniques on Source Code Metrics

Hugo A. Castellanos

hacastellanosm@unal.edu.co 0

CCS Concepts

0 0 Universidad Nacional de Colombia Bogotá , Colombia

Source code has become a data source of interest in the recent years. In the software industry is common the extraction of source code metrics, mainly for quality assurance purposes. In this paper source code metrics are used to consolidate programmers pro les with the purpose to identify different personality traits using machine learning algorithms. This work was done as part of the Personality Recognition in SOurce COde (PR-SOCO) shared task in the Forum for Information Retrieval Evaluation 2016 (FIRE 2016).

Pieces of text have been always of interest in information retrieval as text based documents contain valuable information about the author. During recent decades source code has become a source of valuable information as well. Many e orts in this eld have been addressed to improve both processes and products in the software development industry [ 8 ].

The main e orts in source code analysis have been focused in forensics applications like author recognition [ 5 ], and plagiarism detection [ 2 ]. Several techniques have been used successfully in the mentioned tasks like n-grams, source code metrics, coding styles and abstract syntax trees [ 6 ]. Other applications of source code analysis include feature location [ 3 ], topics identi cation [ 7 ], among others.

The PR-SOCO shared task consisted in predict the personality traits of a programmer given a set of his/her source codes. These source codes as any other production of a human being may be in uenced by personality.

In this work, the use of source code metrics is proposed to nd information about the program author. Speci cally, the author personality traits based on the Big-5 personality test. In addition, machine learning methods are used to predict the personality traits based on the extracted source code metrics.

The rest of this paper is organized as follows. Section 2 presents a general background on source code metrics. Section 3 describes the proposed approach. Section 4 presents the machine learning strategies. Section 5 presents the obtained results. Finally, Section 6 concludes the paper. 2.

BACKGROUND ON SOURCE CODE METRICS

According to Malhotra [ 8 ], software metrics are used to assess the quality of the product or process used to build it. Such metrics have the following characteristics:

Quantitative: metrics have a value.

Understandable: the way the metric is calculated must be easy to understand.

Validatable: metrics must capture the attributes which they were designed to.

Economical: it must be economical to capture the metric.

Repeatable: if measured several times the results should be the same.

Language independent: the metrics should not depend to a speci c language.

Applicability: the metric should be applicable in any phase of the software development.

Comparable: the metric should correlate with another metric capturing the same concept.

Source code metrics must have a scale which can be: Interval: it is given by a de ned range of values.

Ratio: it is a value which has an absolute minimum or zero point.

Absolute: it is a simple count of the elements of interest.

Nominal: it is a value which mainly de nes a discrete scale of values, like 1-present or 0-not present.

Ordinal: it is a categorization which is intended to order or rank, for instance levels of severity: critical, high, medium, etc.

The Halstead volume (V ), described in Equation 3, is a measure of size but it is also interpreted as the number of mental comparisons that were needed to write a program with length N . Moreover, the di culty (D), shown in Equation 4, describes the di culty to write a program. It is highly related to volume because as it increases the di culty also does.

Size: usually intended to estimate cost and e ort. The most popular metric in this category are the source lines of code (SLOC). But in object oriented languages the size can be measured by the number of classes, methods and attributes.

Software quality: intended to measure the quality of the software, this metric can be divided in the following categories: { Based on defects: they consist in measure the level of defects. The main metrics of this category are: the defect density de ned as the number of defects by SLOC; defect removal e ectiveness which is de ned as the number of defects removed in a phase divided by latent defects. If the latent defects are unknown then can be estimated based on previous phases. { Usability: this kind of metrics are intended to measure the user satisfaction using the software.

The satisfaction can be given be the ease to use and learn. { Complexity metrics [ 9 ]: they are oriented to produce a measure on the di culty to test or maintain a piece of source code. This metric also give information about the amount of instructions during execution. { Testing: intended to measure the progress of testing over a software Object oriented metrics: intended to measure object oriented paradigm features. They can be divided in: { Coupling: measure of the level of interdependence between classes, it is calculated counting the number of classes called by another class. { Cohesion: measures how many elements of a class are functionally related to each other. { Inheritance: it measures the depth of the class hierarchy. { Reuse: measure of the amount of times that a class is reused. { Size: intended to measure the size but not only in lines of code but also in the particularities of object oriented paradigm, like method count, attribute count, class count, etc.

Evolutionary metrics: try to measure the evolution of a software based on di erent elements like revisions, refactorings, bug- xes. The measure how much lines of code are new, modi ed or deleted.

Additionally the empirical Halstead metrics [ 4 ] should also be considered. The base to calculate these metrics are the operands (identi ers) and operators (keywords, ++, +). Equation 1 consist in the sum of the unique operators (n1) and operands (n2). Length, described in Equation 2, is the sum of the total number of operands (N1) and operators (N2). (1) (2) (3) (4) (5) (6) (7)

The e ort (E) described in Equation 5, indicates the e ort required to write a program of high di culty.

Finally, the e ort is the base to calculate the time to understand/implement (T ) and bugs delivered (B), as can be seen in Equations 6 and 7, respectively. The time metric is related to the Stroud number [ 12 ], which is the "number of elementary discrimination per second". Stroud claimed that this number ranges from 5 to 20, but the Halstead's experiments indicated empirically that the best number in this case was 18.

T = B =

E 18 3. SOURCE CODE ANALYSIS FOR PERSON

ALITY RECOGNITION

Text documents, contains information about the author. In the work described in [ 1 ], the authors were able to show that certain personality traits could be predicted based on a text, in this case, an essay.

The present work starts from the hypothesis that source code, as a form of text, leaves traces of the author's personality traits. To the scope of this work source code is a text document written by a single author. It is worth mentioning that a single problem solution could be implemented in several ways by a programmer which give a certain guaranty of uniqueness.

To develop this hypothesis, a method is proposed to extract metrics from source code to be able to predict the personality traits. In Figure 1 the general method is summarized. As rst step the source examples provided are separated into individual les. Later a set of metrics is extracted from the source codes using a source code analyzer. With the extracted metrics as an input, machine learning methods are applied in order to predict the personality traits of the authors. Finally, the results are presented.

The provided corpus consisted in a source code le per person, and another le which indicates author and his/her personality traits (ground truth). Each source code le contained several source code pieces divided by a mark. The le was split into several individual les keeping track of the author- le relationship.

An analyzer was written using ANTLR 4 [ 10 ] with the java grammar. From each individual le the source code metrics described in Table 1 were extracted.

As can be seen most of the metrics are based in counting and obtaining the average. All the metrics were normalized, such normalized data were the input of the machine learning algorithms.

As the extracted metrics are from similar categories, a hierarchical clustering using the Ward's method [ 13 ] was applied. It was found that certain related metrics were too close to each other. Therefore, they were consolidated as follows:

Length metrics: contain the metrics related to some length/size measure and it is calculated as the average among: amount of les, average source lines of code, average class number per le, average source code lines per class, average attributes per class, average methods per class, average class name length, and the average number of parameters.

Complexity metrics: contain the metrics related with algorithm complexity and it is calculated as the average of: average amount of for loops, average amount of while loops, average amount of if clauses, average amount of if-else clauses, and the average identi er length.

Halstead : contains all the Halstead metrics extracted, it was calculated as the average of: Halstead bugs delivered, Halstead di culty, Halstead e ort, Halstead time to understand or implement, Halstead volume.

MACHINE LEARNING METHODS

In this section the used machine learning methods are described. Each one corresponds to a submission sent to the shared task: submission 1 corresponds to support vector regression (SVR) over source code metrics, submission 2 corresponds to extra trees regressor (ETR), and submission 3 corresponds to support vector regression over averages. 4.1

Support vector regression (SVR) on metrics

A SVR algorithm was used jointly with the extracted metrics as input. For each personality trait an independent SVR was used and a 6-fold cross validation was executed over the corpus. The best parameters according with this validation can be seen in the Table 2. The Figure 2 shows the resulting mean squared error (y axis) versus the gamma variation (x axis) with the best C and values in logarithmic scale in cross validation. This behavior was similar for all the personality traits. 4.2

Extra trees regressor (ETR) on metrics

Another method applied was the Extra trees regressor, for each personality trait a 6 fold cross validation was performed. For the parameter number of estimators for all traits the best value was 77. 4.3

Support vector regression (SVR) on averages

Based on the clustering results a SVR was used with the metrics averages as input, i.e., length metrics, complexity metrics, and Halstead metrics. The rst step was to calculate the variance. As the complexity metrics variance was too low, it was removed and only the length and Halstead average metrics were used as input.

The best parameters according with cross validation can be seen in Table 3. The graphics of versus error for the best C and values have a similar behavior of the one shown in Figure 2. with other participant results, and showing better results than the baseline in submissions 2 and 3. Conscientiousness followed with the best error for the SVR and Extra Tree Regressor.

The worst predicted trait with RMSE was Emotional Stability/Neuroticism in all methods, based in the results of other participants1, this was a general result [ 11 ]. A deep study in this particular trait is required to improve the results.

When measured with Pearson Product-Moment Correlation (PC), the results are very di erent among runs. But submissions 2 and 3 showed much better results compared with baseline because indicates a stronger correlation that the one showed in the baseline. The SVR with averages has an important correlation in openness with value of 0.3 and conscientiousness with value of -0.25. In the ETR run, openness was the highest value with 0.29. SVR over metrics in openness also had the highest value with 0.28. This trait was the most consistent among all the used methods.

It is interesting that PC shows correlations with openness and conscientiousness. This is a good result because indicates that the used metrics have certain relationship whit the mentioned personality traits. Compared with the baseline RMSE, the proposed method performed slightly better, but still it is not signi cant, which shows that more work is required to obtain a good predictor of personality. Therefore, it is necessary to include more source code metrics within this study. This could lead to nd that certain metrics are related to speci c personality traits.

CONCLUSIONS AND FUTURE WORK The source code metrics extracted and used as input to the machine learning methods were enough to get a close prediction of several personality traits. Other approaches can be consulted in [?] which shows other results and approximations for the PR-SOCO task.

As the PC denotes certain correlation, in this case particularly with openness, this could mean that the metrics considered in this work are likely related to the mentioned trait. However, as there are several other metrics with di erent purposes, like quality, readability, etc., the use of more of those metrics could improve the prediction. Other metrics not considered in this study may have better relationships with the personality traits. This work could be extended by exploring other metrics an its relationship with each personality trait.

[1]

Argamon ,

Dhawle ,

Koppel , and

J. W.

Pennebaker . Lexical predictors of personality type . Proceedings of joint annual meeting of the interface and The Classi cation Society of North America , pages 1 { 16 , 2005 .

[2]

Caliskan-Islam ,

Harang ,

Liu ,

Narayanan ,

Voss ,

Yamaguchi , and

Greenstadt. De-anonymizing Programmers via Code Stylometry. USENIX sec , pages 255 { 270 , 2015 .

[3]

Dit ,

Revelle ,

Gethers , and

Poshyvanyk . Feature location in source code: A taxonomy and survey . Journal of software: Evolution and Process , 25 ( 1 ): 53 { 95 , 2013 .

[4]

M. H.

Halstead . Elements of Software Science (Operating and Programming Systems Series) . Elsevier Science Inc., New York, NY, USA, 1977 .

[5]

D. I.

Holmes and

F. J.

Tweedie . Forensic Stylometry: A Review of the fCUSUMg Controversy . Revue Informatique et Statistique dans les Science Humaines , pages 19 { 47 , 1995 .

[6]

R. R.

Joshi and

R. V.

Argiddi . Author Identi cation : An Approach Based on Style Feature Metrics of Software Source Codes . 4 ( 4 ): 564 { 568 , 2013 .

[7]

Kuhn ,

Ducasse , and T. G^rba. Semantic clustering: Identifying topics in source code . Information and Software Technology , 49 ( 3 ): 230 { 243 , 2007 .

[8]

Malhotra . Empirical Research in Software Engineering: Concepts , Analysis, and Applications . CRC Press, 2015 .

[9]

T. J.

McCabe . A complexity measure . IEEE Transactions on software Engineering , ( 4 ): 308 { 320 , 1976 .

[10]

Parr . The De nitive ANTLR 4 Reference. Pragmatic Bookshelf, 2nd edition , 2013 .

[11]

Rangel ,

Gonzalez ,

Restrepo ,

Montes , and

Rosso . Pan at re: Overview of the pr-soco track on personality recognition in source code . In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation , Kolkata, India, December 7- 10 , 2016 ,

CEUR

Workshop Proceedings . CEUR-WS.org, 2016 .

[12]

V. Y.

Shen ,

S. D.

Conte , and

H. E.

Dunsmore . Software science revisited: A critical analysis of the theory and its empirical support . IEEE Transactions on Software Engineering , ( 2 ): 155 { 165 , 1983 .

[13]

J. H. Ward

Jr . Hierarchical grouping to optimize an objective function . Journal of the American statistical association , 58 ( 301 ): 236 { 244 , 1963 .