=Paper=
{{Paper
|id=Vol-1737/T1-10
|storemode=property
|title=UAEMex System for Identifying Personality Traits from Source Code
|pdfUrl=https://ceur-ws.org/Vol-1737/T1-10.pdf
|volume=Vol-1737
|authors=Eder Vázquez Vázquez,Omar González Brito,Jovani A. García,Miguel García Calderón,Gabriela Villada Ramírez,Alan J. Serrano León,René A. García-Hernández,Yulia Ledeneva
|dblpUrl=https://dblp.org/rec/conf/fire/VazquezBGCRLGL16
}}
==UAEMex System for Identifying Personality Traits from Source Code==
<pdf width="1500px">https://ceur-ws.org/Vol-1737/T1-10.pdf</pdf>
<pre>
     UAEMex System for Identifying Personality Traits from
                        Source Code

 Eder Vázquez Vázquez1, Omar González Brito2, Jovani A. García3, Miguel García Calderón4,
Gabriela Villada Ramírez5, Alan J. Serrano León6, René A. García-Hernández7, Yulia Ledeneva8
               Universidad Autónoma del Estado de México, UAPT Tianguistenco.
                    Instituto Literario, 100, Toluca, Edo. Méx. 50000, México.
     eder2v@hotmail.com1, gonzalezbritoomar@gmail.com2, jovani_2807@hotmail.com3,
   tonsquemike@outlook.com4, inggaby.vr@gmail.com5, alan.serrano.leon@outlook.com6,
                      renearnulfo@hotmail.com7, yledeneva@yahoo.com8
                                                                                    3rd E-mail
ABSTRACT                                              metrics proposed by PR-SOCO, we rank the results with others
This paper describes the UAEMex participation on Personality               systems by personality traits. In section 5, the conclusions are
Recognition Source Code (PR-SOCO 2016) task, where the                     presented.
principal challenge is to identify the five personality traits using the
source code of a developer. In the first phase of the task, a training
                                                                           2. METHODOLOGY
dataset with 50 programs and the degree values of the personality          The proposed methodology is divided in four steps: Corpus
incidence for each trait were provided. In the second phase, a test        Analysis, Feature Extraction, Feature Representation and
dataset with 21 programs must be classified. Our method consists           Classification.
in extracting only 41 features from the source code including the
comments in order to classify it (we test 4 models). Using the
                                                                           2.1 Corpus Analysis
                                                                           The training dataset is composed of 1741 java source codes of 50
evaluation metrics proposed by PR-SOCO, our system is ranked
                                                                           developers that where evaluated with the Big-Five Theory
between the best systems for both evaluation metrics. Finally, using
                                                                           personality traits where each trait ranges between value of 20 and
the RMSE and the PC metric we propose a ranking measure.
                                                                           80. However, the number of different values by personality trait in
Keywords                                                                   the samples is small, we decided to manage each program by
PR-SOCO; Support Vector Machine; Symbolic Regression; KNN;                 separated, to get a good representation (See table 1). There are
Neural Networks; Personality Trait; Genetic Algorithms.                    different numbers of values per class on every personality trait, the
                                                                           distribution is shown in table 1.
1. INTRODUCTION
Personality is an inherent aspect of human nature that has an
                                                                            Table 1. Source code distribution for every personality trait.
influence on its activities. It means, personality is a set of
characteristics that describes one person, and makes it different                                                  Number of different
                                                                                       Personality Trait
from others [1]. Nowadays, identifying the degree of personality                                                        values
traits for determining if a candidate fits with a job is such important                   Neuroticism                     13
as skills and experience [2]. After decades of research, the Big-Five
Theory is the most accepted model for assessing the personality [2].                      Extroversion                       14
This model has a hierarchical organization of personality traits with
five     classes:    Extroversion      (E),     Agreeableness      (A),                    Openness                          11
Conscientiousness (C), Neuroticism (N), and Openness to
                                                                                         Agreeableness                       14
experience (O) [3].
Given a few set of java source codes in PR-SOCO task, the main                         Conscientiousness                     12
objective is to identify the degree of presence of five classes of
personality [4]. In order to get an approximation of what aspects          2.2 Feature Extraction
determine the personality, the NEO-PI R test may be answered (this         Using few source codes of our team members, we identify some
test is based on the Big-Five theory) to measure the personality           personal features in order to identify some similar elements. As
traits [3]. There are many structured surveys based on NEO-PI R in         result, we detected the indentation, identifier and comment features
several Web pages, available on-line for everybody to predict the          are important to determine the author of such codes. These features
personality of the user. Using these aspects, we propose to extract        can be extracted independently of the content or objective of the
41 features as the main information for training four classifiers.         source code. The first 25 features were calculated using average
                                                                           and the last 16 were calculated using frequency. Extracted features
In this paper, we present the working notes of the UAMEX                   can be classified as:
participation on the PR-SOCO 2016 task.
                                                                           Indentation Features: space in code, space in the comments,
     This paper is organized as follows. In section 2, the                 space between classes, space between source code blocks, space
methodology is described. In section 3, the results for the test           between methods, space between control sentences and space in
dataset experiments are presented. In section 4, using the evaluation
clustering characters “(), [], {}”. These features are measured with        The training phase consists on store feature vectors and class labels
the average.                                                                of training dataset. In classification phase is necessarily to define
                                                                            constant 𝑘 and send an unlabeled vector to KNN algorithm for
Identifier Features: The presence of underscore, uppercase and
                                                                            calculate the minimal distance between stored classes and input
lowercase in the name of an identifier is measured in binary way.
                                                                            vector [14]. We use Weka implementation for KNN algorithm [15].
Also, we extract the average number of characters and the average
length in the name of an identifier as features. These features are         2.4.4 Back Propagation Neural Network (BP-NN)
extracted for class, methods and variable names. Also, the                  Neural networks are an elemental processor that recipe a vector as
percentage of number of initialized variables is extracted.                 input data. The feature vector is send at input layer and then every
Comment Features: The presence of line and block comments are               neuron processes a 𝑘 − 𝑖𝑛𝑝𝑢𝑡 with 𝑘 − 𝑤𝑒𝑖𝑔ℎ𝑡 and returns a 𝑘 −
extracted as binary features. Also, the presence of comments with           𝑜𝑢𝑡𝑝𝑢𝑡. Neural networks are used to approximate functions
all letters in uppercase is extracted as binary feature. Finally, the       according to the input data [16].
average of size of the comments is extracted as feature.                    When neural network implements back-propagation error, the
                                                                            output of neural network is compared with desired output to
                                                                            calculate neural network error and then correct weights of every
2.3 Features Representation                                                 neuron in hidden layer [17].
For every source code, 41 features are extracted for representing in
a vector space model, where the Source Code 𝑆𝑖 is represented by
                                                                            3. RUN RESULTS
                                                                            In this section, the results submitted for the PR-SOCO test dataset
one of the 41 features 𝑓𝑗 [5].
                                                                            are described.
2.4 Classification                                                          Run 1: This run was generated using symbolic regression (SR) over
Once the source codes are represented in a vector space model, we           the vector space model but we eliminate the source codes of five
train the system with the next classifiers. The objective of test           developers according to the next criterion: the person who has a
different classifiers is that if the extracted features are good features   high presence in all the personality traits, the person who has a
then we would get, in general, good results with these classifiers. It      lower presence in all the personality traits, the person who has an
is worth to say that these classifiers have been widely used in other       average presence in all the personality traits, the person who has
language processing tasks, especially we trust in the Symbolic              more source codes and the person who has few source codes.
Regression model since the training dataset only has some few
values per trait.                                                           Run 2: Similar to run 1, this run was generated using (SR) but for
                                                                            each personality trait the developers (between 12 and 20) with
2.4.1 Symbolic Regression (SR)                                              average presence of such trait were eliminated.
Finding the structure, coefficients and appropriate elements of a           Run 3: For this run, the whole training dataset was used with Back
model at same time that try to solve problem, is a challenge for            Propagation Neural Network.
which no efficient mathematical method exists, therefore
traditional mathematical techniques are not the best in empirical           Run 4: The whole training dataset with KNN with constant 𝑘 = 3
modeling problems due to their nonlinearity. Because, there is a            was used.
need with an artificial expert which can create or define a model
from available data of specific task without appeal problem                 Run 5: We use a genetic algorithm, but it is not described because
understand [6]. Symbolic Regression is an artificial expert type that       we find a mistake.
evolve models from available data observations [7] [8], whose main          Run 6: The whole training dataset was used for classify with a
objective is to find a model which describes the relationship               SVM.
between dependent variable and independent variables as
accurately as possible [9].                                                 Root Mean Square Error (RMSE) and Pearson Correlation (PC)
                                                                            metrics were used by PR-SOCO task as evaluation of the ranking
Because Symbolic Regression works directly with Genetic                     results. A minimum RMSE is desired for a system. In change, in
Programming is possible to evolve equations or mathematical                 PC metrics a closer value to 1 or -1 is desired. In table 2, the RMSE
functions in order to estimate the behavior of a dataset. The               scores of our runs are presented, with the best scores highlighted in
symbolic regression technique standout as a viable solution to the          bold. As is possible to see, the first and six runs get the best scores,
problem of this work because it does not assume an answer                   where the SR and SVM classifiers were used, respectably.
problem, but also discover it [10].

2.4.2 Support Vector Machine (SVM)                                             Table 2. RMSE results of submitted runs for test dataset.
SVM maps a set of examples as a set of points in the same space
trying to get optimal hyper-plane. Optimal hyper-plane is defined                 Run         N           E          O           A           C
as hyperplane with maximal separation between two classes [11].                     1       11.54       11.08       6.95        8.98       8.53
SVM make predictions based on which side of the gap they fall on
[12]. In this work, we used SVM implementation LIB-SVM [13].                        2       11.10       12.23       9.72        9.94       9.86
                                                                                    3        9.84       12.69       7.34        9.56      11.36
2.4.3 K Nearest Neighbor (KNN)
Is one of the simplest machine learning algorithms known as lazy                    4       10.67        9.49       8.14        8.97       8.82
classifier where classification function is only approximated                       6       10.86        9.85       7.57        9.42       8.53
locally. KNN is trained using vectors on feature space; each vector
must have a class label.
In table 3, the results with Pearson Correlation metric is showed,        RMSE metric correspond with the rank of our results for the PC
with the best score highlighted in bold.                                  metric.
     Table 3. PC results of submitted runs for test dataset.              In PR-SOCO 2016, two evaluation metrics were used given two
                                                                          ways of ranking the results, the RMSE for measuring the average
      Run            N            E       O           A         C
                                                                          error between the observed and predicted values and the PC for
         1       -0.29       -0.14       0.45      0.22       0.11        measuring the correlation between variables. In this paper, we
         2       -0.14       -0.15       0.04      0.19       -0.30       propose ranking the results using both RMSE and PC measures as:

         3         0.35      -0.10       0.28      0.33       -0.01                           𝑅𝑎𝑛𝑘𝑖𝑛𝑔 = ((1 − 𝑃𝐶) ∗ 𝑅𝑀𝑆𝐸)

         4         0.04      -0.04       0.10      0.29       -0.07       This measure only is applied for positive correlation results in PC
                                                                          metric. Since RMSE is not normalized we propose to multiply both
         6         0.13           0        0          0         0         results. This ranking is a metric where best values are those closer
                                                                          to cero. Table 6 shows the best results evaluating with our
                                                                          proposing measure.
4. RANKING RESULTS
In PR-SOCO 2016, eleven teams participated in this task with two
baseline: the baseline bow (bl bow) based on trigram of chars and                           Table 5. Best runs with PC metric.
the baseline mean (bl mean) based on a method that predicts the            Rank         N           E          O           A           C
mean value of the observed values. In table 4, the best RMSE
results of those teams for every personality trait are showed                1        0.36         0.47       0.62       0.38       0.33
according to the rank. In general, our results (uaemex) were ranked                   0.35                    0.45       0.33       0.32
                                                                             2                     0.38
in good positions outperforming the baseline, except for                             uaemex                  uaemex     uaemex     uaemex
Extroversion, in the case of Neuroticism and Agreeableness we                3        0.31         0.35       0.37       0.29
were ranked in second position, in the case of Openness we get the                                                                   0.31
first rank and for Conscientiousness we get the fourth position              4         0.29        0.31       0.33       0.21
between two baselines.                                                       5         0.27        0.31        0.3       0.21        0.21
                                                                             6         0.23        0.16       0.29       0.19        0.19
              Table 4. Best runs with RMSE metric.                                                 0.12
                                                                             7         0.14                   0.27       0.06        0.16
  Rank         N            E           O         A           C                                   bl bow
                                       6.95                                  8         0.1         0.11       0.12         0         0.13
     1        9.78                               8.79        8.38
                                      uaemex                                 9         0.1          0.1        0.05      -0.05       0.07
                            8.6
              9.84                                8.97                                                          0
     2                                 7.16                  8.39           10         0.09        0.08                  -0.07
             uaemex                             uaemex                                                       bl mean                -0.12
                                                    9        8.47                     0.06                                0.08     bl mean
     3        9.97         8.69        7.19                                 11                     0.11       -0.15
                                                 bl bow     bl bow                   bl bow                             bl mean
                                                  9.04       8.53                                    0        -0.17       -0.19      -0.2
     4        10.04         8.8        7.27                                 12         0.05
                                                bl mean    uaemex                                            bl bow      bl bow     bl bow
                                                                                                  uaemex
                                                             8.54                       0            0
     5        10.24        8.96        7.42      9.16                       13                                -0.31      -0.28       -0.23
                                                           bl mean                   bl mean      bl mean
              10.26
     6                     9.01         7.57     9.32        8.59
             bl mean
                                      bl mean                                    Table 6. Results with our proposal evaluation metric.
     7        10.27         9.06                 9.36        8.61
                           bl bow      7.74                               Ranking         N             E       O          A           C
     8        10.28       bl mean                9.39        8.69
                                      bl bow                                            6.39                   3.82
             10.29                                                               1                   5.32                 5.88       6.24
     9                     9.22        8.19      9.55        8.77                      uaemex                 uaemex
             bl bow                                                                                                      6.36
                            9.49                                                 2       6.54        5.59      4.60                  6.78
    10        10.37                    8.21      10.31       8.85                                                       uaemex
                          uaemex                                                                                                     7.03
                                                                                 3       6.74        6.03      4.79       6.71
    11        10.53        11.18       8.43      11.5        9.99                                                                   bl bow
    12        17.55        16.67       15.97     21.1       15.53                4       7.67        6.07      5.13       6.98       7.47
    13        24.16        27.39       22.57     28.63      22.36                                                          7.2
                                                                                 5       8.84        7.52      5.26                  7.55
                                                                                                                         bl bow
                                                                                                     7.97                             7.59
                                                                                 6       8.91                  7.28       8.24
In table 5, the best PC results of those teams for every personality                                bl bow                          uaemex
trait are showed according to the positive correlation results. In                                              7.57                  8.54
general, our results (uaemex) were ranked in good positions                      7          9.3      8.49                 8.49
                                                                                                              bl mean               bl mean
outperforming the baseline configurations. In the case of                                9.67        9.06                 9.04
Neuroticism, Openness, Agreeableness and Conscientiousness we                    8                             8.23                  11.33
                                                                                        bl bow     bl mean              bl mean
were ranked in second position except for the Extroversion trait. In             9       9.74        9.32      8.43       9.26         -
general, it is possible to observe that the rank of our results for the
                        9.85                                          [4] Rangel, F., González, F., Restrepo, F., Montes, M. and
    10        9.93                  8.97       22.61        -
                       uaemex                                             Rosso, P. 2016. Pan at fire: Overview of the pr-soco track on
             10.26                                                        personality recognition in source code. In Working notes of
    11                  23.20       16.47        -          -             FIRE 2016 – Forum for Information Retrieval Evaluation,
            bl mean
    12       12.46      24.65         -          -          -             Kolkata, India, December 7-10, 2016, CEUR Workshop
                                                                          Proceedings. CEUR-WS.org, 2016.
    13       21.74         -          -          -          -
                                                                      [5] Salton, G., Wong, A., Yang, C.S. 1975. A vector space
                                                                          model for automatic indexing. Commun. ACM 18, 613-620.
As we can see in table 6, our results get a better balance between    [6] Dabhi, V.K., Vij, S.K. 2011. Empirical modeling using
RMSE and PC. In table 6, uaemex team is ranking in first position         symbolic regression via postfix Genetic Programming.
for Neuroticism and Openness trait, in second place for                   Image Information Processing (ICIIP), 2011 International
Agreeableness and sixth place for Conscientiousness. However, in          Conference on, 1-6.
this new ranking the Extroversion do not outperform both
baselines.                                                            [7] Koza, J.R. 1992. Genetic programming: on the programming
                                                                          of computers by means of natural selection. MIT Press.
                                                                      [8] Murari, A., Peluso, E., Gelfusa, M., Lupelli, I., Lungaroni,
5. CONCLUSIONS                                                            M., Gaudio, P. 2015. Symbolic regression via genetic
This paper presents the results in personality trait prediction. We       programming for data driven derivation of confinement
describe the participation of the UAEMex at PR-SOCO 2016.                 scaling laws without any assumption on their mathematical
                                                                          form. Plasma Physics and Controlled Fusion 57.
We know that submitted runs overcome the baseline despite that
                                                                      [9] Kommenda, M., Affenzeller, M., Burlacu, B., Kronberger,
corpus has noise like repeated source code, obfuscated source code
                                                                          G., Winkler, S. M. 2014. Genetic programming with data
and it have little samples.
                                                                          migration for symbolic regression. In: Proceedings of the
The training set has different classes of personality. There are          2014 conference companion on Genetic and evolutionary
unbalanced classes and there has not enough examples for class            computation companion, 1361-1366.
values. In this approach, we do not make preprocessing because it     [10] Can, B., Heavey, C. 2011. Comparison of experimental
was considered that all information in corpus are relevant by the          designs for simulation-based symbolic regression of
task. Personality Trait Prediction in source code is a new task and        manufacturing systems. Computers and Industrial
there are not reference approaches about this. It was difficult to         Engineering 61, 447-462.
identify what features would be extracted.
                                                                      [11] Hearst, M.A. 1998. Support Vector Machines. IEEE
The best results in our runs obtained with the symbolic regression         Intelligent Systems 13, 18-28.
model because the training phase try to approximate the output of
                                                                      [12] Cortes, C., Vapnik, V. 1995. Support-Vector Networks.
input vector.
                                                                           Machine Learning 20, 273-297.
Also, we propose a new ranking measure for combine a RMSE and         [13] Chang, C.-C., Lin, C.-J. 1977. LIBSVM: A library for
PC measure in order to get an approximation for evaluation results.        support vector machines. ACM Trans. Intell. Syst. Technol.
According to our experiments in train dataset, we note that it is          2, 1-27.
better than RMSE or PC evaluating alone. RMSE is a minimization
metric and PC is a maximization metric.                               [14] Stone, C.J. 1977. Consistent Nonparametric Regression. 595-
                                                                           620.
6. ACKNOWLEDGMENTS                                                    [15] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
Thanks to Autonomous University of the State of Mexico                     P., Witten, I. H. 2009. The WEKA data mining software: an
(UAEMex), Consejo Nacional de Ciencia y Tecnología                         update. SIGKDD Explor. Newsl. 11, 10-18.
(CONACyT) and Consejo Mexiquense de Ciencia y Tecnología
(COMECyT) for support granted for this work.                          [16] McCulloch, W.S., Pitts, W. 1988. A logical calculus of the
                                                                           ideas immanent in nervous activity. In: James, A.A., Edward,
7. REFERENCES                                                              R. (eds.) Neurocomputing: foundations of research, 15-27.
[1] Montaño, M., Palacios, J., Gantiva, C. 2009. Teorías de la        [17] Rumelhart, D.E., Hinton, G.E., Williams, R. J. 1986.
    personalidad. Un análisis histórico del concepto y su                  Learning internal representations by error propagation. In:
    medición. Psychologia Avances de la disciplina, 81-107.                David, E.R., James, L.M., Group, C.P.R. (eds.) Parallel
[2] Paul, C., R., M.R. 2008. NEO PI-R Revised Neo Personality              distributed processing: explorations in the microstructure of
    Inventory. TEA Ediciones S.A.                                          cognition, vol. 1, 318-362.
[3] Hussain, S., Abbas, M., Shahzad, K., Syeda, A. 2012.
    Personality and career choices. African Journal of Business
    Management (AJBM) 6, 2255-2260.

</pre>