=Paper= {{Paper |id=Vol-1737/T1-1 |storemode=property |title=PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde |pdfUrl=https://ceur-ws.org/Vol-1737/T1-1.pdf |volume=Vol-1737 |authors=Francisco Rangel,Fabio A. González,Felipe Restrepo-Calle,Manuel Montes,Paolo Rosso |dblpUrl=https://dblp.org/rec/conf/fire/PardoGRMR16 }} ==PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde == https://ceur-ws.org/Vol-1737/T1-1.pdf
             PAN at FIRE: Overview of the PR-SOCO Track on
                Personality Recognition in SOurce COde

                Francisco Rangel                      Fabio A. González                Felipe Restrepo-Calle
                 Autoritas Consulting   MindLab Research Group                         MindLab Research Group
                   Valencia, Spain       Universidad Nacional de                       Universidad Nacional de
           francisco.rangel@autoritas.es        Colombia                                      Colombia
                                                 fagonzalezo@unal.edu.co                  Bogotá, Colombia
                                                                                 ferestrepoca@unal.edu.co
                                    Manuel Montes                         Paolo Rosso
                                           INAOE                    PRHLT Research Center
                                           Mexico                   Universitat Politècnica de
                               mmontesg@inaoep.mx                           València
                                                                     prosso@dsic.upv.es

ABSTRACT                                                           code, in order to know whether a candidate may fit in a
Author profiling consists of predicting some author’s charac-      team, may be very valuable for the recruitment process.
teristics (e.g. age, gender, personality) from her writing. Af-    Also in education, to know students’ personality from their
ter addressing at PAN@CLEF mainly age and gender identi-           source codes may help to improve the learning process by
fication, and also personality recognition in Twitter1 , in this   customising the educational offer.
PAN@FIRE track on Personality Recognition from SOurce                 In this PAN@FIRE track on Personality Recognition from
COde (PR-SOCO) we have addressed the problem of pre-               SOurce COde (PR-SOCO), we have addressed the problem
dicting author’s personality traits from her source code. In       of predicting an author’s personality from her source code.
this paper, we analyse 48 runs sent by 11 participant teams.       Given a source code collection of a programmer, the aim
Given a set of source codes written in Java by students            is to identify her personality traits. In the training phase,
who answered also a personality test, participants had to          participants have been provided with source codes in Java,
predict personality traits, based on the big five model. Re-       written by computer science students, together with their
sults have been evaluated with two complementary measures          personality traits. At test, participants have received source
(RMSE and Pearson product-moment correlation) that have            codes of a few programmers and they have to predict their
permitted to identify whether systems with low error rates         personality traits. The number of source codes per program-
may work due to random chance. No matter the approach,             mer will be small reflecting a real scenario such as the one of
openness to experience is the trait where the participants         a job interview: the interviewer could be interested in know-
obtained the best results for both measures.                       ing the interviewee degree of conscientiousness by evaluating
                                                                   just a couple of programming problems.
                                                                      We suggested participants to investigate beyond standard
Keywords                                                           n-grams based features. For example, the way the code is
personality recognition; source code; author profiling             commented, the naming convention for identifiers or inden-
                                                                   tation may also provide valuable information. In order to
1.     INTRODUCTION                                                encourage the investigation of different kinds of features,
                                                                   several runs per participant were allowed. In this paper, we
   Personality influence most, if not all, of the human ac-        describe the participation of 11 teams that sent 48 runs.
tivities, such as the way people write [5, 25], interact with         The reminder of this paper is organised as follows. Sec-
others, and the way people make decisions. For instance,           tion 2 covers the state of the art, Section 3 describes the
in the case of developers, personality influence the criteria      corpus and the evaluation measures, and Section 4 presents
they consider when selecting a software project they want          the approaches submitted by the participants. Section 5 and
to participate [22], or the way they write and structure their     6 discuss results and draw conclusions, respectively.
source code. Personality is defined along five traits using the
Big Five Theory [7], which is the most widely accepted in
psychology. The five traits are: extroversion (E), emotional       2.   RELATED WORK
stability / neuroticism (S), agreeableness (A), conscientious-
                                                                      Pioneers research works in personality recognition were
ness (C), and openness to experience (O).
                                                                   carried out by Argamon et al. [27], who focused on the iden-
   Personality recognition may have several practical appli-
                                                                   tification of extroversion and emotional stability. They used
cations, for example to set up high performance teams. In
                                                                   support vector machines with a combination of word cate-
software development, not only technical skills are required,
                                                                   gories and relative frequency of function words to recognize
but also soft skills such as communication or teamwork. The
                                                                   these traits from self-reports. Similarly, Oberlander and
possibility of using a tool to predict personality from source
                                                                   Nowson [21] focused on personality identification of blog-
1
    http://pan.webis.de/                                           gers. Mairesse et al. [20] analysed the impact of different
set of psycholinguistic features obtained with LIWC2 and            decided to keep them due to the following reasons: firstly,
MRC3 , showing the highest performance on the openness to           participant teams could remove them easily if they decide
experience trait.                                                   to do so; secondly, it is possible that this kind of mistakes is
   Recently, researchers have focused on personality recogni-       related to some personality traits, so this information can be
tion from social media. In [14, 24, 6], the authors analysed        used as a feature as well. Finally, although we encouraged
different sets of linguistic features as well as friends count or   the students to write their own code, some of them could
daily activity. In [18], the authors reported a comprehensive       have reused some pieces of code from other exercises or even
analysis on features such as the size of the friendship net-        looked for code excerpts on books or the Internet.
work, the number of uploaded photos or the events attended             In addition, each student answered a Big Five personal-
by the user. They analysed more than 180,000 Facebook               ity test that allowed us to calculate a numerical score for
users and found correlations among these features and the           each one of the following personality traits: extroversion,
different traits, specially in case of extroversion. Using the      emotional stability / neuroticism, agreeableness, conscien-
same Facebook dataset and similar set of features, Bachrach         tiousness, and openness to experience.
et al. [1] reported high results predicting extroversion auto-         Overall, the dataset consists of 2,492 source code pro-
matically.                                                          grams written by 70 students along with the scores of the
   In [26], the authors analysed 75,000 Facebook messages of        five personality traits for each student, which are provided
volunteers who filled a personality test and found interest-        as floating point numbers in the continuous range [20,80].
ing correlations among words usage and personality traits.          The source codes of each student were organized on a single
According to them, extroverts use more social words and in-         text file with all her source codes together with a line sep-
troverts use more words related to solitary activities. Emo-        arator among them. The dataset was split in training and
tionally stable people use words related to sports, vacation,       test subsets, the first one containing the data for 49 students
beach, church or team; whereas neurotics use more words             and the second one the data of the remaining 21. Partici-
and sentences referring to depression.                              pants only have access to the personality traits scores of the
   Due to the interest on this field and with the aim at defin-     49 students in the training dataset.
ing a common framework of evaluation, some shared tasks
have been organised. For example, i) the Workshop on Com-           3.2      Performance measures
putational Personality Recognition [5]; or ii) the Author             For evaluating participants’ approaches we have used two
Profiling task at PAN 2015 [25] with the objective of iden-         complementary measures: Root Mean Square Error (RMSE)
tifying age, gender and personality traits of Twitter users.        and Pearson Product-Moment Correlation (PC). The moti-
   Regarding programming style and personality, in [3] the          vation to use both measures is to try to understand whether
authors explored the relationship between cognitive style,          a committed error is due to random chance.
personality and computer programming style. More recently,            We have calculated RMSE for each trait with Equation 1:
the authors in [16] also related personality to programming
style and performance. Whereas the 2014 [10] and 2015 [11]                                    v
                                                                                              u   n
PAN@FIRE tracks on SOurce COde (SOCO) where devoted                                           u1 X
                                                                                     RM SEt = t     (yi − ŷi )2                (1)
to detect reuse, in 2016 we aimed at identifying personality                                    n 1
traits from source code.
                                                                    where RM SEt is the Root Mean Square Error for trait t
3.     EVALUATION FRAMEWORK                                         (neuroticism, extroversion, openness, agreeableness, consci-
                                                                    entiousness); yi and ŷi are the ground truth and predicted
  In this section we describe the construction of the corpus,       values respectively for author i. Also for each trait, PC is
covering particular properties, challenges and novelties. Fi-       calculated following Equation 2:
nally, the evaluation measures are described.
                                                                                            Pn
3.1      Corpus                                                                              i=1 (xi − x̄) (yi − ȳ)
                                                                                   r = Pn             2
                                                                                                      Pn                2       (2)
   The dataset is composed of Java programs written by com-                              i=1 (xi − x̄)    i=1 (yi − ȳ)
puter science students from a data structures course at the         where each xi and yi are respectively the ground truth and
Universidad Nacional de Colombia. Students were asked to            the predicted value for each author i; x̄ and ȳ the average
upload source code, responding to some functional require-          values.
ments of different programming tasks, to an automated as-
sessment tool. For each task, students could upload more
than one attempted solution. The number of attempts per             4.     OVERVIEW OF THE SUBMITTED AP-
problem was not limited / discouraged in any way. There are                PROACHES
very similar submissions among different attempts and also             Eleven teams participated in the Personality Recognition
some of them contain compilation-time or runtime errors.            in SOurce COde4 shared task. They sent 48 runs with dif-
   Although in most of the cases students uploaded the right        ferent approaches, and 9 of them have submitted the work-
Java source code file, some of them erroneously uploaded the        ing notes describing their approaches. Following, we briefly
compiler output, debug information or even the source code          highlight the different systems.
in other programming language (e.g.: Python). A priori
this seems to be noise for the dataset and a sensible alterna-            • besumich [23] experimented with two kinds of features,
tive could have been to remove these entries. However, we                   bag of words and character n-grams (with n=1,2,3).
2                                                                           In both cases, they experimented with lowercase and
    http://www.liwc.net/
3                                                                   4
    http://www.psych.rl.ac.uk/                                          http://www.autoritas.es/prsoco/
        original case, and three representations, binary (pres-     • gimenez [13] proposed two different approaches to tackle
        ence/absence), term frequency (TF) and TF-IDF. The            this task. On the one hand, each code sample from
        authors trained linear, ridge and Lasso regressions.          each author was taken as an independent sample and
        The final configuration used to send their runs com-          vectorized using word n-grams; on the other hand,
        bined lowercased unigrams weighted with TF-IDF (with          all the codes from an author was taken as a unique
        and without space characters) with different values for       sample vectorized using word n-grams together with
        the alpha parameter of the Lasso regression.                  hand-crafted features (e.g. number of codes that im-
                                                                      plemented the same class, the appearance of pieces
      • bilan [2] started with analysing the code structure with      of code suspicious of plagiarism, number of developed
        the Antlr Java Code Analyzer5 : it parses the pro-            classes, number of different classes). Regardless of the
        gram code and produces a parse tree of it. Then,              approach, a logistic regression model was trained.
        they use each single node of the output tree (nodes
        represent different code categories, like classes, loops    • hhu [19] extracted structure (e.g. number of methods
        or variables) and count the frequency distribution of         per class, length of function names, cyclomatic com-
        these nodes (around 200 features are taken into con-          plexity) and style (e.g. length of methods per class,
        sideration). Apart from the Antlr, they obtain a set          number of comments per class) features but ignored
        of custom features for the source code, such as the           layout features (e.g. indentation) because they may
        length of the whole program, the average length of vari-      be easily modifiable by the programming IDE. They
        able names, the frequency of comments, their length,          used variance and range besides mean to aggregate the
        what indentation the programmer is using, and also            frequencies and then, constructed a separate model for
        the distribution and usage of various statements and          each trait training both linear regression and nearest
        decorators. They also extract features from the com-          neighbour models.
        ments such as the type/token ratio, usage of punctua-
        tion marks, average word length and a TF-IDF vector.        • kumar [12] used multiple linear regression to model
        They trained their models with two approaches, learn-         each of the five personality traits. For each person-
        ing from each single source code, and from the whole          ality trait, they have used four features: i) the num-
        set of source codes per author.                               ber of genuine comment words in multi-line comments,
                                                                      i.e., between /* and */ found in the program code;
      • castellanos [4] used also Antlr with the Java grammar         ii) the number of genuine single-line comment words
        to obtain different measures from the analysis of the         in single line comments, i.e., comments following ”//”.
        source code. For example, the amount of files, the aver-      Both in the previous feature and in this one, they have
        age lines of code, the average number of classes, the av-     not considered the cases where lines of code are com-
        erage number of lines per class, average attributes per       mented and the feature value is normalized by divid-
        class, average methods per class, average static meth-        ing it by the total number of words in the program
        ods, and so on, combined with Halstead metrics [15]           file; iii) the number of lines containing non-existent
        such as bugs delivered, difficulty, effort, time to un-       spaces, e.g., for (int i=1; i<=cases; i++) as opposed
        derstand or implement, and volume. For prediction,            to for (int i = 1; i< = cases; i++), since the presence
        he experimented with support vector regression, ex-           of spaces is supposed to be a good programming prac-
        tra trees regression, and support vector regression on        tice (this feature value is normalized by dividing it by
        averages.                                                     the total number of lines in the program file); iv) the
                                                                      number of instances where the programmer has im-
      • delair [8] combined style features (e.g. code layout          ported the specific libraries only (e.g. cases of import
        and formatting, indentation, headers, Javadoc, com-           java.io.FileNotFoundException as opposed to import
        ments, whitespaces) with content features (e.g. class         java.io.*) as this is supposed to be a good program-
        design problems, method design problem, annotations,          ming practice. This feature value was also normalized
        block checks, coding, imports, metrics, modifiers, nam-       with respect to the total number of lines in the pro-
        ing conventions, size violations). They trained a sup-        gram file.
        port vector machine for regression, gaussian processes,
        M5, M5 rules and random trees.                              • uaemex [28] obtained three types of features related
                                                                      with: i) Indentation: space in code, space in the com-
      • doval [9] approached the task with a shallow Long             ments, space between classes, spaces between source
        Short Term Memory (LSTM) recurrent neural net-                code blocks, space between methods, spaces between
        work. It works at the byte level, meaning that at each        control sentences, and spaces in clustering characters
        time step a new byte from the input text is processed         ”( ), [ ], { }”; ii) Identifier: the presence of under-
        by the network in an ordered manner. Bytes belonging          score, uppercase, lowercase and numbers characters in
        to a particular source code package in an input text file     the identifier, and the length of the identifier. These
        are considered as a sequence, where the processing of         characteristics were extracted for each class, method
        some byte at time step t is influenced by the previous        and variable names. Also, the percentage of number
        time steps t-1, t-2, ... , 0 (initial time step). The         of initialized variables was extracted; and iii) Com-
        network learning criterion is a smoothed mean abso-           ments: the presence of line and block comments, the
        lute error which uses a squared term if the absolute          size of the comments, and the presence of comments
        element-wise error falls below 1.                             with all letters in uppercase. They have experimented
                                                                      with symbolic regression, support vector machines, k-
5
    https://github.com/antlr                                          nearest neighbours, and neural networks.
  Although montejo have not sent a working note, they sent             • The narrowest inter quartile range corresponds to con-
us a brief description of their system. They have used Tone-             scientiousness (1.22), followed by neuroticism (1.84)
Analyzer6 , an IBM Watson module that proposes a value for               and openness (2.04). The widest correspond to extro-
each big five trait for a given text. The authors used Tone-             version (3.23), followed by agreeableness (2.28).
Analyzer with the source code as it is and rescaled the out-
put to fit the right range for the traits. Similarly, lee sent us     In Figure 3 the distribution of the Pearson correlations is
the description of their system. They set a hypothesis that         shown. Looking at this figure and at the table of results, we
according to the personality, there will be differences in the      can observe that:
steps of the source codes. Given a ith coder and n source
codes for a coder ci , the authors sorted codes by length              • There is only one outlier in agreeableness trait (0.38).
and naming c0i to cn−1i  . They transformed each code to a               Regretfully, this correlation corresponds to a high value
vector vij using skip-thought encoding [17], then calculated             in the RMSE (25.53).
n-1 difference vectors dji using equation dji = vij+1 − vij .          • The mean is between −0.01 and 0.09 (a difference of
The authors plot each coder to a feature space Sum(di )                  0.10), corresponding the lowest value to conscientious-
and Avg(di ), and then apply logistic regression algorithm               ness and agreeableness, and the highest one to open-
to train a model.                                                        ness. In any case, values very close to the random
  Furthermore, we have provided with two baselines:                      chance.
      • bow : a bag of character 3-grams with frequency weight.        • The median is between −0.03 and 0.08 (a difference of
      • mean: an approach that always predicts the mean                  0.11), corresponding the lowest value to agreeableness
        value observed in the training data.                             and the highest one to extroversion.

                                                                       • The lowest difference between mean and median was
5.     EVALUATION AND DISCUSSION OF THE                                  obtained for conscientiousness (0), followed by neuroti-
       SUBMITTED APPROACHES                                              cism (0.01), and extroversion, agreeableness and open-
                                                                         ness (0.02).
   Results are presented in Table 1 in alphabetical order.
Below the participants’ results, a summary with the com-               • The mean is higher than the median in case of openness
mon descriptive statistics is provided for each trait. In the            (0.09 vs. 0.07) and agreeableness (−0.01 vs. −0.03).
bottom of the table, results for the baselines are also pro-             The other occurs in case of neuroticism (0.04 vs. 0.05),
vided. Figures 1 to 3 show the distribution of the two mea-              extroversion (0.06 vs. 0.08), and conscientiousness (in
sures: RMSE and Pearson correlation for all the participants             both −0.01 ).
except the baselines. In Figure 1 we can appreciate that
there are many runs with anomalous RMSE values (out-                   • The minimum value was obtained for the extroversion
liers), whereas in Figure 2 we have removed these outliers.              trait (−0.37), very close to openness (−0.36), and the
Looking at these figures and at the table of results, we can             maximum for openness (0.62), followed by extrover-
observe that:                                                            sion (0.47), agreeableness (0.38), neuroticism (0.36)
                                                                         and conscientiousness (0.33).
      • The mean is between 10.49 and 12.75 (a difference of
        2.26), corresponding the lowest value to openness and          • Nevertheless the goodness of the maximum values, they
        the highest one to neuroticism.                                  correspond in most cases with high RMSE: openness
                                                                         (23.62), extroversion (28.80), agreeableness (25.53), and
      • The median is between 8.14 and 10.77 (a difference of            conscientiousness (22.05). Only in case of neuroticism
        2.63), corresponding again the lowest value to openness          the maximum Pearson correlations corresponds to a
        and the highest one to neuroticism.                              low value of RMSE (10.22).
      • The lowest difference between mean and median was              • The highest q3 corresponds to openness (0.28) and ex-
        obtained for conscientiousness (1.75), followed by neu-          troversion (0.21), followed by conscientiousness (0.14
        roticism (1.98). The highest difference was obtained             and neuroticism (0.14). The lowest one corresponds to
        for extroversion (2.72), agreeableness (2.36) and open-          agreeableness (0.07).
        ness (2.35).
                                                                       • The narrowest inter quartile range corresponds to agree-
      • In all the cases, the mean is higher than the median,
                                                                         ableness (0.18), followed by neuroticism (0.22), consci-
        and also than the 3rd quartile (q3), showing the effect
                                                                         entiousness (0.28), extroversion (0.31) and openness
        of the outliers.
                                                                         (0.33).
      • The minimum and maximum values were obtained for
        openness trait (6.95 and 33.53 respectively).                  We can conclude that, in general, systems performed sim-
                                                                    ilarly in terms of Pearson correlation for all the traits. How-
      • When removing outliers, the maximum value was ob-           ever, there seem to be higher differences with respect to
        tained for extroversion (16.67).                            RMSE, where the systems obtained better results for open-
                                                                    ness than for the rest. The distributions show that the lowest
      • The lowest quartiles, both 1st and 3rd quartiles (q1
                                                                    sparsity occurs with conscientiousness in case of RMSE and
        and q3), correspond to openness (7.54 and 9.58 respec-
                                                                    agreeableness in case of Pearson correlation, meanwhile the
        tively).
                                                                    highest sparsity occurs with extroversion in case of RMSE
6
    https://tone-analyzer-demo.mybluemix.net/                       and openness in case of Pearson correlation.
   Results for neuroticism are plotted in Figure 4. This         tained low RMSE but with negative correlation (bilan, hhu,
figure represents each system’s results by plotting its RMSE     uaemex ). Although the bow-based baseline is not in the top
in x axis and Pearson correlation in y axis. It is worth to      performing methods, it obtained low RMSE (9.06) with over
mention that the system proposed by delair in their 4th run      the median correlation (0.12).
obtained one of the highest values for Pearson correlation          Similarly, openness results are presented in Figure 8. It
(0.29) although with a high RMSE (17.55). This system            is noticeable that two systems presented by delair obtained
consists in a combination of style features (code layout and     the highest correlations but with quite high RMSE. Con-
formatting, indentation...) and content features (class de-      cretely, run 1 obtained the highest correlation (0.62) with
sign, method design, imports...), trained with random trees.     high RMSE (23.62), and run 3 obtaining the second highest
We can also observe a group of five (actually six due to two     correlation (0.54) with a little lower RMSE (20.28). They
systems that obtained the same results) in the upper-left        used M5rules and M5P respectively. Systems in the upper-
corner of the chart. These systems obtained the highest          left corner are shown in detail in Figure 9. We can see
correlations with the lowest error, and they are detailed in     that the best result for both RMSE and Pearson correla-
Figure 5. We can see that all of them (except lee which used     tion was obtained by uaemex in their 1st run. This run
skip-thought encoding) extracted specific features from the      was generated using symbolic regression with three types of
source code, such as the number of methods, the number of        features: indentation, identifiers and comments. The au-
comments per class, the type of comments (/* */ vs. in-          thors optimised this run by eliminating the source codes of
line), type of naming variables, and so on. We can see that      five developers according to the following criteria: the per-
some of these teams obtained similar results for two of their    son who had high values in all the personality traits, the
systems. For example, kumar with their 1st and 2nd runs          person who had a lower values in all the personality traits,
(they used linear regression for both runs, but they tried to    the person who had an average values in all the personality
optimise run 2 by removing from the training set the three       traits, the person who had more source codes and the person
files which obtained the highest error in training), or hhu      who had few source codes. They also obtained high results
that obtained the best results for their 2nd and 4th run         with their 3rd run, where they trained a back propagation
(they both used k-NN with a different combination of fea-        neural network with the whole set of training codes. Sys-
tures). Uaemex obtained their best result with run 3 that        tems presented by bilan also obtained high results in differ-
used neural networks. We can conclude that for neuroticism,      ent runs. Concretely, using Antlr parser to obtain features
specific features extracted from the code (kumar, hhu, uae-      in combination with features extracted from comments and
mex ) worked better than generic features such as n-grams        so on, they trained gradient boosted regression and multi-
(besumich, that obtained low RMSE but without correlation        nomial logistic regression. Similarly, castellanos who used
in most cases), byte streams (doval, that obtained low RMSE      also Antlr combined with Halstead measures and trained
but with negative correlations in most cases) or text streams    extra tree regressor (run 2) and support vector regression
(montejo, that obtained high RMSE with low correlations).        on averages (run 3); kumar with combinations of structure
   In Figure 6 results for extroversion are shown. We can        and style features trained with linear regression (2nd run
see that doval in their 4th run obtained both the highest        optimised by eliminating training files); and hhu also with
Pearson correlation (0.47) but with the worst RMSE (28.80).      combinations of structure and style features with k-NN in
They trained a LSTM recurrent neural network by convert-         both runs. For openness the best performing teams used
ing the input at byte level, that is, without the need of per-   specific features extracted from the code (uaemex, kumar,
forming feature engineering. In the upper-left corner of the     hhu), even with the help of code analysers such as Antlr
figure we can see the group of the best results both in RMSE     (castellanos, bilan). Common features seem to obtain good
and Pearson correlation, that is detailed in Figure 7. We        level of RMSE but with low (or even negative) correlations
can highlight the superiority of besumich run 5 (lowercased      (besumich, bow-based baseline).
character unigrams weighted with TF-IDF and training a              In case of agreeableness, as shown in Figure 10 we can
Lasso regression algorithm with alpha 0.01), which obtained      see that doval with their 4th run obtained the highest cor-
a correlation of 0.38 with a RMSE of 8.60, and kumar run         relation (0.38), but with a high RMSE (25.53). Systems
1 (code specific features with logistic regression without op-   in the upper-left corner are shown in detail in Figure 11.
timisation), with a correlation of 0.35 and a RMSE of 8.60.      We can say that the best result in both measures was ob-
It is worth to mention that lee obtained high results with       tained by gimenez in their 3rd run. The team used ridge
four of their approaches that use skip-thought encoding, and     to train their model with a subset of code style features. It
similar occurred with gimenez. The last one used a combi-        is worth mentioning that the provided baseline consistent
nation of word n-grams with specific features obtained from      in character n-grams appears as one of the top perform-
the code (the number of code that implemented the same           ing methods for this trait. For this trait is more difficult
class, the appearance of pieces of code suspicious of plagia-    to differentiate between common and specific features since
rism, the number of classes developed, and the number of         there are many different teams that, although obtained low
different classes developed), trained with ridge runs 1 (8.75    RMSE, have negative correlations. For example besumich
/ 0.31) and 2 (8.79 / 0.28), and logistic regression run 4       with character n-grams, bilan and castellanos with specific
(8.69 / 0.28). In case of extroversion we can see that com-      features obtained with Antlr (among others), or delair with
mon features such as n-grams (besumich) obtained good re-        a combination of style and content features. However, it is
sults. Also gimenez used word n-grams in combination to          worth to mention that the bow baseline obtained top results
other features, what supports this conclusion. However, byte     both in RMSE and Pearson correlation.
streams (doval ) again produced high RMSE with high corre-          Finally, with respect to conscientiousness results are
lation, or text streams (montejo) produced high RMSE but         depicted in Figure 12. We can see that four runs obtained
with low correlation. In some cases, specific features ob-       high values for Pearson correlation but also obtained high
RMSE. Concretely, delair obtained the highest correlation         tioning that approaches that took advantage of the training
(0.33) with the second highest RMSE (22.05) with their 1st        distributions (such as the baseline based on means did), ob-
and 3rd runs (M5rules and M5P respectively), and also a           tained low RMSE. However, this may be due to random
high correlation (0.27) with a little lower RMSE (15.53) with     chance. This supports the need of using complementary
their 5th run (support vector machine for regression). Sim-       measures to RMSE such as Pearson correlation, in order to
ilarly, doval with their 4th run obtained high correlation        avoid misinterpretations due to a biased measure.
(0.32) but with high RMSE (14.69) by using LSTM recur-
rent neural network with a byte level input. Systems in the
upper-left corner are represented in Figure 13. In this case,
                                                                  7.   ACKNOWLEDGMENTS
the best results in terms of RMSE are not the best ones              Our special thanks go to all of PR-SOCO participants.
in terms of Pearson correlation: with respect to the first        The work of the first author was partially supported by Au-
ones, hhu with runs 1, 2 and 3 or uaemex with run 1. With         toritas Consulting and by Ministerio de Economı́a y Com-
respect to the second ones, lee with runs 2, 4 and 5, bilan       petitividad de España under grant ECOPORTUNITY IPT-
with runs 4 and 5, and doval with run 3. It is noticeable that    2012-1220-430000. The work of the fifth author was par-
again the provided baseline obtained one of the best results.     tially supported by the SomEMBED TIN2015- 71147-C2-1-
In this case the second better RMSE with one of the top 5         P MINECO research project and by the Generalitat Valen-
correlations. In case of conscientiousness, systems that used     ciana under the grant ALMAMATER (PrometeoII/2014/030).
n-grams (besumich, gimenez ), byte streams (doval ) and text
streams (montejo) performed worst in case of Pearson corre-       8.   REFERENCES
lation, with negative values in most cases, whereas the best
results were achieved by combinations of structure, style and      [1] Y. Bachrach, M. Kosinski, T. Graepel, P. Kohli, and
comments (hhu, uaemex ) or features obtained by analysing              D. Stillwell. Personality and patterns of facebook
the codes (bilan). However, again the bow baseline achieved            usage. In Proceedings of the ACM Web Science
top positions, specially in RMSE.                                      Conference, pages 36–44. ACM New York, NY, USA,
   To sum up, depending on the trait, generic features such            2012.
as n-grams obtained different results in comparison with spe-      [2] I. Bilan, E. Saller, B. Roth, and M. Krytchak.
cific features obtained from the code. In case of generic              Caps-prc: A system for personality recognition in
features, their impact is specially on correlation: they may           programming code - notebook for pan at fire16. In
obtain good levels of RMSE but without a good correlation.             Working notes of FIRE 2016 - Forum for Information
As it was expected, the mean-based baseline obtained no                Retrieval Evaluation, Kolkata, India, December 7-10,
correlation, since it seems more a random value. However,              2016, CEUR Workshop Proceedings. CEUR-WS.org,
its RMSE was better than the average results and the me-               2016.
dian results in most cases. This result supports the need          [3] C. Bishop-Clark. Cognitive style, personality, and
of using also a measure like Pearson correlation in order to           computer programming. Computers in Human
avoid low RMSE due to random chance.                                   Behavior, 11(2):241–260, 1995.
                                                                   [4] H. A. Castellanos. Personality recognition applying
6.   CONCLUSION                                                        machine learning techniques on source code metrics.
                                                                       In Working notes of FIRE 2016 - Forum for
   This paper describes the 48 runs sent by 11 participants to
                                                                       Information Retrieval Evaluation, Kolkata, India,
the PR-SOCO shared task at PAN-FIRE 2016. Given a set
                                                                       December 7-10, 2016, CEUR Workshop Proceedings.
of source codes written in Java by students who answered
                                                                       CEUR-WS.org, 2016.
a personality test, the participants had to predict values for
the big five traits.                                               [5] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez,
   Results have been evaluated with two complementary mea-             G. Riccardi, and F. Pianesi. The workshop on
sures: RMSE, which provides an overall score of the perfor-            computational personality recognition 2014. In
mance of the system, and Pearson product-moment corre-                 Proceedings of the ACM International Conference on
lation, which indicates whether the performance is due to              Multimedia, pages 1245–1246. ACM, 2014.
the random chance. In general, systems showed to work              [6] F. Celli and L. Polonio. Relationships between
quite similarly in terms of Pearson correlation for all traits.        personality and interactions in facebook. In Social
Higher differences where noticed with respect to RMSE. The             Networking: Recent Trends, Emerging Issues and
best results were achieved for openness (6.95), as it was pre-         Future Outlook, pages 41–54. Nova Science Publishers,
viously reported by Mairesse et al. [20], as well as this was          Inc, 2013.
one of the traits with the lower RMSE at PAN 2015 [25] for         [7] P. T. Costa and R. R. McCrae. The revised neo
most languages.                                                        personality inventory (neo-pi-r). The SAGE handbook
   Participants have used different kinds of features: from            of personality theory and assessment, 2:179–198, 2008.
general ones such as word or character n-grams to specific         [8] R. Delair and R. Mahajan. Personality recognition in
ones obtained by parsing the code, analysing its structure,            source code. In Working notes of FIRE 2016 - Forum
style or comments. Depending on the trait, generic features            for Information Retrieval Evaluation, Kolkata, India,
obtained competitive results compared with specific ones in            December 7-10, 2016, CEUR Workshop Proceedings.
terms of RMSE. However, in most cases the best RMSE ob-                CEUR-WS.org, 2016.
tained with these features obtained low values of the Pearson      [9] Y. Doval, C. Gómez-Rodrı́guez, and J. Vilares.
correlation. In these cases, some systems seemed to be less            Shallow recurrent neural network for personality
robust, at least for some of the personality traits.                   recognition in source code. In Working notes of FIRE
   Finally, in line with the above comments, it is worth men-          2016 - Forum for Information Retrieval Evaluation,
     Kolkata, India, December 7-10, 2016, CEUR                      Camargo, and F. Restrepo-Calle. Finding
     Workshop Proceedings. CEUR-WS.org, 2016.                       relationships between socio-technical aspects and
[10] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello.        personality traits by mining developer e-mails. In
     Pan@ fire: Overview of soco track on the detection of          Proceedings of the 9th International Workshop on
     source code re-use. In Notebook Papers of FIRE 2014,           Cooperative and Human Aspects of Software
     FIRE-2014, Bangalore, India, 2014.                             Engineering, pages 8–14. ACM, 2016.
[11] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello.   [23] S. Phani, S. Lahiri, and A. Biswas. Personality
     Pan@ fire 2015: Overview of cl-soco track on the               recognition working note: Team besumich. In Working
     detection of cross-language source code re-use. In             notes of FIRE 2016 - Forum for Information Retrieval
     Proceedings of the Seventh Forum for Information               Evaluation, Kolkata, India, December 7-10, 2016,
     Retrieval Evaluation (FIRE 2015), Gandhinagar,                 CEUR Workshop Proceedings. CEUR-WS.org, 2016.
     India, pages 4–6, 2015.                                   [24] D. Quercia, R. Lambiotte, D. Stillwell, M. Kosinski,
[12] K. Ghosh and S. Kumar-Parui. Indian statistical                and J. Crowcroft. The personality of popular facebook
     institute, kolkata at pr-soco 2016 : A simple linear           users. In Proceedings of the ACM 2012 conference on
     regression based approach. In Working notes of FIRE            Computer Supported Cooperative Work, pages
     2016 - Forum for Information Retrieval Evaluation,             955–964. ACM, 2012.
     Kolkata, India, December 7-10, 2016, CEUR                 [25] F. Rangel, P. Rosso, M. Potthast, B. Stein, and
     Workshop Proceedings. CEUR-WS.org, 2016.                       W. Daelemans. Overview of the 3rd author profiling
[13] M. Giménez and R. Paredes. Prhlt at pr-soco: A                task at pan 2015. In Cappellato L., Ferro N., Jones
     regression model for predicting personality traits from        G., San Juan E. (Eds.) CLEF 2015 labs and
     source code - notebook for pr-soco at fire 2016. In            workshops, notebook papers. CEUR Workshop
     Working notes of FIRE 2016 - Forum for Information             Proceedings. CEUR-WS.org, vol. 1391, 2015.
     Retrieval Evaluation, Kolkata, India, December 7-10,      [26] H. A. Schwartz, J. C. Eichstaedt, M. L. Kern,
     2016, CEUR Workshop Proceedings. CEUR-WS.org,                  L. Dziurzynski, S. M. Ramones, M. Agrawal, A. Shah,
     2016.                                                          M. Kosinski, D. Stillwell, M. E. Seligman, et al.
[14] J. Golbeck, C. Robles, and K. Turner. Predicting               Personality, gender, and age in the language of social
     personality with social media. In CHI’11 Extended              media: The open-vocabulary approach. PloS one,
     Abstracts on Human Factors in Computing Systems,               8(9):773–791, 2013.
     pages 253–262. ACM, 2011.                                 [27] S. A. Sushant, S. Argamon, S. Dhawle, and J. W.
[15] M. H. Halstead. Elements of software science.                  Pennebaker. Lexical predictors of personality type. In
     operating and programming systems series, vol. 2,              In Proceedings of the Joint Annual Meeting of the
     1977.                                                          Interface and the Classification Society of North
[16] Z. Karimi, A. Baraani-Dastjerdi, N. Ghasem-Aghaee,             America, 2005.
     and S. Wagner. Links between the personalities, styles    [28] E. Vázquez-Vázquez, O. González-Brito,
     and performance in computer programming. Journal               J. Armeaga-Garcı́a, M. Garcı́a-Calderón,
     of Systems and Software, 111:228–241, 2016.                    G. Villada-Ramı́rez, A. J. Serrano-León, R. A.
[17] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel,               Garcı́a-Hernández, and Y. Ledeneva. Uaemex system
     R. Urtasun, A. Torralba, and S. Fidler. Skip-thought           for identifying traits personality in source code. In
     vectors. In Advances in neural information processing          Working notes of FIRE 2016 - Forum for Information
     systems, pages 3294–3302, 2015.                                Retrieval Evaluation, Kolkata, India, December 7-10,
[18] M. Kosinski, Y. Bachrach, P. Kohli, D. Stillwell, and          2016, CEUR Workshop Proceedings. CEUR-WS.org,
     T. Graepel. Manifestations of user personality in              2016.
     website choice and behaviour on online social
     networks. Machine Learning, pages 1–24, 2013.
[19] M. Liebeck, P. Modaresi, A. Askinadze, and
     S. Conrad. Pisco: A computational approach to
     predict personality types from java source code. In
     Working notes of FIRE 2016 - Forum for Information
     Retrieval Evaluation, Kolkata, India, December 7-10,
     2016, CEUR Workshop Proceedings. CEUR-WS.org,
     2016.
[20] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K.
     Moore. Using linguistic cues for the automatic
     recognition of personality in conversation and text.
     Journal of Artificial Intelligence Research,
     30(1):457–500, 2007.
[21] J. Oberlander and S. Nowson. Whose thumb is it
     anyway?: classifying author personality from weblog
     text. In Proceedings of the COLING/ACL on Main
     conference poster sessions, pages 627–634. Association
     for Computational Linguistics, 2006.
[22] O. H. Paruma-Pabón, F. A. González, J. Aponte, J. E.
        Figure 1: RMSE distribution.




Figure 2: RMSE distribution (without outliers).
Figure 3: Pearson correlation distribution.




Figure 4: RMSE vs. PC for neuroticism.
Figure 5: RMSE vs. PC for neuroticism (detailed).




    Figure 6: RMSE vs. PC for extroversion.
Figure 7: RMSE vs. PC for extroversion (detailed).




      Figure 8: RMSE vs. PC for openness.
Figure 9: RMSE vs. PC for openness (detailed).




  Figure 10: RMSE vs. PC for agreableness.
Figure 11: RMSE vs. PC for agreableness (detailed).




  Figure 12: RMSE vs. PC for conscientiousness.
Figure 13: RMSE vs. PC for conscientiousness (detailed).
Table 1: Participants’ results in terms of root mean square error and Pearson product moment correlation.
         Team          Run    Neuroticism Extroversion      Openness Agreeableness Conscientiousness
         besumich       1    10.69 / 0.05   9.00 / 0.14   8.58 / -0.33  9.38 / -0.09    8.89 / -0.14
                        2    10.69 / 0.05   9.00 / 0.14   8.58 / -0.33  9.38 / -0.09    8.89 / -0.14
                        3    10.53 / 0.05   9.05 / 0.10   8.43 / -0.33  9.32 / -0.07    8.88 / -0.17
                        4    10.53 / 0.05   9.05 / 0.10   8.43 / -0.33  9.32 / -0.07    8.88 / -0.17
                        5    10.83 / 0.10 8.60 / 0.38     9.06 / -0.31  9.66 / -0.10    8.77 / -0.06
         bilan          1    10.42 / 0.04   8.96 / 0.16   7.54 / 0.10   9.16 / 0.04     8.61 / 0.07
                        2    10.28 / 0.14   9.55 / -0.10  7.25 / 0.29   9.17 / -0.12    8.83 / -0.31
                        3    10.77 / -0.12  9.35 / -0.07  7.19 / 0.36   8.84 / 0.21     8.99 / -0.11
                        4    12.06 / -0.04 11.18 / -0.35  7.50 / 0.35  10.89 / -0.05    8.90 / 0.16
                        5    11.95 / 0.06 11.69 / -0.37   7.46 / 0.37  11.19 / -0.05    9.10 / 0.11
         castellanos    1    11.83 / 0.05   9.54 / 0.11   8.14 / 0.28  10.48 / -0.08    8.39 / -0.09
                        2    10.31 / 0.02   9.06 / 0.00   7.27 / 0.29   9.61 / -0.11    8.47 / -0.16
                        3    10.24 / 0.03   9.01 / 0.01   7.34 / 0.30   9.36 / 0.01     9.99 / -0.25
         delair         1    19.07 / 0.20 25.22 / 0.08 23.62 / 0.62    21.47 / -0.15   22.05 / 0.33
                        2    26.36 / 0.19 16.67 / -0.02 15.97 / 0.19   23.11 / -0.13   21.72 / 0.10
                        3    18.75 / 0.20 25.22 / 0.08 20.28 / 0.54    21.47 / -0.15   22.05 / 0.33
                        4    17.55 / 0.29 20.34 / -0.26 16.74 / 0.27   21.10 / -0.06   20.90 / 0.14
                        5    26.72 / 0.18 23.41 / -0.11 16.25 / 0.13   27.78 / -0.19   15.53 / 0.27
         doval          1    11.99 / -0.01 11.18 / 0.09 12.27 / -0.05  10.31 / 0.20     8.85 / 0.02
                        2    12.63 / -0.18 11.81 / 0.21   8.19 / -0.02 12.69 / -0.01    9.91 / -0.30
                        3    10.37 / 0.14 12.50 / 0.00    9.25 / 0.11  11.66 / -0.14    8.89 / 0.15
                        4    29.44 / -0.24 28.80 / 0.47 27.81 / -0.14  25.53 / 0.38    14.69 / 0.32
                        5    11.34 / 0.05 11.71 / 0.19 10.93 / 0.12    10.52 / -0.07   10.78 / -0.12
         gimenez        1    10.67 / -0.22  8.75 / 0.31   7.85 / -0.12  9.29 / 0.03     9.02 / -0.23
                        2    10.46 / -0.07  8.79 / 0.28   7.67 / 0.05   9.36 / 0.00     8.99 / -0.19
                        3     10.22 / 0.09  9.00 / 0.18   7.57 / 0.03  8.79 / 0.33      8.69 / -0.12
                        4    10.73 / -0.15  8.69 / 0.28   7.81 / -0.05  9.62 / -0.03    8.86 / -0.09
                        5    10.65 / -0.16  8.65 / 0.30   7.79 / -0.02  9.71 / -0.06    8.89 / -0.12
         HHU            1    11.65 / 0.05 14.28 / -0.31   7.42 / 0.29  12.29 / -0.28    8.56 / 0.13
                        2     9.97 / 0.23   9.60 / -0.10  8.01 / 0.02  11.91 / -0.30    8.38 / 0.19
                        3    11.65 / 0.05 14.28 / -0.31   7.42 / 0.29  11.50 / -0.32    8.56 / 0.13
                        4     9.97 / 0.23   9.22 / -0.20  7.84 / 0.07  11.50 / -0.32    8.38 / 0.19
                        5    10.36 / 0.13   9.60 / -0.10  8.01 / 0.02  11.91 / -0.30    8.73 / -0.05
                        6    13.91 / -0.10 25.63 / -0.05 33.53 / 0.24  12.29 / -0.28   14.31 / 0.16
         kumar          1    10.22 / 0.36 8.60 / 0.35     7.16 / 0.33   9.60 / 0.09     9.99 / -0.20
                        2    10.04 / 0.27 10.17 / 0.04    7.36 / 0.27   9.55 / 0.11    10.16 / -0.13
         lee            1    10.19 / 0.10   9.08 / 0.00   8.43 / 0.00   9.39 / 0.06     8.59 / 0.00
                        2    12.93 / -0.18  9.26 / 0.26   9.58 / -0.06  9.93 / -0.02    9.18 / 0.21
                        3    9.78 / 0.31     8.8 / 0.25   8.21 / -0.36  8.83 / 0.24     9.11 / 0.05
                        4    12.20 / -0.19  8.98 / 0.31   8.82 / -0.04  9.77 / 0.07     9.03 / 0.26
                        5    12.38 / -0.16  8.80 / 0.31   9.22 / -0.15  9.70 / 0.02     9.05 / 0.31
         montejo        1    24.16 / 0.10 27.39 / 0.10 22.57 / 0.27    28.63 / 0.21    22.36 / -0.11
         uaemex         1    11.54 / -0.29 11.08 / -0.14 6.95 / 0.45    8.98 / 0.22     8.53 / 0.11
                        2    11.10 / -0.14 12.23 / -0.15  9.72 / 0.04   9.94 / 0.19     9.86 / -0.30
                        3     9.84 / 0.35 12.69 / -0.10   7.34 / 0.28   9.56 / 0.33    11.36 / -0.01
                        4    10.67 / 0.04   9.49 / -0.04  8.14 / 0.10   8.97 / 0.29     8.82 / 0.07
                        5    10.25 / 0.00   9.85 / 0.00   9.84 / 0.00   9.42 / 0.00    10.50 / -0.29
                        6    10.86 / 0.13   9.85 / 0.00   7.57 / 0.00   9.42 / 0.00     8.53 / 0.00
                 min          9.78 / -0.29  8.60 / -0.37  6.95 / -0.36  8.79 / -0.32    8.38 / -0.31
                  q1         10.36 / -0.08  9.00 / -0.10  7.54 / -0.05  9.38 / -0.11    8.77 / -0.14
                median       10.77 / 0.05   9.55 / 0.08   8.14 / 0.07   9.71 / -0.03   8.99 / -0.01
                mean         12.75 / 0.04 12.27 / 0.06 10.49 / 0.09    12.07 / -0.01   10.74 / -0.01
                  q3         12.20 / 0.14 12.23 / 0.21    9.58 / 0.28  11.66 / 0.07     9.99 / 0.14
                 max         29.44 / 0.36 28.80 / 0.47 33.53 / 0.62    28.63 / 0.38    22.36 / 0.33
                              Neuroticism Extroversion      Openness Agreeableness Conscientiousness
         baseline      bow 10.29 / 0.06      9.06 / 0.12  7.74 / -0.17  9.00 / 0.20     8.47 / 0.17
                       mean 10.26 / 0.00     9.06 / 0.00  7.57 / 0.00   9.04 / 0.00     8.54 / 0.00