=Paper= {{Paper |id=Vol-1737/T1-1 |storemode=property |title=PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde |pdfUrl=https://ceur-ws.org/Vol-1737/T1-1.pdf |volume=Vol-1737 |authors=Francisco Rangel,Fabio A. González,Felipe Restrepo-Calle,Manuel Montes,Paolo Rosso |dblpUrl=https://dblp.org/rec/conf/fire/PardoGRMR16 }} ==PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde == https://ceur-ws.org/Vol-1737/T1-1.pdf

PAN at FIRE: Overview of the PR-SOCO Track on
Personality Recognition in SOurce COde

Francisco Rangel Fabio A. González Felipe Restrepo-Calle
Autoritas Consulting MindLab Research Group MindLab Research Group
Valencia, Spain Universidad Nacional de Universidad Nacional de
francisco.rangel@autoritas.es Colombia Colombia
fagonzalezo@unal.edu.co Bogotá, Colombia
ferestrepoca@unal.edu.co
Manuel Montes Paolo Rosso
INAOE PRHLT Research Center
Mexico Universitat Politècnica de
mmontesg@inaoep.mx València
prosso@dsic.upv.es

ABSTRACT code, in order to know whether a candidate may fit in a
Author profiling consists of predicting some author’s charac- team, may be very valuable for the recruitment process.
teristics (e.g. age, gender, personality) from her writing. Af- Also in education, to know students’ personality from their
ter addressing at PAN@CLEF mainly age and gender identi- source codes may help to improve the learning process by
fication, and also personality recognition in Twitter1 , in this customising the educational offer.
PAN@FIRE track on Personality Recognition from SOurce In this PAN@FIRE track on Personality Recognition from
COde (PR-SOCO) we have addressed the problem of pre- SOurce COde (PR-SOCO), we have addressed the problem
dicting author’s personality traits from her source code. In of predicting an author’s personality from her source code.
this paper, we analyse 48 runs sent by 11 participant teams. Given a source code collection of a programmer, the aim
Given a set of source codes written in Java by students is to identify her personality traits. In the training phase,
who answered also a personality test, participants had to participants have been provided with source codes in Java,
predict personality traits, based on the big five model. Re- written by computer science students, together with their
sults have been evaluated with two complementary measures personality traits. At test, participants have received source
(RMSE and Pearson product-moment correlation) that have codes of a few programmers and they have to predict their
permitted to identify whether systems with low error rates personality traits. The number of source codes per program-
may work due to random chance. No matter the approach, mer will be small reflecting a real scenario such as the one of
openness to experience is the trait where the participants a job interview: the interviewer could be interested in know-
obtained the best results for both measures. ing the interviewee degree of conscientiousness by evaluating
just a couple of programming problems.
We suggested participants to investigate beyond standard
Keywords n-grams based features. For example, the way the code is
personality recognition; source code; author profiling commented, the naming convention for identifiers or inden-
tation may also provide valuable information. In order to
1. INTRODUCTION encourage the investigation of different kinds of features,
several runs per participant were allowed. In this paper, we
Personality influence most, if not all, of the human ac- describe the participation of 11 teams that sent 48 runs.
tivities, such as the way people write [5, 25], interact with The reminder of this paper is organised as follows. Sec-
others, and the way people make decisions. For instance, tion 2 covers the state of the art, Section 3 describes the
in the case of developers, personality influence the criteria corpus and the evaluation measures, and Section 4 presents
they consider when selecting a software project they want the approaches submitted by the participants. Section 5 and
to participate [22], or the way they write and structure their 6 discuss results and draw conclusions, respectively.
source code. Personality is defined along five traits using the
Big Five Theory [7], which is the most widely accepted in
psychology. The five traits are: extroversion (E), emotional 2. RELATED WORK
stability / neuroticism (S), agreeableness (A), conscientious-
Pioneers research works in personality recognition were
ness (C), and openness to experience (O).
carried out by Argamon et al. [27], who focused on the iden-
Personality recognition may have several practical appli-
tification of extroversion and emotional stability. They used
cations, for example to set up high performance teams. In
support vector machines with a combination of word cate-
software development, not only technical skills are required,
gories and relative frequency of function words to recognize
but also soft skills such as communication or teamwork. The
these traits from self-reports. Similarly, Oberlander and
possibility of using a tool to predict personality from source
Nowson [21] focused on personality identification of blog-
1
http://pan.webis.de/ gers. Mairesse et al. [20] analysed the impact of different
set of psycholinguistic features obtained with LIWC2 and decided to keep them due to the following reasons: firstly,
MRC3 , showing the highest performance on the openness to participant teams could remove them easily if they decide
experience trait. to do so; secondly, it is possible that this kind of mistakes is
Recently, researchers have focused on personality recogni- related to some personality traits, so this information can be
tion from social media. In [14, 24, 6], the authors analysed used as a feature as well. Finally, although we encouraged
different sets of linguistic features as well as friends count or the students to write their own code, some of them could
daily activity. In [18], the authors reported a comprehensive have reused some pieces of code from other exercises or even
analysis on features such as the size of the friendship net- looked for code excerpts on books or the Internet.
work, the number of uploaded photos or the events attended In addition, each student answered a Big Five personal-
by the user. They analysed more than 180,000 Facebook ity test that allowed us to calculate a numerical score for
users and found correlations among these features and the each one of the following personality traits: extroversion,
different traits, specially in case of extroversion. Using the emotional stability / neuroticism, agreeableness, conscien-
same Facebook dataset and similar set of features, Bachrach tiousness, and openness to experience.
et al. [1] reported high results predicting extroversion auto- Overall, the dataset consists of 2,492 source code pro-
matically. grams written by 70 students along with the scores of the
In [26], the authors analysed 75,000 Facebook messages of five personality traits for each student, which are provided
volunteers who filled a personality test and found interest- as floating point numbers in the continuous range [20,80].
ing correlations among words usage and personality traits. The source codes of each student were organized on a single
According to them, extroverts use more social words and in- text file with all her source codes together with a line sep-
troverts use more words related to solitary activities. Emo- arator among them. The dataset was split in training and
tionally stable people use words related to sports, vacation, test subsets, the first one containing the data for 49 students
beach, church or team; whereas neurotics use more words and the second one the data of the remaining 21. Partici-
and sentences referring to depression. pants only have access to the personality traits scores of the
Due to the interest on this field and with the aim at defin- 49 students in the training dataset.
ing a common framework of evaluation, some shared tasks
have been organised. For example, i) the Workshop on Com- 3.2 Performance measures
putational Personality Recognition [5]; or ii) the Author For evaluating participants’ approaches we have used two
Profiling task at PAN 2015 [25] with the objective of iden- complementary measures: Root Mean Square Error (RMSE)
tifying age, gender and personality traits of Twitter users. and Pearson Product-Moment Correlation (PC). The moti-
Regarding programming style and personality, in [3] the vation to use both measures is to try to understand whether
authors explored the relationship between cognitive style, a committed error is due to random chance.
personality and computer programming style. More recently, We have calculated RMSE for each trait with Equation 1:
the authors in [16] also related personality to programming
style and performance. Whereas the 2014 [10] and 2015 [11] v
u n
PAN@FIRE tracks on SOurce COde (SOCO) where devoted u1 X
RM SEt = t (yi − ŷi )2 (1)
to detect reuse, in 2016 we aimed at identifying personality n 1
traits from source code.
where RM SEt is the Root Mean Square Error for trait t
3. EVALUATION FRAMEWORK (neuroticism, extroversion, openness, agreeableness, consci-
entiousness); yi and ŷi are the ground truth and predicted
In this section we describe the construction of the corpus, values respectively for author i. Also for each trait, PC is
covering particular properties, challenges and novelties. Fi- calculated following Equation 2:
nally, the evaluation measures are described.
Pn
3.1 Corpus i=1 (xi − x̄) (yi − ȳ)
r = Pn 2
Pn 2 (2)
The dataset is composed of Java programs written by com- i=1 (xi − x̄) i=1 (yi − ȳ)
puter science students from a data structures course at the where each xi and yi are respectively the ground truth and
Universidad Nacional de Colombia. Students were asked to the predicted value for each author i; x̄ and ȳ the average
upload source code, responding to some functional require- values.
ments of different programming tasks, to an automated as-
sessment tool. For each task, students could upload more
than one attempted solution. The number of attempts per 4. OVERVIEW OF THE SUBMITTED AP-
problem was not limited / discouraged in any way. There are PROACHES
very similar submissions among different attempts and also Eleven teams participated in the Personality Recognition
some of them contain compilation-time or runtime errors. in SOurce COde4 shared task. They sent 48 runs with dif-
Although in most of the cases students uploaded the right ferent approaches, and 9 of them have submitted the work-
Java source code file, some of them erroneously uploaded the ing notes describing their approaches. Following, we briefly
compiler output, debug information or even the source code highlight the different systems.
in other programming language (e.g.: Python). A priori
this seems to be noise for the dataset and a sensible alterna- • besumich [23] experimented with two kinds of features,
tive could have been to remove these entries. However, we bag of words and character n-grams (with n=1,2,3).
2 In both cases, they experimented with lowercase and
http://www.liwc.net/
3 4
http://www.psych.rl.ac.uk/ http://www.autoritas.es/prsoco/
original case, and three representations, binary (pres- • gimenez [13] proposed two different approaches to tackle
ence/absence), term frequency (TF) and TF-IDF. The this task. On the one hand, each code sample from
authors trained linear, ridge and Lasso regressions. each author was taken as an independent sample and
The final configuration used to send their runs com- vectorized using word n-grams; on the other hand,
bined lowercased unigrams weighted with TF-IDF (with all the codes from an author was taken as a unique
and without space characters) with different values for sample vectorized using word n-grams together with
the alpha parameter of the Lasso regression. hand-crafted features (e.g. number of codes that im-
plemented the same class, the appearance of pieces
• bilan [2] started with analysing the code structure with of code suspicious of plagiarism, number of developed
the Antlr Java Code Analyzer5 : it parses the pro- classes, number of different classes). Regardless of the
gram code and produces a parse tree of it. Then, approach, a logistic regression model was trained.
they use each single node of the output tree (nodes
represent different code categories, like classes, loops • hhu [19] extracted structure (e.g. number of methods
or variables) and count the frequency distribution of per class, length of function names, cyclomatic com-
these nodes (around 200 features are taken into con- plexity) and style (e.g. length of methods per class,
sideration). Apart from the Antlr, they obtain a set number of comments per class) features but ignored
of custom features for the source code, such as the layout features (e.g. indentation) because they may
length of the whole program, the average length of vari- be easily modifiable by the programming IDE. They
able names, the frequency of comments, their length, used variance and range besides mean to aggregate the
what indentation the programmer is using, and also frequencies and then, constructed a separate model for
the distribution and usage of various statements and each trait training both linear regression and nearest
decorators. They also extract features from the com- neighbour models.
ments such as the type/token ratio, usage of punctua-
tion marks, average word length and a TF-IDF vector. • kumar [12] used multiple linear regression to model
They trained their models with two approaches, learn- each of the five personality traits. For each person-
ing from each single source code, and from the whole ality trait, they have used four features: i) the num-
set of source codes per author. ber of genuine comment words in multi-line comments,
i.e., between /* and */ found in the program code;
• castellanos [4] used also Antlr with the Java grammar ii) the number of genuine single-line comment words
to obtain different measures from the analysis of the in single line comments, i.e., comments following ”//”.
source code. For example, the amount of files, the aver- Both in the previous feature and in this one, they have
age lines of code, the average number of classes, the av- not considered the cases where lines of code are com-
erage number of lines per class, average attributes per mented and the feature value is normalized by divid-
class, average methods per class, average static meth- ing it by the total number of words in the program
ods, and so on, combined with Halstead metrics [15] file; iii) the number of lines containing non-existent
such as bugs delivered, difficulty, effort, time to un- spaces, e.g., for (int i=1; i<=cases; i++) as opposed
derstand or implement, and volume. For prediction, to for (int i = 1; i< = cases; i++), since the presence
he experimented with support vector regression, ex- of spaces is supposed to be a good programming prac-
tra trees regression, and support vector regression on tice (this feature value is normalized by dividing it by
averages. the total number of lines in the program file); iv) the
number of instances where the programmer has im-
• delair [8] combined style features (e.g. code layout ported the specific libraries only (e.g. cases of import
and formatting, indentation, headers, Javadoc, com- java.io.FileNotFoundException as opposed to import
ments, whitespaces) with content features (e.g. class java.io.*) as this is supposed to be a good program-
design problems, method design problem, annotations, ming practice. This feature value was also normalized
block checks, coding, imports, metrics, modifiers, nam- with respect to the total number of lines in the pro-
ing conventions, size violations). They trained a sup- gram file.
port vector machine for regression, gaussian processes,
M5, M5 rules and random trees. • uaemex [28] obtained three types of features related
with: i) Indentation: space in code, space in the com-
• doval [9] approached the task with a shallow Long ments, space between classes, spaces between source
Short Term Memory (LSTM) recurrent neural net- code blocks, space between methods, spaces between
work. It works at the byte level, meaning that at each control sentences, and spaces in clustering characters
time step a new byte from the input text is processed ”( ), [ ], { }”; ii) Identifier: the presence of under-
by the network in an ordered manner. Bytes belonging score, uppercase, lowercase and numbers characters in
to a particular source code package in an input text file the identifier, and the length of the identifier. These
are considered as a sequence, where the processing of characteristics were extracted for each class, method
some byte at time step t is influenced by the previous and variable names. Also, the percentage of number
time steps t-1, t-2, ... , 0 (initial time step). The of initialized variables was extracted; and iii) Com-
network learning criterion is a smoothed mean abso- ments: the presence of line and block comments, the
lute error which uses a squared term if the absolute size of the comments, and the presence of comments
element-wise error falls below 1. with all letters in uppercase. They have experimented
with symbolic regression, support vector machines, k-
5
https://github.com/antlr nearest neighbours, and neural networks.
Although montejo have not sent a working note, they sent • The narrowest inter quartile range corresponds to con-
us a brief description of their system. They have used Tone- scientiousness (1.22), followed by neuroticism (1.84)
Analyzer6 , an IBM Watson module that proposes a value for and openness (2.04). The widest correspond to extro-
each big five trait for a given text. The authors used Tone- version (3.23), followed by agreeableness (2.28).
Analyzer with the source code as it is and rescaled the out-
put to fit the right range for the traits. Similarly, lee sent us In Figure 3 the distribution of the Pearson correlations is
the description of their system. They set a hypothesis that shown. Looking at this figure and at the table of results, we
according to the personality, there will be differences in the can observe that:
steps of the source codes. Given a ith coder and n source
codes for a coder ci , the authors sorted codes by length • There is only one outlier in agreeableness trait (0.38).
and naming c0i to cn−1i . They transformed each code to a Regretfully, this correlation corresponds to a high value
vector vij using skip-thought encoding [17], then calculated in the RMSE (25.53).
n-1 difference vectors dji using equation dji = vij+1 − vij . • The mean is between −0.01 and 0.09 (a difference of
The authors plot each coder to a feature space Sum(di ) 0.10), corresponding the lowest value to conscientious-
and Avg(di ), and then apply logistic regression algorithm ness and agreeableness, and the highest one to open-
to train a model. ness. In any case, values very close to the random
Furthermore, we have provided with two baselines: chance.
• bow : a bag of character 3-grams with frequency weight. • The median is between −0.03 and 0.08 (a difference of
• mean: an approach that always predicts the mean 0.11), corresponding the lowest value to agreeableness
value observed in the training data. and the highest one to extroversion.

• The lowest difference between mean and median was
5. EVALUATION AND DISCUSSION OF THE obtained for conscientiousness (0), followed by neuroti-
SUBMITTED APPROACHES cism (0.01), and extroversion, agreeableness and open-
ness (0.02).
Results are presented in Table 1 in alphabetical order.
Below the participants’ results, a summary with the com- • The mean is higher than the median in case of openness
mon descriptive statistics is provided for each trait. In the (0.09 vs. 0.07) and agreeableness (−0.01 vs. −0.03).
bottom of the table, results for the baselines are also pro- The other occurs in case of neuroticism (0.04 vs. 0.05),
vided. Figures 1 to 3 show the distribution of the two mea- extroversion (0.06 vs. 0.08), and conscientiousness (in
sures: RMSE and Pearson correlation for all the participants both −0.01 ).
except the baselines. In Figure 1 we can appreciate that
there are many runs with anomalous RMSE values (out- • The minimum value was obtained for the extroversion
liers), whereas in Figure 2 we have removed these outliers. trait (−0.37), very close to openness (−0.36), and the
Looking at these figures and at the table of results, we can maximum for openness (0.62), followed by extrover-
observe that: sion (0.47), agreeableness (0.38), neuroticism (0.36)
and conscientiousness (0.33).
• The mean is between 10.49 and 12.75 (a difference of
2.26), corresponding the lowest value to openness and • Nevertheless the goodness of the maximum values, they
the highest one to neuroticism. correspond in most cases with high RMSE: openness
(23.62), extroversion (28.80), agreeableness (25.53), and
• The median is between 8.14 and 10.77 (a difference of conscientiousness (22.05). Only in case of neuroticism
2.63), corresponding again the lowest value to openness the maximum Pearson correlations corresponds to a
and the highest one to neuroticism. low value of RMSE (10.22).
• The lowest difference between mean and median was • The highest q3 corresponds to openness (0.28) and ex-
obtained for conscientiousness (1.75), followed by neu- troversion (0.21), followed by conscientiousness (0.14
roticism (1.98). The highest difference was obtained and neuroticism (0.14). The lowest one corresponds to
for extroversion (2.72), agreeableness (2.36) and open- agreeableness (0.07).
ness (2.35).
• The narrowest inter quartile range corresponds to agree-
• In all the cases, the mean is higher than the median,
ableness (0.18), followed by neuroticism (0.22), consci-
and also than the 3rd quartile (q3), showing the effect
entiousness (0.28), extroversion (0.31) and openness
of the outliers.
(0.33).
• The minimum and maximum values were obtained for
openness trait (6.95 and 33.53 respectively). We can conclude that, in general, systems performed sim-
ilarly in terms of Pearson correlation for all the traits. How-
• When removing outliers, the maximum value was ob- ever, there seem to be higher differences with respect to
tained for extroversion (16.67). RMSE, where the systems obtained better results for open-
ness than for the rest. The distributions show that the lowest
• The lowest quartiles, both 1st and 3rd quartiles (q1
sparsity occurs with conscientiousness in case of RMSE and
and q3), correspond to openness (7.54 and 9.58 respec-
agreeableness in case of Pearson correlation, meanwhile the
tively).
highest sparsity occurs with extroversion in case of RMSE
6
https://tone-analyzer-demo.mybluemix.net/ and openness in case of Pearson correlation.
Results for neuroticism are plotted in Figure 4. This tained low RMSE but with negative correlation (bilan, hhu,
figure represents each system’s results by plotting its RMSE uaemex ). Although the bow-based baseline is not in the top
in x axis and Pearson correlation in y axis. It is worth to performing methods, it obtained low RMSE (9.06) with over
mention that the system proposed by delair in their 4th run the median correlation (0.12).
obtained one of the highest values for Pearson correlation Similarly, openness results are presented in Figure 8. It
(0.29) although with a high RMSE (17.55). This system is noticeable that two systems presented by delair obtained
consists in a combination of style features (code layout and the highest correlations but with quite high RMSE. Con-
formatting, indentation...) and content features (class de- cretely, run 1 obtained the highest correlation (0.62) with
sign, method design, imports...), trained with random trees. high RMSE (23.62), and run 3 obtaining the second highest
We can also observe a group of five (actually six due to two correlation (0.54) with a little lower RMSE (20.28). They
systems that obtained the same results) in the upper-left used M5rules and M5P respectively. Systems in the upper-
corner of the chart. These systems obtained the highest left corner are shown in detail in Figure 9. We can see
correlations with the lowest error, and they are detailed in that the best result for both RMSE and Pearson correla-
Figure 5. We can see that all of them (except lee which used tion was obtained by uaemex in their 1st run. This run
skip-thought encoding) extracted specific features from the was generated using symbolic regression with three types of
source code, such as the number of methods, the number of features: indentation, identifiers and comments. The au-
comments per class, the type of comments (/* */ vs. in- thors optimised this run by eliminating the source codes of
line), type of naming variables, and so on. We can see that five developers according to the following criteria: the per-
some of these teams obtained similar results for two of their son who had high values in all the personality traits, the
systems. For example, kumar with their 1st and 2nd runs person who had a lower values in all the personality traits,
(they used linear regression for both runs, but they tried to the person who had an average values in all the personality
optimise run 2 by removing from the training set the three traits, the person who had more source codes and the person
files which obtained the highest error in training), or hhu who had few source codes. They also obtained high results
that obtained the best results for their 2nd and 4th run with their 3rd run, where they trained a back propagation
(they both used k-NN with a different combination of fea- neural network with the whole set of training codes. Sys-
tures). Uaemex obtained their best result with run 3 that tems presented by bilan also obtained high results in differ-
used neural networks. We can conclude that for neuroticism, ent runs. Concretely, using Antlr parser to obtain features
specific features extracted from the code (kumar, hhu, uae- in combination with features extracted from comments and
mex ) worked better than generic features such as n-grams so on, they trained gradient boosted regression and multi-
(besumich, that obtained low RMSE but without correlation nomial logistic regression. Similarly, castellanos who used
in most cases), byte streams (doval, that obtained low RMSE also Antlr combined with Halstead measures and trained
but with negative correlations in most cases) or text streams extra tree regressor (run 2) and support vector regression
(montejo, that obtained high RMSE with low correlations). on averages (run 3); kumar with combinations of structure
In Figure 6 results for extroversion are shown. We can and style features trained with linear regression (2nd run
see that doval in their 4th run obtained both the highest optimised by eliminating training files); and hhu also with
Pearson correlation (0.47) but with the worst RMSE (28.80). combinations of structure and style features with k-NN in
They trained a LSTM recurrent neural network by convert- both runs. For openness the best performing teams used
ing the input at byte level, that is, without the need of per- specific features extracted from the code (uaemex, kumar,
forming feature engineering. In the upper-left corner of the hhu), even with the help of code analysers such as Antlr
figure we can see the group of the best results both in RMSE (castellanos, bilan). Common features seem to obtain good
and Pearson correlation, that is detailed in Figure 7. We level of RMSE but with low (or even negative) correlations
can highlight the superiority of besumich run 5 (lowercased (besumich, bow-based baseline).
character unigrams weighted with TF-IDF and training a In case of agreeableness, as shown in Figure 10 we can
Lasso regression algorithm with alpha 0.01), which obtained see that doval with their 4th run obtained the highest cor-
a correlation of 0.38 with a RMSE of 8.60, and kumar run relation (0.38), but with a high RMSE (25.53). Systems
1 (code specific features with logistic regression without op- in the upper-left corner are shown in detail in Figure 11.
timisation), with a correlation of 0.35 and a RMSE of 8.60. We can say that the best result in both measures was ob-
It is worth to mention that lee obtained high results with tained by gimenez in their 3rd run. The team used ridge
four of their approaches that use skip-thought encoding, and to train their model with a subset of code style features. It
similar occurred with gimenez. The last one used a combi- is worth mentioning that the provided baseline consistent
nation of word n-grams with specific features obtained from in character n-grams appears as one of the top perform-
the code (the number of code that implemented the same ing methods for this trait. For this trait is more difficult
class, the appearance of pieces of code suspicious of plagia- to differentiate between common and specific features since
rism, the number of classes developed, and the number of there are many different teams that, although obtained low
different classes developed), trained with ridge runs 1 (8.75 RMSE, have negative correlations. For example besumich
/ 0.31) and 2 (8.79 / 0.28), and logistic regression run 4 with character n-grams, bilan and castellanos with specific
(8.69 / 0.28). In case of extroversion we can see that com- features obtained with Antlr (among others), or delair with
mon features such as n-grams (besumich) obtained good re- a combination of style and content features. However, it is
sults. Also gimenez used word n-grams in combination to worth to mention that the bow baseline obtained top results
other features, what supports this conclusion. However, byte both in RMSE and Pearson correlation.
streams (doval ) again produced high RMSE with high corre- Finally, with respect to conscientiousness results are
lation, or text streams (montejo) produced high RMSE but depicted in Figure 12. We can see that four runs obtained
with low correlation. In some cases, specific features ob- high values for Pearson correlation but also obtained high
RMSE. Concretely, delair obtained the highest correlation tioning that approaches that took advantage of the training
(0.33) with the second highest RMSE (22.05) with their 1st distributions (such as the baseline based on means did), ob-
and 3rd runs (M5rules and M5P respectively), and also a tained low RMSE. However, this may be due to random
high correlation (0.27) with a little lower RMSE (15.53) with chance. This supports the need of using complementary
their 5th run (support vector machine for regression). Sim- measures to RMSE such as Pearson correlation, in order to
ilarly, doval with their 4th run obtained high correlation avoid misinterpretations due to a biased measure.
(0.32) but with high RMSE (14.69) by using LSTM recur-
rent neural network with a byte level input. Systems in the
upper-left corner are represented in Figure 13. In this case,
7. ACKNOWLEDGMENTS
the best results in terms of RMSE are not the best ones Our special thanks go to all of PR-SOCO participants.
in terms of Pearson correlation: with respect to the first The work of the first author was partially supported by Au-
ones, hhu with runs 1, 2 and 3 or uaemex with run 1. With toritas Consulting and by Ministerio de Economı́a y Com-
respect to the second ones, lee with runs 2, 4 and 5, bilan petitividad de España under grant ECOPORTUNITY IPT-
with runs 4 and 5, and doval with run 3. It is noticeable that 2012-1220-430000. The work of the fifth author was par-
again the provided baseline obtained one of the best results. tially supported by the SomEMBED TIN2015- 71147-C2-1-
In this case the second better RMSE with one of the top 5 P MINECO research project and by the Generalitat Valen-
correlations. In case of conscientiousness, systems that used ciana under the grant ALMAMATER (PrometeoII/2014/030).
n-grams (besumich, gimenez ), byte streams (doval ) and text
streams (montejo) performed worst in case of Pearson corre- 8. REFERENCES
lation, with negative values in most cases, whereas the best
results were achieved by combinations of structure, style and [1] Y. Bachrach, M. Kosinski, T. Graepel, P. Kohli, and
comments (hhu, uaemex ) or features obtained by analysing D. Stillwell. Personality and patterns of facebook
the codes (bilan). However, again the bow baseline achieved usage. In Proceedings of the ACM Web Science
top positions, specially in RMSE. Conference, pages 36–44. ACM New York, NY, USA,
To sum up, depending on the trait, generic features such 2012.
as n-grams obtained different results in comparison with spe- [2] I. Bilan, E. Saller, B. Roth, and M. Krytchak.
cific features obtained from the code. In case of generic Caps-prc: A system for personality recognition in
features, their impact is specially on correlation: they may programming code - notebook for pan at fire16. In
obtain good levels of RMSE but without a good correlation. Working notes of FIRE 2016 - Forum for Information
As it was expected, the mean-based baseline obtained no Retrieval Evaluation, Kolkata, India, December 7-10,
correlation, since it seems more a random value. However, 2016, CEUR Workshop Proceedings. CEUR-WS.org,
its RMSE was better than the average results and the me- 2016.
dian results in most cases. This result supports the need [3] C. Bishop-Clark. Cognitive style, personality, and
of using also a measure like Pearson correlation in order to computer programming. Computers in Human
avoid low RMSE due to random chance. Behavior, 11(2):241–260, 1995.
[4] H. A. Castellanos. Personality recognition applying
6. CONCLUSION machine learning techniques on source code metrics.
In Working notes of FIRE 2016 - Forum for
This paper describes the 48 runs sent by 11 participants to
Information Retrieval Evaluation, Kolkata, India,
the PR-SOCO shared task at PAN-FIRE 2016. Given a set
December 7-10, 2016, CEUR Workshop Proceedings.
of source codes written in Java by students who answered
CEUR-WS.org, 2016.
a personality test, the participants had to predict values for
the big five traits. [5] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez,
Results have been evaluated with two complementary mea- G. Riccardi, and F. Pianesi. The workshop on
sures: RMSE, which provides an overall score of the perfor- computational personality recognition 2014. In
mance of the system, and Pearson product-moment corre- Proceedings of the ACM International Conference on
lation, which indicates whether the performance is due to Multimedia, pages 1245–1246. ACM, 2014.
the random chance. In general, systems showed to work [6] F. Celli and L. Polonio. Relationships between
quite similarly in terms of Pearson correlation for all traits. personality and interactions in facebook. In Social
Higher differences where noticed with respect to RMSE. The Networking: Recent Trends, Emerging Issues and
best results were achieved for openness (6.95), as it was pre- Future Outlook, pages 41–54. Nova Science Publishers,
viously reported by Mairesse et al. [20], as well as this was Inc, 2013.
one of the traits with the lower RMSE at PAN 2015 [25] for [7] P. T. Costa and R. R. McCrae. The revised neo
most languages. personality inventory (neo-pi-r). The SAGE handbook
Participants have used different kinds of features: from of personality theory and assessment, 2:179–198, 2008.
general ones such as word or character n-grams to specific [8] R. Delair and R. Mahajan. Personality recognition in
ones obtained by parsing the code, analysing its structure, source code. In Working notes of FIRE 2016 - Forum
style or comments. Depending on the trait, generic features for Information Retrieval Evaluation, Kolkata, India,
obtained competitive results compared with specific ones in December 7-10, 2016, CEUR Workshop Proceedings.
terms of RMSE. However, in most cases the best RMSE ob- CEUR-WS.org, 2016.
tained with these features obtained low values of the Pearson [9] Y. Doval, C. Gómez-Rodrı́guez, and J. Vilares.
correlation. In these cases, some systems seemed to be less Shallow recurrent neural network for personality
robust, at least for some of the personality traits. recognition in source code. In Working notes of FIRE
Finally, in line with the above comments, it is worth men- 2016 - Forum for Information Retrieval Evaluation,
Kolkata, India, December 7-10, 2016, CEUR Camargo, and F. Restrepo-Calle. Finding
Workshop Proceedings. CEUR-WS.org, 2016. relationships between socio-technical aspects and
[10] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. personality traits by mining developer e-mails. In
Pan@ fire: Overview of soco track on the detection of Proceedings of the 9th International Workshop on
source code re-use. In Notebook Papers of FIRE 2014, Cooperative and Human Aspects of Software
FIRE-2014, Bangalore, India, 2014. Engineering, pages 8–14. ACM, 2016.
[11] E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. [23] S. Phani, S. Lahiri, and A. Biswas. Personality
Pan@ fire 2015: Overview of cl-soco track on the recognition working note: Team besumich. In Working
detection of cross-language source code re-use. In notes of FIRE 2016 - Forum for Information Retrieval
Proceedings of the Seventh Forum for Information Evaluation, Kolkata, India, December 7-10, 2016,
Retrieval Evaluation (FIRE 2015), Gandhinagar, CEUR Workshop Proceedings. CEUR-WS.org, 2016.
India, pages 4–6, 2015. [24] D. Quercia, R. Lambiotte, D. Stillwell, M. Kosinski,
[12] K. Ghosh and S. Kumar-Parui. Indian statistical and J. Crowcroft. The personality of popular facebook
institute, kolkata at pr-soco 2016 : A simple linear users. In Proceedings of the ACM 2012 conference on
regression based approach. In Working notes of FIRE Computer Supported Cooperative Work, pages
2016 - Forum for Information Retrieval Evaluation, 955–964. ACM, 2012.
Kolkata, India, December 7-10, 2016, CEUR [25] F. Rangel, P. Rosso, M. Potthast, B. Stein, and
Workshop Proceedings. CEUR-WS.org, 2016. W. Daelemans. Overview of the 3rd author profiling
[13] M. Giménez and R. Paredes. Prhlt at pr-soco: A task at pan 2015. In Cappellato L., Ferro N., Jones
regression model for predicting personality traits from G., San Juan E. (Eds.) CLEF 2015 labs and
source code - notebook for pr-soco at fire 2016. In workshops, notebook papers. CEUR Workshop
Working notes of FIRE 2016 - Forum for Information Proceedings. CEUR-WS.org, vol. 1391, 2015.
Retrieval Evaluation, Kolkata, India, December 7-10, [26] H. A. Schwartz, J. C. Eichstaedt, M. L. Kern,
2016, CEUR Workshop Proceedings. CEUR-WS.org, L. Dziurzynski, S. M. Ramones, M. Agrawal, A. Shah,
2016. M. Kosinski, D. Stillwell, M. E. Seligman, et al.
[14] J. Golbeck, C. Robles, and K. Turner. Predicting Personality, gender, and age in the language of social
personality with social media. In CHI’11 Extended media: The open-vocabulary approach. PloS one,
Abstracts on Human Factors in Computing Systems, 8(9):773–791, 2013.
pages 253–262. ACM, 2011. [27] S. A. Sushant, S. Argamon, S. Dhawle, and J. W.
[15] M. H. Halstead. Elements of software science. Pennebaker. Lexical predictors of personality type. In
operating and programming systems series, vol. 2, In Proceedings of the Joint Annual Meeting of the
1977. Interface and the Classification Society of North
[16] Z. Karimi, A. Baraani-Dastjerdi, N. Ghasem-Aghaee, America, 2005.
and S. Wagner. Links between the personalities, styles [28] E. Vázquez-Vázquez, O. González-Brito,
and performance in computer programming. Journal J. Armeaga-Garcı́a, M. Garcı́a-Calderón,
of Systems and Software, 111:228–241, 2016. G. Villada-Ramı́rez, A. J. Serrano-León, R. A.
[17] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, Garcı́a-Hernández, and Y. Ledeneva. Uaemex system
R. Urtasun, A. Torralba, and S. Fidler. Skip-thought for identifying traits personality in source code. In
vectors. In Advances in neural information processing Working notes of FIRE 2016 - Forum for Information
systems, pages 3294–3302, 2015. Retrieval Evaluation, Kolkata, India, December 7-10,
[18] M. Kosinski, Y. Bachrach, P. Kohli, D. Stillwell, and 2016, CEUR Workshop Proceedings. CEUR-WS.org,
T. Graepel. Manifestations of user personality in 2016.
website choice and behaviour on online social
networks. Machine Learning, pages 1–24, 2013.
[19] M. Liebeck, P. Modaresi, A. Askinadze, and
S. Conrad. Pisco: A computational approach to
predict personality types from java source code. In
Working notes of FIRE 2016 - Forum for Information
Retrieval Evaluation, Kolkata, India, December 7-10,
2016, CEUR Workshop Proceedings. CEUR-WS.org,
2016.
[20] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K.
Moore. Using linguistic cues for the automatic
recognition of personality in conversation and text.
Journal of Artificial Intelligence Research,
30(1):457–500, 2007.
[21] J. Oberlander and S. Nowson. Whose thumb is it
anyway?: classifying author personality from weblog
text. In Proceedings of the COLING/ACL on Main
conference poster sessions, pages 627–634. Association
for Computational Linguistics, 2006.
[22] O. H. Paruma-Pabón, F. A. González, J. Aponte, J. E.
Figure 1: RMSE distribution.

Figure 2: RMSE distribution (without outliers).
Figure 3: Pearson correlation distribution.

Figure 4: RMSE vs. PC for neuroticism.
Figure 5: RMSE vs. PC for neuroticism (detailed).

Figure 6: RMSE vs. PC for extroversion.
Figure 7: RMSE vs. PC for extroversion (detailed).

Figure 8: RMSE vs. PC for openness.
Figure 9: RMSE vs. PC for openness (detailed).

Figure 10: RMSE vs. PC for agreableness.
Figure 11: RMSE vs. PC for agreableness (detailed).

Figure 12: RMSE vs. PC for conscientiousness.
Figure 13: RMSE vs. PC for conscientiousness (detailed).
Table 1: Participants’ results in terms of root mean square error and Pearson product moment correlation.
Team Run Neuroticism Extroversion Openness Agreeableness Conscientiousness
besumich 1 10.69 / 0.05 9.00 / 0.14 8.58 / -0.33 9.38 / -0.09 8.89 / -0.14
2 10.69 / 0.05 9.00 / 0.14 8.58 / -0.33 9.38 / -0.09 8.89 / -0.14
3 10.53 / 0.05 9.05 / 0.10 8.43 / -0.33 9.32 / -0.07 8.88 / -0.17
4 10.53 / 0.05 9.05 / 0.10 8.43 / -0.33 9.32 / -0.07 8.88 / -0.17
5 10.83 / 0.10 8.60 / 0.38 9.06 / -0.31 9.66 / -0.10 8.77 / -0.06
bilan 1 10.42 / 0.04 8.96 / 0.16 7.54 / 0.10 9.16 / 0.04 8.61 / 0.07
2 10.28 / 0.14 9.55 / -0.10 7.25 / 0.29 9.17 / -0.12 8.83 / -0.31
3 10.77 / -0.12 9.35 / -0.07 7.19 / 0.36 8.84 / 0.21 8.99 / -0.11
4 12.06 / -0.04 11.18 / -0.35 7.50 / 0.35 10.89 / -0.05 8.90 / 0.16
5 11.95 / 0.06 11.69 / -0.37 7.46 / 0.37 11.19 / -0.05 9.10 / 0.11
castellanos 1 11.83 / 0.05 9.54 / 0.11 8.14 / 0.28 10.48 / -0.08 8.39 / -0.09
2 10.31 / 0.02 9.06 / 0.00 7.27 / 0.29 9.61 / -0.11 8.47 / -0.16
3 10.24 / 0.03 9.01 / 0.01 7.34 / 0.30 9.36 / 0.01 9.99 / -0.25
delair 1 19.07 / 0.20 25.22 / 0.08 23.62 / 0.62 21.47 / -0.15 22.05 / 0.33
2 26.36 / 0.19 16.67 / -0.02 15.97 / 0.19 23.11 / -0.13 21.72 / 0.10
3 18.75 / 0.20 25.22 / 0.08 20.28 / 0.54 21.47 / -0.15 22.05 / 0.33
4 17.55 / 0.29 20.34 / -0.26 16.74 / 0.27 21.10 / -0.06 20.90 / 0.14
5 26.72 / 0.18 23.41 / -0.11 16.25 / 0.13 27.78 / -0.19 15.53 / 0.27
doval 1 11.99 / -0.01 11.18 / 0.09 12.27 / -0.05 10.31 / 0.20 8.85 / 0.02
2 12.63 / -0.18 11.81 / 0.21 8.19 / -0.02 12.69 / -0.01 9.91 / -0.30
3 10.37 / 0.14 12.50 / 0.00 9.25 / 0.11 11.66 / -0.14 8.89 / 0.15
4 29.44 / -0.24 28.80 / 0.47 27.81 / -0.14 25.53 / 0.38 14.69 / 0.32
5 11.34 / 0.05 11.71 / 0.19 10.93 / 0.12 10.52 / -0.07 10.78 / -0.12
gimenez 1 10.67 / -0.22 8.75 / 0.31 7.85 / -0.12 9.29 / 0.03 9.02 / -0.23
2 10.46 / -0.07 8.79 / 0.28 7.67 / 0.05 9.36 / 0.00 8.99 / -0.19
3 10.22 / 0.09 9.00 / 0.18 7.57 / 0.03 8.79 / 0.33 8.69 / -0.12
4 10.73 / -0.15 8.69 / 0.28 7.81 / -0.05 9.62 / -0.03 8.86 / -0.09
5 10.65 / -0.16 8.65 / 0.30 7.79 / -0.02 9.71 / -0.06 8.89 / -0.12
HHU 1 11.65 / 0.05 14.28 / -0.31 7.42 / 0.29 12.29 / -0.28 8.56 / 0.13
2 9.97 / 0.23 9.60 / -0.10 8.01 / 0.02 11.91 / -0.30 8.38 / 0.19
3 11.65 / 0.05 14.28 / -0.31 7.42 / 0.29 11.50 / -0.32 8.56 / 0.13
4 9.97 / 0.23 9.22 / -0.20 7.84 / 0.07 11.50 / -0.32 8.38 / 0.19
5 10.36 / 0.13 9.60 / -0.10 8.01 / 0.02 11.91 / -0.30 8.73 / -0.05
6 13.91 / -0.10 25.63 / -0.05 33.53 / 0.24 12.29 / -0.28 14.31 / 0.16
kumar 1 10.22 / 0.36 8.60 / 0.35 7.16 / 0.33 9.60 / 0.09 9.99 / -0.20
2 10.04 / 0.27 10.17 / 0.04 7.36 / 0.27 9.55 / 0.11 10.16 / -0.13
lee 1 10.19 / 0.10 9.08 / 0.00 8.43 / 0.00 9.39 / 0.06 8.59 / 0.00
2 12.93 / -0.18 9.26 / 0.26 9.58 / -0.06 9.93 / -0.02 9.18 / 0.21
3 9.78 / 0.31 8.8 / 0.25 8.21 / -0.36 8.83 / 0.24 9.11 / 0.05
4 12.20 / -0.19 8.98 / 0.31 8.82 / -0.04 9.77 / 0.07 9.03 / 0.26
5 12.38 / -0.16 8.80 / 0.31 9.22 / -0.15 9.70 / 0.02 9.05 / 0.31
montejo 1 24.16 / 0.10 27.39 / 0.10 22.57 / 0.27 28.63 / 0.21 22.36 / -0.11
uaemex 1 11.54 / -0.29 11.08 / -0.14 6.95 / 0.45 8.98 / 0.22 8.53 / 0.11
2 11.10 / -0.14 12.23 / -0.15 9.72 / 0.04 9.94 / 0.19 9.86 / -0.30
3 9.84 / 0.35 12.69 / -0.10 7.34 / 0.28 9.56 / 0.33 11.36 / -0.01
4 10.67 / 0.04 9.49 / -0.04 8.14 / 0.10 8.97 / 0.29 8.82 / 0.07
5 10.25 / 0.00 9.85 / 0.00 9.84 / 0.00 9.42 / 0.00 10.50 / -0.29
6 10.86 / 0.13 9.85 / 0.00 7.57 / 0.00 9.42 / 0.00 8.53 / 0.00
min 9.78 / -0.29 8.60 / -0.37 6.95 / -0.36 8.79 / -0.32 8.38 / -0.31
q1 10.36 / -0.08 9.00 / -0.10 7.54 / -0.05 9.38 / -0.11 8.77 / -0.14
median 10.77 / 0.05 9.55 / 0.08 8.14 / 0.07 9.71 / -0.03 8.99 / -0.01
mean 12.75 / 0.04 12.27 / 0.06 10.49 / 0.09 12.07 / -0.01 10.74 / -0.01
q3 12.20 / 0.14 12.23 / 0.21 9.58 / 0.28 11.66 / 0.07 9.99 / 0.14
max 29.44 / 0.36 28.80 / 0.47 33.53 / 0.62 28.63 / 0.38 22.36 / 0.33
Neuroticism Extroversion Openness Agreeableness Conscientiousness
baseline bow 10.29 / 0.06 9.06 / 0.12 7.74 / -0.17 9.00 / 0.20 8.47 / 0.17
mean 10.26 / 0.00 9.06 / 0.00 7.57 / 0.00 9.04 / 0.00 8.54 / 0.00