=Paper= {{Paper |id=Vol-1737/T1-6 |storemode=property |title=Shallow Recurrent Neural Network for Personality Recognition in Source Code |pdfUrl=https://ceur-ws.org/Vol-1737/T1-6.pdf |volume=Vol-1737 |authors=Yerai Doval,Carlos Gómez-Rodríguez,Jesús Vilares |dblpUrl=https://dblp.org/rec/conf/fire/DovalGV16 }} ==Shallow Recurrent Neural Network for Personality Recognition in Source Code== https://ceur-ws.org/Vol-1737/T1-6.pdf
             Shallow recurrent neural network for personality
                       recognition in source code

                     Yerai Doval               Carlos Gómez-Rodríguez                      Jesús Vilares
            Grupo COLE, Departamento            Grupo LYS, Departamento de         Grupo LYS, Departamento de
                  de Computación                        Computación                        Computación
           E.S. de Enxeñaría Informática,         Facultade de Informática,          Facultade de Informática,
                Universidade de Vigo              Universidade da Coruña             Universidade da Coruña
            Campus As Lagoas, 32004 –           Campus de Elviña, 15071 – A        Campus de Elviña, 15071 – A
                  Ourense (Spain)                     Coruña (Spain)                     Coruña (Spain)
              yerai.doval@uvigo.es                   cgomezr@udc.es                      jvilares@udc.es

ABSTRACT                                                         some coders tend to use block delimiters even when they are
Personality recognition in source code constitutes a novel       not necessary, or to add a certain number of blank lines in or-
task in the field of author profiling on written text. In this   der to clearly separate two function declarations. Morevoer,
paper we describe our proposal for the PR-SOCO shared            variable and function names are custom made by the coder,
task in FIRE 2016, which is based on a shallow recurrent         and commentaries include information in natural language.
LSTM neural network that tries to predict five personality       Therefore, it sounds reasonable to take advantage of this
traits of the author given a source code fragment. Our pre-      type of patterns to attempt to extract information about
liminary results show that it should be possible to tackle the   the author of a source code fragment, which constitutes a
problem at hand with our approach but also that there is         novel task in the author profiling field.
still room for improvement through more complex network             In this work, we describe our contribution to the Per-
architectures and training processes.                            sonality Recognition in SOurce COde (PR-SOCO) shared
                                                                 task [13], held in conjunction with the FIRE 2016. The
                                                                 objective of this task is quantifying five personality traits
CCS Concepts                                                     about the author of a given source code fragment, namely,
•Applied computing → Law, social and behavioral                  the standard traits from the Big Five Theory [6]: extrover-
sciences; Psychology; Document analysis; •Human-                 sion, emotional stability/neuroticism, agreeableness, consci-
centered computing → Text input; •Social and pro-                entiouness and openness to experience. To achieve this, we
fessional topics → User characteristics; •Computing              propose using a shallow recurrent neural network that, tak-
methodologies → Natural language processing; Neu-                ing as input the sequence of bytes in an input source code
ral networks;                                                    text, will try to predict the five values for the corresponding
                                                                 traits of its author. By reading the most elementary unit
Keywords                                                         available for encoded text, the byte (in most cases directly
                                                                 aligned with individual characters), we seek to find all possi-
personality recognition, source code, recurrent neural net-      ble useful patterns carved deep into the text. Furthermore,
work, LSTM                                                       with this approach we are not limiting our models to those
                                                                 patterns a human can grasp, but we are enabling the neural
1.     INTRODUCTION                                              network to extract any information it may consider useful
   Written text can tell us a lot about its author. Demo-        for the task.
graphic information such as age, gender or specific person-         The results obtained with our shallow networks are en-
ality traits of the author can be inferred by a human expert     couraging with respect to the root mean squared error metric
by the sole observation of a written text fragment [7]. This     (RMSE), which is aligned with the smoothed mean absolute
task is called author profiling, and it can be also applied to   error criterion employed in our training process. However,
other channels such as speech or body language. But de-          they do not perform so well for Pearson Correlation (PC),
tecting the patterns which allow for this kind of information    which we have not considered at this time. We have also
extraction is not restricted to humans, as we will see in this   found that the use of more layers in our networks can im-
work.                                                            prove their performance, agreeing with previous work [17].
   Source code is another form of written text, and it is be-
coming very accessible as software developers are now able       2.   RELATED WORK
to easily publish their work on the Web through services
                                                                     There has been a recent surge of interest on author pro-
such as Github1 or Bitbucket.2 Although more constrained
                                                                 filing related to personality recognition [14, 3].
and formal than natural language, source code text may also
                                                                     For written text, traditional author profiling approaches
have something to tell us about its author, as there is still
                                                                 tend to rely on lexical and syntactical features, such as iden-
room for personal preferences in its writing. For instance,
                                                                 tification of key words, part-of-speech tags [1] or n-grams [10]
1
    https://github.com/                                          paired with statistical models such as Hidden Markov Mod-
2
    https://bitbucket.org/                                       els. There is also work which studies the application of
these traditional techniques on short informal texts, which
often translates into lower performance figures than those
obtained for regular texts [18].
   However, author profiling is not restricted to written text.
Mairesse et al [12] extend this type of analysis to speech,
where features such as sound frequencies and the duration
of pauses made by the speaker are considered. Biel et al [2]
go one step further by analysing Youtube videos and adding
what they call “nonverbal cues” to the feature set, which
take into consideration the different types of motion that can
be observed in the video. There are even approaches that
analyse the structure and topology of the social network of
subjects [16].
   Regarding the psychological aspects of this work, the pro-
posed task relies on the so-called Big Five Theory [6] to es-
tablish the personality traits to be predicted: extroversion,
neuroticism, agreeableness, conscientiousness, and openness        Figure 1: Simplified view of the neural network
to experience. It is worth noting that although in trait the-      used. The first layer should have 256 neurons, one
ory there are more than five traits, the most extended theo-       per possible input byte value. The second layer, the
retical approaches reduce its total number to five, as in the      only hidden layer in this case, is conformed by re-
case of the Big Five Theory, or even to just three: neuroti-       current LSTM units. The last layer is the output
cism, extraversion and psychoticism [15].                          layer, and has exactly 5 neurons, one for each trait
                                                                   we want to calculate.
3.   THE PROPOSED APPROACH
   Written text, either natural language or source code, can
be viewed as a sequence of basic elements such as sentences,       have used a smoothed mean absolute error as the training
words or characters, to name a few possibilities. Given a          criterion of the network, which uses a squared term if the
particular domain, we can choose the sequential view of the        absolute element-wise error falls below one, making it less
input text which best fits our needs. In our case, source code     sensitive to outlier data and preventing exploding gradients,
is full of reserved keywords such as if, return or while, so a     a common problem in neural network training [8].
word-based approach may seem appropriate at first, as the             It is worth noting that, in contrast with previous work
word vocabulary seems to be relatively fixed and reduced.          (see Section 2), our approach does not require a feature engi-
However, the problem then comes with the custom names              neering phase as neural networks, in their training processes,
given by the coder to classes, variables, functions, etc. which    reflect the most interesting features from the input domain
have an unpredictable nature and do not fit well in a strict       in the values of their parameters, sometimes referred to as
vocabulary approach. Furthermore, it would be interesting          weights.
that the vocabulary of sequence elements was as small as              Lastly, we opted for feeding the network with input in-
possible since this affects the required size of the input layer   stances (sequences) which are independent from each other,
of our models. In order to keep things simple, we will not         so that the important dependence relationships (patterns)
follow a word-level or character-level approach but a pure         between elements (bytes) of a sequence may be observed
byte-level approach, thus limiting the size of the vocabulary      by our model. For this reason, we have constructed se-
to 256 possible byte values.                                       quences from whole source code packages. As these se-
   To process these byte sequences we will use recurrent neu-      quences can be quite long (see Section 4), traditional re-
ral networks, as they are a perfect fit for sequential data [5].   current networks may have problems to recall important in-
Thus, each byte from the input sequence is fed to the net-         formation extracted at the beginning of the input sequence
work at each time step through the input layer, which trans-       while they are processing the last elements of it. In order
forms byte values into internal representations that can be        to address this limitation, we use long short term memory
manipulated by the hidden layers of the network. Moreover,         (LSTM) units as the neurons in the hidden layers of the net-
the output of these hidden layers is not only influenced by        work [9]. These units pack a memory cell and other elements
the current input but also by some of the information re-          that manipulate its contents, thus enabling them to remem-
tained from those bytes processed at previous time steps.          ber important information from the distant past of an input
This is achieved thanks to the recurrent connections added         sequence. See Figure 1 for a simplified visual representation
to the neurons in these layers of the network. Once the fi-        of this model.
nal byte from the input sequence has been processed, the
output from the last hidden layer corresponding to the last
time step is then used to perform a linear transformation          4.   EXPERIMENTS
and produce as a result a vector of five values, each of them        Our models were implemented using the scientific frame-
corresponding to a particular personality trait (as described      work Torch [4] and the recurrent neural network library
in Section 2). In order to achieve this, the network had to be     torch-rnn [11]. We took advantage of the GPU computing
accurately trained to return relevant values at its output and     capabilities of these resources using an Nvidia GTX Titan
not just random garbage. In this case, we have configured it       X. For further implementation details, the source code will
to minimize the difference between its obtained output and         be made available at https://cloud.wyffy.com/index.php/s/
the desired one for each input sequence. More precisely, we        EphokbtRuQ43BWc.
                                                                          5.5
                                                                                                                           1x300 train
                                                                                                                            1x300 val
                                                                           5                                               1x500 train
 int integ01;      105 110 116 32 105 110 116 101 103 48 49 59                                                              1x500 val
                                                                                                                           2x300 train
                                                                          4.5                                               2x300 val


                                                                           4
 int i2;           0 0 0 0 0 105 110 116 32 105 50 59
                                                                          3.5




                                                                  error
                                                                           3
 int i1;                  0 105 110 116 32 105 49 59
                                                                          2.5


                                                                           2
 int in1;                 105 110 116 32 105 110 49 59
                                                                          1.5


                                                                           1

Figure 2: Text is represented as sequences of byte                        0.5
                                                                                0   10   20   30    40     50    60   70   80        90   100
values which are then gathered into batches (grey                                                        epoch
filled rectangles) where they are padded with zeros
at the beginning.                                                 Figure 3: Training and validation error evolution
                                                                  through 100 training epochs for three different
                                                                  model configurations.
Training and validation
First of all, we have preprocessed the training corpus given                                       #hlayers      batch size
by the PR-SOCO organization to best fit the training pro-                            run01-v2         1              10
cedure of the neural network. In this vein, we have merged                           run02            2              10
the personality information (i.e. the personality trait values)                      run03-v2         1              20
from each one of the 49 developers right after the end marker                        run04            1               1
of the package of their source code files. This way our input                        run05-v2         1               1
format contains the input sequence to the network above
the marker and the desired output one line below. Then we         Table 1: Number of hidden layers and batch size
merged all resulting files into a single one and shuffled the     of the models used for the test runs. All of them
instances, resulting in a total of 1600 instances. The vali-      have 300 neurons per hidden layer. The difference
dation dataset was built taking the first 141 instances from      between run04 and run05-v2 is the training time,
the resulting file, leaving the rest for training.                greater in the latter case.
   The shortest meaningful sequence in the training corpus
has length 34, the longest 27654 and the average one is ap-
proximately 4823 characters long. Regarding the personality       to note that this figures would be lower if we did not run
scores in the training corpus, their values fall in the range     multiple training processes in parallel.
20–80 and their means are: 49.92 for neuroticism, 45.22 for          In Figure 3 we show our preliminary experiments to attest
extroversion, 49.51 for openness, 47.02 for agreeableness and     for the capacity of our models to tackle the task at hand.
46.37 for conscientiousness.                                      Although the observed behaviour in training time of these
   In order to benefit from the processing power of the GPU,      models was acceptable and invited us to use them against
we gathered input sequences into batches. Since all se-           the test corpus (which we will describe shortly), they were
quences in a given batch must have the same length, we            affected by the padding bug mentioned earlier and cannot
padded the shorter sequences with zeros at the beginning.         be considered as clear evidence of the performance of the
Unfortunately, this was not exactly the case throughout our       models. Since the batch size was established to 10, the bug
experiments, and until very recently the padding was being        caused the models to train with a tenth of the total training
added to the end of shorter sequences instead, giving rise        and validation instances. In any case, as we can see in the
to a bug were these sequences were automatically discarded        graph, it seems beneficial for the generalization capabilities
in the training process. This bug did not affect experiment       of a neural network trained for this task to add at least one
settings with a batch size of 1.                                  extra hidden layer to its architecture (steady training data
   We show in Figure 2 how some sample input texts are rep-       fit and lower final validation error), while adding neurons to
resented as sequences of byte values which are then gathered      a sole hidden layer results in a counterproductive measure.
into batches where they are appropriately padded with zeros
at the beginning.                                                 Testing and official results
   The training process consists of 100 full cycles (epochs)      The test corpus supplied by the PR-SOCO organization did
through the training corpus. The time needed to accomplish        not undergo a preprocessing stage such as the one described
this depends on the complexity of the network and the batch       above. In this case we have to evaluate 21 developers whose
size used. As an example, one epoch in a network with two         source code is fragmented in 750 test instances. The maxi-
hidden layers of 300 neurons each and a batch size of 1 can       mum sequence length observed is 33550, the minimum 114
take up to 4.8 hours while using a batch size of 10 reduces       and the mean 3743.
the training time to 2.6 hours. Similarly, adjusting the batch       For the five runs performed on the test data, we have used
size to 10, a network formed by a 300 neurons hidden layer        five different models differing in the number of hidden layers
needs 2 hours to train through one epoch. It is important         and batch size to be employed, which are related in Table 1.
                                                                         6
                   N          E       O       A        C                                                                1x300 train
                                                                                                                         1x300 val
     run01-v2     11.99     11.18   12.27   10.31    8.85                                                               2x300 train
                                                                                                                         2x300 val
     run02        12.63     11.81   8.19    12.69     9.91               5

     run03-v2     10.37      12.5    9.25   11.66     8.89
     run04        29.44      28.8   27.81   25.53    14.69               4
     run05-v2     11.34     11.71   10.93   10.52    10.78
     task mean    12.75     12.27   10.49   12.07    10.74




                                                                 error
                                                                         3


Table 2: Official PR-SOCO RMSE results over
5 runs.      Personality traits:    (N)euroticism,                       2

(E)xtroversion, (O)pennes, (A)greeableness and
(C)onscientiousness. run02 is the only run affected                      1
by the batch padding bug.

                                                                         0
                      N       E       O       A       C                      0    10   20   30   40     50    60   70   80        90   100
                                                                                                      epoch
      run01-v2      -0.01   0.09    -0.05    0.2     0.02
      run02         -0.18   0.21    -0.02   -0.01    -0.3
      run03-v2      0.14     0.0     0.11   -0.14    0.15        Figure 4: Training and validation error evolution
      run04         -0.24   0.47    -0.14   0.38    0.32         for 1 and 2-hidden layer networks not affected by
      run05-v2       0.05   0.19    0.12    -0.07   -0.12        the padding bug.
      task mean      0.04   0.06     0.09   -0.01   -0.01
                                                                 strained and formal than natural language due to its very
Table 3:    Official PR-SOCO PC results over                     nature, it also allows for some personal preferences to pour
5 runs.      Personality traits: (N)euroticism,                  down into its structure and content, giving rise to the pos-
(E)xtroversion, (O)pennes, (A)greeableness and                   sibility of author profiling on it.
(C)onscientiousness.                                                In this paper we have shown our proposal for personality
                                                                 recognition in source code. Viewing such text as a sequence
                                                                 of characters (or bytes), we have used shallow recurrent neu-
All of them have 300 neurons per hidden layer and have           ral networks as our personality trait predictors. In order to
been trained with the whole training corpus, including the       maximize the pattern detection capabilities of our model, we
validation part. Note that the difference between run04 and      have fed entire source code packages as sequence inputs to
run05-v2 is the training time, longer in the latter case. The    the network. The network learning criterion was a smoothed
only run affected by the padding bug was run02.                  mean absolute error, less sensitive to outliers than RMSE or
   In Tables 2 and 3 we can see our official results obtained    the mean absolute error.
for the PR-SOCO task. In general, the correlation scores            Given the encouraging results obtained, we think that our
are quite low while the RMSE figures are acceptable (con-        approach may be a viable one to tackle this problem. On one
sidering that they beat the task average) except for run04,      hand, the RMSE figures obtained, which are aligned with
whose better results in correlation might be attributed to a     the criterion we were optimizing for, are positive consider-
mere coincidence. On the other hand, we see that RMSE            ing that we have used a shallow network, whose expressivity
scores for run02 are quite good despite being the only case      power is limited, with large input sequences. On the other
affected by the batch padding bug mentioned above. This          hand, we have found some hints pointing at a better per-
fact seems to be related with the benefits provided by the       formance in the case of using deeper neural networks and
extra network layer that the corresponding model has with        training them for longer periods of time, which may consti-
respect to the rest. We can also observe, in the difference      tute immediate ways of improving our results.
between run04 and run05-v2, that allowing the model to              As future lines of work, we will try to improve our results
train for longer periods of time is indeed useful to attain      by adding more layers to our neural network—in a one-by-
good performance. Finally, at this time the data available       one fashion until we see no more significant improvement—,
do not allow us to extract any particular conclusion about       and also by introducing a new training criterion that consid-
the influence of the batch size on our results.                  ers the correlation between instances. Another interesting
   It is worth noting that, unfortunately, we could not re-run   research line would be the study and visualization of the
the 2-hidden layer network without the padding bug against       activation mechanisms which occur within the network at
the test corpus because of time constraints. Nevertheless, in    evaluation time in order to try to interpret the patterns, or
order to confirm the hypothesis that adding an extra layer       features, that the model has previously extracted during the
to the network is beneficial to its performance, we have con-    training phase. In other words, to analyse the behaviour of
ducted some a posteriori experiments with the training cor-      the network to try to observe human interpretable patterns
pus. In Figure 4 we can see how the 2-hidden layer network       and thus distil the knowledge condensed in the network.
obtains, once again, better generalization capabilities than
the 1-hidden layer network.
                                                                 6.              ACKNOWLEDGMENTS
                                                                    This work has been partially funded by the Spanish Min-
5.    CONCLUSIONS                                                isterio de Economı́a y Competitividad through projects
  Source code is a form of written text which has been be-       FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by
coming very accessible in recent years. While more con-          Xunta de Galicia through an Oportunius program grant.
  We gratefully acknowledge NVIDIA Corporation for the                personality traits from social network structure. In
donation of a GTX Titan X GPU used for this research.                 Proceedings of the 2012 ACM Conference on
                                                                      Ubiquitous Computing, pages 321–330. ACM, 2012.
7.   REFERENCES                                                  [17] M. Telgarsky. Benefits of depth in neural networks.
 [1] S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni.               CoRR, abs/1602.04485, 2016.
     Gender, genre, and writing style in formal written          [18] C. Zhang and P. Zhang. Predicting gender from blog
     texts. TEXT, 23(3):321–346, 2003.                                posts. Technical report, Technical Report. University
 [2] J.-I. Biel, O. Aran, and D. Gatica-Perez. You Are                of Massachusetts Amherst, USA, 2010.
     Known by How You Vlog: Personality Impressions
     and Nonverbal Behavior in Youtube. In ICWSM, 2011.
 [3] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez,
     G. Riccardi, and F. Pianesi. The Workshop on
     Computational Personality Recognition 2014. In
     Proceedings of the 22nd ACM International
     Conference on Multimedia, pages 1245–1246. ACM,
     2014.
 [4] R. Collobert, K. Kavukcuoglu, and C. Farabet.
     Torch7: A Matlab-like environment for machine
     learning. In BigLearn, NIPS Workshop, number
     EPFL-CONF-192376, 2011.
 [5] J. T. Connor, R. D. Martin, and L. E. Atlas.
     Recurrent neural networks and robust time series
     prediction. IEEE Transactions on Neural Networks,
     5(2):240–254, 1994.
 [6] P. T. Costa and R. R. MacCrae. Revised NEO
     personality inventory (NEO PI-R) and NEO
     five-factor inventory (NEO FFI): Professional manual.
     Psychological Assessment Resources, 1992.
 [7] D. P. Crowne. Personality theory. Don Mills, Ont.:
     Oxford University Press, 2007.
 [8] R. Girshick. Fast R-CNN. In Proceedings of the IEEE
     International Conference on Computer Vision, pages
     1440–1448, 2015.
 [9] S. Hochreiter and J. Schmidhuber. Long short-term
     memory. Neural computation, 9(8):1735–1780, 1997.
[10] J. Houvardas and E. Stamatatos. N-gram feature
     selection for authorship identification. In International
     Conference on Artificial Intelligence: Methodology,
     Systems, and Applications, pages 77–86. Springer,
     2006.
[11] N. Léonard, S. Waghmare, and Y. Wang. RNN:
     Recurrent library for torch. arXiv preprint
     arXiv:1511.07889, 2015.
[12] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K.
     Moore. Using linguistic cues for the automatic
     recognition of personality in conversation and text.
     Journal of Artificial Intelligence Research, 30:457–500,
     2007.
[13] F. Rangel, F. González, F. Restrepo, M. Montes, and
     P. Rosso. PAN at FIRE: Overview of the PR-SOCO
     Track on Personality Recognition in SOurce COde. In
     Working notes of FIRE 2016 - Forum for Information
     Retrieval Evaluation, Kolkata, India, December 7-10,
     2016, CEUR Workshop Proceedings. CEUR-WS.org,
     2016.
[14] F. Rangel, P. Rosso, M. Potthast, B. Stein, and
     W. Daelemans. Overview of the 3rd Author Profiling
     Task at PAN 2015. In CLEF, 2015.
[15] E. H. E. SBG. Manual of the Eysenck personality
     questionnaire, 1975.
[16] J. Staiano, B. Lepri, N. Aharony, F. Pianesi, N. Sebe,
     and A. Pentland. Friends don’t lie: inferring