=Paper=
{{Paper
|id=Vol-1737/T1-6
|storemode=property
|title=Shallow Recurrent Neural Network for Personality Recognition in Source Code
|pdfUrl=https://ceur-ws.org/Vol-1737/T1-6.pdf
|volume=Vol-1737
|authors=Yerai Doval,Carlos Gómez-Rodríguez,Jesús Vilares
|dblpUrl=https://dblp.org/rec/conf/fire/DovalGV16
}}
==Shallow Recurrent Neural Network for Personality Recognition in Source Code==
Shallow recurrent neural network for personality recognition in source code Yerai Doval Carlos Gómez-Rodríguez Jesús Vilares Grupo COLE, Departamento Grupo LYS, Departamento de Grupo LYS, Departamento de de Computación Computación Computación E.S. de Enxeñaría Informática, Facultade de Informática, Facultade de Informática, Universidade de Vigo Universidade da Coruña Universidade da Coruña Campus As Lagoas, 32004 – Campus de Elviña, 15071 – A Campus de Elviña, 15071 – A Ourense (Spain) Coruña (Spain) Coruña (Spain) yerai.doval@uvigo.es cgomezr@udc.es jvilares@udc.es ABSTRACT some coders tend to use block delimiters even when they are Personality recognition in source code constitutes a novel not necessary, or to add a certain number of blank lines in or- task in the field of author profiling on written text. In this der to clearly separate two function declarations. Morevoer, paper we describe our proposal for the PR-SOCO shared variable and function names are custom made by the coder, task in FIRE 2016, which is based on a shallow recurrent and commentaries include information in natural language. LSTM neural network that tries to predict five personality Therefore, it sounds reasonable to take advantage of this traits of the author given a source code fragment. Our pre- type of patterns to attempt to extract information about liminary results show that it should be possible to tackle the the author of a source code fragment, which constitutes a problem at hand with our approach but also that there is novel task in the author profiling field. still room for improvement through more complex network In this work, we describe our contribution to the Per- architectures and training processes. sonality Recognition in SOurce COde (PR-SOCO) shared task [13], held in conjunction with the FIRE 2016. The objective of this task is quantifying five personality traits CCS Concepts about the author of a given source code fragment, namely, •Applied computing → Law, social and behavioral the standard traits from the Big Five Theory [6]: extrover- sciences; Psychology; Document analysis; •Human- sion, emotional stability/neuroticism, agreeableness, consci- centered computing → Text input; •Social and pro- entiouness and openness to experience. To achieve this, we fessional topics → User characteristics; •Computing propose using a shallow recurrent neural network that, tak- methodologies → Natural language processing; Neu- ing as input the sequence of bytes in an input source code ral networks; text, will try to predict the five values for the corresponding traits of its author. By reading the most elementary unit Keywords available for encoded text, the byte (in most cases directly aligned with individual characters), we seek to find all possi- personality recognition, source code, recurrent neural net- ble useful patterns carved deep into the text. Furthermore, work, LSTM with this approach we are not limiting our models to those patterns a human can grasp, but we are enabling the neural 1. INTRODUCTION network to extract any information it may consider useful Written text can tell us a lot about its author. Demo- for the task. graphic information such as age, gender or specific person- The results obtained with our shallow networks are en- ality traits of the author can be inferred by a human expert couraging with respect to the root mean squared error metric by the sole observation of a written text fragment [7]. This (RMSE), which is aligned with the smoothed mean absolute task is called author profiling, and it can be also applied to error criterion employed in our training process. However, other channels such as speech or body language. But de- they do not perform so well for Pearson Correlation (PC), tecting the patterns which allow for this kind of information which we have not considered at this time. We have also extraction is not restricted to humans, as we will see in this found that the use of more layers in our networks can im- work. prove their performance, agreeing with previous work [17]. Source code is another form of written text, and it is be- coming very accessible as software developers are now able 2. RELATED WORK to easily publish their work on the Web through services There has been a recent surge of interest on author pro- such as Github1 or Bitbucket.2 Although more constrained filing related to personality recognition [14, 3]. and formal than natural language, source code text may also For written text, traditional author profiling approaches have something to tell us about its author, as there is still tend to rely on lexical and syntactical features, such as iden- room for personal preferences in its writing. For instance, tification of key words, part-of-speech tags [1] or n-grams [10] 1 https://github.com/ paired with statistical models such as Hidden Markov Mod- 2 https://bitbucket.org/ els. There is also work which studies the application of these traditional techniques on short informal texts, which often translates into lower performance figures than those obtained for regular texts [18]. However, author profiling is not restricted to written text. Mairesse et al [12] extend this type of analysis to speech, where features such as sound frequencies and the duration of pauses made by the speaker are considered. Biel et al [2] go one step further by analysing Youtube videos and adding what they call “nonverbal cues” to the feature set, which take into consideration the different types of motion that can be observed in the video. There are even approaches that analyse the structure and topology of the social network of subjects [16]. Regarding the psychological aspects of this work, the pro- posed task relies on the so-called Big Five Theory [6] to es- tablish the personality traits to be predicted: extroversion, neuroticism, agreeableness, conscientiousness, and openness Figure 1: Simplified view of the neural network to experience. It is worth noting that although in trait the- used. The first layer should have 256 neurons, one ory there are more than five traits, the most extended theo- per possible input byte value. The second layer, the retical approaches reduce its total number to five, as in the only hidden layer in this case, is conformed by re- case of the Big Five Theory, or even to just three: neuroti- current LSTM units. The last layer is the output cism, extraversion and psychoticism [15]. layer, and has exactly 5 neurons, one for each trait we want to calculate. 3. THE PROPOSED APPROACH Written text, either natural language or source code, can be viewed as a sequence of basic elements such as sentences, have used a smoothed mean absolute error as the training words or characters, to name a few possibilities. Given a criterion of the network, which uses a squared term if the particular domain, we can choose the sequential view of the absolute element-wise error falls below one, making it less input text which best fits our needs. In our case, source code sensitive to outlier data and preventing exploding gradients, is full of reserved keywords such as if, return or while, so a a common problem in neural network training [8]. word-based approach may seem appropriate at first, as the It is worth noting that, in contrast with previous work word vocabulary seems to be relatively fixed and reduced. (see Section 2), our approach does not require a feature engi- However, the problem then comes with the custom names neering phase as neural networks, in their training processes, given by the coder to classes, variables, functions, etc. which reflect the most interesting features from the input domain have an unpredictable nature and do not fit well in a strict in the values of their parameters, sometimes referred to as vocabulary approach. Furthermore, it would be interesting weights. that the vocabulary of sequence elements was as small as Lastly, we opted for feeding the network with input in- possible since this affects the required size of the input layer stances (sequences) which are independent from each other, of our models. In order to keep things simple, we will not so that the important dependence relationships (patterns) follow a word-level or character-level approach but a pure between elements (bytes) of a sequence may be observed byte-level approach, thus limiting the size of the vocabulary by our model. For this reason, we have constructed se- to 256 possible byte values. quences from whole source code packages. As these se- To process these byte sequences we will use recurrent neu- quences can be quite long (see Section 4), traditional re- ral networks, as they are a perfect fit for sequential data [5]. current networks may have problems to recall important in- Thus, each byte from the input sequence is fed to the net- formation extracted at the beginning of the input sequence work at each time step through the input layer, which trans- while they are processing the last elements of it. In order forms byte values into internal representations that can be to address this limitation, we use long short term memory manipulated by the hidden layers of the network. Moreover, (LSTM) units as the neurons in the hidden layers of the net- the output of these hidden layers is not only influenced by work [9]. These units pack a memory cell and other elements the current input but also by some of the information re- that manipulate its contents, thus enabling them to remem- tained from those bytes processed at previous time steps. ber important information from the distant past of an input This is achieved thanks to the recurrent connections added sequence. See Figure 1 for a simplified visual representation to the neurons in these layers of the network. Once the fi- of this model. nal byte from the input sequence has been processed, the output from the last hidden layer corresponding to the last time step is then used to perform a linear transformation 4. EXPERIMENTS and produce as a result a vector of five values, each of them Our models were implemented using the scientific frame- corresponding to a particular personality trait (as described work Torch [4] and the recurrent neural network library in Section 2). In order to achieve this, the network had to be torch-rnn [11]. We took advantage of the GPU computing accurately trained to return relevant values at its output and capabilities of these resources using an Nvidia GTX Titan not just random garbage. In this case, we have configured it X. For further implementation details, the source code will to minimize the difference between its obtained output and be made available at https://cloud.wyffy.com/index.php/s/ the desired one for each input sequence. More precisely, we EphokbtRuQ43BWc. 5.5 1x300 train 1x300 val 5 1x500 train int integ01; 105 110 116 32 105 110 116 101 103 48 49 59 1x500 val 2x300 train 4.5 2x300 val 4 int i2; 0 0 0 0 0 105 110 116 32 105 50 59 3.5 error 3 int i1; 0 105 110 116 32 105 49 59 2.5 2 int in1; 105 110 116 32 105 110 49 59 1.5 1 Figure 2: Text is represented as sequences of byte 0.5 0 10 20 30 40 50 60 70 80 90 100 values which are then gathered into batches (grey epoch filled rectangles) where they are padded with zeros at the beginning. Figure 3: Training and validation error evolution through 100 training epochs for three different model configurations. Training and validation First of all, we have preprocessed the training corpus given #hlayers batch size by the PR-SOCO organization to best fit the training pro- run01-v2 1 10 cedure of the neural network. In this vein, we have merged run02 2 10 the personality information (i.e. the personality trait values) run03-v2 1 20 from each one of the 49 developers right after the end marker run04 1 1 of the package of their source code files. This way our input run05-v2 1 1 format contains the input sequence to the network above the marker and the desired output one line below. Then we Table 1: Number of hidden layers and batch size merged all resulting files into a single one and shuffled the of the models used for the test runs. All of them instances, resulting in a total of 1600 instances. The vali- have 300 neurons per hidden layer. The difference dation dataset was built taking the first 141 instances from between run04 and run05-v2 is the training time, the resulting file, leaving the rest for training. greater in the latter case. The shortest meaningful sequence in the training corpus has length 34, the longest 27654 and the average one is ap- proximately 4823 characters long. Regarding the personality to note that this figures would be lower if we did not run scores in the training corpus, their values fall in the range multiple training processes in parallel. 20–80 and their means are: 49.92 for neuroticism, 45.22 for In Figure 3 we show our preliminary experiments to attest extroversion, 49.51 for openness, 47.02 for agreeableness and for the capacity of our models to tackle the task at hand. 46.37 for conscientiousness. Although the observed behaviour in training time of these In order to benefit from the processing power of the GPU, models was acceptable and invited us to use them against we gathered input sequences into batches. Since all se- the test corpus (which we will describe shortly), they were quences in a given batch must have the same length, we affected by the padding bug mentioned earlier and cannot padded the shorter sequences with zeros at the beginning. be considered as clear evidence of the performance of the Unfortunately, this was not exactly the case throughout our models. Since the batch size was established to 10, the bug experiments, and until very recently the padding was being caused the models to train with a tenth of the total training added to the end of shorter sequences instead, giving rise and validation instances. In any case, as we can see in the to a bug were these sequences were automatically discarded graph, it seems beneficial for the generalization capabilities in the training process. This bug did not affect experiment of a neural network trained for this task to add at least one settings with a batch size of 1. extra hidden layer to its architecture (steady training data We show in Figure 2 how some sample input texts are rep- fit and lower final validation error), while adding neurons to resented as sequences of byte values which are then gathered a sole hidden layer results in a counterproductive measure. into batches where they are appropriately padded with zeros at the beginning. Testing and official results The training process consists of 100 full cycles (epochs) The test corpus supplied by the PR-SOCO organization did through the training corpus. The time needed to accomplish not undergo a preprocessing stage such as the one described this depends on the complexity of the network and the batch above. In this case we have to evaluate 21 developers whose size used. As an example, one epoch in a network with two source code is fragmented in 750 test instances. The maxi- hidden layers of 300 neurons each and a batch size of 1 can mum sequence length observed is 33550, the minimum 114 take up to 4.8 hours while using a batch size of 10 reduces and the mean 3743. the training time to 2.6 hours. Similarly, adjusting the batch For the five runs performed on the test data, we have used size to 10, a network formed by a 300 neurons hidden layer five different models differing in the number of hidden layers needs 2 hours to train through one epoch. It is important and batch size to be employed, which are related in Table 1. 6 N E O A C 1x300 train 1x300 val run01-v2 11.99 11.18 12.27 10.31 8.85 2x300 train 2x300 val run02 12.63 11.81 8.19 12.69 9.91 5 run03-v2 10.37 12.5 9.25 11.66 8.89 run04 29.44 28.8 27.81 25.53 14.69 4 run05-v2 11.34 11.71 10.93 10.52 10.78 task mean 12.75 12.27 10.49 12.07 10.74 error 3 Table 2: Official PR-SOCO RMSE results over 5 runs. Personality traits: (N)euroticism, 2 (E)xtroversion, (O)pennes, (A)greeableness and (C)onscientiousness. run02 is the only run affected 1 by the batch padding bug. 0 N E O A C 0 10 20 30 40 50 60 70 80 90 100 epoch run01-v2 -0.01 0.09 -0.05 0.2 0.02 run02 -0.18 0.21 -0.02 -0.01 -0.3 run03-v2 0.14 0.0 0.11 -0.14 0.15 Figure 4: Training and validation error evolution run04 -0.24 0.47 -0.14 0.38 0.32 for 1 and 2-hidden layer networks not affected by run05-v2 0.05 0.19 0.12 -0.07 -0.12 the padding bug. task mean 0.04 0.06 0.09 -0.01 -0.01 strained and formal than natural language due to its very Table 3: Official PR-SOCO PC results over nature, it also allows for some personal preferences to pour 5 runs. Personality traits: (N)euroticism, down into its structure and content, giving rise to the pos- (E)xtroversion, (O)pennes, (A)greeableness and sibility of author profiling on it. (C)onscientiousness. In this paper we have shown our proposal for personality recognition in source code. Viewing such text as a sequence of characters (or bytes), we have used shallow recurrent neu- All of them have 300 neurons per hidden layer and have ral networks as our personality trait predictors. In order to been trained with the whole training corpus, including the maximize the pattern detection capabilities of our model, we validation part. Note that the difference between run04 and have fed entire source code packages as sequence inputs to run05-v2 is the training time, longer in the latter case. The the network. The network learning criterion was a smoothed only run affected by the padding bug was run02. mean absolute error, less sensitive to outliers than RMSE or In Tables 2 and 3 we can see our official results obtained the mean absolute error. for the PR-SOCO task. In general, the correlation scores Given the encouraging results obtained, we think that our are quite low while the RMSE figures are acceptable (con- approach may be a viable one to tackle this problem. On one sidering that they beat the task average) except for run04, hand, the RMSE figures obtained, which are aligned with whose better results in correlation might be attributed to a the criterion we were optimizing for, are positive consider- mere coincidence. On the other hand, we see that RMSE ing that we have used a shallow network, whose expressivity scores for run02 are quite good despite being the only case power is limited, with large input sequences. On the other affected by the batch padding bug mentioned above. This hand, we have found some hints pointing at a better per- fact seems to be related with the benefits provided by the formance in the case of using deeper neural networks and extra network layer that the corresponding model has with training them for longer periods of time, which may consti- respect to the rest. We can also observe, in the difference tute immediate ways of improving our results. between run04 and run05-v2, that allowing the model to As future lines of work, we will try to improve our results train for longer periods of time is indeed useful to attain by adding more layers to our neural network—in a one-by- good performance. Finally, at this time the data available one fashion until we see no more significant improvement—, do not allow us to extract any particular conclusion about and also by introducing a new training criterion that consid- the influence of the batch size on our results. ers the correlation between instances. Another interesting It is worth noting that, unfortunately, we could not re-run research line would be the study and visualization of the the 2-hidden layer network without the padding bug against activation mechanisms which occur within the network at the test corpus because of time constraints. Nevertheless, in evaluation time in order to try to interpret the patterns, or order to confirm the hypothesis that adding an extra layer features, that the model has previously extracted during the to the network is beneficial to its performance, we have con- training phase. In other words, to analyse the behaviour of ducted some a posteriori experiments with the training cor- the network to try to observe human interpretable patterns pus. In Figure 4 we can see how the 2-hidden layer network and thus distil the knowledge condensed in the network. obtains, once again, better generalization capabilities than the 1-hidden layer network. 6. ACKNOWLEDGMENTS This work has been partially funded by the Spanish Min- 5. CONCLUSIONS isterio de Economı́a y Competitividad through projects Source code is a form of written text which has been be- FFI2014-51978-C2-1-R and FFI2014-51978-C2-2-R, and by coming very accessible in recent years. While more con- Xunta de Galicia through an Oportunius program grant. We gratefully acknowledge NVIDIA Corporation for the personality traits from social network structure. In donation of a GTX Titan X GPU used for this research. Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pages 321–330. ACM, 2012. 7. REFERENCES [17] M. Telgarsky. Benefits of depth in neural networks. [1] S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni. CoRR, abs/1602.04485, 2016. Gender, genre, and writing style in formal written [18] C. Zhang and P. Zhang. Predicting gender from blog texts. TEXT, 23(3):321–346, 2003. posts. Technical report, Technical Report. University [2] J.-I. Biel, O. Aran, and D. Gatica-Perez. You Are of Massachusetts Amherst, USA, 2010. Known by How You Vlog: Personality Impressions and Nonverbal Behavior in Youtube. In ICWSM, 2011. [3] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez, G. Riccardi, and F. Pianesi. The Workshop on Computational Personality Recognition 2014. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 1245–1246. ACM, 2014. [4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011. [5] J. T. Connor, R. D. Martin, and L. E. Atlas. Recurrent neural networks and robust time series prediction. IEEE Transactions on Neural Networks, 5(2):240–254, 1994. [6] P. T. Costa and R. R. MacCrae. Revised NEO personality inventory (NEO PI-R) and NEO five-factor inventory (NEO FFI): Professional manual. Psychological Assessment Resources, 1992. [7] D. P. Crowne. Personality theory. Don Mills, Ont.: Oxford University Press, 2007. [8] R. Girshick. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015. [9] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [10] J. Houvardas and E. Stamatatos. N-gram feature selection for authorship identification. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer, 2006. [11] N. Léonard, S. Waghmare, and Y. Wang. RNN: Recurrent library for torch. arXiv preprint arXiv:1511.07889, 2015. [12] F. Mairesse, M. A. Walker, M. R. Mehl, and R. K. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research, 30:457–500, 2007. [13] F. Rangel, F. González, F. Restrepo, M. Montes, and P. Rosso. PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [14] F. Rangel, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF, 2015. [15] E. H. E. SBG. Manual of the Eysenck personality questionnaire, 1975. [16] J. Staiano, B. Lepri, N. Aharony, F. Pianesi, N. Sebe, and A. Pentland. Friends don’t lie: inferring