Personality Recognition in Source Code Working Note: Team BESUMich Shanta Phani Shibamouli Lahiri Arindam Biswas Information Technology Computer Science and Information Technology IIEST, Shibpur Engineering IIEST, Shibpur Howrah 711103, West Bengal, University of Michigan Howrah 711103, West Bengal, India Ann Arbor, MI 48109 India shantaphani@gmail.com lahiri@umich.edu abiswas@it.becs.ac.in ABSTRACT data, one of our runs achieved the minimum RMSE for extrover- In this paper, we describe the results of source code personality sion. identification from Team BESUMich. We used a set of simple, ro- The rest of this paper is organized as follows. We discuss rele- bust, scalable, and language-independent features on the PR-SOCO vant literature in Section 2. Section 3 gives details on the PR-SOCO dataset. Using leave-one-coder-out strategy, we obtained minimum task, especially the data and task description. We also describe our RMSE on the test data for extroversion, and competitive results for features, regressors, and experimental methodology in this section, other personality traits. especially delineating why we chose these features instead of code- style features. Section 4 provides experimental evaluation, and im- portant insights that we gained along the way. We conclude in Sec- CCS Concepts tion 5, outlining our contributions, limitations, and directions for •Computing methodologies → Natural language processing; Su- future research. Relevant terminology is introduced as and when pervised learning by regression; they first appear in the paper. Keywords 2. RELATED WORK personality; source code; regression; RMSE; Pearson correlation; Personality recognition usually falls under the purview of author extroversion; neuroticism; openness; agreeableness; conscientious- profiling [2, 3, 8, 14, 16]. Argamon et al. [2] showed that authors ness of informal texts could be successfully classified according to high or low neuroticism, and high or low extroversion. Four different 1. INTRODUCTION sets of lexical features were used: a standard function word list, Personality is an important element of human sociology and psy- conjunctive phrases, modality indicators, and appraisal adjectives chology. It determines and underscores our day-to-day decisions, and modifiers. Appraisal use was found to be the best predictor shopping and dating behaviors, educational aptitude, and emotional for neuroticism, and function words worked best for extroversion. intelligence – to name a few. It is therefore no coincidence that the An SVM SMO classifier was used on essays written by college source code a programmer writes tends to be influenced by his/her students. personality. While the traditional Author Profiling task consists Argamon et al. [3] extended this study in 2009 to take into ac- of predicting an author’s demographics (e.g., age, gender, person- count gender, age, native language, and personality. Three differ- ality) from his/her writing, in the PR-SOCO shared task [15] the ent corpora were used, in conjunction with content-based and style- goal was to predict a programmer’s personality from his/her source based features. Bayesian Multinomial Regression (BMR) was used code. Personality traits influence most human activities, including as classifier [9]. Style features were found to be very informative but not limited to the way people write [4, 14], interact with oth- for personality traits. Most discriminative style features indicated ers, and make decisions. For example in the case of programmers, that neurotics tended to refer to themselves. personality traits may influence the criteria they use to select which Estival et al. [8] created an email dataset consisting of ten traits open-source software projects to participate [11], and the way they – five demographic (gender, age, geographic origin, level of educa- write and organize their code. tion, native language), and five psychometric (the same ones men- In PR-SOCO, given a source code collection of a programmer, tioned in Section 1). They further designed a Text Attribution Tool the goal was to identify his/her personality. Personality was de- (TAT), and subjected their data to this tool for rigorous validation, fined according to five traits using the Big Five Theory [6]: extro- normalization, linguistic analysis, processing, and parsing. Three version (E), neuroticism (S), agreeableness (A), conscientiousness types of features – character-level, lexical, and structural – were (C), and openness to experience (O). Each programmer was rated extracted. It was shown that a combination of features performed on a numeric scale on each of the five traits. Training and test data best, and beat the baseline. consisted of such ratings, along with code snippets from the devel- Rangel et al. [16] presented the Author Profiling Task at PAN opers. Since the response variable was a real number rather than 2013. The task consisted of age and gender classification in En- a class label, we used a regression framework to model the super- glish and Spanish, and a special exercise on identifying adult-adult vised learning problem. We used a set of simple, robust, scalable, sexual conversations, and fake profiles for sexual predators. The and language-independent features (Section 3), and optimized the task was extended by Rangel et al. in 2015 [14] to include four root mean squared error (RMSE) averaged across all five traits in languages (English, Spanish, Italian, and Dutch), Big Five Person- a leave-one-out cross-validation strategy. While applied on the test ality traits, and Twitter users. The participants used content-based Table 1: Statistics of the distribution of the number of code snippets in the PR-SOCO dataset. α represents the power-law exponent of the distribution. We also give the corresponding p-value (> 0.05 indicates significance). Training data Min Median Mean Max SD TOTAL α p-value 5 29 35.53 121 24.35 1741 2.86 0.91 Test data Min Median Mean Max SD TOTAL α p-value 13 28 35.76 108 22.98 751 3.06 1.0 features (bag of words, word n-grams, term vectors, tfidf n-grams, 3) with and without space characters and punctuation symbols. For named entities, dictionary words, slang words, ironic words, senti- each category, we experimented with lowercase and original case ment words, emotional words), and style-based features (frequen- formatting, and three representations: binary (presence/absence), cies, punctuation, POS, verbosity measures, several different tweet- term frequency (tf), and tfidf. Word n-grams (n = 2, 3), and combi- specific statistics such as mentions, hashtags, and URLs). The nation of different types of features (feature fusion; cf. [10]) could highest accuracies in gender identification were achieved in Dutch not be explored due to sparsity and runtime issues, which we would and Spanish with values over 95%. like to investigate in future. While all the above studies are important, and ground-breaking We used three different regression models (general linear mod- in some cases, we found none that looked into personality recogni- els) from the scikit-learn package [12]: Linear Regression, Ridge tion from source code. From that perspective, the PR-SOCO shared Regression, and Lasso. For Linear and Ridge Regression, we used task breaks a unique ground [15]. default parameter settings. For Lasso, we tuned the α parameter as described in the next section. In the next section, we will see how 3. TASK DESCRIPTION the combinations of different features and regressors perform. The PR-SOCO task [15] released a set of text files for 70 pro- grammers – 49 as training data, and 21 as test. Each text file con- 4. RESULTS sisted of several source code snippets. The number of code snippets As mentioned in Section 1, we performed leave-one-coder-out vary significantly from programmer to programmer. We show the cross-validation on the training data to find out the optimal feature- distribution of snippets in Table 1. It is to be noted that the distri- regressor combination, as well as the optimal parameter settings. bution forms a power law with exponent α = 2.86 for the train- We used the average across five RMSEs (for five personality traits) ing data, and 3.06 for the test data (statistically significant in both as our objective function. The reason we did not use Pearson Cor- cases; cf. [5]). Furthermore, there is considerable similarity among relation Coefficient (ρ) or its square (R2 ) is because there exists the programmers in the way they wrote code. This stems from two some debate as to whether we should use pure R2 or adjusted R2 . factors: (a) the programmers were given standardized coding ques- RMSE avoids this debate. We would like to minimize the mean tions (prompts) to implement, and (b) they were not precluded from RMSE. using the Internet and copy-pasting code thereof. This resulted in The main results are shown in Table 2 through Table 4. Note substantial similarity between programmers. Moreover, oftentimes that overall, Linear Regression performs the worst, with high RM- programmers wrote comments and named variables in non-English SEs across most feature combinations. This is expected, because languages (we detected Spanish in manual investigation), and also the output space should be highly non-linear in terms of features. submitted run information (which should ideally remain separate Ridge Regression and Lasso perform much better, with the best from the code). values coming out of Lasso using character unigrams (lowercased) All the above observations indicate that the data contains much – for binary, tf, and tfidf. This is a rather surprising finding, as it noise. While we could have opted for a serious filtering and pre- shows two things: (a) a handful of very simple character unigrams processing step, such procedure was considered potentially harm- can capture very complex and highly non-linear output spaces, and ful, because we could end up removing useful information such as (b) character unigrams beat more complex features in expressive coding style and unique developer signature. Note also that much power. of the source code is not natural language, so standard NLP tools As next step, we proceeded to tune the Lasso parameter α that such as parsers, named entity recognizers, and POS taggers would governs the shrinkage of coefficients. Note from Table 2 to Table have been useless in such a scenario. Explicit code style indica- 4 that the lowest RMSE came from lowercased character unigrams tors such as commenting and indentation patterns could have been and tfidf. Hence, we used this combination, and tweaked the α useful, but the possibility of copy-pasting code from the Internet parameter of Lasso. We obtained the following five top-performing renders such features useless. Since comments and run informa- combinations: tion were intermixed with code, we needed a set of simple, robust, powerful, scalable, and language-independent features. 1. all characters, Lasso α = 0.05, mean RMSE = 8.38. We are of the opinion that the only type of features that can of- fer all five of the above desiderata comes from word and character 2. all non-space characters, Lasso α = 0.05, mean RMSE = n-grams. They kill two birds with one stone: they are robust and re- 8.38. sistant against copy-pasting from the Internet (because of the shin- gling property much used in plagiarism research [1]), and they are 3. all characters, Lasso α = 0.1, mean RMSE = 8.4. very effective at discriminating between author styles (as evidenced in authorship attribution studies [7, 13, 17]). 4. all non-space characters, Lasso α = 0.1, mean RMSE = 8.4. We therefore experimented with the two following categories of features: (1) Bag of words, and (2) Character n-grams (n = 1, 2, 5. all characters, Lasso α = 0.01, mean RMSE = 8.41. Table 2: RMSE of leave-one-out cross-validation on the training data (default parameter settings for Lasso) with binary feature representa- tion. Minimum values have been boldfaced. AW = all words, AC = all characters, SS = all characters except space, PP = all characters except punctuation, SP = all characters except space and punctuation. Feature Representation Feature Category Feature Type Linear Regression Ridge Regression Lasso Binary (Presence/Absence) Word unigrams AW 2.82e12 8.77 8.89 Word unigrams (lowercased) AW 5.53e12 8.78 8.95 Character unigrams AC 5.83e12 8.8 8.66 SS 1.34e13 8.8 8.66 PP 4.14e12 8.82 8.65 SP 1.21e12 8.82 8.65 Character bigrams AC 5.48e11 8.97 8.64 SS 5.16e11 9.4 8.89 PP 3.51e11 9.75 8.64 SP 4.19e11 9.39 8.86 Character trigrams AC 3.91e12 8.65 8.72 SS 4.72e12 8.61 8.68 PP 5.32e12 8.6 8.83 SP 7.53e12 8.73 8.81 Character unigrams (lowercased) AC 7.99e12 8.73 8.54 SS 1.54e13 8.73 8.54 PP 2.43e13 8.66 8.51 SP 1.02e13 8.66 8.51 Character bigrams (lowercased) AC 2.50e11 9.11 8.51 SS 2.80e11 10.11 8.82 PP 2.12e11 9.82 8.6 SP 4.85e11 9.89 8.87 Character trigrams (lowercased) AC 7.00e12 8.72 8.73 SS 6.35e12 8.69 8.77 PP 5.17e12 8.7 8.81 SP 5.55e12 8.86 8.92 Table 3: RMSE of leave-one-out cross-validation on the training data (default parameter settings for Lasso) with tf feature representation. Minimum value has been boldfaced. AW = all words, AC = all characters, SS = all characters except space, PP = all characters except punctuation, SP = all characters except space and punctuation. Feature Representation Feature Category Feature Type Linear Regression Ridge Regression Lasso Tf Word unigrams AW 2.63e11 9.07 8.96 Word unigrams (lowercased) AW 4.50e11 9.14 9.01 Character unigrams AC 8.75 8.73 8.6 SS 8.77 8.75 8.63 PP 8.72 8.7 8.63 SP 8.78 8.77 8.71 Character bigrams AC 5.17e7 13.9 9.23 SS 1.29e9 14.23 8.61 PP 1.82e7 13.22 9.31 SP 4.02e8 14.97 8.77 Character trigrams AC 3.03e8 9.53 8.89 SS 2.55e11 9.58 9.02 PP 2.91e8 10.09 9.22 SP 7.80e11 10.21 9.06 Character unigrams (lowercased) AC 8.61 8.59 8.49 SS 8.67 8.65 8.56 PP 8.48 8.47 8.43 SP 8.52 8.51 8.48 Character bigrams (lowercased) AC 1.18e7 16.43 8.8 SS 1.14e9 17.02 8.67 PP 157.72 14.69 9.22 SP 95.91 16.16 8.97 Character trigrams (lowercased) AC 6.55e8 9.85 9.17 SS 1.27e11 9.93 9.4 PP 1.96e8 10.89 9.81 SP 2.69e10 11.24 9.45 Table 4: RMSE of leave-one-out cross-validation on the training data (default parameter settings for Lasso) with tfidf feature representation. Minimum values have been boldfaced. AW = all words, AC = all characters, SS = all characters except space, PP = all characters except punctuation, SP = all characters except space and punctuation. Feature Representation Feature Category Feature Type Linear Regression Ridge Regression Lasso Tfidf Word unigrams AW 1.56e12 8.7 8.63 Word unigrams (lowercased) AW 1.61e12 8.72 8.66 Character unigrams AC 8.73 8.61 8.47 SS 8.73 8.61 8.47 PP 8.79 8.74 8.6 SP 8.79 8.74 8.6 Character bigrams AC 3.12e10 13.2 9.51 SS 3.69e10 15.1 9.66 PP 2.71e10 18.91 8.87 SP 9.82e9 19.09 8.9 Character trigrams AC 8.12e11 9.45 9.62 SS 1.96e12 9.48 9.01 PP 1.86e12 9.46 9.22 SP 2.74e12 9.82 9.4 Character unigrams (lowercased) AC 8.56 8.49 8.4 SS 8.56 8.49 8.4 PP 8.55 8.53 8.48 SP 8.55 8.53 8.48 Character bigrams (lowercased) AC 3.79e9 16.36 9.67 SS 9.19e9 16.58 9.81 PP 161.94 20.48 8.88 SP 4.38e10 22.97 9.2 Character trigrams (lowercased) AC 2.02e12 9.56 10.15 SS 1.86e12 9.44 8.81 PP 7.17e11 10.28 9.93 SP 3.98e11 10.59 9.05 We used the corresponding models on the test data as our five [2] S. Argamon, S. Dhawle, M. Koppel, and J. W. Pennebaker. runs. The final results from five runs are shown in Table 5. Our Lexical Predictors of Personality Type. In Proceedings of the Run 5 achieved the best RMSE on extroversion (8.60) and compet- 2005 Joint Annual Meeting of the Interface and the itive results on other traits. We believe that with more parameter Classification Society of North America, 2005. tuning and feature engineering (e.g., word n-grams), we can beat [3] S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler. the performance of our existing system and be able to advance the Automatically Profiling the Author of an Anonymous Text. state-of-the-art in this challenging and interesting task. Commun. ACM, 52(2):119–123, Feb. 2009. [4] F. Celli, B. Lepri, J.-I. Biel, D. Gatica-Perez, G. Riccardi, 5. CONCLUSION and F. Pianesi. The Workshop on Computational Personality In this paper, we reported the design, feature engineering, and Recognition 2014. In Proceedings of the 22Nd ACM evaluation of the BESUMich system submitted to the PR-SOCO International Conference on Multimedia, MM ’14, pages Shared Task [15]. One of our runs achieved the best RMSE on ex- 1245–1246, New York, NY, USA, 2014. ACM. troversion, and all five runs performed competitively. We could not [5] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-Law experiment with word n-grams due to sparsity and runtime issues, Distributions in Empirical Data. SIAM Rev., 51(4):661–703, but we hope to resolve them in future work. Future research di- Nov. 2009. rections include a more rigorous feature engineering and parameter [6] P. T. Costa Jr. and R. R. McCrae. The Revised NEO tuning step, along with feature ranking to identify which features Personality Inventory (NEO-PI-R). The SAGE Handbook of are the most important in this task. Another interesting idea will be Personality Theory and Assessment, 2:179–198, 2008. to explore the learning curve to see how much training data is suffi- [7] H. J. Escalante, T. Solorio, and M. Montes-y Gomez. Local cient to obtain reasonable RMSE values. Similarly, a feature curve Histograms of Character N-grams for Authorship will be able to indicate a reasonable vocabulary size for the exper- Attribution. In Proceedings of the 49th Annual Meeting of iments we performed. Overall, we are hopeful that our methodol- the Association for Computational Linguistics: Human ogy, combined with the methods presented by other participants, Language Technologies, pages 288–298, Portland, Oregon, will significantly advance future research in this domain. USA, June 2011. Association for Computational Linguistics. [8] D. Estival, T. Gaustad, S. B. Pham, W. Radford, and 6. REFERENCES B. Hutchinson. Author Profiling for English Emails. In [1] S. M. Alzahrani, N. Salim, and A. Abraham. Understanding Proceedings of the 10th Conference of the Pacific Plagiarism Linguistic Patterns, Textual Features, and Association for Computational Linguistics (PACLING’07), Detection Methods. Trans. Sys. Man Cyber Part C, pages 263–272, 2007. 42(2):133–149, Mar. 2012. [9] A. Genkin, D. D. Lewis, and D. Madigan. Large-Scale Table 5: RMSE of five submitted runs on the test data. Run ID Neuroticism Extroversion Openness Agreeableness Conscientiousness 1 10.69 9.00 8.58 9.38 8.89 2 10.69 9.00 8.58 9.38 8.89 3 10.53 9.05 8.43 9.32 8.88 4 10.53 9.05 8.43 9.32 8.88 5 10.83 8.60 9.06 9.66 8.77 Bayesian Logistic Regression for Text Categorization. Technometrics, 49(3):291–304, 2007. [10] U. G. Mangai, S. Samanta, S. Das, and P. R. Chowdhury. A Survey of Decision Fusion and Feature Fusion Strategies for Pattern Classification. IETE Technical Review, 27(4):293–307, 2010. [11] O. H. Paruma-Pabón, F. A. González, J. Aponte, J. E. Camargo, and F. Restrepo-Calle. Finding Relationships between Socio-technical Aspects and Personality Traits by Mining Developer E-mails. In Proceedings of the 9th International Workshop on Cooperative and Human Aspects of Software Engineering, CHASE ’16, pages 8–14, New York, NY, USA, 2016. ACM. [12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [13] F. Peng, D. Schuurmans, S. Wang, and V. Keselj. Language Independent Authorship Attribution using Character Level Language Models. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics - Volume 1, EACL ’03, pages 267–274, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. [14] F. Rangel, F. Celli, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. Overview of the 3rd Author Profiling Task at PAN 2015. 2015. [15] F. Rangel, F. González, F. Restrepo, M. Montes, and P. Rosso. PAN at FIRE: Overview of the PR-SOCO Track on Personality Recognition in SOurce COde. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [16] F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, and G. Inches. Overview of the Author Profiling Task at PAN 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pages 352–365. CELCT, 2013. [17] U. Sapkota, S. Bethard, M. Montes, and T. Solorio. Not All Character N-grams Are Created Equal: A Study in Authorship Attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–102, Denver, Colorado, May–June 2015. Association for Computational Linguistics.