=Paper=
{{Paper
|id=Vol-1737/T1-9
|storemode=property
|title=Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach
|pdfUrl=https://ceur-ws.org/Vol-1737/T1-9.pdf
|volume=Vol-1737
|authors=Kripabandhu Ghosh,Swapan Kumar Parui
|dblpUrl=https://dblp.org/rec/conf/fire/GhoshP16
}}
==Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach==
Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach Kripabandhu Ghosh Swapan Kumar Parui Indian Statistical Institute Indian Statistical Institute Kolkata, India Kolkata, India kripa.ghosh@gmail.com swapan.parui@gmail.com ABSTRACT next section. Secondly, we tried to judge the efficiency of the We participated in the PR-SOCO task hosted in FIRE 2016 code. Since we were not provided with the problem state- and tried some basic approaches which we look to improve in ment or input data for which the source codes were written, the future. We defined some simple features from the source we had no way to evaluate the algorithmic efficiency of the code which, in our opinion, were indicative of the manner code. However, we noticed that a particular feature can be in which the code was written and which might give some used to understand the efficiency of the code, to some ex- clues about the personality of the programmer. We built tent. For the efficiency aspect, one feature (IS) is defined a multiple linear regression model from the training data in the next section. We believe that these four features can and applied this model on the test data. The results show predict the personality of a person. For example, a person that our method produces good prediction performances for with prominent Neuroticism1 exhibits low emotional stabil- Neuroticism, Extroversion and Openness. ity and so is likely to be less methodical in writing a code. Persons with high Extroversion,2 on the other hand, are likely to express themselves and possibly provide meaning- CCS Concepts ful comments in their code. We discuss these features in the •Computing methodologies → Supervised learning following section. Next we use these features for predicting by regression; •Information systems → Information the personality traits. We model a multiple linear regres- extraction; sion3 for each BIG5 personality trait. That is, each BIG5 personality value, for a given user, is predicted from these Keywords features extracted from her program code. In the multiple linear regression framework, each of the BIG5 traits is the BIG5 personality; Source code; Linear regression dependent variable and the four features are the explanatory variables. 1. INTRODUCTION The rest of the paper is arranged as follows: We describe Much work has been done on predicting user personal- the proposed methodology in Section 2. We present the ity based on text written in a natural language (e.g, Face- results in Section 3. We conclude in Section 4. book status updates [2]). The task of predicting age, gen- der, and personality traits of Twitter users has also been attempted in the author profiling task [3] as one of the tasks 2. METHODOLOGY of PAN/CLEF 2015 [5]. However, the PR-SOCO 2016 [4] task presents a different and possibly, a more challenging 2.1 Feature selection problem. The main challenge lies in the fact that in this We used four features (explanatory variables) for multiple task, the BIG5 personality traits [1] need to be predicted linear regression. Here each of the BIG5 traits is the depen- from the source code which is written within the strict lexi- dent variable. The feature values were extracted from the cal and syntactic bounds of a programming language. This source code of each program file. The features are as follows is likely to limit the usual vocabulary of the programmer (examples are shown in Table 1): which she could have used in a natural language composi- tion. So, we looked to employ simple means to judge the 1. Multi-line comments (MLC): This is the number quality of the program code and hope to gain insights about of genuine comment words in multi-line comments, i.e., the personality of the programmer. Firstly, we tried to eval- between /* and */ found in the program code. In Ta- uate the “readability” of the code by automatically detecting ble 1, we see a case of genuine comment under Positive the tendency of the programmer to provide useful comments example. We have not considered the cases where lines in the code. By useful comments we mean the ones which of code were commented, as shown under Negative ex- describe the functionality and purpose of different segments ample. To extract this feature from a source code file, of the code. However, we considered that the presence of 1 https://en.wikipedia.org/wiki/Neuroticism as seen on 26th commented lines of code in the source file to be not desir- October, 2016 able. We also considered the judicious use of spaces in the 2 https://en.wikipedia.org/wiki/Extraversion and code to be a good programming practice and this was also introversion as seen on 26th October, 2016 supposed to improve the readability. For the readibility as- 3 https://en.wikipedia.org/wiki/Linear regression as seen on pect, three features (MLC, SLC and NES) are defined in the 26th October, 2016 we first read the lines within /* and */. Then we mentation in R.5 Here, scoreBIG5 is the dependent variable eliminated any instances of program code by search- and MLC, SLC, NES and IS are the explanatory variables. ing for a regular expression containing ;= as symbols and functions of the form [a-zA-Z][a-zA-Z]*[ ]*( (e.g., 3. RESULTS System.out.println(“Even”);) used in a Java code. This feature value was normalized by dividing it by the total We submitted two runs as follows: number of words in the program file. 1. Run1.txt: The values of the dependent variables were 2. Single-line comments (SLC): This is the number of generated on the test data using the regression equa- genuine single-line comment words in single line com- tion (1) learned from the training data. ments, i.e., comments following “//” (as shown in Ta- 2. Run2.txt: For this run, for each BIG5 trait, we cal- ble 1, under Positive example). Here also, we have culated the values of the dependent variables given by not considered the cases where lines of code were com- the linear regression equation (1) on the training set. mented (as shown in Table 1, under Negative example). We then calculated the error between the predicted To extract this feature value, we simply determined value and the actual value for each of the 49 training the number of words following “//” in the code. Then samples. We removed the samples in the training set we eliminated the occurrences of program code by the with the three highest error values. We then trained procedure used for the feature MLC. This feature value the linear regression on the new training set and calcu- was normalized by dividing it by the total number of lated the regression coefficients. Finally, values of the words in the program file. dependent variables were calculated on the test data. 3. Non-existent spaces (NES): This is the number of The purpose of this run is to remove some outliers from lines containing non-existent spaces, e.g., i=1; i<=casos; the training set. as shown in Table 1, under Negative example as op- posed to i = 1; i< = casos; as shown in Table 1, un- The performances of these two runs are shown in Tables der Positive example. We have considered this feature 2 and 3. Table 2 reports the results in terms of RMSE. The since the presence of spaces is supposed to be a good table also reports two official baselines (bow and mean) and programming practice. This feature was extracted by the best results reported among all the submitted runs (Re- identifying the lines of code satisfying the regular ex- ported best).6 In RMSE, our run Run1.txt produced the best pression [a-z][a-z]* [a-z][a-z]*[=<>+] (e.g., int i=1). performance for Extroversion. This run also produced good This feature value was normalized by dividing it by performances for Neuroticism and Openness when compared the total number of lines in the program file. with the baselines. Table 3 reports the results in terms of Pearson Product- 4. Import Specific (IS): This is the number of instances Moment Correlation (PC). The table also reports two offi- where the programmer exported the specific libraries cial baselines (bow and mean) and the best results reported only (e.g., cases of among all the submitted runs (Reported best). In PC, our import java.io.FileNotFoundException as opposed to run Run1.txt produced the best performance for Neuroti- import java.io.*). These examples are also shown in cism. This run produced good performances also for Extro- Table 1. We have considered this feature as this is version and Openness when compared with the baselines. supposed to be a good programming practice to use Table 4 shows the regression coefficient values learned specific libraries which reduce compilation time. In from the training data for each BIG5 trait, used for Run1. addition, the choice of specific libraries may indicate Since our predictions for Neuroticism, Extroversion and Open- the experience and proficiency in programming. This ness are promising, we try to draw some inferences from is because a good programmer is supposed to know the Table 4 for these traits, as follows. specific libraries which will be useful. On the other Neuroticism: The negative value of high magnitude of hand, an inexperienced programmer is more likely to β2 indicates that a person who frequently provides Single “import” all the libraries to somehow get the job done. Line Comments (SLC) in her code is likely to exhibit a low This feature was extracted by considering all the in- level of Neuroticism. This agrees with our intuition that a stances of “import” not ending with a “*”. This feature Neurotic person is not organized in her coding. However a value also was normalized with respect to the total positive value (though of relatively lower magnitude) of β1 number of lines in the program file. indicates that a person who provides Multi Line Comments (MLC) is likely to have a high level of Neuroticism. Also, 2.2 Multiple linear regression model a negative value of β3 indicates that a person who does not provide necessary spaces in the code is likely to have a low For each BIG5 trait, we define a multiple linear regression level of Neuroticism. These two coefficient values somewhat model4 for a program code p, given as follows: contradict our intuition that a Neurotic person is necessarily chaotic in nature while writing a code. But negative value scoreBIG5 (p) =α + β1 M LC(p) + β2 SLC(p) of high magnitude of β4 indicates that a person who tends (1) + β3 N ES(p) + β4 IS(p) to import libraries selectively, is likely to have a low level of Neuroticism, which again agrees with our intuition. We calculate the values of parameters α and βi , i = 1, 2, 5 3, 4 from the training data using the linear regression imple- https://www.r-bloggers.com/ r-tutorial-series-multiple-linear-regression/ 4 6 https://en.wikipedia.org/wiki/Linear regression as seen on These values are reported at http://www.autoritas.es/ 11th October, 2016 prsoco/evaluation/ Extroversion: The positive values of β1 , β2 and β4 indi- cates that a person who tends to provide genuine comments (both Multi Line and Single Line) and import specific li- braries in her code is likely to have high Extrovertion. But, the positive value β3 indicates that an Extrovert may not provide appropriate spaces in her code. The value of β2 is much higher than the other coefficients, which implies that a person with a tendency of providing genuine Single Line Comments is likely to have high Extrovertion. Openness: The observations about Openness are similar to those about Extroversion. However, the prediction results show that our features are possibly not suitable indicators for Agreeableness and Con- scientiousness. 4. CONCLUSION We see that these simple and intuitive features yield promis- ing prediction results for Neuroticism, Extroversion and Open- ness, as inferred from samples of written program code. We gain some interesting insights into the relationship of these three traits with these features. For example, Neu- roticism has a strong negative correlation with the tendency of writing genuine Single Line Comments, Extroversion has a strong (positive) correlation with the tendency of writing genuine Single Line Comments etc. We look to explore other features in future. However, these features are not adequate for predicting Agreeableness and Conscientiousness. 5. REFERENCES [1] P. Costa and R. McCrae. The Revised NEO Personality Inventory (NEO-PI-R). In The SAGE Handbook of Personality Theory and Assessment, pages 179–198, 2008. [2] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens, and M. De Cock. Computational personality recognition in social media. User Modeling and User-Adapted Interaction, pages 1–34, 2016. [3] F. M. R. Pardo, F. Celli, P. Rosso, M. Potthast, B. Stein, and W. Daelemans. Overview of the 3rd author profiling task at PAN 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., 2015. [4] F. Rangel, F. González, F. Restrepo, M. Montes, and P. Rosso. Pan at fire: Overview of the pr-soco track on personality recognition in source code. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [5] E. Stamatatos, M. Potthast, F. M. R. Pardo, P. Rosso, and B. Stein. Overview of the PAN/CLEF 2015 evaluation lab. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, pages 518–538, 2015. Feature Positive example Negative example MLC /** /*System.out.println(“Even”); * Make the hash table logically empty. printQ(qEven); */ System.out.println(“Odd”); printQ(qOdd);*/ SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”); NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++) IS import java.io.FileNotFoundException import java.io.* Table 1: The table shows positive examples (i.e., conforming to the feature requirement) and negative examples (i.e., not conforming to the feature requirement) of features. For MLC and SLC, the positive examples show cases of genuine comments while the negative examples show cases where lines of code are commented out. For NES, the positive example shows a case where space is absent while the negative example shows a case where spaces are present. For IS, the positive example shows a case where a specific library is imported while in the negative example, all the libraries are imported. Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS Run1.txt 10.22 8.60 7.16 9.60 9.99 Run2.txt 10.04 10.17 7.36 9.55 10.16 Baseline (bow) 10.29 9.06 7.74 9.00 8.47 Baseline (mean) 10.26 9.06 7.57 9.04 8.54 Reported best 9.78 8.60 6.95 8.79 8.38 Table 2: Root Mean Squared Error (RMSE). The best result produced by our submitted runs when compared to all the submitted runs is shown in bold. Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS Run1.txt 0.36 0.35 0.33 0.09 -0.20 Run2.txt 0.27 0.04 0.27 0.11 -0.13 Baseline (bow) 0.06 0.12 -0.17 0.20 0.17 Baseline (mean) 0.00 0.00 0.00 0.00 0.00 Reported best 0.36 0.47 0.62 0.38 0.33 Table 3: Pearson Product-Moment Correlation (PC). The best result produced by our submitted runs when compared to all the submitted runs is shown in bold. BIG5 α β1 β2 β3 β4 Trait (Intercept) (MLC) (SLC) (NES) (IS) Neuroticism 55.30 10.82 -331.58 -57.15 -282.14 Extroversion 39.58 50.49 261.44 67.38 163.28 Openness 46.63 46.07 98.92 28.20 49.48 Agreeableness 42.521 -1.103 78.905 90.909 196.740 Conscientiousness -1.708 -1.708 225.988 -67.633 135.353 Table 4: The regression coefficients for Run1