=Paper= {{Paper |id=Vol-1737/T1-9 |storemode=property |title=Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach |pdfUrl=https://ceur-ws.org/Vol-1737/T1-9.pdf |volume=Vol-1737 |authors=Kripabandhu Ghosh,Swapan Kumar Parui |dblpUrl=https://dblp.org/rec/conf/fire/GhoshP16 }} ==Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach== https://ceur-ws.org/Vol-1737/T1-9.pdf

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A
Simple Linear Regression Based Approach

Kripabandhu Ghosh Swapan Kumar Parui
Indian Statistical Institute Indian Statistical Institute
Kolkata, India Kolkata, India
kripa.ghosh@gmail.com swapan.parui@gmail.com

ABSTRACT next section. Secondly, we tried to judge the eﬃciency of the
We participated in the PR-SOCO task hosted in FIRE 2016 code. Since we were not provided with the problem state-
and tried some basic approaches which we look to improve in ment or input data for which the source codes were written,
the future. We defined some simple features from the source we had no way to evaluate the algorithmic eﬃciency of the
code which, in our opinion, were indicative of the manner code. However, we noticed that a particular feature can be
in which the code was written and which might give some used to understand the eﬃciency of the code, to some ex-
clues about the personality of the programmer. We built tent. For the eﬃciency aspect, one feature (IS) is defined
a multiple linear regression model from the training data in the next section. We believe that these four features can
and applied this model on the test data. The results show predict the personality of a person. For example, a person
that our method produces good prediction performances for with prominent Neuroticism1 exhibits low emotional stabil-
Neuroticism, Extroversion and Openness. ity and so is likely to be less methodical in writing a code.
Persons with high Extroversion,2 on the other hand, are
likely to express themselves and possibly provide meaning-
CCS Concepts ful comments in their code. We discuss these features in the
•Computing methodologies → Supervised learning following section. Next we use these features for predicting
by regression; •Information systems → Information the personality traits. We model a multiple linear regres-
extraction; sion3 for each BIG5 personality trait. That is, each BIG5
personality value, for a given user, is predicted from these
Keywords features extracted from her program code. In the multiple
linear regression framework, each of the BIG5 traits is the
BIG5 personality; Source code; Linear regression
dependent variable and the four features are the explanatory
variables.
1. INTRODUCTION The rest of the paper is arranged as follows: We describe
Much work has been done on predicting user personal- the proposed methodology in Section 2. We present the
ity based on text written in a natural language (e.g, Face- results in Section 3. We conclude in Section 4.
book status updates [2]). The task of predicting age, gen-
der, and personality traits of Twitter users has also been
attempted in the author profiling task [3] as one of the tasks
2. METHODOLOGY
of PAN/CLEF 2015 [5]. However, the PR-SOCO 2016 [4]
task presents a diﬀerent and possibly, a more challenging 2.1 Feature selection
problem. The main challenge lies in the fact that in this We used four features (explanatory variables) for multiple
task, the BIG5 personality traits [1] need to be predicted linear regression. Here each of the BIG5 traits is the depen-
from the source code which is written within the strict lexi- dent variable. The feature values were extracted from the
cal and syntactic bounds of a programming language. This source code of each program file. The features are as follows
is likely to limit the usual vocabulary of the programmer (examples are shown in Table 1):
which she could have used in a natural language composi-
tion. So, we looked to employ simple means to judge the 1. Multi-line comments (MLC): This is the number
quality of the program code and hope to gain insights about of genuine comment words in multi-line comments, i.e.,
the personality of the programmer. Firstly, we tried to eval- between /* and */ found in the program code. In Ta-
uate the “readability” of the code by automatically detecting ble 1, we see a case of genuine comment under Positive
the tendency of the programmer to provide useful comments example. We have not considered the cases where lines
in the code. By useful comments we mean the ones which of code were commented, as shown under Negative ex-
describe the functionality and purpose of diﬀerent segments ample. To extract this feature from a source code file,
of the code. However, we considered that the presence of 1
https://en.wikipedia.org/wiki/Neuroticism as seen on 26th
commented lines of code in the source file to be not desir- October, 2016
able. We also considered the judicious use of spaces in the 2
https://en.wikipedia.org/wiki/Extraversion and
code to be a good programming practice and this was also introversion as seen on 26th October, 2016
supposed to improve the readability. For the readibility as- 3
https://en.wikipedia.org/wiki/Linear regression as seen on
pect, three features (MLC, SLC and NES) are defined in the 26th October, 2016
we first read the lines within /* and */. Then we mentation in R.5 Here, scoreBIG5 is the dependent variable
eliminated any instances of program code by search- and MLC, SLC, NES and IS are the explanatory variables.
ing for a regular expression containing ;= as symbols
and functions of the form [a-zA-Z][a-zA-Z]*[ ]*( (e.g., 3. RESULTS
System.out.println(“Even”);) used in a Java code. This
feature value was normalized by dividing it by the total We submitted two runs as follows:
number of words in the program file.
1. Run1.txt: The values of the dependent variables were
2. Single-line comments (SLC): This is the number of generated on the test data using the regression equa-
genuine single-line comment words in single line com- tion (1) learned from the training data.
ments, i.e., comments following “//” (as shown in Ta-
2. Run2.txt: For this run, for each BIG5 trait, we cal-
ble 1, under Positive example). Here also, we have
culated the values of the dependent variables given by
not considered the cases where lines of code were com-
the linear regression equation (1) on the training set.
mented (as shown in Table 1, under Negative example).
We then calculated the error between the predicted
To extract this feature value, we simply determined
value and the actual value for each of the 49 training
the number of words following “//” in the code. Then
samples. We removed the samples in the training set
we eliminated the occurrences of program code by the
with the three highest error values. We then trained
procedure used for the feature MLC. This feature value
the linear regression on the new training set and calcu-
was normalized by dividing it by the total number of
lated the regression coeﬃcients. Finally, values of the
words in the program file.
dependent variables were calculated on the test data.
3. Non-existent spaces (NES): This is the number of The purpose of this run is to remove some outliers from
lines containing non-existent spaces, e.g., i=1; i<=casos; the training set.
as shown in Table 1, under Negative example as op-
posed to i = 1; i< = casos; as shown in Table 1, un- The performances of these two runs are shown in Tables
der Positive example. We have considered this feature 2 and 3. Table 2 reports the results in terms of RMSE. The
since the presence of spaces is supposed to be a good table also reports two oﬃcial baselines (bow and mean) and
programming practice. This feature was extracted by the best results reported among all the submitted runs (Re-
identifying the lines of code satisfying the regular ex- ported best).6 In RMSE, our run Run1.txt produced the best
pression [a-z][a-z]* [a-z][a-z]*[=<>+] (e.g., int i=1). performance for Extroversion. This run also produced good
This feature value was normalized by dividing it by performances for Neuroticism and Openness when compared
the total number of lines in the program file. with the baselines.
Table 3 reports the results in terms of Pearson Product-
4. Import Specific (IS): This is the number of instances Moment Correlation (PC). The table also reports two oﬃ-
where the programmer exported the specific libraries cial baselines (bow and mean) and the best results reported
only (e.g., cases of among all the submitted runs (Reported best). In PC, our
import java.io.FileNotFoundException as opposed to run Run1.txt produced the best performance for Neuroti-
import java.io.*). These examples are also shown in cism. This run produced good performances also for Extro-
Table 1. We have considered this feature as this is version and Openness when compared with the baselines.
supposed to be a good programming practice to use Table 4 shows the regression coeﬃcient values learned
specific libraries which reduce compilation time. In from the training data for each BIG5 trait, used for Run1.
addition, the choice of specific libraries may indicate Since our predictions for Neuroticism, Extroversion and Open-
the experience and proficiency in programming. This ness are promising, we try to draw some inferences from
is because a good programmer is supposed to know the Table 4 for these traits, as follows.
specific libraries which will be useful. On the other Neuroticism: The negative value of high magnitude of
hand, an inexperienced programmer is more likely to β2 indicates that a person who frequently provides Single
“import” all the libraries to somehow get the job done. Line Comments (SLC) in her code is likely to exhibit a low
This feature was extracted by considering all the in- level of Neuroticism. This agrees with our intuition that a
stances of “import” not ending with a “*”. This feature Neurotic person is not organized in her coding. However a
value also was normalized with respect to the total positive value (though of relatively lower magnitude) of β1
number of lines in the program file. indicates that a person who provides Multi Line Comments
(MLC) is likely to have a high level of Neuroticism. Also,
2.2 Multiple linear regression model a negative value of β3 indicates that a person who does not
provide necessary spaces in the code is likely to have a low
For each BIG5 trait, we define a multiple linear regression
level of Neuroticism. These two coeﬃcient values somewhat
model4 for a program code p, given as follows:
contradict our intuition that a Neurotic person is necessarily
chaotic in nature while writing a code. But negative value
scoreBIG5 (p) =α + β1 M LC(p) + β2 SLC(p) of high magnitude of β4 indicates that a person who tends
(1)
+ β3 N ES(p) + β4 IS(p) to import libraries selectively, is likely to have a low level of
Neuroticism, which again agrees with our intuition.
We calculate the values of parameters α and βi , i = 1, 2, 5
3, 4 from the training data using the linear regression imple- https://www.r-bloggers.com/
r-tutorial-series-multiple-linear-regression/
4 6
https://en.wikipedia.org/wiki/Linear regression as seen on These values are reported at http://www.autoritas.es/
11th October, 2016 prsoco/evaluation/
Extroversion: The positive values of β1 , β2 and β4 indi-
cates that a person who tends to provide genuine comments
(both Multi Line and Single Line) and import specific li-
braries in her code is likely to have high Extrovertion. But,
the positive value β3 indicates that an Extrovert may not
provide appropriate spaces in her code. The value of β2 is
much higher than the other coeﬃcients, which implies that
a person with a tendency of providing genuine Single Line
Comments is likely to have high Extrovertion.
Openness: The observations about Openness are similar
to those about Extroversion.
However, the prediction results show that our features are
possibly not suitable indicators for Agreeableness and Con-
scientiousness.

4. CONCLUSION
We see that these simple and intuitive features yield promis-
ing prediction results for Neuroticism, Extroversion and Open-
ness, as inferred from samples of written program code.
We gain some interesting insights into the relationship of
these three traits with these features. For example, Neu-
roticism has a strong negative correlation with the tendency
of writing genuine Single Line Comments, Extroversion has
a strong (positive) correlation with the tendency of writing
genuine Single Line Comments etc. We look to explore other
features in future. However, these features are not adequate
for predicting Agreeableness and Conscientiousness.

5. REFERENCES
[1] P. Costa and R. McCrae. The Revised NEO
Personality Inventory (NEO-PI-R). In The SAGE
Handbook of Personality Theory and Assessment, pages
179–198, 2008.
[2] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli,
M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens, and
M. De Cock. Computational personality recognition in
social media. User Modeling and User-Adapted
Interaction, pages 1–34, 2016.
[3] F. M. R. Pardo, F. Celli, P. Rosso, M. Potthast,
B. Stein, and W. Daelemans. Overview of the 3rd
author profiling task at PAN 2015. In Working Notes of
CLEF 2015 - Conference and Labs of the Evaluation
forum, Toulouse, France, September 8-11, 2015., 2015.
[4] F. Rangel, F. González, F. Restrepo, M. Montes, and
P. Rosso. Pan at fire: Overview of the pr-soco track on
personality recognition in source code. In Working
notes of FIRE 2016 - Forum for Information Retrieval
Evaluation, Kolkata, India, December 7-10, 2016,
CEUR Workshop Proceedings. CEUR-WS.org, 2016.
[5] E. Stamatatos, M. Potthast, F. M. R. Pardo, P. Rosso,
and B. Stein. Overview of the PAN/CLEF 2015
evaluation lab. In Experimental IR Meets
Multilinguality, Multimodality, and Interaction - 6th
International Conference of the CLEF Association,
CLEF 2015, Toulouse, France, September 8-11, 2015,
Proceedings, pages 518–538, 2015.
Feature Positive example Negative example
MLC /** /*System.out.println(“Even”);
* Make the hash table logically empty. printQ(qEven);
*/ System.out.println(“Odd”);
printQ(qOdd);*/
SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”);
NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++)
IS import java.io.FileNotFoundException import java.io.*

Table 1: The table shows positive examples (i.e., conforming to the feature requirement) and negative
examples (i.e., not conforming to the feature requirement) of features. For MLC and SLC, the positive
examples show cases of genuine comments while the negative examples show cases where lines of code are
commented out. For NES, the positive example shows a case where space is absent while the negative example
shows a case where spaces are present. For IS, the positive example shows a case where a specific library is
imported while in the negative example, all the libraries are imported.

Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS
Run1.txt 10.22 8.60 7.16 9.60 9.99
Run2.txt 10.04 10.17 7.36 9.55 10.16
Baseline (bow) 10.29 9.06 7.74 9.00 8.47
Baseline (mean) 10.26 9.06 7.57 9.04 8.54
Reported best 9.78 8.60 6.95 8.79 8.38

Table 2: Root Mean Squared Error (RMSE). The best result produced by our submitted runs when compared
to all the submitted runs is shown in bold.

Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS
Run1.txt 0.36 0.35 0.33 0.09 -0.20
Run2.txt 0.27 0.04 0.27 0.11 -0.13
Baseline (bow) 0.06 0.12 -0.17 0.20 0.17
Baseline (mean) 0.00 0.00 0.00 0.00 0.00
Reported best 0.36 0.47 0.62 0.38 0.33

Table 3: Pearson Product-Moment Correlation (PC). The best result produced by our submitted runs when
compared to all the submitted runs is shown in bold.

BIG5 α β1 β2 β3 β4
Trait (Intercept) (MLC) (SLC) (NES) (IS)
Neuroticism 55.30 10.82 -331.58 -57.15 -282.14
Extroversion 39.58 50.49 261.44 67.38 163.28
Openness 46.63 46.07 98.92 28.20 49.48
Agreeableness 42.521 -1.103 78.905 90.909 196.740
Conscientiousness -1.708 -1.708 225.988 -67.633 135.353

Table 4: The regression coeﬃcients for Run1