=Paper= {{Paper |id=Vol-1737/T1-9 |storemode=property |title=Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach |pdfUrl=https://ceur-ws.org/Vol-1737/T1-9.pdf |volume=Vol-1737 |authors=Kripabandhu Ghosh,Swapan Kumar Parui |dblpUrl=https://dblp.org/rec/conf/fire/GhoshP16 }} ==Indian Statistical Institute Kolkata at PR-SOCO 2016: A Simple Linear Regression Based Approach== https://ceur-ws.org/Vol-1737/T1-9.pdf
   Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A
         Simple Linear Regression Based Approach

                     Kripabandhu Ghosh                                        Swapan Kumar Parui
                     Indian Statistical Institute                             Indian Statistical Institute
                           Kolkata, India                                           Kolkata, India
                        kripa.ghosh@gmail.com                                    swapan.parui@gmail.com



ABSTRACT                                                         next section. Secondly, we tried to judge the efficiency of the
We participated in the PR-SOCO task hosted in FIRE 2016          code. Since we were not provided with the problem state-
and tried some basic approaches which we look to improve in      ment or input data for which the source codes were written,
the future. We defined some simple features from the source      we had no way to evaluate the algorithmic efficiency of the
code which, in our opinion, were indicative of the manner        code. However, we noticed that a particular feature can be
in which the code was written and which might give some          used to understand the efficiency of the code, to some ex-
clues about the personality of the programmer. We built          tent. For the efficiency aspect, one feature (IS) is defined
a multiple linear regression model from the training data        in the next section. We believe that these four features can
and applied this model on the test data. The results show        predict the personality of a person. For example, a person
that our method produces good prediction performances for        with prominent Neuroticism1 exhibits low emotional stabil-
Neuroticism, Extroversion and Openness.                          ity and so is likely to be less methodical in writing a code.
                                                                 Persons with high Extroversion,2 on the other hand, are
                                                                 likely to express themselves and possibly provide meaning-
CCS Concepts                                                     ful comments in their code. We discuss these features in the
•Computing methodologies → Supervised learning                   following section. Next we use these features for predicting
by regression; •Information systems → Information                the personality traits. We model a multiple linear regres-
extraction;                                                      sion3 for each BIG5 personality trait. That is, each BIG5
                                                                 personality value, for a given user, is predicted from these
Keywords                                                         features extracted from her program code. In the multiple
                                                                 linear regression framework, each of the BIG5 traits is the
BIG5 personality; Source code; Linear regression
                                                                 dependent variable and the four features are the explanatory
                                                                 variables.
1. INTRODUCTION                                                     The rest of the paper is arranged as follows: We describe
   Much work has been done on predicting user personal-          the proposed methodology in Section 2. We present the
ity based on text written in a natural language (e.g, Face-      results in Section 3. We conclude in Section 4.
book status updates [2]). The task of predicting age, gen-
der, and personality traits of Twitter users has also been
attempted in the author profiling task [3] as one of the tasks
                                                                 2.     METHODOLOGY
of PAN/CLEF 2015 [5]. However, the PR-SOCO 2016 [4]
task presents a different and possibly, a more challenging        2.1      Feature selection
problem. The main challenge lies in the fact that in this           We used four features (explanatory variables) for multiple
task, the BIG5 personality traits [1] need to be predicted       linear regression. Here each of the BIG5 traits is the depen-
from the source code which is written within the strict lexi-    dent variable. The feature values were extracted from the
cal and syntactic bounds of a programming language. This         source code of each program file. The features are as follows
is likely to limit the usual vocabulary of the programmer        (examples are shown in Table 1):
which she could have used in a natural language composi-
tion. So, we looked to employ simple means to judge the               1. Multi-line comments (MLC): This is the number
quality of the program code and hope to gain insights about              of genuine comment words in multi-line comments, i.e.,
the personality of the programmer. Firstly, we tried to eval-            between /* and */ found in the program code. In Ta-
uate the “readability” of the code by automatically detecting            ble 1, we see a case of genuine comment under Positive
the tendency of the programmer to provide useful comments                example. We have not considered the cases where lines
in the code. By useful comments we mean the ones which                   of code were commented, as shown under Negative ex-
describe the functionality and purpose of different segments              ample. To extract this feature from a source code file,
of the code. However, we considered that the presence of         1
                                                                   https://en.wikipedia.org/wiki/Neuroticism as seen on 26th
commented lines of code in the source file to be not desir-      October, 2016
able. We also considered the judicious use of spaces in the      2
                                                                   https://en.wikipedia.org/wiki/Extraversion and
code to be a good programming practice and this was also         introversion as seen on 26th October, 2016
supposed to improve the readability. For the readibility as-     3
                                                                   https://en.wikipedia.org/wiki/Linear regression as seen on
pect, three features (MLC, SLC and NES) are defined in the       26th October, 2016
     we first read the lines within /* and */. Then we            mentation in R.5 Here, scoreBIG5 is the dependent variable
     eliminated any instances of program code by search-          and MLC, SLC, NES and IS are the explanatory variables.
     ing for a regular expression containing ;= as symbols
     and functions of the form [a-zA-Z][a-zA-Z]*[ ]*( (e.g.,      3.     RESULTS
     System.out.println(“Even”);) used in a Java code. This
     feature value was normalized by dividing it by the total          We submitted two runs as follows:
     number of words in the program file.
                                                                       1. Run1.txt: The values of the dependent variables were
  2. Single-line comments (SLC): This is the number of                    generated on the test data using the regression equa-
     genuine single-line comment words in single line com-                tion (1) learned from the training data.
     ments, i.e., comments following “//” (as shown in Ta-
                                                                       2. Run2.txt: For this run, for each BIG5 trait, we cal-
     ble 1, under Positive example). Here also, we have
                                                                          culated the values of the dependent variables given by
     not considered the cases where lines of code were com-
                                                                          the linear regression equation (1) on the training set.
     mented (as shown in Table 1, under Negative example).
                                                                          We then calculated the error between the predicted
     To extract this feature value, we simply determined
                                                                          value and the actual value for each of the 49 training
     the number of words following “//” in the code. Then
                                                                          samples. We removed the samples in the training set
     we eliminated the occurrences of program code by the
                                                                          with the three highest error values. We then trained
     procedure used for the feature MLC. This feature value
                                                                          the linear regression on the new training set and calcu-
     was normalized by dividing it by the total number of
                                                                          lated the regression coefficients. Finally, values of the
     words in the program file.
                                                                          dependent variables were calculated on the test data.
  3. Non-existent spaces (NES): This is the number of                     The purpose of this run is to remove some outliers from
     lines containing non-existent spaces, e.g., i=1; i<=casos;           the training set.
     as shown in Table 1, under Negative example as op-
     posed to i = 1; i< = casos; as shown in Table 1, un-            The performances of these two runs are shown in Tables
     der Positive example. We have considered this feature        2 and 3. Table 2 reports the results in terms of RMSE. The
     since the presence of spaces is supposed to be a good        table also reports two official baselines (bow and mean) and
     programming practice. This feature was extracted by          the best results reported among all the submitted runs (Re-
     identifying the lines of code satisfying the regular ex-     ported best).6 In RMSE, our run Run1.txt produced the best
     pression [a-z][a-z]* [a-z][a-z]*[=<>+] (e.g., int i=1).      performance for Extroversion. This run also produced good
     This feature value was normalized by dividing it by          performances for Neuroticism and Openness when compared
     the total number of lines in the program file.               with the baselines.
                                                                     Table 3 reports the results in terms of Pearson Product-
  4. Import Specific (IS): This is the number of instances        Moment Correlation (PC). The table also reports two offi-
     where the programmer exported the specific libraries         cial baselines (bow and mean) and the best results reported
     only (e.g., cases of                                         among all the submitted runs (Reported best). In PC, our
     import java.io.FileNotFoundException as opposed to           run Run1.txt produced the best performance for Neuroti-
     import java.io.*). These examples are also shown in          cism. This run produced good performances also for Extro-
     Table 1. We have considered this feature as this is          version and Openness when compared with the baselines.
     supposed to be a good programming practice to use               Table 4 shows the regression coefficient values learned
     specific libraries which reduce compilation time. In         from the training data for each BIG5 trait, used for Run1.
     addition, the choice of specific libraries may indicate      Since our predictions for Neuroticism, Extroversion and Open-
     the experience and proficiency in programming. This          ness are promising, we try to draw some inferences from
     is because a good programmer is supposed to know the         Table 4 for these traits, as follows.
     specific libraries which will be useful. On the other           Neuroticism: The negative value of high magnitude of
     hand, an inexperienced programmer is more likely to          β2 indicates that a person who frequently provides Single
     “import” all the libraries to somehow get the job done.      Line Comments (SLC) in her code is likely to exhibit a low
     This feature was extracted by considering all the in-        level of Neuroticism. This agrees with our intuition that a
     stances of “import” not ending with a “*”. This feature      Neurotic person is not organized in her coding. However a
     value also was normalized with respect to the total          positive value (though of relatively lower magnitude) of β1
     number of lines in the program file.                         indicates that a person who provides Multi Line Comments
                                                                  (MLC) is likely to have a high level of Neuroticism. Also,
2.2 Multiple linear regression model                              a negative value of β3 indicates that a person who does not
                                                                  provide necessary spaces in the code is likely to have a low
 For each BIG5 trait, we define a multiple linear regression
                                                                  level of Neuroticism. These two coefficient values somewhat
model4 for a program code p, given as follows:
                                                                  contradict our intuition that a Neurotic person is necessarily
                                                                  chaotic in nature while writing a code. But negative value
        scoreBIG5 (p) =α + β1 M LC(p) + β2 SLC(p)                 of high magnitude of β4 indicates that a person who tends
                                                          (1)
                        + β3 N ES(p) + β4 IS(p)                   to import libraries selectively, is likely to have a low level of
                                                                  Neuroticism, which again agrees with our intuition.
   We calculate the values of parameters α and βi , i = 1, 2,     5
3, 4 from the training data using the linear regression imple-      https://www.r-bloggers.com/
                                                                  r-tutorial-series-multiple-linear-regression/
4                                                                 6
  https://en.wikipedia.org/wiki/Linear regression as seen on        These values are reported at http://www.autoritas.es/
11th October, 2016                                                prsoco/evaluation/
  Extroversion: The positive values of β1 , β2 and β4 indi-
cates that a person who tends to provide genuine comments
(both Multi Line and Single Line) and import specific li-
braries in her code is likely to have high Extrovertion. But,
the positive value β3 indicates that an Extrovert may not
provide appropriate spaces in her code. The value of β2 is
much higher than the other coefficients, which implies that
a person with a tendency of providing genuine Single Line
Comments is likely to have high Extrovertion.
  Openness: The observations about Openness are similar
to those about Extroversion.
  However, the prediction results show that our features are
possibly not suitable indicators for Agreeableness and Con-
scientiousness.

4. CONCLUSION
   We see that these simple and intuitive features yield promis-
ing prediction results for Neuroticism, Extroversion and Open-
ness, as inferred from samples of written program code.
We gain some interesting insights into the relationship of
these three traits with these features. For example, Neu-
roticism has a strong negative correlation with the tendency
of writing genuine Single Line Comments, Extroversion has
a strong (positive) correlation with the tendency of writing
genuine Single Line Comments etc. We look to explore other
features in future. However, these features are not adequate
for predicting Agreeableness and Conscientiousness.

5. REFERENCES
[1] P. Costa and R. McCrae. The Revised NEO
    Personality Inventory (NEO-PI-R). In The SAGE
    Handbook of Personality Theory and Assessment, pages
    179–198, 2008.
[2] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli,
    M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens, and
    M. De Cock. Computational personality recognition in
    social media. User Modeling and User-Adapted
    Interaction, pages 1–34, 2016.
[3] F. M. R. Pardo, F. Celli, P. Rosso, M. Potthast,
    B. Stein, and W. Daelemans. Overview of the 3rd
    author profiling task at PAN 2015. In Working Notes of
    CLEF 2015 - Conference and Labs of the Evaluation
    forum, Toulouse, France, September 8-11, 2015., 2015.
[4] F. Rangel, F. González, F. Restrepo, M. Montes, and
    P. Rosso. Pan at fire: Overview of the pr-soco track on
    personality recognition in source code. In Working
    notes of FIRE 2016 - Forum for Information Retrieval
    Evaluation, Kolkata, India, December 7-10, 2016,
    CEUR Workshop Proceedings. CEUR-WS.org, 2016.
[5] E. Stamatatos, M. Potthast, F. M. R. Pardo, P. Rosso,
    and B. Stein. Overview of the PAN/CLEF 2015
    evaluation lab. In Experimental IR Meets
    Multilinguality, Multimodality, and Interaction - 6th
    International Conference of the CLEF Association,
    CLEF 2015, Toulouse, France, September 8-11, 2015,
    Proceedings, pages 518–538, 2015.
           Feature              Positive example                              Negative example
            MLC                        /**                              /*System.out.println(“Even”);
                       * Make the hash table logically empty.                    printQ(qEven);
                                        */                                System.out.println(“Odd”);
                                                                                printQ(qOdd);*/
             SLC     // Create a new double-sized, empty table     //String[] ss = linea.readLine().split(“ ”);
             NES           for (int i=1; i<=casos; i++)                for (int i = 1; i< = casos; i++)
              IS       import java.io.FileNotFoundException                      import java.io.*

Table 1: The table shows positive examples (i.e., conforming to the feature requirement) and negative
examples (i.e., not conforming to the feature requirement) of features. For MLC and SLC, the positive
examples show cases of genuine comments while the negative examples show cases where lines of code are
commented out. For NES, the positive example shows a case where space is absent while the negative example
shows a case where spaces are present. For IS, the positive example shows a case where a specific library is
imported while in the negative example, all the libraries are imported.




     Method        NEUROTICISM     EXTROVERSION          OPENNESS        AGREEABLENESS            CONSCIENTIOUSNESS
    Run1.txt          10.22            8.60                 7.16              9.60                       9.99
    Run2.txt          10.04            10.17                7.36              9.55                       10.16
 Baseline (bow)       10.29            9.06                 7.74              9.00                       8.47
 Baseline (mean)      10.26            9.06                 7.57              9.04                       8.54
  Reported best        9.78            8.60                 6.95              8.79                       8.38

Table 2: Root Mean Squared Error (RMSE). The best result produced by our submitted runs when compared
to all the submitted runs is shown in bold.




     Method        NEUROTICISM     EXTROVERSION          OPENNESS        AGREEABLENESS            CONSCIENTIOUSNESS
    Run1.txt           0.36            0.35                 0.33              0.09                       -0.20
    Run2.txt           0.27            0.04                 0.27              0.11                       -0.13
 Baseline (bow)        0.06            0.12                -0.17              0.20                       0.17
 Baseline (mean)       0.00            0.00                 0.00              0.00                       0.00
  Reported best        0.36            0.47                 0.62              0.38                       0.33

Table 3: Pearson Product-Moment Correlation (PC). The best result produced by our submitted runs when
compared to all the submitted runs is shown in bold.




                            BIG5                 α          β1        β2         β3         β4
                             Trait         (Intercept)   (MLC)      (SLC)      (NES)       (IS)
                         Neuroticism          55.30        10.82   -331.58     -57.15   -282.14
                         Extroversion         39.58        50.49   261.44       67.38    163.28
                          Openness            46.63        46.07     98.92      28.20     49.48
                        Agreeableness        42.521       -1.103   78.905     90.909    196.740
                       Conscientiousness      -1.708      -1.708   225.988    -67.633   135.353

                                Table 4: The regression coefficients for Run1