<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Supervised Approach for Personality Recognition in Source Code using Code Analysis Tool at FIRE 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rehana Delair</string-name>
          <email>rehanad10@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rutal Mahajan</string-name>
          <email>rutal.mahajan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SNPIT&amp;RC</institution>
          ,
          <addr-line>Bardoli, Gujarat, +91 9426393096</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>SNPIT&amp;RC</institution>
          ,
          <addr-line>Bardoli, Gujarat, +91 9904039419</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Personality Recognition from the author's source code is a task organized by PR-SOCO team in conjunction with the FIRE 2016 Forum for Information Retrieval Evaluation. The aim is to identify author's personality traits from source code collection of a programmer. We have used various supervised learning approaches to train the regression model with different set of features extracted using static code analysis tool checkstyle. Based on these features, the trained regression model is used to predict the score for different personality traits. All the systems are evaluated using two evaluation metrics: Root Mean Squared Error (RMSE) and Pearson Product-Moment Correlation (PC). Our system has scored 0.62 and 0.33 PC in two personality traits, Openness and Conscientiousness respectively using M5Rules algorithm as regression model, which is the best score among all the submitted runs of our system as well as among all the participated systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        There is a lot of work going on in the area of Personality
Recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Personality traits influence most of
the human activities such as the way people write [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], interact
with each other, and the way they make a decision. The
programmer’s personality will affect the type of software project
they chose to participate [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or the way they write or structures
their code.
      </p>
      <p>
        There are many projects that use written text to identify author’s
personality. In “whose thumb is it anyway?” [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] personal weblogs
are analyzed to predict personality traits. They have used the
Support Vector Machine algorithm to predict personality traits.
Main features are word based bi- and tri- grams. In “Finding
relationships between socio-technical aspects and personality
traits by mining developer e-mails.” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] they have used
developer’s emails to identify their personality.
      </p>
      <p>Personality Recognition from the source code is different than
other projects because the source code has limited scope. The
Programmer doesn’t have the choice to select their own word.
They have to follow some of the pre-defined rules. Identifying
Personality from the source code is a difficult task.</p>
      <p>
        Personality can be defined along five traits using the Big Five
Theory [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is the most widely accepted in psychology. The
five traits are extroversion (E), emotional stability / neuroticism
(S), agreeableness (A), conscientiousness (C), and openness to
experience (O).
In order to collect different features from the given source code,
checkstyle [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used. It is a code analysis tool which performs
different checks on the source code. We have used weka [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] tool
to train the regression model.
      </p>
      <p>The rest of this paper is structured as follows: Section 2 outlines
our approach on the Personality Recognition in Source Code.
Section 3 presents tools used. Section 4 describes training and test
data. Section 5 describes experiments and Section 6 describes
official results of this task. Finally, we conclude in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>Overview</title>
      <p>Main Process of Personality Recognition includes the following
steps, which is shown in Figure 1:
1.
2.
3.
4.
5.
6.</p>
      <sec id="sec-3-1">
        <title>Collect individual corpora</title>
        <p>
          In this step, we need to collect training data. In this case,
we need source codes of different programmers which is
training data provided by PR-SOCO committee [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Collect associated personality ratings for each participant</title>
        <p>
          This is the step where we collect personality ratings for
each programmer. We have used Big-Five personality
traits [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to describe the personality of an individual.
This data is also provided by PR-SOCO committee [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Pre-processing</title>
        <p>
          In this step, given file/data is converted into the efficient
format for checkstyle [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It removes any separating
lines from the source code and converts data into an
actual JAVA file. We have also implemented a function
to isolate one single program from the given training
files of source code.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Extract relevant features from the texts</title>
        <p>
          In this step main features are identified from the given
source code. We need to find different features of good
source code which reflects authors’ personality. For this
purpose we have used a code analysis tool checkstyle
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It performs different checks on the source code such
as how well the code is commented, how it is indented,
naming conventions, etc. From this we have collected
measures for different features.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Build statistical models of the personality ratings based on the features</title>
        <p>
          We have used different regression models to predict the
personality traits like Support Vector Machine
Regression, Gaussian Processes,M5 algorithm, M5’
Rules and Random Tree. We have used JAVA API for
Weka [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] to train different regression models.
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>Test the learned models on unseen individuals</title>
        <p>Using different features and trained regression model
we predicted the score for different personality traits.</p>
        <sec id="sec-3-6-1">
          <title>Feature</title>
        </sec>
        <sec id="sec-3-6-2">
          <title>Extraction</title>
          <p>Training Data</p>
        </sec>
        <sec id="sec-3-6-3">
          <title>Preprocessing</title>
        </sec>
        <sec id="sec-3-6-4">
          <title>Individual Programs checkstyle</title>
        </sec>
        <sec id="sec-3-6-5">
          <title>Errors and Warnings</title>
        </sec>
        <sec id="sec-3-6-6">
          <title>Collect Features</title>
        </sec>
        <sec id="sec-3-6-7">
          <title>Features</title>
        </sec>
        <sec id="sec-3-6-8">
          <title>Build Regression Model</title>
        </sec>
        <sec id="sec-3-6-9">
          <title>Regression Model</title>
        </sec>
        <sec id="sec-3-6-10">
          <title>Test Data Output</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Features</title>
      <p>
        We have used total 154 features of source code, which is extracted
using static code analysis tool Checkstyle [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These features are
categories into two categories to train the regression model: Style
based features and Content based features. These are shown in the
Table 1.
      </p>
      <p>1.</p>
      <sec id="sec-4-1">
        <title>Style based Features</title>
        <p>It is the category of different features related to the style
of the code. Such features are used to perform checks on
code layout and formatting problems. It contains
Indentation, Headers, Javadoc comments, white spaces,
Block checks, etc.</p>
        <p>
          Content based Features
It is the category of different features related to the
content of the source code. It performs checks on class
design problems, method design problem, Annotations,
Coding, Imports, Metrics, Modifiers, Naming
conventions, Size violations and other miscellaneous
features.
Single program is separated from the collection of source code
and it is checked using checkstyle [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Errors and Warnings are
counted and converted in per line of code format.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. Data set</title>
      <p>
        The training data set was provided by PR-SOCO committee itself
that consists of source codes written in Java. The data consist of
49 documents that consist of a collection of source code of
different authors. These source codes are labeled with personality
traits of the programmer in a continuous range from 20 to 80.
Test data were also provided by PR-SOCO committee [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It is
consists of 21 documents of a source code collection. We have
used this data to evaluate our system.
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. Experiments</title>
      <p>We have a collection of source code written by 49 different
programmers along with their personality traits. We have used this
data to train our model and then tested it on 21 unseen source
codes. Two metrics were used to evaluate the system: the average
Root Mean Squared Error (RMSE) as well as the Pearson
Product-Moment Correlation (PC) between our software scores
and the ground-truth scores. We have tested our system on a given
test data. Results are discussed in the next section.</p>
      <p>RMSE is the square root of the mean/average of the square of all
of the error and PC is defined as a measure of the strength of a
linear association between two variables.</p>
      <p>We have used different Supervised Regression model to predict
personality traits of different authors. These are Support Vector
Machine, Gaussian Processes, M5P algorithm, M5Rule and
Random tree algorithm. Support Vector Machine plots all the data
items as a point in n-dimensional space. We have used default
kernel settings in Support Vector Machine. M5P algorithm is
decision tree based algorithm and M5Rule is rule based algorithm.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Results</title>
      <p>We have submitted total five runs. This all runs use different
regression algorithms. We have used Support Vector Machine,
Gaussian Processes, M5P algorithm, M5Rule and Random tree
algorithm for regression.</p>
      <p>
        Results obtained for different runs of our system are shown in the
Table 2. Two metrics are shown for each personality trait:
RMSE/PC. It shows Root Mean Squared Error / Pearson
ProductMoment Correlation values. At the bottom of the Table 2,
measures for baselines: (a) a bag of character 3-grams with
frequency weight; (2) an approach that always predicts the mean
value observed in the training data are shown [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our system has
scored 0.62 and 0.33 PC in two personality traits, Openness and
Conscientiousness respectively using M5Rules algorithm as
regression model, which is the best score among all the submitted
runs of our system as well as among all the participated systems.
In Neuroticism personality trait, our predicted scores are
positively correlated with the ground truth scores. It gives nearly
worst RMSE in Gaussian processes and SMO. In Extroversion
personality traits, all regression models give different scores and it
is weakly correlated with ground truth scores. Openness
personality trait is strongly correlated with the ground truth score
and gives good results. Agreeableness is negatively related with
ground truth scores and it also gives worst RMSE. In
Conscientiousness, predicted scores are positively correlated with
ground truth scores.
      </p>
      <sec id="sec-7-1">
        <title>M5Rules GP M5P</title>
      </sec>
      <sec id="sec-7-2">
        <title>Random Tree SMO</title>
      </sec>
      <sec id="sec-7-3">
        <title>Baseline bow Baseline mean</title>
        <p>Best
Results
Worst
Results
N</p>
        <p>E</p>
        <p>O</p>
        <p>A</p>
        <p>C</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusion</title>
      <p>Various supervised learning algorithms proved to be very capable
of predicting personality traits scores for different authors from
their given source code. Currently in our system we have not
refined the effect of individual extracted features on different
personality trait. Such refinement may yield better prediction
results than the current submitted runs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Celli</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lepri</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biel</surname>
            <given-names>J. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatica-Perez</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riccardi</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianesi</surname>
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <source>The workshop on computational personalty recognition 2014. Proc. Of the ACM Int. Conf. on Multimedia. Pp</source>
          .
          <volume>1245</volume>
          -
          <fpage>1246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] CheckStyle project</article-title>
          , http://checkstyle.sourceforge.net/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Costa</surname>
            <given-names>P.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCrae</surname>
            <given-names>R.R.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>The revised neo personality inventory (neo-pi-r)</article-title>
          .
          <source>The SAGE handbook of personality theory and assessment 2</source>
          ,
          <fpage>179</fpage>
          -
          <lpage>198</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Hall</surname>
          </string-name>
          , Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ian H. Witten</surname>
          </string-name>
          (
          <year>2009</year>
          );
          <source>The WEKA Data Mining Software: An Update; SIGKDD Explorations</source>
          , Volume
          <volume>11</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Oberlander</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nowson</surname>
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Whose thumb is it anyway? Classifying author personality from weblog text</article-title>
          .
          <source>Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions</source>
          , pages
          <fpage>627</fpage>
          -
          <lpage>634</lpage>
          , Sydney,
          <year>July 2006</year>
          .
          <article-title>©2006 Association for Computational Linguistics</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Paruma-Pab ́</surname>
            on
            <given-names>O.H.</given-names>
          </string-name>
          , Gonz ́alez
          <string-name>
            <given-names>F.A.</given-names>
            ,
            <surname>Aponte</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Camargo</surname>
          </string-name>
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Restrepo-Calle</surname>
          </string-name>
          <string-name>
            <surname>F</surname>
          </string-name>
          . (
          <year>2016</year>
          ).
          <article-title>Finding relationships between socio-technical aspects and personality traits by mining developer e-mails</article-title>
          . Workshop on Cooperative and
          <article-title>Human Aspects of Software Engineering (CHASE), ICSE</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Rangel</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celli</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Overview of the 3rd Author Profiling Task at PAN 2015</article-title>
          .
          <article-title>CLEF 2015 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEURWS.org</source>
          , vol.
          <volume>1391</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Francisco</given-names>
            <surname>Rangel</surname>
          </string-name>
          , Fabio González, Felipe Restrepo,
          <article-title>Manuel Montes and Paolo Rosso. PAN at FIRE: Overview of the PRSOCO Track on Personality Recognition in SOurce Code</article-title>
          . Working notes of FIRE 2016 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          . CEUR Workshop Proceedings. CEUR-WS.
          <year>org</year>
          .
          <source>2016</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>