<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting an author's demographics from text using Topic Modeling approach</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>COMSATS Institute of Information Technology</institution>
          ,
          <addr-line>Lahore</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hafiz Rizwan Iqbal</institution>
          ,
          <addr-line>Muhammad Adnan Ashraf</addr-line>
          ,
          <country>Rao Muhammad Adeel Nawab</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>The paper presents an approach to predict personality traits of a writer for the author profiling task of the PAN CLEF 2015. The task aimed at predicting authors' demographics based on the written tweets of an author. These demographics included traditional authorship attributes of age, gender and various personality traits of an author. We applied topic modeling using LDA as baseline approach and used the generated topic to get hierarchical probabilities of the topics. J48 decision tree was used for training classification model. The trained models were then used to successfully predict the demographics of training and test datasets</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Identifying various demographic traits such as, age, gender, native language and other
personality aspects, from the authors writing style is termed as Author profiling [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Due to its high implication in the computer forensics, marketing and content
recommendations over the internet, it has become a hot research area in Natural Language
Processing.
      </p>
      <p>
        Twitter has been the field of quantitative study on a number of aspects and
characteristics recently. The primary interest of researchers has been to process the user
tweets to interpret users’ interests and to correlate social and global happenings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
whereas this research focus on predicting the author profiling attributes. Twitter
dataset has been used in this research for author profiling.
      </p>
      <p>PAN 15 is the competition held as a part of CLEF Conference. The PAN 15’
competition is designed for three different tasks namely, Plagiarism Detection, Author
Verification and Author Profiling. Each task required to develop a composite software
and submission on the TIRA, an evaluation engine.</p>
      <p>The PAN 15’ Author profiling task was designed to evaluate seven demographic
constraints of the author from his/her tweets. These demographics include identifying
authors’ age, gender and five personality traits which include extroverted, stable,
agreeable, conscientious and open. The training corpus was provided by PAN in four
different languages, English, Spanish, Italian and Dutch. The target was to achieve the
highest ranking rating, which included ratio for accurately identifying the authors age
and gender and the average Root Mean Squared Error for the personality constraints.</p>
      <p>
        To predict a given author’s attributes, we generated LDA based topic models using
mallet and used J48 decision tree in Weka for training and evaluation of our model.
LDA identifies latent topic associations in multi-document collection where each
topic is assigned a probability with respect to all other topics in a document and also
each topic is assigned a probability with respect to number of words [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Topic
modeling using standard LDA has gained attention recently and work has been conducted in
community detection using LDA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and author profiling. Topic modeling using
LDA has also provided encouraging results in microblogging and its application
[12].MALLET [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a famous topic modeling and inferring toolkit, uses LDA to build
the topic models for given text.
      </p>
      <p>
        This paper focuses on the English tweets of the PAN 15’ provided dataset for both
training and testing phases [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The detail of the methodology is explained in the
Section 2 while results of training phase and testing phase are discussed in the Section 3
and Section 4, respectively. Section 5 provides conclusion and future work.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>
        We used topic modeling [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as the baseline approach to predict an author’s profile on
the basis of his/her tweets. Why topic modeling as baseline approach? It has been
analyzed that different categories of people have different topics of interests [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] e.g.
women mostly talk about fashion, dresses and cooking etc. whereas men like to
discuss politics, cricket and technology etc. This natural phenomenon leads us to predict
a person’s age, gender and other personality traits on the basis of his/her written text.
There are the three stages in our proposed approach (1) Dataset Pre-processing, (2)
Fabrication of Topic and Classification Models (3) Prediction of author traits.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Pre-processing:</title>
        <p>The English Language training dataset provided by PAN 15’ was selected for the
author profiling task. The training dataset consisted of 152 users’ tweets. Each user’s
data was placed separately in an xml file. The classifications of all xml files were
placed in a single text file.</p>
        <p>
          During pre-processing phase only tweets were extracted from each xml file and
were stored in a separate text file for each user. There was no further pre-processing
performed on the dataset, such as stop word removal, stemming, removal of
punctuation marks, lemmatization, as the topic model disregards it and to retain the author’s
original style based features [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Fabrication: Topic and Classification Models</title>
        <p>The provided dataset consisted of three main demographic traits of users, i.e. gender,
age and personality constraints. Age and gender had accuracy values in classification
whereas the five personality constraints had root mean error as the classification
values.</p>
        <p>
          A directory structure was created with subdirectories for two demographics (age
and gender) and five personality traits (extroverted, stable, agreeable, conscientious
and open). Table 1 enlists the classification details of the dataset provided in PAN
15’. The text files extracted in pre-processing stage were placed in their classification
based subdirectory structure. The dataset contained equally distributed profiles for the
male and female authors. By analyzing the dataset, it was found that the majority of
the profiles’ authors were from the first two age groups (i.e. 18-24 and 25-30)
whereas the profiles from age group 34-50 and 50+ were relatively lower. Each personality
identifier was further classified based on provided root mean square error value
ranging between -0.5 and 0.5 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Each subdirectory was imported into MALLET, ran the topic modeling routine
with setting of 20 topics for each subdirectory and inference file. As an output of this
routine, list of extracted topics, topic composition file (file which contains the
probability of participation of other topics into a single topic), trained topic model and topic
inference file [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] was generated in a sequential order with respect to each trait
directory.
        </p>
        <p>
          ARFF (Attribute Relation File Format) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] file was created from the topic
composition file. Each topic was considered as one attribute and its probability taken as
value of that attribute. Classification attribute was created for each arff file with respect
to each personality trait. Each author arff file was sourced to WEKA and J48 tree
classifier algorithm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was applied for construction of classification model for the
respective personality attribute.
-0.1
10
-0.1
22
-0.1
12
-0.1
-0.1
4
To predict files in test data set, first two steps of the proposed approach with little
variation in step 2, were applied on each test file to get the topics list, topic
composition file and finally arff file. The test file was then compared with trained
classification model to predict each personality trait value. The predicated results were then
output in an xml file as per the task requirement.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results for Training Phase</title>
      <p>The final submission consisted of java based composite software which required an
input directory consisting of xml files and an output directory to place the resultant
xml files. The submitted software was first run on training dataset. Table 2 shows the
results obtained on the PAN 15’ training dataset with accuracy as evaluation measure
for age and gender attributes whereas the personality traits’ results based on Root
Mean Square Error [RMSE] are presented in Table 3. The results show that our
software was able to predict 54% correct classification for the age and 81.5% for the
Gender whereas 44.7% correct predictions were made for both correct age and gender
for the users. Similarly the results on personality traits are also encouraging.</p>
      <p>Age
0.540
The trained models were then run on the English test dataset 2 provided by PAN 15’.
The evaluated test results are manipulated in the Table 4 and Table 5. The Test results
on age and gender were different from the training dataset results. We were able to
predict the age more accurately (69.7%) than the age on training dataset (54%) but
gender prediction was poor (55.6 %) with respect to the gender on the training dataset
(81.5%). Similarly the results of the personality traits on the test dataset were also
encouraging with respect to the training dataset.
0.208
open</p>
      <sec id="sec-3-1">
        <title>RMSE</title>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>Author profiling requires an efficient and effective system for analyzing data for
security and commercial purposes. In our approach, we developed a java based software
that implied LDA for topic model and J48 classification algorithm to predict writers’
demographics from the twitter dataset provided by PAN 15’. The results obtained are
very encouraging especially the accuracy measures.</p>
      <p>Future efforts can be focused on applying the different variations of topic modeling
algorithm such as hierarchical LDA and implying supervised classification models to
predict the demographic traits more accurately and precisely. The code will be
optimized and effort can be put to minimize the total runtime of the software.
6</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Liangjie</given-names>
            <surname>Hong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brian D.</given-names>
            <surname>Davison</surname>
          </string-name>
          .:
          <source>Empirical Study of Topic Modeling in Twitter, 1st Workshop on Social Media Analytics (SOMA '10)</source>
          , Washington, DC, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Suraj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prasha</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Thamar</surname>
          </string-name>
          .
          <article-title>: A Simple Approach to Author Profiling in MapReduce, Notebook for PAN, CLEF (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Blei</surname>
          </string-name>
          , David M.;
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , Andrew Y.:
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>Michael I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet allocation</article-title>
          .
          <source>In Lafferty, John. Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <issue>4</issue>
          -5): pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pavan</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mogadala</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Author profiling using LDA and Maximum Entropy, Notebook for PAN at CLEF (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Caruana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An Empirical Comparison of Supervised Learning Algorithms”</article-title>
          .
          <source>In Proceedings of the International Conference on Machine Learning</source>
          . Pittsburgh, Pennsylvania, pp.
          <fpage>161</fpage>
          -
          <lpage>168</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd author profiling task at pan 2015</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gareth</surname>
            <given-names>J.</given-names>
          </string-name>
          and San Juan E. (Eds). (Eds.)
          <article-title>CLEF 2015 Labs and Workshops, Notebook Papers</article-title>
          .
          <article-title>CEUR-WS</article-title>
          .org, (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K</given-names>
            <surname>Santosh</surname>
          </string-name>
          , Romil Bansal, Mihir Shekhar, and Vasudeva Varma.: Author Profiling:
          <article-title>Predicting Age and Gender from Blogs, Notebook for PAN at CLEF (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Liebling</surname>
          </string-name>
          .:
          <article-title>Characterizing microblogs with topic models</article-title>
          .
          <source>In International AAAI Conference on Weblogs and Social Media</source>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Kachites</surname>
          </string-name>
          .:
          <article-title>MALLET: A Machine Learning for Language Toolkit</article-title>
          . http://mallet.cs.umass.edu (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          .:
          <source>The WEKA Data Mining Software: An Update; SIGKDD Explorations</source>
          , Volume
          <volume>11</volume>
          , Issue 1. (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Foley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yen</surname>
          </string-name>
          .
          <article-title>Probabilistic community discovery using hierarchical latent Gaussian mixture model</article-title>
          .
          <source>In AAAI'07: Proceedings of the 22nd National Conference on Artificial Intelligence</source>
          , pages
          <fpage>663</fpage>
          -
          <lpage>668</lpage>
          , (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>