<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gender, Age, and Dialect Recognition using Tweets in a Deep Learning Framework - Notebook for FIRE 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chanchal Suman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Purushottam Kumar</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sriparna Saha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pushpak Bhattacharyya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science &amp; Engineering, Indian Institute of Technology Patna</institution>
          ,
          <addr-line>Patna</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science &amp; Engineering, National Institute of Technology Durgapur</institution>
          ,
          <addr-line>Durgapur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media sites are a rich platform for a user-generated text that can be used to identify di erent aspects of the authors. Age, gender, dialect, region, are di erent aspects of an author, which can be identi ed with the proper mining of these contents. This pro ling provides a way of identifying anonymous users. Recognizing the pro le of an anonymous user help in indirect recognition of the identity of the user. In this notebook, we describe the working of our author pro ling software submitted for FIRE 2019 which recognizes the gender, age, and dialects of Twitter users in the Arabic language. We have used Long short term memory neural network and some hand-crafted features for recognizing the age, gender, and dialects of the author of a tweet3.</p>
      </abstract>
      <kwd-group>
        <kwd>Dialects</kwd>
        <kwd>LSTM</kwd>
        <kwd>Aravec</kwd>
        <kwd>emoji</kwd>
        <kwd>emoticons</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Author pro ling is the technique to identify the di erent aspects of the author
from a given text. It di erentiates among the di erent classes of an author by
studying their writing style and the words used in their text. It also shows how
the behavior viewpoint is used to recognize the di erent classes of an author.
It tells about the uses of di erent writing skills and how the language is shared
by a di erent author in their text. The textual information based on di erent
features and styles helps identify the author's pro le based on di erent aspects
such as gender, age, and dialect[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The focus of this task is to identify the gender,
age and the dialect variety of Arabic Twitter users. these information about an
anonymous text can be used in detecting criminal in cyber-forenciscs. During
investigation, it is very tough to get the idea about the real guilty. This type
3 Copyright ©2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15
December 2019, Kolkata, India.
of analysis helps the authority to conclude about the traits of the guilty and
helps in chasing the possible suspects. Gender is an important aspect of a user,
if it is detected correctly, then it would be very helpful in selecting the possible
suspects. Similarly, the age and the language variety too. These applications of
such analysis motivated us to do research in this area. There have been several
papers by too many researchers over the years studying on the topic of Author
Pro ling in text. The text used in these papers is taken from di erent sources, for
example, Blogs, Hotel reviews, and Tweets. In the traditional methods for author
pro ling task, researchers mainly use features such as words, word classes, and
part of speech(POS) n-grams to train their model. In our model, all the tweets
are in text and it contains some features. So, we use the LSTM model with some
hand-crafted features of Deep Learning to recognize age, gender, and dialect of
the authors of a tweet. We performed experiments on the lstm model with and
without handcrafted features. We got an improvement in the accuracy, for the
lstm-based system with handcrafted features in comparison to the normal lstm
based system. Thus it can be said that data shown that the style based features
play a crucial role in recognizing the author pro les.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Author pro ling has attracted researchers and other di erent competitions.
Researchers have studied the dependence of linguistic features and the pro le of the
author. This dependency is a subject of interest for di erent areas like linguistics,
psychology, and natural language processing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Researchers use the syntactic,
lexical, and structural features for recognizing the gender and the age group of
authors. They used the decision tree for identifying the author pro le. This
research methodology is also helpful for other applications like security, criminal
detection, and author detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Researchers focused on the representation
of the documents, to improve the representation of tweets. They computed high
quality discriminative and descriptive features built on the top of the textual
features (e.g., content words, function words, punctuation marks, etc.) by exploiting
discriminative and descriptive features [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Some researchers used typed
character n-grams, lexical features, and non-textual features (domain names) for the
author pro ling task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Researchers also tried deep learning models to directly
learn the gender of blog authors. They used Window Recurrent Convolutional
Neural Network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Use of Language and Author Pro ling uses Computational
linguistics approaches, Author Pro ling Tasks, Neurology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Description</title>
      <p>
        The dataset was provided by the APDA track organizers of FIRE 20194. It
consists of Arabic tweets of 5 set of tweet data. Each data has 100 tweets of 450
users [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since we are using deep learning-based model, so we needed a large
      </p>
      <sec id="sec-3-1">
        <title>4 https://www.autoritas.net/APDA/</title>
        <p>set of tweets. We considered each tweet as a sample, thus we have 45000 sample
of tweets divided into two class. The users who were male, their tweets were
labeled with the male, and similarly for females. Similarly, for age and dialect,
45000 samples were created. Finally, after getting the results, the decision is
taken on the majority voting basis. For example, We have 450 tweets of 1 user
in 1 document. As per our method, we will get result for all 450 tweets. Let say
for 200 of the tweets, the result is Male, and female for remaining then the nal
result for that document will be female. In this way, we have made the data
samples and concluded the results.</p>
        <p>These tweets are in Arabic language having Emoticons, emojis, #mentions,
@mentions, and URLs. Since the provided training corpus consists of tweets are
with HTML format, rstly, we extract all the Tweets from its HTML format to
simple text and then applied the preprocessing step for cleaning the tweets. The
preprocessing stage is useful as it reduces non-textual features to their semantic
classes. We used these preprocessing steps before the extraction of features.</p>
        <p>URLs : The URLs are deleted. @mentions : The @ mentions were deleted.
Emoticons : Emoticons provide useful style-based information. It informs about
the view of a speci c user. We captured their presence only. If the emoticon is
present, then 1 otherwise 0. Furthermore, we apply the following normalization:
Punctuation marks: The punctuation marks are split from adjacent word and
captured their presence separately. The Stopwords and Punctuation were also
removed from the tweets.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Framework</title>
      <p>In this section, we are discussing the proposed architecture, performance
measures and the results.
4.1</p>
      <sec id="sec-4-1">
        <title>Proposed Architecture</title>
        <p>We applied a deep learning model for the author pro ling task. We have used
long short term memory network and applied to the tweets as a classi cation
problem.</p>
        <p>Model-I Long short Term Memory(LSTMs) are designed to learn the
longterm dependency of text-based problems. LSTMs remember the information for
long periods and it is their default behavior5. In our LSTM model, rstly, we
create a word embedding matrix, a tokenizer and a vocab for our training corpus
to determine the unknown words. We randomize our training corpus to make
it more e cient. We introduce three layers with di erent activation keys(relu,
sigmoid) to train our training corpus on the LSTM model and test it on the test
corpus. We use Binary cross-entropy as a loss function to use for binary decisions
and adam as our optimizer in the compiling of our model. In sub gure 1a, we
have shown the structure of the model.</p>
        <sec id="sec-4-1-1">
          <title>5 https://colah.github.io/posts/2015-08-Understanding-LSTMs/</title>
          <p>(a) Model-I
(b) Model-II
Model-II We tried a simple variant of the Model-I discussed in 1a, by adding
some extra hand-crafted features. We added the hand-crafted features in the
developed model to check the performance of the system. In sub gure 4.1, we
have shown the structure of the model. Below are the additional handcrafted
features:
{ Emoji Count : It counts the total number of emoji presents in the tweet.
{ Word Count : It counts the total number of words present in the tweet.
{ Polarity of Sentiment Analysis : It gives, 1 for positive, -1 for negative and
0 for neutral sentiment.
{ Mean Word Length: It calculates the average length of the words present in
the tweet. It is the ratio of the total length of words present in the tweet to
the total no. of words present in the tweet.
{ Sentence length: It gives the length of the sentence in the tweet.
{ Special Character : It gives the total count of punctuation present in the
tweet.
{ Unrepeated Words : It gives the total number of the words which appeared
only once in the tweet.
{ URL Extractor : It gives 1 if URL is present in the tweet otherwise it gives
0.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Evaluation Framework</title>
        <p>
          We used Aravec [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], to create vectors of Arabic words. Word embeddings are
the vector of words, which satisfy the semantic property of words [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. They split
the words of a tweet using space and applied a twitter-based embedding to get
the vector of words. After the creation of vectors, the tweets are applied to the
LSTM layer and then a softmax layer is applied to get the nal class of the
tweet.
        </p>
        <p>For model-II, we extracted the features from the tweet data and appended
the extra features to the output of the LSTM layer. After that, the nal result is
extracted from the softmax layer. The Github link for the proposed framework
is given below. 6
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>The performance of the system is evaluated based on the accuracy of the
developed system. The Accuracy of a system is the ratio of the total number of
instances predicted correctly to the total number of instances present in the
data.</p>
        <p>Accuracy =</p>
        <p>P</p>
        <p>P + Q</p>
        <p>Where, P is the number of instances predicted correctly, and Q is the number
of instances predicted incorrectly.
(1)</p>
        <p>We found the accuracy of model-I was less than model-II, thus we are
reporting the performance of model-II only. In table 1, we have shown the accuracy
achieved on training data, and in table 2, we have shown the cross0validation
accuracy on the training data. We performed 5-fold cross-validation on the training
data to check the performance of the system and reduce the generalization error.
It was done because we didn't have the labels of test data before submission.</p>
        <p>For training data, model-II is giving better results than model-I. The
performance is similar on test data too. The model-II achieved an accuracy of 66.25%
for gender, 22.22% for age, and 80.28% for language variety. The joint
performance of the system is 0.1083. While the joint accuracy of model-I is 0.0722. In
table 3, we have shown the results of the proposed systems on test data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future work</title>
      <p>In this work, we have presented the task of automatic classi cation of the
author's gender, age, and dialect from their writing. This work attracts several
potential applications like security, forensics, and marketing, etc. We have
developed an lstm-based neural network model for recognizing the age, gender, and
language variety of an author by using his/her written text. We have used some
style-based features for improving the performance of the lstm-based system. In
the future, we will try to optimize the neural network architecture to enhance
the e ciency of the system. We would also look into task-speci c handcrafted
features, to improve the performance of the system. We will work on the using
the properties of homonyms in our feature detection.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alvarez-Carmona</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Monroy</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gomez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>VillasenorPineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jair-Escalante</surname>
          </string-name>
          , H.:
          <article-title>Inaoes participation at pan15: Author pro ling task</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bartle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
          </string-name>
          , J.:
          <article-title>Gender classi cation with deep learning</article-title>
          .
          <source>In: Technical report. The Stanford NLP Group</source>
          .
          <article-title>(</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Char</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaghouani</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snchez-Junquera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Overview of the track on author pro ling and deception detection in arabic</article-title>
          . In: Mehta P.,
          <string-name>
            <surname>Rosso</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            <given-names>M</given-names>
          </string-name>
          . (Eds.)
          <article-title>Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019)</article-title>
          . CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December
          <volume>12</volume>
          -
          <fpage>15</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mansanet</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Albiol</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paredes</surname>
          </string-name>
          , R.:
          <article-title>Local deep neural networks for gender recognition</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>70</volume>
          ,
          <issue>80</issue>
          {
          <fpage>86</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          :
          <article-title>Adapting cross-genre author pro ling to language and corpus</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          . pp.
          <volume>947</volume>
          {
          <issue>955</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Patra</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saikh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Automatic author pro ling based on linguistic and stylistic features</article-title>
          .
          <source>Notebook for PAN at CLEF</source>
          <volume>1179</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Rangel</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.M.</given-names>
            ,
            <surname>Celli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Daelemans</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Overview of the 3rd author pro ling task at pan 2015</article-title>
          . In:
          <article-title>CLEF 2015 Evaluation Labs</article-title>
          and Workshop Working Notes Papers. pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Soliman</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eissa</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Beltagy</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          :
          <article-title>Aravec: A set of arabic word embedding models for use in arabic nlp</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>117</volume>
          ,
          <issue>256</issue>
          {
          <fpage>265</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Word embedding:
          <article-title>Word embedding | Wikipedia, the free encyclopedia (</article-title>
          <year>2019</year>
          ), https://en.wikipedia.org/wiki/Word embedding, [Online; accessed 16-march-2019]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>