<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A statistical approach to gender and age range classi cation in multilingual corpus</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Autoritas Consulting</institution>
          ,
          <addr-line>Valencia 46011</addr-line>
          ,
          <country country="ES">SPAIN</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Optical Tech and Support</institution>
          ,
          <addr-line>Valencia 46025</addr-line>
          ,
          <country country="ES">SPAIN</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes a statistical approach to the task of gender and age range classi cation in SMS. Our approach started developing our own implementation of Low Dimensional Representation (LDR) method, with the idea to add some other statistics which had not been used in the original implementation, such as skewness, kurtosis and central moments. The proposed method calculates term frequencies and uses 3 statistics per class: mean, standard deviation and skewness.</p>
      </abstract>
      <kwd-group>
        <kwd>Gender identi cation age identi cation author pro ling statistical approach LDR</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Author pro ling task is widely studied and some new ideas arise from time
to time [
        <xref ref-type="bibr" rid="ref1 ref3 ref4">4,3,1</xref>
        ]. We have pursued a model to classify SMS that ts into Big Data
environment. We could consider models based on Deep Learning, which are time
and resources expensive to build and train, and to predict as well. Our goal, in
the scenario on MAPonSMS Author Pro ling task was to test our algorithm with
a multilingual corpus for gender and age range classi cation. The presented
approach implements a variation of the Low Dimensional Representation (LDR)
method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with new statistics: skewness, kurtosis and central moments. While
LDR uses 6 characteristics per class, the proposed algorithm reduces the number
of characteristic to 3 per class, which is interesting in the context of a Big Data
application. Whereas in BoW or similar approaches every word is a
characteristic, our method reduces the number of characteristics to 3 per class by codifying
the words in probabilities per class, thus calculating the average, standard
deviation and skewness. This is implemented retrieving the associated values from
previously built dictionaries, which is very e cient. In a Big Data environment,
velocity tends to be critical, and our method speeds up the process.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus.</title>
      <p>The MAPonSMS Author Pro ling Train corpus is composed by one single
collection of SMS messages written in Roman Urdu and English languages mixed.
The goal of the task is to classify the user by gender (Male/Female) and by
age range (15-19, 20-24, 25-xx). We were provided with SMS messages from 350
di erent users, being the number of SMS messages variable. So, we had not the
same number of SMS messages for each user. In regard of gender, the classes
were not balanced, being 60% of the SMS labeled as Male, thus 40% Female.
Age range classes were again not balanced. 30.86% of the users belong to 15-19
class, 50.29% to 20-24 class and 18.85% to 25-xx class.</p>
      <p>Our goal was to develop a method that was not language dependent, and that
required no prior knowledge of the language. We started implementing TF
representation for each user in the dataset, counting how many times each word
appears in each user and globally for all users. We decided to use TF since this
way we could represent a priori class dependent probability for each term for
each class simply by counting the number of times a term occurs for each class,
and then dividing this amount by the total number of times this term shows for
all classes. For the sake of understanding we wanted to be sure we were dealing
with probabilities. In addition, calculating TF is less time and resource
consuming than calculating Tf-Idf (as in the original LDR). We build a vocabulary set
including each word we have seen in the training corpus. We decided to discard
the words that appear less than 5 times in the corpus, which highly reduces
the size of the resulting dictionaries. Then, we went over the training corpus,
one user each time, checking for words for this user and writing down into an
array the related a priori probability for each class. Finally, we got one vector
per class per user (2 for gender, 3 for age range) with the a priori probability of
each word to pertain to each class. Then, we can calculate the di erent
statistics from these a priori probabilities arrays that represent the text used by each
user. Once we had the average, standard deviation and skewness for every user
in the training dataset we used these characteristics to train a LinearSVC
machine learning algorithm from Python's Sklearn library. This is a Support Vector
Machine with Linear kernel, where multiclass support is handled according to a
one-vs-the-rest scheme. It is important to notice that when we want the model
to include more users or SMS messages to build the a priori probability vectors,
we only have to run the procedure with new labeled SMS messages. This new
vector is what we will use to predict new incoming SMS messages. The whole
procedure is simple and fast, and can be done in parallel. Once the new vectors
are built, we only need to point the algorithm to these new vectors. This is an
easy way of keeping the vectors up to date, and more importantly, this is a clean,
fast and reliable updating procedure. The code used in this task can be found
at https://github.com/OscarGariboiOrts/MAPonSMS18
4</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation results.</title>
      <p>In Table 3 we present our results vs Baseline. Baseline was built by controlling
class. Our results improve Baseline for gender by almost 30% and age range
by almost 12%. We have stated that we were removing from the vocabulary all
words appearing in less than 5 users. This is a parameter that can be adjusted
depending on the dataset characteristics and on the task involved. We would
have liked to submit more than 1 single submission, as was stated in the task,
since we have seen that removing more or less words might have a big in uence in
the algorithm accuracy. Even more, this is important in such a small dataset as
the one used in this task. Our classi cation method ends up with the accuracies
shown in Table 3.</p>
      <p>
        The method is using statistical parameters calculated from a priori
probabilities of each word to belong to one of the classes. The average, standard deviation
and skewness for these distributions should re ect the fact that di erent classes
use language in di erent ways. At the end, we are relying on the mathematical
concept the more similar 2 distributions are, the more similar their statistics will
be. Hence, feeding the Support Vector Machine (SVM) with these statistics SVM
should be able to nd an hyperplane to linearly separate the related classes. In
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we showed the importance of skewness in a 2 class classi cation problem.
Here we have used skewness together with average and standard deviation to
calibrate the accuracy of the method.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Literature review.</title>
      <p>
        Works on Multilingual Author Pro ling are rare. "Multilingual author pro ling
on Facebook" [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] refers to the use of state-of-the-art author pro ling techniques,
such as content based features (word and character Ngrams) and 64 di erent
stylistic based features (11 lexical word based features, 47 lexical character based
features and 6 vocabulary richness measures) for age and gender identi cation
on multilingual corpora. These techniques rely on lexical and stylistic features,
whereas the method we presented relies on the fact that di erent classes use
language in di erent ways. Men and women use language in di erent manner [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Age range classi cation is usually approached based on language features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
which basically mean di erent ranges will use language in a di erent way. This is
what we try to capture in the proposed method, which relies in the mathematical
distribution of the words used by every di erent class.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work.</title>
      <p>In this paper we presented a method to classify Gender and Age rank in the
MAPonSMS International Author Pro ling Shared Task at Forum for
Information Retrieval Evaluation (FIRE'18). We have shown that codifying the text
with the statistical features extracted from a priori class dependent arrays is an
easy, cheap and language independent method to classify texts with di erent
features. We are always relying on the fact that di erent classes will use the
language in di erent ways that our algorithm can extract information from. A
bigger dataset should give us a clearer idea of how good the performance can
be. We need to check how critical in the performance eliminating words can be.
For age range we could consider not removing any word. In fact, during task
works we saw that eliminating words that appeared less than 3 times made us
get a test set prediction with the same % of Male and Female labels as we got
in the training set. But as 1 single submission was nally allowed, we decided to
rest in the number we had already made some tests in-house. We could try to
introduce some other statistics as central moments, minimums, maximums, etc,
to see whether these new features a ect to the algorithm performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Garibo-Orts</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>A Big Data approach to gender classi cation in Twitter. Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ).
          <article-title>In: Patrice Bellot and Chiraz Trabelsi and Josiane Mothe and Fionn Murtagh and Jian Yun Nie and Laure Soulier and Eric Sanjuan and Linda Cappellato</article-title>
          and Nicola Ferro Editors (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franco-Salvador</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A low dimensionality representation for language variety identi cation</article-title>
          .
          <source>In: 17th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , CICLing, Springer-Verlag, LNCS (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gomez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Overview of the 6th Author Pro ling Task at PAN 2018: Multimodal Gender Identi cation in Twitter</article-title>
          . In: Linda Cappellato and
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          and Laure Soulier (Eds.)
          <article-title>CLEF 2018 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Overview of the 5th Author Pro ling Task at PAN 2017: Gender and Language Variety Identi cation in Twitter</article-title>
          . In: Cappellato L.,
          <string-name>
            <surname>Ferro</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeuriot</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandl</surname>
            <given-names>T</given-names>
          </string-name>
          . (Eds.)
          <article-title>CLEF 2017 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          , vol.
          <source>1866</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Mehwish</given-names>
            <surname>Fatima</surname>
          </string-name>
          , Komal Hasan, Saba Anwar, Rao Muhammad Adeel Nawab (
          <year>2017</year>
          ),
          <article-title>"Multilingual author pro ling on Facebook"</article-title>
          ,
          <source>Information Processing &amp; Management, Elsevier</source>
          , pp:
          <fpage>886</fpage>
          -
          <lpage>904</lpage>
          , Vol:
          <volume>53</volume>
          , Issue: 4, Standard:
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Xiufang</given-names>
            <surname>Xia</surname>
          </string-name>
          (
          <year>2013</year>
          ),
          <article-title>"Gender Di erences in Using Language"</article-title>
          .
          <source>ISSN 1799-2591. Theory and Practice in Language Studies</source>
          , Vol.
          <volume>3</volume>
          , No.
          <issue>8</issue>
          , pp.
          <fpage>1485</fpage>
          -
          <lpage>1489</lpage>
          ,
          <year>August 2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Morgan-Lopez</surname>
            <given-names>AA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            <given-names>AE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chew</surname>
            <given-names>RF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruddle</surname>
            <given-names>P</given-names>
          </string-name>
          (
          <year>2017</year>
          )
          <article-title>Predicting age groups of Twitter users based on language and metadata features</article-title>
          .
          <source>PLoS ONE</source>
          <volume>12</volume>
          (
          <article-title>8): e0183537</article-title>
          . https://doi.org/10.1371/journal.pone.0183537
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>