<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gender and Age Prediction Multilingual Author Profiles Based on Comments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Nemati</string-name>
          <email>anemati@uw.udu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1900 Commerce St</institution>
          ,
          <addr-line>Tacoma, WA 98402</addr-line>
          ,
          <country country="US">United States of America</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Washington Tacoma</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recently, several approaching been presented to detect automatically users' age and gender classification from multiple languages based on documents, text, and comments on the web or social media update status. The purpose of this task is determining and detecting information such as age, and gender from multilingual (Roman, Urdu and English) author profiles based on texts or documents. By using four machine learning techniques, my system derives an ensemble model for age and gender categories. The ensemble model is composed of a multinomial Naive Bayes classifier, a Gradient Boosting Classifier, a Logistic Regression CV and a Multi-Layer Perceptron classifier. The system can categorize and diagnose text source automatically with a sensitivity and specificity of age and gender with unknown testing data. The accuracy result is 83 percent for gender category, 60 percent for age, and accuracy 49 percent is for joint age and gender category.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Authors profile task helps to reach age and gender classification by the feature
extractions from texts, documents and comments on the web or social media update status
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recently, many researchers have investigated multilingual author comments to
detect as much as possible and important information such as gender and age from an
author. For example, business companies are gathering customers’ age and gender in
order to give better services in the future [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Furthermore, identifying gender and age about customers style, according to their
comments on social media, helps them to recognize who their customers are.
Therefore, they make decisions to improve their services in the future [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In case of
developing and evaluating automatic author profiling system, the training dataset combines
350 separate text files. The training dataset contains documents that have accumulated
over social media such as Facebook, Tweeter, other social media websites and
authors’ comments are based on multilingual languages such as Urdu, English, and
Roman.
      </p>
      <p>
        The dataset has collected with smartphones that are written by QWERTY
keyboards and is available publically on the web address “Fire’18 MAPonSMS” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A
true CSV file has released with 350 records including age and gender that corresponds
with each text files. An ensemble model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which is a combination of four
classifiers is used in this study. The first classifier is called Logistic Regression CV
Classifier. The second one is called Naïve Base Classifier. Multi-layer Perceptron Classifier
(MLP) is another classifier and the last classifier is a Gradient Boosting Classifier.
The goal of this task is to implement a system to recognize users’ information on
social media. This system is trained according to authors’ Short Message Send (SMS)
or documents. The result for accuracy metric based on unknown testing data reveals
that for gender class 83 present, for age class 60 percent and joint age and gender
class 49 percent accuracy is obtained. The findings of our dataset are:
1. Even though the dataset is very small, a better efficiency than the baseline result is
achieved.
2. The results of the model improved when the ensemble model was used because
having the specific model for analysis and process text data has not better
performance or does not achieve the higher accuracy.
      </p>
      <p>The python application is downloadable in https://goo.gl/D37Qii .
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        As already mentioned, Author profile identification be used in serval areas such as
psychology and natural language processing. In more recent studies, the interest in
data mining has grown, and several papers have explored the developing age and
gender prediction collected information over social media [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7-10</xref>
        ]. Gender
identification was done by Burger and Henderson in 2006[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another Author profile research
was proposed by Pastor López-Monroy and his colleagues to detect a new document
representation gender and age over social media in 2015. Furthermore, Monroy and
el., were presented a new paper representation for author profiling detection in 2013
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Marquardt and el., has published a paper about the predictive age and gender
identification according to Social Media at University of Washington [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Similar work
has been done on predicted task such as gender, and age from smaller dataset consists
of social media comments on Twitter [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>Compared to the dataset that is proposed in this paper, they have used the differing
dataset. And all the prior works done in age and gender prediction have targeted the
task of using ensemble model to obtain the higher achievement.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Description and Preprocessing</title>
      <p>The dataset has gender and age class and consists of a binary classification with male
and female. In addition, age class is a multiple classification that is based age group
on such as 15-19, 20-24, 25-xx. This ensemble model is chosen to reduce the time
during learning the system and to obtain a highly accurate result or at least close to
the real outcome as much as possible.</p>
      <p>Figure 1 exposes the distribution of age and gender in the true CSV file that
releases into the training data. It illustrates that 40 percent of records are females and
60 present of gender are males. 50.28 percent of people aged between 20-24. and,
30.85 percent of them aged between 15-19. The rest of the recodes are 25 years old or
above.</p>
      <p>The baseline result for this dataset is 60 percent of gender (Male) and 51 percent of
them aged between 20-24.</p>
      <p>120
100
80
60
40
20
0
38
70
64
112
38
28
female
male
female
male
female</p>
      <p>male
15-19
20-24</p>
      <p>25-xx
This system can detect age and gender class based on author profiles. For text
processing, Scikit-learn package is applied. This package is ubiquitous in order to use in
machine learning with free libraries. The ensemble model is a supervised learning
model by using Scikit-learn package.</p>
      <p>First of all, the application receives the training file which has two columns, age
and gender. The third column called that called transcripts_test is created to
accumulate all comments of authors that have correlated with each person. Next, the dataset
splits into 280 training instances (80 %) and 70 test instances(20%). The dataset has
shuffled for reducing variance and avoiding overfit.</p>
      <p>
        Afterward, for converting transcripts_test column to a vector of integer counts, the
system requires to insert transcripts_test into CountVectorizer with all parameters as
the default [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Finally, the application recalls the ensemble model, that is discussed
in the paper, and fit and predict the results with five-fold cross-validation. There are
multiple models to apply for this task, but these models are being well-designed for
text classification, binary and Boolean features. The application runs on Google
Colaboratory (Colab) that has free CPU cloud services by using TensorFlow. The
Google Colab consists of 33 GB hard disk and 13 GB RAM and 2-core Xeon 2.2GHz.
for running and testing the system Python 3.x (3.6) is used.
      </p>
      <p>The system works with the ensemble model that combines four classifiers with
high accuracy. For that reason, all models in one particular model are joined and to
get votes for all models. As a result, the ensemble model achieves the results with
high accuracy or can be close to real precision by applying 5 fold cross-validation.
The classifiers in the ensemble model are listed such as the Logistic Regression CV
Classifier, the Naïve Base Classifier, the Multi-layer Perceptron Classifier(MLP) and
the Gradient Boosting Classifier. In the below section, these four machine learning
classifiers are described:
4.1</p>
      <sec id="sec-3-1">
        <title>Logistic Regression CV Classifier</title>
        <p>The system has used the Logistic Regression CV classifier as one of the model with
python 3.6. The Logistic regression model is a machine learning method for the
analysis of high dimensional information and text dataset.</p>
        <p>Similarly, it uses the logistic sigmoid function to achieve the result of text sources,
and different parameters are experimented. Eventually, the factor that is the solver is
modified and default solver ‘lbfgs’ to ‘linear’ are altered because it is the appropriate
solver for the small dataset. Other parameters are regarded as defaults. The result with
the five fold cross-validation for gender class is 84.27 present and for age class is
64.32 present.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Naïve Bayes Classifier</title>
        <p>Naïve Bayes classifier is an excellent machine learning technique for text
categorization. This model is very fast and sophisticated method in real-world events such as
spam filtering, document categorization and text classification in our task. Naive
Bayes classifier has three models such as multinomial, Gaussian and Bernoulli.</p>
        <p>
          The Multinomial Naive Bayes (Multinomial NB) classifier is chosen to be able to
extract features e.g., word counts for our task [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This specific model requires having
integer counts for a numerical statistic. In order of having integer counts, it requires
calling term frequency-inverse document frequency (tf-idf).
        </p>
        <p>The tf-idf determines how many important words can be in the dataset.
Multinomial NB with alpha equal 0.13 is used and the rest of the parameters are as defaults. The
result achieves 86.08 percent accuracy for gender and 65.01 percent accuracy for age.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Gradient Boosting Classifier</title>
        <p>To receive a high accuracy for text source, the Gradient Boosting Classifier has used
and the following parameters are modified. At first, to explain this model, the learning
rate default which is 0.1 is shifted to 0.2. Then, the max_depth default parameter is
tuned from 3 to number integer 2 to achieve a better performance. At last on,
random_State is modified to false because the dataset already has shuffled. Other factors
are not change. The results have shown 84.29 for gender and 64.89 for age prediction.
4.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Multi-Layer Perceptron Network Classifier (MLP Classifier):</title>
        <p>Multi-Layer Perceptron Network Classifier (MLP Classifier) has derived from
feedforward artificial neural network. It uses a backpropagation method for training. The
accuracy metric 86.76 percent for gender category and 66.54 percent for gender
category has been obtained. Achieving this accuracy requires to modify the following
parameters. Parameters are altered to deliver high performance and hidden_layer_size
have been changed to 21. 21 hidden layers have been applied to avoid overfitting the
model. The shuffle factor is false and random_state is zero (0) because the dataset
already has shuffled.</p>
        <p>For the training dataset, the parameter maximum numbers of iterations (max_itrr)
tune to 1500. The max_iter default is 200 iteration. Tolerance for the optimization
modifies to 0.012 with default 1e-4 (0.0001) and the rest features are as defaults.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Result and Analysis</title>
      <p>In table 1, the five fold cross-validation displays for each classifier as above
mentioned. The training data to 80 presents for training and 20 percent for testing are
devied. To achieve higher accuracy or close to the real result, the system has
calculated five fold cross-validation of the dataset. At that point, it computes the mean of that
the cross-validation. Accuracy metric can be generalized in this text data. The
ensemble model is voting for all the above classifiers. The ensemble voting has two types of
voting, hard and soft. The default voting has applied which is hard voting. The system
predicts the label gender or gender as a result label. This result label has the most
frequency label from all four classifiers. In figure 2, the ensemble model is proposed.</p>
      <p>Classifier</p>
      <sec id="sec-4-1">
        <title>Logistic Regression CV:</title>
      </sec>
      <sec id="sec-4-2">
        <title>Naïve Base:</title>
      </sec>
      <sec id="sec-4-3">
        <title>Gradient Boosting:</title>
      </sec>
      <sec id="sec-4-4">
        <title>Multi-layer Perceptron:</title>
      </sec>
      <sec id="sec-4-5">
        <title>Ensemble:</title>
      </sec>
      <sec id="sec-4-6">
        <title>Result of MAPonSMS: Baseline:</title>
        <p>In addition, table 2 indicates mean square error (MSE) in all four models. The
smallest MSE offers be the best fitted for the data points. The best MSE is 0.15 for
gender by using the Gradient Boosting classifier and the Logestic Regression CV
classifier. Also, the best MSE is 0.49 for age by using the ensemble model. The worst
MSE is 0.25 for gender by using the Naïve base classifier and 0.62 for age by using
the Logestic Regression CV classifier.
In this task the model has described text features with predictive influence. It can be
extended across online social media. The task aims to assist companies to have better
services. Many classifiers have been tried to predict gender as the binary classification
and age as the multi-class classification. Finally, this system applies the ensemble
model and this machine learning technique is able to predict with 60 percent accuracy
metric for age category and 83 percent accuracy metric for gender category on the
hidden text files. The training results are shown in table 1 in details.</p>
        <p>This task that displays the ensemble model leads the researcher to have better
results. For the future work, the researcher need to work on text prediction to achieve
high accuracy than to use the pre-trained machine learning models. In addition, the
system can make possibly offer in real-time classification over smart-phones or
websites by improving the ensemble model.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. L´
          <string-name>
            <surname>opez-Monroy</surname>
            ,
            <given-names>A. P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
          </string-name>
          y-G´
          <article-title>omez</article-title>
          , et al.
          <article-title>Discriminative subprofile-specific representations for author profiling in social media, Knowledge-Based Systems</article-title>
          , Vol
          <volume>89</volume>
          ,
          <year>2005</year>
          , Pages
          <fpage>134</fpage>
          -
          <lpage>147</lpage>
          , ISSN 0950-7051, doi:10.1016/j.knosys.
          <year>2015</year>
          .
          <volume>06</volume>
          .024 (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Farnadi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sitaraman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sushmita</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          et al.
          <source>User Model User-Adapt Inter</source>
          <volume>26</volume>
          :
          <fpage>109</fpage>
          . doi:
          <volume>10</volume>
          .1007/s11257-016-9171-
          <fpage>0</fpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Marquardt</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          , et al.
          <article-title>Age and Gender Identification in Social Media</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>1180</volume>
          , pp.
          <fpage>1129</fpage>
          -
          <lpage>1136</lpage>
          , doi:10.1145/1871985.1871993 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. FIRE'18 MAPonSMS, https://lahore.comsats.edu.pk/cs/MAPonSMS/index.html,
          <source>Accessed 26 Aug</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Ensemble</given-names>
            <surname>Model -</surname>
          </string-name>
          scikit-learn, http://scikit-learn.org/stable/modules/ensemble.html,
          <source>Accessed 26 Aug</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tsoumakas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Random k-labelsets: An ensemble method for multilabel classification</article-title>
          .
          <source>In Machine Learning: ECML 2007</source>
          , Springer, pp
          <fpage>406</fpage>
          -
          <lpage>417</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          -74958-5_
          <fpage>38</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Murray</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Durrell</surname>
          </string-name>
          .
          <article-title>Inferring demographic attributes of anonymous internet users</article-title>
          .
          <source>In Web Usage Analysis and User Profiling</source>
          , Springer, pp
          <fpage>7</fpage>
          -
          <lpage>20</lpage>
          . (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mislove</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alan</surname>
          </string-name>
          , et al.
          <article-title>You Are Who You Know: Inferring User Profiles in Online Social Networks</article-title>
          .
          <source>Proceedings of the Third ACM International Conference on Web Search and Data Mining</source>
          , pp
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          , doi: 10.1145/1718487.1718519 (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Delip</surname>
          </string-name>
          , et al.
          <article-title>Classifying Latent User Attributes in Twitter</article-title>
          .
          <source>Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents</source>
          , pp
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Smith</surname>
          </string-name>
          , James. Gender Prediction in Social Media. (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Burger</surname>
          </string-name>
          ,
          <string-name>
            <surname>John D. D</surname>
          </string-name>
          , et al.
          <source>Discriminating Gender on Twitter. EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference</source>
          , pp.
          <fpage>1301</fpage>
          -
          <lpage>1309</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. L´
          <string-name>
            <surname>opez-Monroy</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-G´omez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H. J.</given-names>
          </string-name>
          ,
          <article-title>Villase˜nor-</article-title>
          <string-name>
            <surname>Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villatoro-Tello</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>INAOE's participation at PAN'13: Author profiling task</article-title>
          . (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Marquardt</surname>
          </string-name>
          ,
          <string-name>
            <surname>James</surname>
            <given-names>F</given-names>
          </string-name>
          , et al.
          <article-title>Age and Gender Identification in Social Media</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>1180</volume>
          , pp.
          <fpage>1129</fpage>
          -
          <lpage>1136</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>B.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , et al.
          <article-title>Classifying latent user attributes in twitter</article-title>
          .
          <source>In Proceedings of the 2nd international workshop on Search and mining user generated contents, ACM</source>
          , pp
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>