Gender and Age Prediction Multilingual Author Profiles
                 Based on Comments

                                        Ali Nemati1

                          University of Washington Tacoma, USA
            11900 Commerce St, Tacoma, WA 98402, United States of America

                                    anemati@uw.udu


       Abstract. Recently, several approaching been presented to detect automatically
       users’ age and gender classification from multiple languages based on docu-
       ments, text, and comments on the web or social media update status. The
       purpose of this task is determining and detecting information such as age, and
       gender from multilingual (Roman, Urdu and English) author profiles based on
       texts or documents. By using four machine learning techniques, my system
       derives an ensemble model for age and gender categories. The ensemble model
       is composed of a multinomial Naive Bayes classifier, a Gradient Boosting Clas-
       sifier, a Logistic Regression CV and a Multi-Layer Perceptron classifier. The
       system can categorize and diagnose text source automatically with a sensitivity
       and specificity of age and gender with unknown testing data. The accuracy re-
       sult is 83 percent for gender category, 60 percent for age, and accuracy 49 per-
       cent is for joint age and gender category.
           Keywords: Age and gender prediction, Multilingual author Profile, Social
       media, Text analysis, Data mining


1      Introduction

Authors profile task helps to reach age and gender classification by the feature extrac-
tions from texts, documents and comments on the web or social media update status
[1]. Recently, many researchers have investigated multilingual author comments to
detect as much as possible and important information such as gender and age from an
author. For example, business companies are gathering customers’ age and gender in
order to give better services in the future [2].
   Furthermore, identifying gender and age about customers style, according to their
comments on social media, helps them to recognize who their customers are. There-
fore, they make decisions to improve their services in the future [3]. In case of devel-
oping and evaluating automatic author profiling system, the training dataset combines
350 separate text files. The training dataset contains documents that have accumulated
over social media such as Facebook, Tweeter, other social media websites and au-
thors’ comments are based on multilingual languages such as Urdu, English, and Ro-
man.
2


   The dataset has collected with smartphones that are written by QWERTY key-
boards and is available publically on the web address “Fire’18 MAPonSMS” [4]. A
true CSV file has released with 350 records including age and gender that corresponds
with each text files. An ensemble model [5][6] which is a combination of four classi-
fiers is used in this study. The first classifier is called Logistic Regression CV Classi-
fier. The second one is called Naïve Base Classifier. Multi-layer Perceptron Classifier
(MLP) is another classifier and the last classifier is a Gradient Boosting Classifier.
The goal of this task is to implement a system to recognize users’ information on
social media. This system is trained according to authors’ Short Message Send (SMS)
or documents. The result for accuracy metric based on unknown testing data reveals
that for gender class 83 present, for age class 60 percent and joint age and gender
class 49 percent accuracy is obtained. The findings of our dataset are:

1. Even though the dataset is very small, a better efficiency than the baseline result is
   achieved.
2. The results of the model improved when the ensemble model was used because
   having the specific model for analysis and process text data has not better
   performance or does not achieve the higher accuracy.

The python application is downloadable in https://goo.gl/D37Qii .


2      Related Works

As already mentioned, Author profile identification be used in serval areas such as
psychology and natural language processing. In more recent studies, the interest in
data mining has grown, and several papers have explored the developing age and
gender prediction collected information over social media [7-10]. Gender identifica-
tion was done by Burger and Henderson in 2006[11]. Another Author profile research
was proposed by Pastor López-Monroy and his colleagues to detect a new document
representation gender and age over social media in 2015. Furthermore, Monroy and
el., were presented a new paper representation for author profiling detection in 2013
[12].
    Marquardt and el., has published a paper about the predictive age and gender iden-
tification according to Social Media at University of Washington [13]. Similar work
has been done on predicted task such as gender, and age from smaller dataset consists
of social media comments on Twitter [14].
    Compared to the dataset that is proposed in this paper, they have used the differing
dataset. And all the prior works done in age and gender prediction have targeted the
task of using ensemble model to obtain the higher achievement.


3      Dataset Description and Preprocessing

The dataset has gender and age class and consists of a binary classification with male
and female. In addition, age class is a multiple classification that is based age group
                                                                                        3


on such as 15-19, 20-24, 25-xx. This ensemble model is chosen to reduce the time
during learning the system and to obtain a highly accurate result or at least close to
the real outcome as much as possible.
   Figure 1 exposes the distribution of age and gender in the true CSV file that
releases into the training data. It illustrates that 40 percent of records are females and
60 present of gender are males. 50.28 percent of people aged between 20-24. and,
30.85 percent of them aged between 15-19. The rest of the recodes are 25 years old or
above.
   The baseline result for this dataset is 60 percent of gender (Male) and 51 percent of
them aged between 20-24.
            120

            100                                    112

             80

             60                   70
                                          64
             40
                     38                                      38
             20                                                           28
               0
                   female      male    female      male    female     male
                          15-19                20-24              25-xx


                            Fig. 1. Age and Gender on the true file


4      Methodology

This system can detect age and gender class based on author profiles. For text pro-
cessing, Scikit-learn package is applied. This package is ubiquitous in order to use in
machine learning with free libraries. The ensemble model is a supervised learning
model by using Scikit-learn package.
   First of all, the application receives the training file which has two columns, age
and gender. The third column called that called transcripts_test is created to accumu-
late all comments of authors that have correlated with each person. Next, the dataset
splits into 280 training instances (80 %) and 70 test instances(20%). The dataset has
shuffled for reducing variance and avoiding overfit.
   Afterward, for converting transcripts_test column to a vector of integer counts, the
system requires to insert transcripts_test into CountVectorizer with all parameters as
the default [7]. Finally, the application recalls the ensemble model, that is discussed
in the paper, and fit and predict the results with five-fold cross-validation. There are
4


multiple models to apply for this task, but these models are being well-designed for
text classification, binary and Boolean features. The application runs on Google
Colaboratory (Colab) that has free CPU cloud services by using TensorFlow. The
Google Colab consists of 33 GB hard disk and 13 GB RAM and 2-core Xeon 2.2GHz.
for running and testing the system Python 3.x (3.6) is used.
   The system works with the ensemble model that combines four classifiers with
high accuracy. For that reason, all models in one particular model are joined and to
get votes for all models. As a result, the ensemble model achieves the results with
high accuracy or can be close to real precision by applying 5 fold cross-validation.
The classifiers in the ensemble model are listed such as the Logistic Regression CV
Classifier, the Naïve Base Classifier, the Multi-layer Perceptron Classifier(MLP) and
the Gradient Boosting Classifier. In the below section, these four machine learning
classifiers are described:


4.1    Logistic Regression CV Classifier

The system has used the Logistic Regression CV classifier as one of the model with
python 3.6. The Logistic regression model is a machine learning method for the anal-
ysis of high dimensional information and text dataset.
   Similarly, it uses the logistic sigmoid function to achieve the result of text sources,
and different parameters are experimented. Eventually, the factor that is the solver is
modified and default solver ‘lbfgs’ to ‘linear’ are altered because it is the appropriate
solver for the small dataset. Other parameters are regarded as defaults. The result with
the five fold cross-validation for gender class is 84.27 present and for age class is
64.32 present.


4.2    Naïve Bayes Classifier

Naïve Bayes classifier is an excellent machine learning technique for text categoriza-
tion. This model is very fast and sophisticated method in real-world events such as
spam filtering, document categorization and text classification in our task. Naive
Bayes classifier has three models such as multinomial, Gaussian and Bernoulli.
   The Multinomial Naive Bayes (Multinomial NB) classifier is chosen to be able to
extract features e.g., word counts for our task [8]. This specific model requires having
integer counts for a numerical statistic. In order of having integer counts, it requires
calling term frequency-inverse document frequency (tf-idf).
   The tf-idf determines how many important words can be in the dataset. Multinomi-
al NB with alpha equal 0.13 is used and the rest of the parameters are as defaults. The
result achieves 86.08 percent accuracy for gender and 65.01 percent accuracy for age.


4.3    Gradient Boosting Classifier
To receive a high accuracy for text source, the Gradient Boosting Classifier has used
and the following parameters are modified. At first, to explain this model, the learning
rate default which is 0.1 is shifted to 0.2. Then, the max_depth default parameter is
                                                                                       5


tuned from 3 to number integer 2 to achieve a better performance. At last on, ran-
dom_State is modified to false because the dataset already has shuffled. Other factors
are not change. The results have shown 84.29 for gender and 64.89 for age prediction.


4.4    Multi-Layer Perceptron Network Classifier (MLP Classifier):

Multi-Layer Perceptron Network Classifier (MLP Classifier) has derived from feed-
forward artificial neural network. It uses a backpropagation method for training. The
accuracy metric 86.76 percent for gender category and 66.54 percent for gender cate-
gory has been obtained. Achieving this accuracy requires to modify the following
parameters. Parameters are altered to deliver high performance and hidden_layer_size
have been changed to 21. 21 hidden layers have been applied to avoid overfitting the
model. The shuffle factor is false and random_state is zero (0) because the dataset
already has shuffled.
   For the training dataset, the parameter maximum numbers of iterations (max_itrr)
tune to 1500. The max_iter default is 200 iteration. Tolerance for the optimization
modifies to 0.012 with default 1e-4 (0.0001) and the rest features are as defaults.


5      Result and Analysis

In table 1, the five fold cross-validation displays for each classifier as above men-
tioned. The training data to 80 presents for training and 20 percent for testing are
devied. To achieve higher accuracy or close to the real result, the system has calculat-
ed five fold cross-validation of the dataset. At that point, it computes the mean of that
the cross-validation. Accuracy metric can be generalized in this text data. The ensem-
ble model is voting for all the above classifiers. The ensemble voting has two types of
voting, hard and soft. The default voting has applied which is hard voting. The system
predicts the label gender or gender as a result label. This result label has the most
frequency label from all four classifiers. In figure 2, the ensemble model is proposed.
6


                                  Fig. 2.   Ensemble model

   Table 1 declares the mean of five-fold cross-validation for 350 records of training
and testing data. The best accuracy result is 88.54 percent for gender and 67.18 per-
cent for gender by using MLP-Classifier.
   The worst accuracy result displays for gender and age categories by having
Logestic Regression CV and Gradient Boosting classifier. The system has 59 percent
for gender and 84 percent for gender in the worst occurrence.
   Moreover, the dataset with 150 hidden text files on MAPonSMS website does not
introduce gender and gender. The system needs to apply the ensemble model to pre-
dict gender and gender by feeding hidden dataset. Results confirm as shown below
that gender category predicts 83 percent and gender category predicts 60 percent. The
joined age and gender predicts 49 percent by using the ensemble model.

          Table 1.   5-fold Cross Validation Accuracy of training and testing dataset

    Classifier                         5-fold CV Acc-Age         5-fold CV ACC-Gender
    Logistic Regression CV:                        59                         84
    Naïve Base:                                    62                         86
    Gradient Boosting:                             59                         84
    Multi-layer Perceptron:                        65                         86
    Ensemble:                                     67.18                      88.54
    Result of MAPonSMS:                            60                         83
    Baseline:                                      51                         60

   In addition, table 2 indicates mean square error (MSE) in all four models. The
smallest MSE offers be the best fitted for the data points. The best MSE is 0.15 for
gender by using the Gradient Boosting classifier and the Logestic Regression CV
classifier. Also, the best MSE is 0.49 for age by using the ensemble model. The worst
MSE is 0.25 for gender by using the Naïve base classifier and 0.62 for age by using
the Logestic Regression CV classifier.

                             Table 2. Mean Square Error (MSE)

                 Classifier                    MSE - Age MSE Gender
                 Logistic Regression CV:       0.62        0.15
                 Naïve Base:                   0.54        0.25
                 Gradient Boosting             0.50        0.15
                 Multi-layer Perceptron:       0.50        0.17
                 Ensemble:                     0.49        0.17


6      Conclusions and Future Work

In this task the model has described text features with predictive influence. It can be
extended across online social media. The task aims to assist companies to have better
                                                                                       7


services. Many classifiers have been tried to predict gender as the binary classification
and age as the multi-class classification. Finally, this system applies the ensemble
model and this machine learning technique is able to predict with 60 percent accuracy
metric for age category and 83 percent accuracy metric for gender category on the
hidden text files. The training results are shown in table 1 in details.
   This task that displays the ensemble model leads the researcher to have better re-
sults. For the future work, the researcher need to work on text prediction to achieve
high accuracy than to use the pre-trained machine learning models. In addition, the
system can make possibly offer in real-time classification over smart-phones or web-
sites by improving the ensemble model.
8


References
 1. L´opez-Monroy, A. P, Montes-y-G´omez, et al. Discriminative subprofile-specific repre-
    sentations for author profiling in social media, Knowledge-Based Systems, Vol 89, 2005 ,
    Pages 134-147, ISSN 0950-7051, doi:10.1016/j.knosys.2015.06.024 (2005).
 2. Farnadi, G., Sitaraman, G., Sushmita, S. et al. User Model User-Adapt Inter 26:109.
    doi:10.1007/s11257-016-9171-0 (2016).
 3. Marquardt, J, et al. Age and Gender Identification in Social Media. CEUR Workshop Pro-
    ceedings, vol. 1180 , pp. 1129–1136, doi:10.1145/1871985.1871993 (2014).
 4. FIRE'18 MAPonSMS, https://lahore.comsats.edu.pk/cs/MAPonSMS/index.html, Accessed
    26 Aug. 2018.
 5. Ensemble Model — scikit-learn, http://scikit-learn.org/stable/modules/ensemble.html, Ac-
    cessed 26 Aug. 2018.
 6. Tsoumakas, G., Vlahavas, I. Random k-labelsets: An ensemble method for multilabel clas-
    sification. In Machine Learning: ECML 2007, Springer, pp 406–417. doi:10.1007/978-3-
    540-74958-5_38 (2007)
 7. D. Murray and K. Durrell. Inferring demographic attributes of anonymous internet users.
    In Web Usage Analysis and User Profiling, Springer, pp 7–20. (2000).
 8. Mislove, Alan, et al. You Are Who You Know: Inferring User Profiles in Online Social
    Networks. Proceedings of the Third ACM International Conference on Web Search and
    Data Mining, pp 251–260 , doi: 10.1145/1718487.1718519 (2010).
 9. Rao, Delip, et al. Classifying Latent User Attributes in Twitter. Proceedings of the 2nd In-
    ternational Workshop on Search and Mining User-Generated Contents , pp 37–44 (2010).
10. Smith, James. Gender Prediction in Social Media. (2014).
11. Burger, John D. D, et al. Discriminating Gender on Twitter. EMNLP 2011 - Conference on
    Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp.
    1301–1309 (2011).
12. L´opez-Monroy, A. P., Montes-y-G´omez, M., Escalante, H. J., Villase˜nor-Pineda, L.,
    Villatoro-Tello, E. INAOE’s participation at PAN’13: Author profiling task. (2013).
13. Marquardt, James F, et al. Age and Gender Identification in Social Media. CEUR Work-
    shop Proceedings, vol. 1180, pp. 1129–1136 (2014).
14. B. Rao, D., et al. Classifying latent user attributes in twitter. In Proceedings of the 2nd in-
    ternational workshop on Search and mining user generated contents, ACM, pp 37–44.
    (2010).