Gender, Age, and Dialect Recognition using
       Tweets in a Deep Learning Framework -
               Notebook for FIRE 2019

      Chanchal Suman1 , Purushottam Kumar2 , Sriparna Saha1 , and Pushpak
                                Bhattacharyya1
 1
     Dept. of Computer Science & Engineering, Indian Institute of Technology Patna,
                                   Patna, India
     2
       Dept. of Computer Science & Engineering, National Institute of Technology
                             Durgapur, Durgapur, India
       {1821cs11, sriparna, pb}@iitp.ac.in , pk.17u10625@btech.nitdgp.ac.in


         Abstract. Social media sites are a rich platform for a user-generated
         text that can be used to identify different aspects of the authors. Age,
         gender, dialect, region, are different aspects of an author, which can
         be identified with the proper mining of these contents. This profiling
         provides a way of identifying anonymous users. Recognizing the profile
         of an anonymous user help in indirect recognition of the identity of the
         user. In this notebook, we describe the working of our author profiling
         software submitted for FIRE 2019 which recognizes the gender, age, and
         dialects of Twitter users in the Arabic language. We have used Long
         short term memory neural network and some hand-crafted features for
         recognizing the age, gender, and dialects of the author of a tweet3 .

         Keywords: Dialects, LSTM, Aravec, emoji, emoticons


1      Introduction
Author profiling is the technique to identify the different aspects of the author
from a given text. It differentiates among the different classes of an author by
studying their writing style and the words used in their text. It also shows how
the behavior viewpoint is used to recognize the different classes of an author.
It tells about the uses of different writing skills and how the language is shared
by a different author in their text. The textual information based on different
features and styles helps identify the author’s profile based on different aspects
such as gender, age, and dialect[7]. The focus of this task is to identify the gender,
age and the dialect variety of Arabic Twitter users. these information about an
anonymous text can be used in detecting criminal in cyber-forenciscs. During
investigation, it is very tough to get the idea about the real guilty. This type
3
     Copyright ©2019 for this paper by its authors. Use permitted under Creative Com-
     mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem-
     ber 2019, Kolkata, India.
2        Chanchal et al.

of analysis helps the authority to conclude about the traits of the guilty and
helps in chasing the possible suspects. Gender is an important aspect of a user,
if it is detected correctly, then it would be very helpful in selecting the possible
suspects. Similarly, the age and the language variety too. These applications of
such analysis motivated us to do research in this area. There have been several
papers by too many researchers over the years studying on the topic of Author
Profiling in text. The text used in these papers is taken from different sources, for
example, Blogs, Hotel reviews, and Tweets. In the traditional methods for author
profiling task, researchers mainly use features such as words, word classes, and
part of speech(POS) n-grams to train their model. In our model, all the tweets
are in text and it contains some features. So, we use the LSTM model with some
hand-crafted features of Deep Learning to recognize age, gender, and dialect of
the authors of a tweet. We performed experiments on the lstm model with and
without handcrafted features. We got an improvement in the accuracy, for the
lstm-based system with handcrafted features in comparison to the normal lstm
based system. Thus it can be said that data shown that the style based features
play a crucial role in recognizing the author profiles.


2     Related Work
Author profiling has attracted researchers and other different competitions. Re-
searchers have studied the dependence of linguistic features and the profile of the
author. This dependency is a subject of interest for different areas like linguistics,
psychology, and natural language processing [6]. Researchers use the syntactic,
lexical, and structural features for recognizing the gender and the age group of
authors. They used the decision tree for identifying the author profile. This re-
search methodology is also helpful for other applications like security, criminal
detection, and author detection [6]. Researchers focused on the representation
of the documents, to improve the representation of tweets. They computed high
quality discriminative and descriptive features built on the top of the textual fea-
tures (e.g., content words, function words, punctuation marks, etc.) by exploiting
discriminative and descriptive features [1]. Some researchers used typed charac-
ter n-grams, lexical features, and non-textual features (domain names) for the
author profiling task [5]. Researchers also tried deep learning models to directly
learn the gender of blog authors. They used Window Recurrent Convolutional
Neural Network [2]. Use of Language and Author Profiling uses Computational
linguistics approaches, Author Profiling Tasks, Neurology [4].


3     Dataset Description
The dataset was provided by the APDA track organizers of FIRE 20194 . It
consists of Arabic tweets of 5 set of tweet data. Each data has 100 tweets of 450
users [3]. Since we are using deep learning-based model, so we needed a large
4
    https://www.autoritas.net/APDA/
                        Deep Learning based framework for Author profiling      3

set of tweets. We considered each tweet as a sample, thus we have 45000 sample
of tweets divided into two class. The users who were male, their tweets were
labeled with the male, and similarly for females. Similarly, for age and dialect,
45000 samples were created. Finally, after getting the results, the decision is
taken on the majority voting basis. For example, We have 450 tweets of 1 user
in 1 document. As per our method, we will get result for all 450 tweets. Let say
for 200 of the tweets, the result is Male, and female for remaining then the final
result for that document will be female. In this way, we have made the data
samples and concluded the results.
    These tweets are in Arabic language having Emoticons, emojis, #mentions,
@mentions, and URLs. Since the provided training corpus consists of tweets are
with HTML format, firstly, we extract all the Tweets from its HTML format to
simple text and then applied the preprocessing step for cleaning the tweets. The
preprocessing stage is useful as it reduces non-textual features to their semantic
classes. We used these preprocessing steps before the extraction of features.
    URLs : The URLs are deleted. @mentions : The @ mentions were deleted.
Emoticons : Emoticons provide useful style-based information. It informs about
the view of a specific user. We captured their presence only. If the emoticon is
present, then 1 otherwise 0. Furthermore, we apply the following normalization:
Punctuation marks: The punctuation marks are split from adjacent word and
captured their presence separately. The Stopwords and Punctuation were also
removed from the tweets.


4     Evaluation Framework
In this section, we are discussing the proposed architecture, performance mea-
sures and the results.

4.1    Proposed Architecture
We applied a deep learning model for the author profiling task. We have used
long short term memory network and applied to the tweets as a classification
problem.

Model-I Long short Term Memory(LSTMs) are designed to learn the long-
term dependency of text-based problems. LSTMs remember the information for
long periods and it is their default behavior5 . In our LSTM model, firstly, we
create a word embedding matrix, a tokenizer and a vocab for our training corpus
to determine the unknown words. We randomize our training corpus to make
it more efficient. We introduce three layers with different activation keys(relu,
sigmoid) to train our training corpus on the LSTM model and test it on the test
corpus. We use Binary cross-entropy as a loss function to use for binary decisions
and adam as our optimizer in the compiling of our model. In subfigure 1a, we
have shown the structure of the model.
5
    https://colah.github.io/posts/2015-08-Understanding-LSTMs/
4         Chanchal et al.


           (a) Model-I                               (b) Model-II

                            Fig. 1: Proposed Architectures


Model-II We tried a simple variant of the Model-I discussed in 1a, by adding
some extra hand-crafted features. We added the hand-crafted features in the
developed model to check the performance of the system. In subfigure 4.1, we
have shown the structure of the model. Below are the additional handcrafted
features:

    – Emoji Count: It counts the total number of emoji presents in the tweet.
    – Word Count: It counts the total number of words present in the tweet.
    – Polarity of Sentiment Analysis: It gives, 1 for positive, -1 for negative and
      0 for neutral sentiment.
    – Mean Word Length: It calculates the average length of the words present in
      the tweet. It is the ratio of the total length of words present in the tweet to
      the total no. of words present in the tweet.
    – Sentence length: It gives the length of the sentence in the tweet.
                        Deep Learning based framework for Author profiling       5

 – Special Character : It gives the total count of punctuation present in the
   tweet.
 – Unrepeated Words: It gives the total number of the words which appeared
   only once in the tweet.
 – URL Extractor : It gives 1 if URL is present in the tweet otherwise it gives
   0.

4.2    Evaluation Framework
We used Aravec [8], to create vectors of Arabic words. Word embeddings are
the vector of words, which satisfy the semantic property of words [9]. They split
the words of a tweet using space and applied a twitter-based embedding to get
the vector of words. After the creation of vectors, the tweets are applied to the
LSTM layer and then a softmax layer is applied to get the final class of the
tweet.
    For model-II, we extracted the features from the tweet data and appended
the extra features to the output of the LSTM layer. After that, the final result is
extracted from the softmax layer. The Github link for the proposed framework
is given below. 6

4.3    Results
The performance of the system is evaluated based on the accuracy of the de-
veloped system. The Accuracy of a system is the ratio of the total number of
instances predicted correctly to the total number of instances present in the
data.
                                              P
                               Accuracy =                                      (1)
                                            P +Q
    Where, P is the number of instances predicted correctly, and Q is the number
of instances predicted incorrectly.


           Table 1: Accuracy of the developed system on training data

        Data name       Gender           Age               Dialect
        15              82.01            87.58             93.24
        16              86.79            82.78             95.04
        17              78.27            80.73             90.07
        18              85.30            84.58             96.06
        19              81.77            84.32             89.86


6
    https://github.com/chanchalIITP/Author-Profling-FIRE-2019
6      Chanchal et al.


Table 2: Accuracy of the developed system for training data on cross-validation

       Data name         Gender           Age               Variety
       15                59.16            68.96             74.35
       16                63.24            70.68             77.84
       17                58.33            65.14             76.95
       18                61.35            71.21             80.04
       19                62.05            62.56             76.82


    We found the accuracy of model-I was less than model-II, thus we are report-
ing the performance of model-II only. In table 1, we have shown the accuracy
achieved on training data, and in table 2, we have shown the cross0validation ac-
curacy on the training data. We performed 5-fold cross-validation on the training
data to check the performance of the system and reduce the generalization error.
It was done because we didn’t have the labels of test data before submission.
    For training data, model-II is giving better results than model-I. The perfor-
mance is similar on test data too. The model-II achieved an accuracy of 66.25%
for gender, 22.22% for age, and 80.28% for language variety. The joint perfor-
mance of the system is 0.1083. While the joint accuracy of model-I is 0.0722. In
table 3, we have shown the results of the proposed systems on test data.


           Table 3: Performance of the proposed system on test data

        Model        Gender       Age             Variety     joint
        Model-I      57.64        27.50           55.14       0.0722
        Model-II     66.25        22.22           80.28       1.083


5   Conclusion and Future work

In this work, we have presented the task of automatic classification of the au-
thor’s gender, age, and dialect from their writing. This work attracts several
potential applications like security, forensics, and marketing, etc. We have de-
veloped an lstm-based neural network model for recognizing the age, gender, and
language variety of an author by using his/her written text. We have used some
style-based features for improving the performance of the lstm-based system. In
the future, we will try to optimize the neural network architecture to enhance
the efficiency of the system. We would also look into task-specific handcrafted
features, to improve the performance of the system. We will work on the using
the properties of homonyms in our feature detection.
                         Deep Learning based framework for Author profiling            7

References
1. Alvarez-Carmona, M.A., López-Monroy, A.P., Montes-y Gómez, M., Villasenor-
   Pineda, L., Jair-Escalante, H.: Inaoes participation at pan15: Author profiling task.
   Working Notes Papers of the CLEF (2015)
2. Bartle, A., Zheng, J.: Gender classification with deep learning. In: Technical report.
   The Stanford NLP Group. (2015)
3. Rangel, F., Rosso, P., Charfi, A., Zaghouani, W., Ghanem, B., Snchez-Junquera,
   J.: Overview of the track on author profiling and deception detection in arabic. In:
   Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum for
   Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In:
   CEUR-WS.org, Kolkata, India, December 12-15 (2019)
4. Mansanet, J., Albiol, A., Paredes, R.: Local deep neural networks for gender recog-
   nition. Pattern Recognition Letters 70, 80–86 (2016)
5. Markov, I., Gómez-Adorno, H., Sidorov, G., Gelbukh, A.F.: Adapting cross-genre
   author profiling to language and corpus. In: CLEF (Working Notes). pp. 947–955
   (2016)
6. Patra, B.G., Banerjee, S., Das, D., Saikh, T., Bandyopadhyay, S.: Automatic author
   profiling based on linguistic and stylistic features. Notebook for PAN at CLEF 1179
   (2013)
7. Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.:
   Overview of the 3rd author profiling task at pan 2015. In: CLEF 2015 Evaluation
   Labs and Workshop Working Notes Papers. pp. 1–8 (2015)
8. Soliman, A.B., Eissa, K., El-Beltagy, S.R.: Aravec: A set of arabic word embedding
   models for use in arabic nlp. Procedia Computer Science 117, 256–265 (2017)
9. Word embedding: Word embedding — Wikipedia, the free encyclopedia (2019),
   https://en.wikipedia.org/wiki/Word embedding, [Online; accessed 16-march-2019]