Gender Prediction From Tweets With Convolutional
                     Neural Networks
                          Notebook for PAN at CLEF 2018

              Erhan Sezerer, Ozan Polatbilek, Özge Sevgili, and Selma Tekir

                                   Izmir Institute of Technology
                {erhansezerer, ozanpolatbilek, ozgesevgili, selmatekir}@iyte.edu.tr


         Abstract This paper presents a system1 developed for the author profiling task of
         PAN at CLEF 2018 . The system utilizes style-based features to predict the gender
         information from the given tweets of each user. These features are automatically
         extracted by Convolutional Neural Networks (CNN). The system mainly depends
         on the idea that the informativeness of each tweet is not the same in terms of the
         gender of a user. Thus, the attention mechanism is included to the CNN outputs
         in order to discriminate the tweets carrying more information. Our architecture
         was able to obtain competitive results on three languages provided by the PAN
         2018 author profiling challenge with an average accuracy of 75.1% on local runs
         and 70.23% on the submission run.


1      Introduction

Author profiling is the characterization of an author through some key attributes such
as gender, age, and language. It’s an indispensable task especially in security, forensics,
and marketing. In the security world, predictive profiling is a measure for proactive
threat assessment. In forensics; profiling is used to support attribution for an incident,
while in marketing it helps to prepare targeted advertisements.
    In today’s social media-driven environment, automatic user profiling is not the same
as before because what the users write and share in social media provide a great data
source for the potential learning approaches. As a general rule, more data make classi-
fiers more accurate.
    In more technical terms, author profiling is defined as a classification task where
the aim is to predict the attribute of an author out of the given attribute classes. The
traditional machine learning process is followed to fulfill the task. Feature selection is
an important part of the process. Literature categorize the types of features that can be
used for authorship profiling as content-based features and style-based features. Evi-
dence proved that the most effective style-based features for gender discrimination are
determiners and prepositions (markers of male writing) and pronouns (markers of fe-
male writing). As for content-based features, words related to technology (male) and
words related to personal life or relationships (female) are proved to be most useful [1].
 1
     The implementation can be found at: https://github.com/Darg-Iztech/Gender_Classification
    The recent deep learning-based approaches take prominence in this area as they per-
form feature selection automatically. We tackled the problem in a similar way. The pro-
posed approach feeds the characters of a specific user’s tweets into the system, where
the system learns the embeddings character to character and it runs a Convolutional
Neural Network (CNN) for each individual tweet of the user. Then, CNN outputs are
combined and pass through an attention layer to form the user specific vector for pre-
diction.
    In this work, we aim to obtain style-based features from the tweets of users by using
CNNs. CNNs are known to be good at identifying the local patterns from the inputs [5].
They were originally designed to tackle the problems in vision tasks by identifying the
small objects or patterns in images [9], but later, they were introduced into NLP tasks
to extract the syntactic, local features from the text [4].
    In PAN 2018 [16] author profiling task [15], the profiling dimensions are determined
as gender and language, where the selected languages are English, Spanish, and Arabic
respectively. As for training data; in addition to text in the form of tweets, the user
shared images are provided as well. Thus, hybrid solutions that use both text and image-
based features are encouraged.
    Our system uses only text-based features. The basic characteristics of our approach
can be highlighted as follows:

    – The system learns on a user basis iteratively.
    – The input is in the form of characters.
    – A CNN per-tweet is constructed to identify local tweet-wide indicators in larger
      user profile vector.
    – An attention layer is used to combine CNN outputs using normalized weights.

    In the remaining part of the paper, we first present the related work. In Section
3, the proposed method is explained in detail. Then, the performance is tabulated and
evaluated. Finally, in Section 5, the paper is concluded with some remarks and possible
future directions.


2     Related Work
Argamon et al. [1] categorize the types of features that can be used for authorship pro-
filing as content-based features and style-based features. Their experiments show that
the most effective style-based features for gender discrimination are determiners and
prepositions (markers of male writing) and pronouns (markers of female writing). As
for content-based features, words related to technology (male) and words related to
personal life or relationships (female) are proved to be most useful.
     Rangel and Rosso [14] investigate the impact of emotions in age and gender identi-
fication. They process text, create part-of-speech (POS) tag graphs (POS tags as nodes
and their sequence as edges) and expand those graphs by related topic words, polarity la-
bels, and emotion words from the emotion dictionary. Then, they extract features using
graph analytics and feed them into machine learning algorithms to make the classifica-
tion. Their results prove that language use and emotions are effective in discriminating
gender and age.
     In the overview paper of the Author Profiling Task at PAN 2017: Gender and Lan-
guage Variety Identification in Twitter [12], the participant systems are compared with
respect to features and classification approaches. In that edition of the author profil-
ing task, more participants employ deep learning techniques, which perform automatic
feature selection. In the gender and language variety subtasks; the best performances
belong to a logistic regression classifier with combinations of character, word, and POS
n-grams, emojis, sentiments, character flooding, an SVM trained with combinations
of character and tf-idf n-grams, and a deep learning approach combining word and
character embeddings with CNN, RNN, attention mechanism, max-pooling layer, and
fully-connected layer.
     Basile et al. [3] try a Support Vector Machine (SVM) with word unigram and char-
acter n-grams on PAN 2017 author profiling task where they have best results among
other competitors. They use character three to five grams and word uni to bi-grams with
tf-idf weighting and use SVM on this feature space to discriminate both gender and
language variety. They also mention that the hand-crafted features decrease accuracy
rather than helping in this specific task.
     Miura et al. [11] propose two deep-learning based approaches which combine both
context-based and style-based features by taking the word level and the character level
information of the tweet’s text. Their systems use both Recurrent Neural Network
(RNN) (to address context-based features with the given word information) and CNN
(to address style-based features with the given character information). Their archi-
tectures consist of attention mechanism layers, a max-pooling layer, and also fully-
connected layers. The difference between the architectures is that one of them is on a
tweet-basis while the other one is on a user-basis. Additionally, the places of layers lead
to another difference.
     Kodiyan et al. [8] also use a deep learning approach by implementing a bidirectional
RNN with Gated Recurrent Units. They add an attention layer on tweet level to learn the
most important parts of each tweet. In order to move from tweet level to user level they
add the tweet predictions of a user together and use it as a single user level prediction.


3     Method

In this section, the description of the dataset and the details of the proposed model are
given including choice of parameters, preprocessing steps and architectural details.


3.1   Data

PAN 2018 Author Profiling dataset [15] is based on 3 languages (English, Arabic, Span-
ish) with ground-truth gender information. It has 3000 users for English, 3000 users for
Spanish, and 1500 users for Arabic language where each user has 100 tweets and 10
images that they posted on Twitter. In this work, only text data are used in gender clas-
sification.
3.2    Preprocessing
In Twitter, characters are used not only to create words but also to express emotions
like smiling as ’:)’ or blinking as ’;)’, because of this type of usage, punctuations and
stop words did not get eliminated, texts are given as how they are. NLTK [10] is used
to tokenize tweets. To illustrate (example from NLTK):
     Tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <–"
     Tokenized Tweet = [’This’, ’is’, ’a’, ’cooool’, ’#dummysmiley’, ’:’, ’:-)’, ’:-P’, ’<3’,
’and’, ’some’, ’arrows’, ’<’, ’>’, ’->’, ’<–’]
     Each word in the tokenized tweet is applied lowercasing. Then, each character from
the word is taken to be utilized in the input to the system. Thus, the tweet in the above
example is turned into the following input:
     Input = [’t’, ’h’, ’i’, ’s’, ’i’, ’s’, ’a’, ’c’, ’o’, ’o’, ’o’, ’o’, ’l’, ’#’, ’d’, ’u’, ’m’, ’m’,
’y’, ’s’, ’m’, ’i’, ’l’, ’e’, ’y’, ’:’, ’:’, ’-’, ’)’, ’:’, ’-’, ’p’, ’<’, ’3’, ’a’, ’n’, ’d’, ’s’, ’o’, ’m’,
e’, ’a’, ’r’, ’r’, ’o’, ’w’, ’s’, ’<’, ’>’, ’-’, ’>’, ’<’, ’-’, ’-’]
     For each user, the number of characters is set to the highest number that is allowed
for tweets in Twitter. If a tweet has fewer number of characters than the maximum,
padding is applied to the end of the tweet.

3.3    Character Embeddings
Character embeddings with size 25 are initialized by sampling from uniform distri-
bution with 0 mean and trained simultaneously with the neural network. Due to their
smaller size and count, training character embeddings requires fewer text to be trained
than word embeddings. Therefore, the given dataset was sufficient to train them and no
additional data are collected or used.

3.4    Architecture
In this study, each tweet of a user is passed to the CNN simultaneously as a sequence
of characters to assess the style-based features of each particular tweet. CNN outputs a
feature vector for each tweet.
    At this level, using other methods like combining, flattening or averaging the feature
vectors would mean to explicitly assume the equal importance among tweets. However,
the level of information on gender may differ from tweet to tweet. Therefore, A Bah-
danau attention mechanism [2] is combined with the character CNN in order to learn
which tweet holds more information on the gender of its author. Figure 1 shows the
attention mechanism in detail which is calculated by the following formulas:

                                       Ai = tanh(Wα ti + b)                                              (1)
                                              exp(Ai wi )
                                        vi = P                                                           (2)
                                               j exp(Aj wj )

                                               oi = vi t i                                               (3)
                                                   X
                                               K=        oi                                              (4)
                                                        i
    where Wα is a weight matrix used to multiply each output of the CNN, ti is the ith
tweet, b is bias vector, wi is the attention weights, Ai is the attention context vector, vi
is the attention value for ith tweet, oi is attention output vector for the corresponding
tweet, K is the output vector for user.


                                  Figure 1. Attention mechanism.


    A fully connected layer is used on the output of the attention layer to reduce the size
of the feature vector to the number of genders. Predictions are obtained after applying
softmax over the output of the fully connected layer. Proposed model can be seen in
Figure 2.
    CNN [6]2 is implemented with ReLu activation function and [filter size, embedding
size] shaped filters with stride 1 to make all characters visited. Adam optimizer [7] is
used with cross entropy loss. To prevent the model from overfitting L2 regularization
loss is used.


 2
     implementation can be found at: https://github.com/dennybritz/cnn-text-classification-tf
                               Figure 2. The proposed model.


3.5   Parameter Selection


Exhaustive grid search is used to optimize the hyperparameters of the model. Parame-
ters we have tried for each language can be seen in Table 1. Due to differences in each
language and the size of the dataset, different hyperparameters gave best results for each
language (Table 2) .


                      Table 1. Hyperparameters used in optimizations.

         Parameter                                        Values
      Embedding Size                                         25
       Learning Rate                  5x10−3 , 10−4 , 5x10−4 , 10−5 , 5x10−5 , 10−6
 L2 Regularization Coefficient 5x10−4 , 10−5 , 5x10−5 , 10−6 , 5x10−6 , 10−7 , 5x10−7 , 10−8
         Filter sizes                                    3, 5, 6, 9
      Number of Filters                            40, 50, 60, 75, 100
                          Table 2. Parameters with tuned values.

                          Parameter             English Spanish Arabic
                       Embedding Size             25      25      25
                        Learning Rate            10−4    10−4 10−4
                  L2 Regularization Coefficient 10−6 5x10−6 10−6
                          Filter sizes            3, 6    3, 6  3, 6, 9
                       Number of Filters          75      60      50
                            Strides                1       1       1


4   Results
We have selected the model with the best working parameters, shown in Table 2. As
can be seen from Table 3, our best model gives between 70% and 79% accuracy for
different languages in our validation runs. In the submission run over TIRA framework
[13], our best models obtained approximately 4% lower accuracy than the validation
runs for each language.


                  Table 3. Gender prediction accuracy for each language.

                   Language Validation Accuracy(%) Test Accuracy(%)
                    English           79.0               74.95
                    Arabic            75.7               69.20
                    Spanish           70.7               66.55
                    Average           75.1               70.23


           Table 4. Accuracy(%) of models with and without attention mechanism

                   Language CNN without attention CNN with attention
                    English        76.3                79.0
                    Arabic         72.0                75.7
                    Spanish        66.3                70.7
                    Average        71.5                75.1


    We have also observed in our experiments that, instead of averaging the feature
vectors at the output of the CNN or using fully connected layers to combine them,
using attention increases the accuracy of the system by approximately 3 percent on
an average in three aforementioned languages (Table 4). This shows that the attention
layer was able to learn "where to look" and identify the tweets that are more informative
when it comes to gender prediction. Table 5 shows an example of attention values for
three tweets of a particular user for each gender where the attention values correspond
to the probabilities of the respective tweets over the hundred tweets provided for the
user by the PAN author profiling dataset. It can be seen that the attention layer was
able to assign higher values to tweets which have stronger gender indicators such as the
words "bro" for male, "love" for female whereas it assigned low scores to automatically
generated tweets like the third tweet of the male user.

                 Table 5. Example of attention values on tweets for two users

 User Tweets                                                                            Attention values
 Male @******* bro it’s 1 sub                                                             0.04344852
       Recorded 2 videos today and I think they are going to be my best ever videos 0.01365168
       ever! Not sure when I will upload it tho :/
       Welcome to my new 9 followers and goodbye to 3 unfollowers (FREE stats by 0.00081017
       https:*******
Female Love hearing from the women themselves about their own experiences. includ- 0.12932625
       ing from the trans community. https:*******
       I have to declare I love the Great British Sewing New..for reality TV they are 0.01901261
       such kind generous people
       If you caught Prt 2 of "The Oldest Profession" on Night,s hear the other 2 progs: 0.00156761
       https:******* or download all 3 @*******


5   Conclusion
We have described a system submitted to the author profiling task of PAN at CLEF
2018. A CNN architecture is proposed which takes the characters of each tweet’s text
as an input. This input is based on a user in which each tweet of the user given to the sys-
tem. Local style-based features are aimed to be extracted by this system, automatically.
The critical issue related with the proposed system is to recognize that each tweet can
carry different level of information to discriminate the gender of a user. The attention
mechanism is able to catch that difference. This mechanism is added to CNN outputs.
Therefore, the predictions are based on the tweets holding more information about the
gender. As an output, the system gives the prediction of user’s gender in a vector form.
    In the given dataset, in addition to tweets, there are images posted by the users. In
future, we are also planning to make use of those image data along with our current
architecture and we are expecting to get improved results due to that addition.

References
 1. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author
    of an anonymous text. Commun. ACM 52(2), 119–123 (Feb 2009),
    http://doi.acm.org/10.1145/1461928.1461959
 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align
    and translate. In: Proceedings of the 3rd International Conference on Learning
    Representations (2014), http://arxiv.org/abs/1409.0473
 3. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: N-gram: New
    groningen author-profiling model. CoRR abs/1707.03764 (2017),
    http://arxiv.org/abs/1707.03764
 4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural
    language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (Nov
    2011), http://dl.acm.org/citation.cfm?id=1953048.2078186
 5. Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan
    & Claypool Publishers (2017)
 6. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the
    2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp.
    1746–1751. Association for Computational Linguistics (2014)
 7. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980
    (2014), http://arxiv.org/abs/1412.6980
 8. Kodiyan, D., Hardegger, F., Neuhaus, S., Cieliebak, M.: Author profiling with bidirectional
    rnns using attention with grus. In: CLEF (2017)
 9. Lecun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. In:
    Arbib, M. (ed.) The handbook of brain theory and neural networks. MIT Press (1995)
10. Loper, E., Bird, S.: Nltk: The natural language toolkit. In: Proceedings of the ACL-02
    Workshop on Effective Tools and Methodologies for Teaching Natural Language
    Processing and Computational Linguistics - Volume 1. pp. 63–70. ETMTNLP ’02,
    Association for Computational Linguistics, Stroudsburg, PA, USA (2002)
11. Miura, Y., Taniguchi, T., Taniguchi, M., Ohkuma, T.: Author Profiling with
    Word+Character Neural Attention Network—Notebook for PAN at CLEF 2017. In:
    Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and
    Workshop – Working Notes Papers, 11-14 September, Dublin, Ireland. CEUR-WS.org (Sep
    2017), http://ceur-ws.org/Vol-1866/
12. Pardo, F.M.R., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task
    at pan 2017: Gender and language variety identification in twitter. In: CLEF (2017)
13. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
    Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
    Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury,
    A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality,
    and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp.
    268–299. Springer, Berlin Heidelberg New York (Sep 2014)
14. Rangel, F., Rosso, P.: On the impact of emotions on author profiling. Information
    Processing & Management 52(1), 73 – 92 (2016),
    http://www.sciencedirect.com/science/article/pii/S0306457315000783, emotion and
    Sentiment in Social and Expressive Media
15. Rangel, F., Rosso, P., Montes-y-Gómez, M., Potthast, M., Stein, B.: Overview of the 6th
    Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In:
    Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF
    2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep
    2018)
16. Stamatatos, E., Rangel, F., Tschuggnall, M., Kestemont, M., Rosso, P., Stein, B., Potthast,
    M.: Overview of PAN-2018: Author Identification, Author Profiling, and Author
    Obfuscation. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J., Soulier, L., Sanjuan,
    E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality,
    and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18). Springer,
    Berlin Heidelberg New York (Sep 2018)