Pooled LSTM for Dutch cross-genre gender
                   classification

                         Matej Martinc1 and Senja Pollak1,2
                    matej.martinc@ijs.si, senja.pollak@ijs.si
                1
                 Jožef Stefan Institute,Jamova 39, 1000 Ljubljana, Slovenia
            2
                Usher Institute, Medical school, University of Edinburgh, UK


        Abstract. We present the results of cross-genre and in-genre gender
        classification performed on the data sets of Dutch tweets, YouTube com-
        ments and news prepared for the CLIN 2019 shared task. We propose a
        recurrent neural network architecture for gender classification, in which
        the input word and part-of-speech sequences are fed to the LSTM layer,
        which is followed by average and max pooling layers. The best cross-genre
        accuracy of 55.2% was achieved by the model trained on YouTube com-
        ments and tweets, and tested on the balanced news corpus, while the best
        in-genre accuracy of 61.33% was achieved on YouTube comments. Over-
        all, the proposed approach ranked 2nd in the global cross-genre ranking
        and 6th in the global in-genre ranking of CLIN 2019 shared task.


1     Introduction
Author profiling (AP) is a well-established subfield of natural language pro-
cessing with a thriving community gathering data, organizing shared tasks and
publishing about this topic. AP entails the prediction of an author’s profile - i.e.
demographic and/or psychological characteristics of the author - based on the
text that he/she has written. The single most prominent author profiling task
is gender classification, other tasks include the prediction of age, personality,
region of origin and mental health of an author.
    Gender prediction became a mainstream research topic with the influential
work by Koppel et al. (2002). Based on the experiments on a subset of the
British National Corpus, they found that women have a more relational writing
style (e.g., using more pronouns) and men have a more informational writing
style (e.g., using more determiners). Later gender prediction research remained
focused on English, but in the last few years, more languages have received
attention in the context of author profiling (Rangel et al., 2015, 2016), with
the publication of the TwiSty corpus containing gender information on Twitter
authors for six languages (Verhoeven et al., 2016) as a highlight so far.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0)
2        Martinc and Pollak

    A recent study by van der Goot et al. (2018) calls the cross-genre transferabil-
ity of machine learning approaches to gender prediction into question by noticing
that most of these approaches has typically focused on lexical and specialized
social network features, which boosted the performance of the approaches, but
on the other hand also made the approaches highly genre and topic dependent.
To solve this problem, a fairly new development in the field of AP is the search
for data set independent features and approaches, capable of capturing the most
generic differences between male and female writing, which transfer well across
different genres and languages (Dell Orletta and Nissim, 2018). This is also the
main focus of the present research, in which we primarily deal with the devel-
opment and testing of the system for Dutch cross-genre gender classification.
In contrast to the majority of the best performing systems in the field of AP,
which use hand-crafted features and traditional classifiers such as Support vector
machines (SVM) and Logistic regression (Rangel et al., 2017), we opted for the
neural classifier and automated feature engineering.
    This paper is structured as follows. The findings from the related work are
presented in Section 2. The data sets and the methodology used are presented
in Section 3. Results are presented in Section 4, while in Section 5 we conclude
the paper and present plans for the future work.


2     Related work
The lively AP community is centered around a series of scientific events and
shared tasks on digital text forensic, such as PAN (Uncovering Plagiarism, Au-
thorship, and Social Software Misuse)1 and VarDial (Varieties and Dialects)2
(Zampieri et al., 2014). While VarDial is more focused on the identification of
language varieties and dialects, most past PAN AP shared tasks were centered
around gender classification.
    The first PAN event took place in 2011 and the first AP shared task was
organized in 2013 (Rangel et al., 2013). From the beginning, the PAN shared
task was multilingual (Rangel et al., 2013, 2014, 2015, 2016, 2017, 2018) and
two of the past competitions also had a cross-genre setting (Rangel et al., 2014,
2016). Another shared task dedicated to cross-genre gender classification on
Italian documents was the EVALITA 2018 cross-genre gender prediction (GxG)
task (Dell Orletta and Nissim, 2018).
    The most popular approach to gender classification usually relies on bag-
of-words features and SVM classifiers. For instance, winners of the PAN 2017
competition (Basile et al., 2017) used an SVM based system with very simple
features (just word unigrams, bigrams and character three- to five-grams).
    Some quite successful attempts of tackling the gender classification with neu-
ral networks have also been reported. A system consisting of a recurrent neural
network (RNN) layer, a convolutional neural network (CNN) layer, and an at-
tention mechanism proposed by Miura et al. (2017) ranked fourth in the PAN
1
    http://pan.webis.de/
2
    http://corporavm.uni-koeln.de/vardial/sharedtask.html
                   Pooled LSTM for Dutch cross-genre gender classification    3

2017 shared task. In the PAN 2018 multimodal gender classification task (Rangel
et al., 2018), where the task was to predict the gender of the Twitter user from
their tweets and published images, deep learning approaches were prevailing and
the overall winners used RNN for texts and CNN for images (Takahashi et al.,
2018).
    Another related research we looked at was the use of part-of-speech (POS)
tags in existing gender classification approaches, since we hypothesized that
POS based features would be less topic and genre-dependent, and therefore
appropriate for the cross-genre task at hand. Mukherjee and Liu (2010) showed
that sequences of POS tags can be successfully used for gender prediction as
a standalone feature or in combination with other features. POS tag sequences
were also successfully used in combination with other features in the PAN 2017
AP shared task by Martinc et al. (2017), who overall ranked second in the
competition and also tested their model in a cross-genre setting (Martinc and
Pollak, 2018).


3     Experimental setup

This section describes the data sets, methodology and the conducted experi-
ments.


3.1   Data sets

CLIN 2018 shared task organizers provided six data sets from three different
genres. Altogether, they provided 30,000 tweets, 19,658 YouTube comments and
2,832 news, each of them split into a gender labeled train set and an unlabeled
test set. All data sets are balanced in terms of number of documents written
by male and female authors. A more detailed description of all the data sets in
terms of document size and word length is given in Table 1.


                       Dataset       Documents Words
                       Twitter train 20,000    380,074
                       Twitter test 10,000     192,306
                       YouTube train 14,744    280,498
                       YouTube test 4,914       87,038
                       News train    1,832     336,602
                       News test     1,000     401,235

                   Table 1. Data sets used in the experiments
4      Martinc and Pollak


                      Word sequence                                                     POS sequence


                   Word embedding (200)                                             POS embedding (200)


                  Bidirectional LSTM (256)                                         Bidirectional LSTM (256)


       Max pooling (256)            Avg pooling (256)                   Max pooling (256)            Avg pooling (256)


                                               Concatenation layer (1024)


                                                            RELU


                                                        Dropout (0.4)


                                                         Dense (2)


             Fig. 1. Infrastructure of the proposed Pooled LSTM network


3.2   Methodology

Altogether, six classification models, three in-genre and three cross-genre, were
trained and later used for prediction in our experiments. For the in-genre ex-
periments, the train set for a specific genre was randomly split into a train set
containing 90% of the documents and a validation set containing 10% of the
documents. For the cross-genre experiments, we trained the Twitter cross-genre
model on a concatenation of YouTube and news train sets (Twitter train set
was used as a validation set during training), YouTube cross-genre model was
trained on a concatenation of Twitter and news train sets (YouTube train set was
used as a validation set during training) and news cross-genre model was trained
on tweets and YouTube comments (news train set was used as a validation set
during training).
    Text preprocessing is light, we only replace hashtags in some of the data sets
with #HASHTAG tokens, URLs with HTTPURL tokens and mentions with
@MENTION tokens. We also limit the text vocabulary to 30,000 most frequent
words and replace the rest with the <unk> token.
    We decided on a neural approach to the task at hand, mostly because of the
relatively large sizes of the available train and test sets (described in Section
3.1). Taking into the consideration some of the findings from the related work,
we opted for the bidirectional recurrent architecture, which was successfully
employed for gender prediction in the past (Miura et al., 2017; Takahashi et al.,
2018). Initial experiments and previous research (Martinc et al., 2017; Martinc
and Pollak, 2018) also suggested that adding POS tag information improves the
performance of the model (especially in the cross-genre setting), therefore POS
sequences are fed to the network together with the preprocessed texts.
                     Pooled LSTM for Dutch cross-genre gender classification         5

                                          Twitter YouTube News Average
          Validation set in-genre          0.6245 0.6270 0.6477  0.6331
          Validation set cross-genre       0.5473 0.5580  0.5573 0.5542
          Official test set in-genre       0.6099 0.6133  0.5990 0.6074
          Official test set cross-genre    0.5427 0.5507 0.5520  0.5485

            Table 2. Results of the in-genre and cross-genre classification


    Embedding vectors of size 200 are produced for input word and POS tag
sequences, with the help of two randomly initialized embedding layers, and then
fed to two distinct Bidirectional Long short-term memory networks (BiLSTM)
with 256 neurons, which both produce a two dimensional matrix (with the time-
step dimension and the feature vector dimension) representation for every token
in the sequence. In order to find the words/POS tags with the highest predictive
power, we use an approach similar to the one proposed by Lai et al. (2015), and
employ one-dimensional max pooling and average-pooling operations (Collobert
et al., 2011) on the time-step dimension to obtain two fixed-length vectors for
each of the inputs.
    The four resulting vectors are concatenated and fed into the rectified linear
unit (RELU) activation function, on the output of which we conduct a dropout
operation, in which 40% of input units are dropped in order to reduce overfitting.
The resulting vector is passed on to a fully connected layer (Dense) responsible
for producing the final binary gender prediction.
    We use the Python Pytorch library (Paszke et al., 2017) for the implemen-
tation of the system. For optimization, we use an Adam optimizer (Kingma and
Ba, 2014) with a learning rate of 0.0001. Each of the models is trained on the
train set for one hundred epochs and tested on the validation set after every
epoch. The model with the best performance on the validation set is chosen for
the test set predictions. For POS tagging, a Perceptron tagger from NLTK (Bird
and Loper, 2004) is used and for measuring the performance of the classifier, ac-
curacy is used.


4    Results

Classification results are presented in Table 2. On the official test sets, the highest
cross-genre accuracy (55.20%) was achieved on news. Slightly worse was the
accuracy on the data set of YouTube comments (55.07%), while the accuracy on
the tweet test set was almost 1% lower. When it comes to the official in-genre
results, the highest accuracy was achieved on the test set of YouTube comments
(61.33%) and lowest on news (59.99%).
    Results on the validation sets are in all cases better than the results on the
official test sets, when same genres and same types of classification on validation
and test sets are compared. This suggests some overfitting, which is generally
more alarming in the in-genre setting, where the training sets were smaller.
6       Martinc and Pollak

Overfitting is the worst in the news in-genre setting, where the difference in
performance on the official test set and validation set is almost 5%.
    When we compare these results to the results of other teams in the CLIN
shared task, our approach yields good performance in the cross-genre part of the
competition, where we ranked second as a team, although it should be mentioned
that the first ranked team submitted two runs which both performed better than
our submitted run. On the other hand, our approach yields worse results in the
in-genre setting, where we ranked sixth out of eight teams with the ninth best
run.


5   Conclusion

In this paper we presented the results of the CLIN 2019 cross-genre and in-genre
gender classification shared task performed on the data set of Dutch tweets,
YouTube comments and news. A neural network architecture, which takes word
and POS sequences as input, is capable of detecting relatively good features by
performing max and average pooling on the output matrix of the LSTM layer.
On the official CLIN 2019 test sets, our team ranked second in the cross-genre
setting and sixth in the in-genre setting.
    Not surprisingly, the models trained and tested on the same genre achieve
much better performance than the models with train and test sets from different
genres, even though the train sets in the cross-genre setting are much larger in all
the cases. The performance of our classifier is quite consistent across all genres,
which is against our expectations, since we expected better performance on the
news data set because of the on average much longer documents and therefore
more per-instance information for the classifier.
    Dutch gender classification is still a tough problem, which becomes clear, if
we compare the low performances of all the approaches in the shared task with
the performances usually achieved on the English data sets in PAN shared tasks.
In order to narrow this gap, for the short term future work we plan to test our
approach on other languages, just to get the better picture of the difficulty of
cross-genre and in-genre gender classification across different languages. We will
also be conducting a comprehensive error analysis, which will help us identify
language- and genre-independent features that work well across different genres
and languages. In the long term, we will try to improve our approach by test-
ing numerous state-of-the-art neural architectures and employ transfer learning
techniques.


Acknowledgments

The work presented in this paper has been supported by European Unions
Horizon 2020 research and innovation programme under grant agreement No.
825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented
Languages in European News Media). The authors acknowledge also the financial
                  Pooled LSTM for Dutch cross-genre gender classification   7

support from the Slovenian Research Agency core research programme Knowl-
edge Technologies (P2-0103). The Titan Xp used for this research was donated
by the NVIDIA Corporation.
                             Bibliography


Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma,
  and Malvina Nissim. 2017. N-gram: New groningen author-profiling model.
  arXiv preprint arXiv:1707.03764.
Steven Bird and Edward Loper. 2004. Nltk: the natural language toolkit. In
  Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,
  page 31. Association for Computational Linguistics.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray
  Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost)
  from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
Felice Dell Orletta and Malvina Nissim. 2018. Overview of the evalita 2018 cross-
  genre gender prediction (gxg) task. Proceedings of the 6th evaluation campaign
  of Natural Language Processing and Speech tools for Italian (EVALITA18),
  Turin, Italy. CEUR. org.
Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, and Bar-
  bara Plank. 2018. Bleaching text: Abstract features for cross-lingual gender
  prediction. arXiv preprint arXiv:1805.03122.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
  mization. arXiv preprint arXiv:1412.6980.
Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automat-
  ically categorizing written texts by author gender. Literary and Linguistic
  Computing, 17(4):401–412.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional
  neural networks for text classification. In AAAI, volume 333, pages 2267–2273.
Matej Martinc and Senja Pollak. 2018. Reusable workflows for gender predic-
  tion. In Proceedings of the Eleventh International Conference on Language
  Resources and Evaluation (LREC-2018).
Matej Martinc, Iza Škrjanec, Katja Zupan, and Senja Pollak. 2017. Pan 2017:
  Author profiling-gender and language variety prediction. Cappellato et al.[13].
Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma.
  2017. Author profiling with word+ character neural attention network. In
  CLEF (Working Notes).
Arjun Mukherjee and Bing Liu. 2010. Improving gender classification of blog au-
  thors. In Proceedings of the 2010 conference on Empirical Methods in natural
  Language Processing, pages 207–217. Association for Computational Linguis-
  tics.
Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. Py-
  torch: Tensors and dynamic neural networks in python with strong gpu accel-
  eration. Available at https://pytorch.org/.
Francisco Rangel, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and
  Walter Daelemans. 2015. Overview of the 3rd author profiling task at pan
  2015. In CLEF 2015 Working Notes. CEUR.
                   Pooled LSTM for Dutch cross-genre gender classification     9

Francisco Rangel, Paolo Rosso, Irina Chugur, Martin Potthast, Martin
  Trenkmann, Benno Stein, Ben Verhoeven, and Walter Daelemans. 2014.
  Overview of the author profiling task at pan 2014. In CLEF 2014 Evalua-
  tion Labs and Workshop Working Notes Papers. CEUR.
Francisco Rangel, Paolo Rosso, Manuel Montes-y Gómez, Martin Potthast, and
  Benno Stein. 2018. Overview of the 6th author profiling task at pan 2018:
  multimodal gender identification in twitter. Working Notes Papers of the
  CLEF.
Francisco Rangel, Paolo Rosso, Moshe Koppel, Efstathios Stamatatos, and Gi-
  ancomo Inches. 2013. Overview of the author profiling task at pan 2013. In
  CLEF 2013 Evaluation Labs and Workshop Working Notes Papers. CEUR.
Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017.
  Overview of the 5th author profiling task at pan 2017: Gender and language
  variety identification in twitter. Working Notes Papers of the CLEF.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pot-
  thast, and Benno Stein. 2016. Overview of the 4th author profiling task at pan
  2016: cross-genre evaluations. In CLEF 2016 Working Notes. CEUR-WS.org.
Takumi Takahashi, Takuji Tahara, Koki Nagatani, Yasuhide Miura, Tomoki
  Taniguchi, and Tomoko Ohkuma. 2018. Text and image synergy with fea-
  ture cross technique for gender identification. Working Notes Papers of the
  CLEF.
Ben Verhoeven, Walter Daelemans, and Barbara Plank. 2016. TwiSty: a multi-
  lingual Twitter stylometry corpus for gender and personality profiling. In Pro-
  ceedings of the 10th Language Resources and Evaluation Conference (LREC
  2016). ELRA, Portorož, Slovenia.
Marcos Zampieri, Liling Tan, Nikola Ljubešić, and Jörg Tiedemann. 2014. A
  report on the dsl shared task 2014. In Proceedings of the first workshop on
  applying NLP tools to similar languages, varieties and dialects, pages 58–67.