Two Attempts to Predict Author Gender in Cross-Genre Settings in Dutch Eduardo Brito, Rafet Sifa, and Christian Bauckhage Fraunhofer IAIS, Fraunhofer Center for Machine Learning, Sankt Augustin, Germany Abstract. This paper describes the systems designed by the Fraunhofer IAIS team at the CLIN29 shared task on cross-genre gender detection in Dutch. We show two alternative classification approaches: a rather stan- dard one consisting of feature engineering and a random forest classifier; and an alternative one involving a LSTM classifier. Both are enhanced by a LDA model trained on stems. We considered various features such as frequency of function words, parts-of-speech and sentiment among others. We achieved 53.77% average accuracy in the cross-genre settings. 1 Introduction The CLIN29 shared task is defined as predicting the gender of the author of col- lections of texts in Dutch. The organizers provided a labeled dataset from three different genres according to their source: ”Twitter”, ”Youtube” and ”News”. For each of these genres, we were asked to predict the gender of the respective author by means of two different models: one trained with the text collection of the same genre (in-genre setting) and one trained with anything but text from the same genre (cross-genre setting). We decided to design a common machine learning pipeline (per presented system) that works for all the settings. Thus, all the models are learned by applying the same pipeline and they only differ by their respective training data. This pipeline is described in section 3. Due to the shared task motivation of exploring approaches for cross-genre settings, we focused on tuning the pipeline components by evaluating each configuration on the three cross-genre settings exclusively, virtually ignoring intra-genre evaluations for hyperparameter tuning. Finally, our experiments suggested that two different architectures were the best to tackle this shared task: one based on term frequency of function words and topic modelling; and another based on recurrent neural network using fea- tures of different nature. Since we were allowed to submit two different runs per setting, we provided predictions from both systems for each of the defined Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 Brito, Sifa, and Bauckhage settings. Although both approaches reached similar accuracy values in the cross- validations that we performed on the training dataset, the evaluation on the final test dataset shows that our first approach can generalize better on the evaluated genres. 2 Related Work There is already a certain tradition on research on the broader area of author profiling, which includes detection of other author characteristics of the author such as their age, as we can see at the PAN shared task series1 . However, most of the research has been performed for text in English. In particular, we could only find previous work on cross-genre author gender detection in Dutch at the PAN 2016 shared task (Rangel et al., 2016). Most of the features that our models use are inspired by the best performing systems presented for that challenge. Syntactic features such as part-of-speech (POS) and syntactic dependency relations can predict author gender (Company, 2016). Due to their independence from the semantic content, we considered them suitable for cross-genre settings. Hence, we incorporated both of them in the second approach that we present. Chen et al. (2018) showed that emoji usage in twitter highly depends on the gender of the author. However, emojis appear mostly on social media and they hardly exist in the ”News” genre. Hence, we decided to remove all emojis and emoticons in order to prevent overfitting in cross-genre settings: our aim was to test the same set of features in all settings at the cost of eventually underperforming at in-genre settings. 3 Experimental Setup Our two systems consist of three main processes: topic modeling, feature extrac- tion and classification. An overview of the pipeline can be visualized in figure 1. 3.1 Topic modeling Data: The topic modeling module takes two external corpora as input: – NLCOW142 : the Dutch web corpus from the COW initiative (Schfer and Bildhauer, 2012; Schfer, 2015). – Wikipedia: A Dutch Wikipedia dump from 2018-10-013 . Filtering: In order to comply with the cross-genre setting rules of not using any text coming from the genre where it is evaluated, we filtered out all docu- ments from NLCOW14 whose URL contains the string ”twitter” or ”youtube”. 1 https://pan.webis.de 2 https://github.com/rsling/cow , https://github.com/rsling/texrex 3 https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/ Two Attempts to Predict Author Gender in Cross-Genre Settings in Dutch 3 Fig. 1. Pipeline overview including both presented systems. Since the concept of genre is rather defined by the data source than by the text content, we consider that this approach suffices to discard all texts belonging to the ”twitter” and ”youtube” genres: each of these genres comes from a single data source. In contrast to the other genres, ”News” may come from diverse data sources which we cannot reliably filter out by just checking the document URL against a blacklist in the same way that we did for the NLCOW14 corpus. Therefore, we selected the Wikipedia corpus for the topic modelling part of the two ”News” settings. Preprocessing: From each document, a set of words is filtered out, namely: stop words, emoticons, emojis, URLs and words starting with the characters ’#’and ’@’ (typical for Twitter hashtags and mentions respectively). Then, each word token is both lemmatized and stemmed consecutively. Word tokenization and lemmatization are performed with the help of the Frog parser (Bosch et al., 2007) available via LaMachine4 while stemming is achieved with NLTK (Bird et al., 2009). Model training: We train latent Dirichlet allocation (LDA) models (Blei et al., 2003) with 50 topics on a document basis using the respective Gensim software package (Řehůřek and Sojka, 2010). That is, each document is assigned a 50-dimensional vector at the end of this process. 3.2 Feature extraction The labeled data from the three genres goes through a similar preprocessing as described for topic modeling. However, instead of being lemmatized and 4 https://proycon.github.io/LaMachine/ 4 Brito, Sifa, and Bauckhage stemmed, non-function words are normalized into some special tokens. Namely, words which are parsed as nouns, verbs, adjectives and special words are sub- stituted by a token related to their POS. Additionally, tokens recognized as ordinals, hours, dates and other types of numbers are normalized to respective tokens as well. The feature vector used for classification is dependent on the system: System 1 We create feature vectors based on function words. In our prepro- cessed text, only function words and normalized tokens are left. For each docu- ment, we compute the logarithm of the term frequency vector normalized with respect to the L1-norm as proposed by Diederich et al. (2003). Since we assume that their usage frequency vary depending on the context where they appear, we concatenate a 50-dimensional vector inferred from the trained LDA model in order to provide contextual information to the model. System 2 We decided to try also a neural network architecture based on a long short-term memory (LSTM) classifier (Hochreiter and Schmidhuber, 1997). Since some dependencies across words within a document may also be character- istic from a specific gender (Chen et al., 2018), the LSTM should capture their relevancy for classification. For this architecture, we understand a document as a sequence of words. We transform each document into a sequence of vectors relative to the word tokens of each document. Per sentence, we also add two additional tokens marking the start and the end of the sentence respectively. Each of these vectors contains: – A coarse-grained POS tag (e.g. verb). – A fine-grained POS tag (e.g. 1st person, singular, present). – Position of the word within the sentence. – Relative position of the word (position / sentence length) – Position distance of the word respect to the parent word (in the syntax dependency tree) within the sentence. – Polarity score. – Subjectivity score. For each vector, we append a 50-dimensional vector inferred by the respective LDA model from the document where the respective word resides in the same fashion that we do for system 1. The POS tags are based on the Corpus Gesproken Nederlands (CGN) tags (Van Eynde, 2004) obtained by the Frog parser. For each parsed CGN tag, we split the string into the part before the brackets, which is our coarse-grained POS (e.g. ”WW” for verbs); and the part within the brackets, which constitutes our fine-grained POS tag (e.g. ”pv,tgw,ev” for finite verb, present tense, singular). The polarity and subjectivity scores are obtained from the Pattern software package (Smedt and Daelemans, 2012). In order to have a document representation of fixed size, we restrict each representation to a sequence of 100 vectors. If a document has less than 100 vectors, we prepend zero vectors until we reach the fixed size. Two Attempts to Predict Author Gender in Cross-Genre Settings in Dutch 5 Setting Max depth Twitter in-genre 5 Twitter cross-genre 4 News in-genre 11 News cross-genre 8 Youtube in-genre 10 Youtube cross-genre 5 Table 1. Max depth length set for the random forest classifiers for each setting. 3.3 Classification For both systems, the training data depends on the evaluated setting: for the in-genre settings, only the provided examples from the same genre are used to generate the feature vectors; for the cross-genre settings, the examples for the not evaluated genres are fetched. Additionally, each feature vector includes a topic vector inferred by the respective topic model. For instance, the training set the for ”In-genre Twitter” setting consists of the preprocessed ”Twitter” examples plus a topic vector derived from a topic model trained on the NLWAC corpus, whereas the training set for the ”Cross-genre Twitter” setting is made out of the preprocessed ”News” and ”Youtube” examples (plus the respective topic vectors). System 1 We train a random forest classifier (Breiman, 2001) for each of the de- fined settings by means of its scikit-learn implementation (Pedregosa et al., 2011) on the generated vectors from the feature extraction module and the labels from the given dataset. All classifiers have identical hyperparameter configurations (namely 2001 estimators, log2 of the number of features as maximum features, minimum number of samples required to split an internal node, 2 minimum num- ber of samples required to be at a leaf node) but a different maximal depth of the trees, which is detailed in table 1. We found no significant improvement by having other setting-specific hyperparameters. System 2 Per setting, we train the following recurrent neural network archi- tecture with Keras5 : – A bidirectional LSTM layer with 50 hidden units with 0.5 dropout ratio for the linear transformations of the inputs and the recurrent state respectively. – A dropout layer with 0.5 ratio – A softmax layer The architecture is trained by a Nesterov Adam optimizer (Dozat, 2016) with a categorical crossentropy loss function. 5 https://keras.io/ 6 Brito, Sifa, and Bauckhage Setting # Training Sys. 1 Sys. 2 examples In-genre Twitter 20,000 59.45 59.15 News 1,832 50.30 49.40 Youtube 14,744 55.66 55.11 Average 55.14 54.55 Cross-genre Twitter 16,576 54.25 51.77 News 34,744 54.80 50.40 Youtube 21,832 52.27 50.94 Average 53.77 51.04 Table 2. Accuracy (in %) of our submitted predictions from the two presented systems. 4 Results and discussion As we can see from table 2, system 1 performs consistently better than system 2 in all settings. This is an unexpected result considering that the latter learns from a wider range of features. It is possible that the amount of training examples might have not sufficed for the second approach to work better than the first one because LSTM classifiers require considerably more labeled examples than random forest classifiers. Another surprising result is that the accuracy in the cross-genre setting is higher than in the in-genre one in the case of the ”News” datasets. This is probably to the much larger size of the training dataset for the cross-genre setting (34,744 examples from ”Twitter” and ”Youtube”) than for the in-genre setting (1,832 examples from ”News”). Therefore, system 1 may still perform better if it had a larger training dataset. Our models show that they can only predict author gender for only a (small) part of the evaluated text collections. Hence, we suspect that most pieces of text do not have any genre-independent characteristics that can be reliably used to predict the gender of the author, although in some of them this may be possible. Therefore, we remain uncertain about whether it is feasible (in general) to predict author gender by using features that are independent from the text genre. Bibliography Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022. Antal van den Bosch, Bertjan Busser, Sander Canisius, and Walter Daelemans. 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. LOT Occasional Series 7:191–206. Leo Breiman. 2001. Random forests. Machine learning 45(1):5–32. Zhenpeng Chen, Xuan Lu, Wei Ai, Huoran Li, Qiaozhu Mei, and Xuanzhe Liu. 2018. Through a gender lens: Learning usage patterns of emojis from large- scale android users. In Proceedings of the 2018 World Wide Web Conference. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, WWW ’18, pages 763–772. Juan Soler Company. 2016. Use of discourse and syntactic features for gender identification. In STAIRS . pages 215–220. Joachim Diederich, Jörg Kindermann, Edda Leopold, and Gerhard Paass. 2003. Authorship attribution with support vector machines. Applied intelligence 19(1-2):109–123. Timothy Dozat. 2016. Incorporating nesterov momentum into adam . Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neu- ral computation 9(8):1735–1780. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825– 2830. Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Pot- thast, and Benno Stein. 2016. Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al.. pages 750–784. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, pages 45–50. Roland Schfer. 2015. Processing and querying large web corpora with the COW14 architecture. In Piotr Baski, Hanno Biber, Evelyn Breiteneder, Marc Kupietz, Harald Lngen, and Andreas Witt, editors, Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3). UCREL, IDS, Lancaster. Roland Schfer and Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari (Conference 8 Brito, Sifa, and Bauckhage Chair), Khalid Choukri, Thierry Declerck, Mehmet Uur Doan, Bente Mae- gaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Re- sources and Evaluation (LREC’12). European Language Resources Associa- tion (ELRA), Istanbul, Turkey, pages 486–493. Tom De Smedt and Walter Daelemans. 2012. Pattern for python. Journal of Machine Learning Research 13(Jun):2063–2067. Frank Van Eynde. 2004. Part of speech tagging and lemmatizing of the Corpus Gesproken Nederlands (Spoken Dutch Corpus). KU Leuven .