Exploring Combining Training Datasets for the CLIN 2019 Shared Task on Cross-genre Gender Detection in Dutch Gerlof Bouma Department of Swedish / Språkbanken University of Gothenburg gerlof.bouma@gu.se Abstract. We present our entries to the Shared Task on Cross-genre Gender Detection in Dutch at CLIN 2019. We start from a simple logistic regression model with commonly used features, and consider two ways of combining training data from different sources. Our in-genre models do reasonably well, but the cross-genre models are a lot worse. Post-task experiments show no clear systematic advantage of one way of combining training data sources over the other, but do suggest accuracy can be gained from a better way of setting model hyperparameters. 1 Introduction Detection of binary author gender can be done from text alone with impressive effectiveness. For instance, Van der Goot et al. (2018) report an accuracy of 80% on Dutch tweets for their system, and the top systems for four other languages reported in Rangel et al. (2017) also perform in the low 80% accuracy range. These results concern systems that were trained on and applied to Twitter data, with multiple documents (i.e., tweets) per author. It is to be expected that performance suffers when such systems are applied to another genre than they were initially trained on. The CLIN 2019 Shared Task on Cross-genre Gender Detection in Dutch therefore invites authors to investigate “gender prediction within and across different genres in Dutch.”1 This paper reports on our participation this shared task. The shared task consists of an in-genre setting and a cross-genre setting. In the former, models are trained on and applied to data from the same Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) The emoji images used in this paper are part of FxEmoji, CC BY 4.0 Mozilla Foundation. 1 www.let.rug.nl/clin29/shared_task.php 2 Gerlof Bouma genre/data source. In the latter, models are trained on data from one or more sources, and applied to data from a genre that is assumed to be completely unknown during training. The genres supplied in the shared task were News, Twitter and YouTube comments. Having access to training data from multiple sources raises the question of whether we can use that fact to construct models that generalize better and therefore perform better in a cross-genre setting. Our contribution to the shared task is a small investigation of the effect of how the multiple sources are combined. Building upon a basic logistic regression model with features taken from existing research on author profiling in general and gender identification in particular, we compare two ways of combining training data sources. Section 2 gives a formal description of the used model and introduces an alternative objective that combines training datasets in a principled way. Sec- tion 3 describes the used features and gives some implementation details. The results for our systems in the shared task are presented and briefly discussed in Section 4. These results prompt a set of post-task experiments, whose outcome and implications are reflected upon in Section 5 2 Description of the models The core of our approach is a logistic regression model. To fit a logistic regression model with L2 regularization, we need to find an intercept b0 and feature weights B that minimize the sum of a) the normalized negative log-likelihood of the model given the data, and b) the squared magnitude of those weights. To be precise, we minimize |B| 1 αX 2 − log L(b0 , B|D) + b , (1) |D| 2 i=1 i where D is the training data, and α a hyperparameter that lets us set the strength of the regularization factor. (See, e.g., Hastie et al., 2009; Malouf, 2010 for proper introductions.) For the in-genre setting, we simply apply this formulation of the objective. In the cross-genre setting, we are able to train on a combination of two or more datasets from different genres. Ideally, we would be able to leverage this fact to find models that generalize better and therefore fare better when applied to a new genre. Handling data from different sources is well-studied in the domain adaptation literature. However, there one often has access to training data from both source and target domains (like in Daumé’s frustratingly easy method, Daume III, 2007, or equivalently, multilevel regression, Finkel and Manning, 2009), or to source training data and distributional information about the target domain (e.g., work on targeting different kinds of distributional shifts). In this shared task, however, we have two or more sources, but are supposed to assume no knowledge of the target genre. Therefore, such domain adaptation methods do not apply directly. Exploring Combining Training Datasets for Gender Detection 3 1.5 Source D₁ Source D₂ (10× larger) Pooled data model Average objective LSE objective (k=1,4,16,64) (minima) Normalized negative log-likelihood 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Parameter setting Fig. 1. Hypothetical normalized negative log-likelihood curves for two source datasets D1 and D2 , in a model with 1 parameter, and three ways of combining them. The optimal parameter setting for the two separate source datasets and their combinations are indicated by vertical dotted lines. 4 Gerlof Bouma A direct way to combine training data from multiple sources is simply to pool the data. We can then proceed to train according to Equation 1, as we would with a single dataset. When we are dealing with source datasets of unequal size, we can also consider normalizing the respective negative log-likelihoods to the sizes of the datasets, so that the larger dataset does not dominate the model. Just summing/averaging normalized negative log-likelihoods for each source could still lead to one source dominating, if that source is much easier to model. We therefore combine the negative log-likelihoods by taking their maximum. The log-sum-exp function gives us the kind of smooth maximum we need in order to be able to use standard optimization algorithms. In the formulation we use, log-sum-exp also takes a scaling parameter: 1 lse(x, y; k) = log(ekx + eky ). (2) k With positive k, lse(x, y; k) is always greater than max(x, y). A higher k makes lse less smooth but closer to max. The objective for two datasets would thus be   |B| 1 1 αX 2 lse − log L(b0 , B|D1 ), − log L(b0 , B|D2 ); k + b , (3) |D1 | |D2 | 2 i=1 i where α and k are hyperparameters. This is trivially extended to more datasets. Figure 1 illustrates the three ways of combining two source datasets graphically. The hope is that by combining sources in a balanced way, we find models that generalize better to new genres. This is on the premise that we do not know anything about the target genre. If we had information that the new genre is more like one of the source genres, we might be better off building an ‘unfair’ model. 3 Features, data preparation, and implementation A linear model using surface form-based n-gram features has been shown to be very effective in (in-domain) gender identification (Basile et al., 2018), and we will follow this method here, too, albeit in simplified form. The results presented in the cited paper suggest the lion’s share of accuracy is contributed by simple unigram features, and Bamman et al. (2014) present an investigation of which (classes of) lexical unigrams differentiate male from female authors on Twitter. We therefore only use unigram token occurrences in our model. Van der Goot et al. (2018) show the effectiveness of ‘bleached’ lexical features when doing cross- lingual gender detection. Inspired by their approach, we include word lengths as features. Finally, character n-grams are a common ingredient in author profiling. Zechner (2017), on authorship attribution, shows that even character unigram frequencies carry identifying information. These therefore constitute our final feature subset. Keeping the feature set small and simple allows us to focus on the effects of model combination. Exploring Combining Training Datasets for Gender Detection 5 log10 α X-val accuracy Eval accuracy Rank In-genre 1 News 1 .6386 .639 4 Twitter 1 .6327 .6316 5 YouTube 0 .6183 .6294 3 — Average .6299 .6333 4 of 13 In-genre 2 News 2 .6495 .620 6 Twitter 1 .6269 .6311 6 YouTube 2 .6194 .6233 5 — Average .6319 .6248 5 of 13 Table 1. Results for the in-genre models We follow common praxis in authorship attribution (see e.g., Smith and Aldridge, 2011), in restricting the set of features to just the most frequently occurring types. The cut-off points where chosen on the basis of non-systematic trial-and-error investigation of in-genre classification. Feature frequencies are estimated from the training data, by (macro-)averaging frequency distributions from different training data sources. Also following results from authorship attribution, we use z-scores of frequencies as feature values. All in all, the feature vector for a document is made up of z-scores for the 2500 most frequent words, z-scores for the 50 most frequent characters and z-scores for the 10 most frequent word lengths. Texts were tokenized using Cutter (Graën et al., 2018), with some provisions to treat ascii emoticons ‘:P’, repeated punctuation marks ‘???’ and sequences of unicode emoji ‘ ’ as single words. Other punctuation was included as any other other ‘word’. All text was lower-cased before constructing the feature vectors. Fitting the logistic regression models was done with L-BFGS using the facilities supplied by SciPy2 and Autograd.3 4 Entries to the shared task For the in-genre models, we entered two groups: full models according to the specifications above (‘in-genre 1’), and models that only uses the 2500 lexical features (‘in-genre 2’). We set the regularization hyperparamater α by 5-fold cross-validation. The results are in Table 1. Cross-validation gives fair estimates of the task evaluation results (except for one overestimate). In the task evaluation, the full models do better than the reduced. We speculate that the character and word length features, being less sparse, make the models more robust. Compared to the other entries to the shared task, both kinds of model perform reasonably well, 2 scipy.optimize.minimize, see scipy.org 3 See github.com/HIPS/autograd 6 Gerlof Bouma X-val acc log10 α News Twitter YouTube Avg Eval acc Eval rank Cross-genre 1 News 0 – .6202 .6047 .6125 .510 10 Twitter 0 .5993 – .6127 .6060 .5428 7 YouTube 0 .6183 .6175 – .6084 .5252 7 — Average .6090 .5260 11 of 13 Cross-genre 2 News 0 – .6225 .6029 .6127 .508 11 Twitter 0 .5971 – .6157 .6064 .5494 5 YouTube 2 .5354 .6298 – .5826 .5236 8 — Average .6006 .5270 10 of 13 Table 2. Results for the cross-genre models with accuracies consistently in the top half. It should be noted that all entries to the shared task perform well below the 80% mentioned in the introduction. This is probably related to the training data set sizes and the fact that the task requires prediction on the basis of just one document. For the cross-genre models, we also entered two groups of models: one group combining the datasets using the lse-based formulation of the objective (‘cross- genre 1’) and one pooling the data (‘cross-genre 2’). The scaling hyperparameter k for lse was kept constant at 20, no attempts where made to optimize it, and α was set by 5-fold cross-validation. The hyperparameter setting for the model with the highest macro-averaged accuracy between data sources was chosen. The results are in Table 2. The cross-validated accuracies are all lower than in the in-genre case: Apparently the models suffer more from having to deal with two different sources than they benefit from having larger training datasets. Comparing accuracies of cross-genre 1 (lse) to cross-genre 2 (pooling), we can observe that the pooled data model tends to cater better for the larger source dataset (viz., Twitter or YouTube), although this is only very pronounced in the model combing News and Twitter training data. The average cross-validation accuracies of the two combination methods are very similar, except in this last case, where lse has a slight advantage. The task evaluation accuracies are also very similar between the two model types. As is to be expected from the shift in genre, the cross-validation accuracy here is a very poor indication of evaluation accuracy: the latter is on average almost 8 percentage points lower. Compared to the other entries in the shared task, we now do a lot worse, performing in the bottom segment. The poor results on the News genre – the least similar genre – are to blame for this, as we are in the middle bracket for the other two genres. 5 Post-task experiments As mentioned, the cross-genre models do not fare as well as the in-genre ones in the shared task. In addition, the differences between lse and pooling is also small. Exploring Combining Training Datasets for Gender Detection 7 0.60 0.60 0.60 Pooled data Pooled data Pooled data LSE model (k=1,2,4,...,128) LSE model (k=1,2,4,...,128) LSE model (k=1,2,4,...,128) (x-val settings) (x-val settings) (x-val settings) 0.58 0.58 0.58 Cross-genre accuracy on Youtube data Cross-genre accuracy on Twitter data Cross-genre accuracy on News data 0.56 0.56 0.56 0.54 0.54 0.54 0.52 0.52 0.52 0.50 0.50 0.50 -10 -5 0 5 -10 -5 0 5 -10 -5 0 5 log⏨ α log⏨ α log⏨ α Fig. 2. Cross-genre gender detection accuracy per hyperparameter setting: evaluation on News data (left), Twitter data (middle), and YouTube data (right). To see to what extent these results depend on our choice of hyperparamaters, we trained models on two genres at different levels of α and k, and evaluated on a third genre. We only used the shared task’s training data for these experiments. The results are in Figure 2. Note that the results of the shared task evaluation are not in here, since we did not use the task evaluation data. There does not seem to be a clear, systematic difference between accuracies for the two ways of combining data. However, we can see that the hyperparameter settings from cross-validation are suboptimal for both methods in all three datasets. In addition, choosing the right hyperparameter setting for k can make a real difference in performance, although overall it seems that a higher k is preferable. 6 Conclusions We have presented our efforts in the Cross-genre Gender Detection shared task, where we aimed to compare two ways of combining data sources: simply pooling the data vs optimizing an objective that combines the respective negative log- likelihoods with log-sum-exp. The two methods perform similarly, and we have not seen evidence of a real advantage of using the more involved method. However, a set of post-task experiments does show that there is performance to be gained from a better way of picking the hyperparamaters in both methods. In this work, we have not focussed on the feature set definition nor studied the effectiveness of different kinds of features in any depth. In theory, these issues are orthogonal to what we presented in our report. We thus reserve the investigation of model combination methods in the context of known state of the art feature sets for future work. Bibliography David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160. Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. 2018. Simply the best: Minimalist system trumps com- plex models in author profiling. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 143–156, Cham. Springer International Publishing. Hal Daume III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. Association for Computational Linguistics. Jenny Rose Finkel and Christopher D. Manning. 2009. Hierarchical bayesian domain adaptation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 602–610, Boulder, Colorado. Association for Computational Linguistics. Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, and Barbara Plank. 2018. Bleaching text: Abstract features for cross-lingual gender pre- diction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 383–389. Associa- tion for Computational Linguistics. Johannes Graën, Mara Bertamini, and Martin Volk. 2018. Cutter – a univer- sal multilingual tokenizer. In Proceedings of the 3rd Swiss Text Analytics Conference - SwissText 2018, pages 75–81, Winterthur. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer-Verlag, New York. Rob Malouf. 2010. Maximum entropy models. In Alex Clark, Chris Fox, and Shalom Lappin, editors, Handbook of Computational Linguistics and Natural Language Processing, pages 133–155. Wiley Blackwell. Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Vari- ety Identification in Twitter. In Working Notes Papers of the CLEF 2017 Evaluation Labs. Peter Smith and W. Aldridge. 2011. Improving authorship attribution: Optimizing burrows’ delta method. Journal of Quantitative Linguistics, 18(1):63–88. Niklas Zechner. 2017. A Novel Approach to Text Classification. Ph.D. thesis, Umeå University.