Overview of the EVALITA 2018 Cross-Genre Gender Prediction (GxG) Task Felice Dell’Orletta Malvina Nissim ItaliaNLP Lab, ILC-CNR CLCG, University of Groningen Pisa, Italy The Netherlands felice.dellorletta@ilc.cnr.it m.nissim@rug.nl Abstract 1 Introduction English. The Gender Cross-Genre (GxG) As publishing has become more and more acces- task is a shared task on author profil- sible and basically cost-free, virtually anyone can ing (in terms of gender) on Italian texts, get their words spread, especially online. Such with a specific focus on cross-genre per- ease of disseminating content doesn’t necessarily formance. This task has been proposed for go together with author identifiability. In other the first time at EVALITA 2018, provid- words: it’s very simple for anyone to publicly ing different datasets from different tex- write any text, but it isn’t equally simple to always tual genres: Twitter, YouTube, Children tell who the author of a text is. writing, Journalism, Personal diaries. Re- In the interest of companies who want to adver- sults from a total of 50 different runs show tise, or legal institutions, finding out at least some that the task is difficult to learn in itself: characteristics of an author is of crucial impor- while almost all runs beat a 50% base- tance. Author profiling is the task of automatically line, no model reaches an accuracy above discovering latent user attributes from text. Gen- 70%. We also observe that cross-genre der, which we focus on in this paper, and which modelling yields a drop in performance, is traditionally characterised as a binary feature, is but not as substantial as one would expect. one of such attributes. Italiano. GxG (Gender Cross-Genre) Thanks to a series of tasks introduced at the è la prima campagna di valutazione per PAN Labs in the last five years (pan.webis. l’identificazione del genere di un autore de, and the production of a variety of gender- di testi scritti in lingua italiana e fa parte annotated datasets focused on social media (Ver- di quell’area di studio detta author pro- hoeven et al., 2016; Emmery et al., 2017, e.g.), filing. In questa edizione di GxG parti- gender prediction has been addressed quite sub- colare attenzione è stata posta nella va- stantially in NLP. State-of-the-art gender predic- lutazione dei sistemi in contesti di anal- tion on Twitter for English, the most common plat- isi cross-dominio. I domini testuali presi form and language used for this task, is approx- in esame sono stati: Twitter, YouTube, imately 80–85% (Rangel et al., 2015; Alvarez- Children writing, Journalism, Personal di- Carmona et al., 2015; Rangel et al., 2017; Basile aries. I risultati ottenuti da un totale di 50 et al., 2017), as obtained at the yearly PAN evalu- diverse run (prodotte da tre diversi gruppi ation campaigns (pan.webis.de). di ricerca) mostrano che il task è com- However, in the context of the 2016 PAN eval- plesso: mentre quasi tutti i sistemi tes- uation campaign, a cross-genre setting was intro- tati superano la baseline del 50%, nes- duced for gender prediction on English, Spanish, sun modello raggiunge un’accuratezza su- and Dutch, and best scores were recorded at an periore al 70%. Si osserva inoltre che average accuracy of less than 60% (Rangel et al., i risultati raggiunti nel contesto di anal- 2016). This was achieved by training models on isi cross-dominio mostrano un calo delle tweets, and testing them on datasets from a differ- prestazioni non così sostanziale come ci si ent source still in the social media domain, namely sarebbe potuto aspettare. blogs. To further explore the cross-genre issue, (Medvedeva et al., 2017) ran additional experi- ments using PAN data from previous years with more prominent; and (ii) make it easier to quan- the model that had achieved best results at the tify the loss when modelling gender across gen- cross-genre PAN 2016 challenge (Busger op Vol- res. Therefore, the gender prediction task must be lenbroek et al., 2016). The picture they obtain done in two ways: is mixed in terms of accuracy of cross-genre per- formance, eventually showing that models are not • using a model which has been trained on the yet general enough to capture gender accurately same genre across different datasets. This is evidence that we have not yet found • using a model which has been trained on any- the actual dataset-independent features that do in- thing but that genre. deed capture the way females and males might write differently. To address this issue, we have We selected five different genres (Section 3), designed a task specifically focused on cross- and asked participants to submit up to ten different genre gender detection, and launched it within the models, as per the overview in Table 1. Obviously, EVALITA 2018 Campaign (Caselli et al., 2018). if one participant wanted to have one single model The rationale behind the cross-genre settings is as for everything, they could submit one model for all follows: if we can make gender prediction sta- settings. In the cross-genre setting, the only con- ble across very different genres, then we are more straint is not using in training any single instance likely to have captured deeper gender-specific from the genre they are testing on. Other than traits rather than dataset characteristics. As a by that, participants were free to combine the other product, this task yields a variety of models for datasets as they wished. gender prediction in Italian, also shedding light on Participants were also free to use external re- which genres favour or discourage in a way gender sources, provided the cross-genre settings were expression, by looking at whether they are easier carefully preserved, and everything used was de- or harder to model. scribed in detail in their final report. While Italian has featured in multi-lingual gen- der prediction at PAN (Rangel et al., 2015), this Measures As standardly done in binary classifi- is the first task that addresses author profiling for cation tasks with balanced classes (see Section 3), Italian specifically, within and across genres. we will evaluate performance using accuracy. For each of the 10 models, five in the in-genre 2 Task settings, and five in the cross-genre settings, we calculate the accuracy for the two classes, i.e. F GxG (Gender Cross-Genre) is a task on author and M. In order to derive two final scores, one for profiling (in terms of gender) on Italian texts, with the in-genre and of for the cross-genre settings, we a specific focus on cross-genre performance. will simply average over the five accuracies ob- Given a (collection of) text(s) from a specific tained per genre. In-genre: genre, the gender of the author has to be predicted. 5 The task is cast as a binary classification task, with P Accin−genre j gender represented as F (female) or M (male). Accin−genre = j=1 5 Evaluation settings were designed bearing in mind the question at the core of this task: are and cross-genre: there indicative traits across genres that can be 5 Acccross−genre P leveraged to model gender in a rather genre- j j=1 independent way? Acccross−genre = 5 We hoped to provide some answers to this ques- tion by making participants train and test their We keep the two scorings separate. For deter- models on datasets from different genres. For mining the official “winner”, we can consider the comparison, participants were also recommended cross-genre ranking, as more specific to this task. to submit genre-specific models, i.e., tested on the very same genre they were trained on. In-genre Baselines For all settings, given that the datasets modelling can (i) shed light on which genres might are balanced for gender distribution, through ran- be easier to model, i.e. where gender traits are dom assignment we will have 50% accuracy. Table 1: Models to submit. IN - GENRE CROSS - GENRE Twitter in-genre model non-Twitter model for Twitter YouTube in-genre model non-YouTube model for YouTube Children in-genre model non-Children model for Children Journalism in-genre model non-Journalism model for Journalism Diaries in-genre model non-Diaries model for Diaries 3 Data the second year. It was collected during the two school years 2012–2013 and 2013–2014. In order to test the portability and stability of pro- filing models across genres, we created datasets News/journalism This dataset was created from five genres. We describe them below, to- scraping two famous Italian online newspapers gether with the format and train/test split of the (La Repubblica and Corriere della Sera) and materials distributed to participants. In Figure 1 selecting only single-authored newspaper articles. we provide a few samples to illustrate the variety Gender assignment was done manually. of the data, and the format provided to the partici- Personal diaries In order to include personal pations (Section 3.3). writing which is more distant from social media, 3.1 Genres we collected personal diaries that are freely avail- We selected data from the following genres able as part of the Fondazione Archivio Diaristico grounding our choice of both availability and wide Nazionale della Città di Pieve Santo Stefano.3 The variety. documents are of varying but comparable sizes, and the author’s name is clearly specified in their Twitter Tweets were downloaded using the metadata. Gender assignment was done manually. Twitter API and a language identification mod- ule to restrict the selection to Italian messages1 . 3.2 Train and test sets Names from usernames were matched with a list For each genre we have a portion of training and of unambiguous male and female names. a portion of test data. The distribution of gender labels was controlled for in each dataset (50/50). YouTube YouTube comments were scraped us- Additionally, we aimed at providing sets of com- ing the YouTube API and an available scraper2 . parable sizes in terms of tokens so as to avoid Videos were pre-selected manually with the aim to including training size as a relevant factor. This avoid gender biases, resulting in a selection from a was intended for test, too, so as to have the same few general topics: travel, music, documentaries, amount of evaluation samples, but due to limited politics. The names of the comments’ authors are availability we eventually used a smaller test set visible, and gender was automatically assigned via for the Diary genre. matching first names to the same list of male and Table 2 shows the size of the training and the female proper names used for the Twitter dataset. test sets in terms of tokens and authors. The Children writing This dataset is a collection of datasets are composed by texts written by multiple essays written by Italian L1 learners collected dur- users, with possibly multiple documents per user. ing the first and second year of lower secondary It is also possible that in the Twitter and YouTube school called CItA (Corpus Italiano di Appren- datasets, different texts by the same user ended up denti L1, (Barbagli et al., 2016)). CItA contains both in training and in test. For what concerns essays written by the same students chronologi- the Children writing dataset, training and test con- cally ordered and covering a two-year temporal tain texts written by same 155 children. Differ- span. The corpus contains a total of 1,352 es- ently from the Children training set, the Children says written by 153 students the first year and 155 test set is composed by texts written on the same 1 3 https://developer.twitter.com/ http://archiviodiari.org/index.php/ 2 https://github.com/philbot9/ iniziative-e-progetti/brani-di-dirai. youtube-comment-scraper html. Twitter @edmond644 @ilsussidiario Sarebbe vero se li avessimo eletti ma, non avendolo fatto, "altri" se li meritano. Children Questa estate mi sono divertito molto perché mio padre ha preso casa nella località del Circeo. La casa era a due piani, al piano terra c’era un giardino dove il mio gatto sela spassava. C’era molta ombra nel giardino e io mi ci addormivo sempre. Il mare era poco lontano da casa e ci andavamo ogni giorno e giocavamo a fare i subacquei. Siamo andati a mangiare la pizza fuori ed era molto buona. YouTube alla fine esce sempre il tuo lato gattaro! sei forte! bellissimo video come sempre! Journalism Elogio alla longevità, l’intervista bresciana a Rita Levi Montalcini Trent’anni fa il Nobel a Rita Levi Montalcini. Ecco l’ultima intervista bresciana a cura di Luisa Monini: «I giovani credano nei valori, i miei collaboratori sono tutte donne» Tra le numerose interviste che Rita Levi Montalcini ha avuto la bontà di concedermi, mi piace ricordare l’ultima, quella dei suoi 100 anni. Eravamo nello studio della sua Fondazione e lei era particolarmente serena, disponibile. Elegante come sempre. [...] Diaries 23.9.80 Sergio, volutamente stai coinvolgendo Alessandro in questa nostra situazione, invece di tenerlo fuori: sai quanto è sensibile, quanto è fragile, quanto è difficile anche - né puoi ignorare che non solo lui in particolare ma nessun ragazzino di 14 anni è in grado di subire o di affrontare o di sostenere una prova così dolorosa. Lo stai distruggendo, impedendogli di riflettere da solo, martellando di parole (o scritti addirittura, come quella tua dichiarazione) per sentirti meno solo o per annullare la sua volontà e imporgli la tua, come volevi fare con me: ma non ti rendi conto che non è amore il tuo, [...] Figure 1: Sample instances of all five genres from the training sets, as distributed to participants. Chil- dren, Diaries, and Journalism samples are cut due to space constraints. Table 2: Size of datasets and label distribution. T RAINING T EST Genre F M Tokens F M Tokens Children 100 100 65,986 100 100 48,913 Diaries 100 100 82,989 37 37 22,209 Journalism 100 100 113,437 100 100 112,036 Twitter 3000 3000 101,534 3000 3000 129,846 Youtube 2200 2200 90,639 2200 2200 61,008 topic at the same time, at the end of the two school lometric analysis, such as the vocabulary richness, years. For News and Diaries we made sure no au- use of the first or third person, etc.4 thor was included in both training and test. We did not balance the number of users per genre, nor the ItaliaNLP tested three different classification number of documents per user, assuming these as models: one based on linear SVM, and two based rather natural conditions. on Bi-directional Long Short Term Memory (Bi- LSTM). The two deep neural network architec- 3.3 Format tures use 2-layers of Bi-LSTM. The first Bi-LSTM The data was distributed as simil-XML. The for- layer encodes each sentence as a token sequence, mat can be seen in Figure 1. Although we dis- the second layer encodes the sentence sequence. tributed one file per genre, we still included the These two architectures differ in the learning ap- genre information in the XML so as to ease the proaches they use: Single-Task Learning (STL) combination of the different files. and Multi-Task Learning (MTL) (Cimino et al., 2018). 4 Participants and Results 5 Analysis and Discussion Following a call for interest, 15 teams registered for the task and thus obtained the training data. In this section we provide both a discussion of the Eventually, three teams submitted their predic- approaches and an analysis of the results. tions, for a total of 50 runs. Three different runs 5.1 Approaches were allowed per task, and one team experimented with three different models submitting three dif- Participants experimented with more classical ma- ferent predictions for each of the 10 subtasks. A chine learning approaches as well as with neural summary of participants is provided in Table 3, networks. Results show that while neural mod- while Tables 4 and 5 report the final results on the els achieve globally more accurate results, feature test sets of the EVALITA 2018 GxG Task. engineered SVMs are as competitive. This holds both in the in-genre and in the cross-genre settings. CapetownMilanoTirana proposed a classifier All models suffer to some extent based on Support Vector Machine (SVM) as learn- from the shift to cross-genre, though the ing algorithm. They tested different n-gram fea- CapetownMilanoTirana-SVM system appears to tures extracted at the word level as well as at the be the most robust. This might be due more to character level. In addition, they experimented the choice of (abstract) features, rather than the feature abstraction transforming each word into a learning algorithm itself. This system also em- list of symbols and computing the length of the ob- ploys bleaching (a technique to fade out lexicon tained word and its frequency (Basile et al., 2018). in favour of more abstract token representation) UniOr tested several binary classifiers based on in this GxG cross-genre setting, after it had different learning algorithms. For the official run, shown promise in a cross-lingual profiling task, they used their two best systems based on Logistic where it was firstly introduced (van der Goot et Regression (LR) and Random Forest (RF) depend- al., 2018). However, from their cross-validation ing on the dataset analyzed. As features, they ex- 4 The participation of this team was not followed by a sys- ploited linguistic parameters extracted using sty- tem description paper. Table 3: Participants to the EVALITA 2018 GxG Task with number of runs. Team Name Research Group # Runs CapetownMilanoTirana Symanto Research, CoGrammar, freelance researcher 10 UniOr Università Orientale di Napoli 10 ItaliaNLP Lab ItaliaNLP Lab, ILC-CNR, Pisa 30 Table 4: Results in terms of Accuracy of the EVALITA 2018 GxG In-Domain Task. Team Name-Model CH DI JO TW YT TOT CapetownMilanoTirana-SVM 0.615 0.635 0.480 0.545 0.547 0.564 UniOr-LR-RF 0.550 0.550 0.585 0.49 0.500 0.535 ItaliaNLP Lab-SVM 0.550 0.649 0.555 0.567 0.555 0.575 ItaliaNLP Lab-STL 0.545 0.541 0.500 0.595 0.512 0.538 ItaliaNLP Lab-MTL 0.640 0.676 0.470 0.561 0.546 0.578 avg-Accuracy 0.580 0.610 0.518 0.552 0.532 0.558 Table 5: Results in terms of Accuracy of the EVALITA 2018 GxG Cross-Domain Task. Team Name-Model CH DI JO TW YT TOT CapetownMilanoTirana-SVM 0.535 0.635 0.515 0.555 0.503 0.549 UniOr-LR-RF 0.525 0.550 0.415 0.500 0.500 0.498 ItaliaNLP Lab-SVM 0.540 0.514 0.505 0.586 0.513 0.532 ItaliaNLP Lab-STL 0.640 0.554 0.495 0.609 0.510 0.562 ItaliaNLP Lab-MTL 0.535 0.595 0.510 0.500 0.500 0.528 avg-Accuracy 0.555 0.570 0.488 0.550 0.505 0.534 results on training data, where they also perform from the dataset from the target genre. It ap- an evaluation of feature contribution, it seems pears that no team tried to exploit functions of that bleaching in this context does not yield the (dis)similarity among genres in order to select sub- expected benefits (Basile et al., 2018). portions of data for training when testing on a The use of external resources was globally lit- given genre. The only different model in the way tle explored, with the exception of generic word the training data from the other genres is used is embeddings (ItaliaNLP Lab). While such embed- the ItaliaNLP Lab-MTL, but its performance on dings do not seem to have contributed much to per- the cross-genre setting indicates that this approach formance, specialised lexica or embeddings could is not robust for this task. be something to be investigated in the future. 5.2 Results From a learning settings’ perspective, teams chose quite straightforward strategies. In-genre, The in-domain results (Table 4) are useful to all models were trained using data from the tar- identify which genres are overall easier to model get genre only, with the exception of the Ital- in terms of the author’s gender, and provide an iaNLP Lab-MTL model, where the adopted multi- overview of gender detection in Italian. task strategy could use the knowledge from other As could be expected, Diaries are the easiest genres, even in the in-genre settings. This seems genre to model. This might result from the fact confirmed comparing the ItaliaNLP Lab-MTL’s that the texts are longer, and are characterised by results with the twin model ItaliaNLP Lab-STL a more personal and subjective writing style. For (the same architecture with a single-task setting). example, the collected diaries present an extensive In the cross-genre scenario, all systems have use of the first and second singular person verbs used as training all of the available datasets apart and a higher distribution of possessive adjectives. Due to availability, the Diaries test set is smaller where some models fail to beat the 50% baseline. than the others, providing thus fewer instances for However, it is interesting to note that the highest evaluation (see Table 2) and possibly weaker re- score in-genre is achieved by UniOr, which uses liability of results. However, from the analysis of a selection of stylometric features (Koppel et al., results reported by (Basile et al., 2018), we see that 2002; Schler et al., 2006, e.g.), which have long even in the cross-validated training set (100 sam- been thought to capture unconscious behaviour ples), accuracy on Diaries is the highest out of the better than just lexical choice (on which most of five genres. the other models are based, as they mainly use n- We also see that Children writings carry better grams). The highest Journalism score in the cross- signal towards gender detection than social media. settings is achieved by CapetownMilanoTirana. This might be due to the fact that the Children test Cross-genre, we observe that results are on av- set is composed by documents characterised by erage lower, but only by 2.5 percentage points a common prompt (in the original collection set- (55.8 vs 53.4), which is less than one would ex- tings, this was meant to provide evidence of how pect. Some models clearly drop more heavily from students perceive the different writing instructions in-genre to cross-genre (ItaliaNLP Lab-MTL: .578 received in the considered school years). This fea- vs .528 average accuracy, ItaliaNLP Lab-SVM: ture makes the children texts a reflexive textual ty- .575 vs .532, UniOr: .535 vs .498). How- pology, typically characterised by a more subjec- ever, others appear more stable in both settings tive writing style as we observed also for Diaries. (CapetownMilanoTirana-SVM: .564 vs .549), or Both Twitter and YouTube score above a 50% even better at the cross- rather than in-genre pre- baseline, but are clearly harder to model. This diction (ItaliaNLP Lab-STL: .538 vs .562). could be due to the short texts, which in some From a genre perspective, the drop is more cases offer very little evidence to go by. Com- substantial for some genres, with Diaries losing pared to previous results reported for gender de- the most, with large variation though across sys- tection on Twitter in Italian, as obtained at the tems. For example, the model that achieves best 2015 PAN Lab challenge on author profiling performance on Diaries in-genre (ItaliaNLP Lab- (Rangel et al., 2015), and on the TwiSty dataset MTL, .676) suffers a drop of almost eight per- (Verhoeven et al., 2016), scores at GxG are sub- centage points on the same dataset cross-genre stantially lower (PAN 2015’s best performance (.595). Conversely, CapetownMilanoTirana pre- on Italian: .8611, TwiSty’s reported F-measure: serve a high performance on the Diaries testset .7329). The reason for this could be the fact that in both in- and cross-settings (.635), yielding the in the PAN Lab and Twisty datasets authors are highest cross-performance on this genre. Twitter represented by a collection of tweets, while we shows the least variation between in- and cross- do not control for this aspect at all, under the as- genre testing. Not only the losses for all sys- sumption that if gender traits emerge, these should tems are minimal, but in some cases we even ob- not depend on having large evidence from a single serve higher scores in the cross-genre setting. Ital- author. It could therefore be that the PAN’s and iaNLP Lab-STL obtains the highest score on this TwiSty’s results are inflated by partially modelling test set in both settings, but the performance is authors rather than gender. Another interesting higher for cross- than in-genre (.609 vs .595). observation regarding Twitter is that when cross- Finally, some possibly interesting insights also validating the training set, (Basile et al., 2018) re- emerge from looking at the precision and recall for port accuracy in the 70s, while their in-genre re- the two classes (which we do not report in tables sults on the test data are just above 50% (Table 4). as there are too many data points to show). For ex- The most difficult genre is Journalism. While ample, we observe that for some genres only one texts can be as long as diaries, results suggest gender ends up being assigned almost entirely by that the requested jargon and writing conventions all classifiers. This happens in the cross-genre set- typical of this genre overshadow the gender sig- tings, while the same test set has rather balanced nal. Moreover, while we have selected only ar- assignments of the two classes in the in-genre set- ticles written by a single author, there is always tings. In the case of Journalism, all systems in the chance that revisions are made by an editor be- cross-genre modelling almost only predict female fore the piece is published. This is the only genre as the author’s gender. At the opposite side of the spectrum we find YouTube, where almost all test and Hugo Jair Escalante. 2015. INAOE’s partici- instances are classified as male by almost all sys- pation at PAN’15: Author profiling task. In CLEF Working Notes. tems, cross-genre. In this case though we also see high recall for male in the in-genre setting, though Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta, not so dominant. While more grounded consider- Simonetta Montemagni, and Giulia Venturi. 2016. ations are left to a deeper analysis in the future, Cita: an l1 italian learners corpus to study the devel- opment of writing competence. In In Proceedings we could speculate that some genres are globally of the 10 thConference on Language Resources and seen by classifiers as more characteristic of one Evaluation (LREC 2016), pages 88–95. gender or the other, as learnt from a large amount Angelo Basile, Gareth Dwyer, Maria Medvedeva, Jo- of mixed-genre, gender-balanced data. sine Rawee, Hessel Haagsma, and Malvina Nissim. 2017. N-gram: New groningen author-profiling 6 Conclusions model. In Proceedings of the CLEF 2017 Evalua- tion Labs and Workshop – Working Notes Papers, Gender detection was for the first time the focus 11-14 September, Dublin, Ireland (Sept. 2017). of a dedicated task at EVALITA. The GxG task Angelo Basile, Gareth Dwyer, and Chiara Rubagotti. specifically focussed on comparing performance 2018. CapetownMilanoTirana for GxG at within and across genres, building on previous Evalita2018. Simple n-gram based models perform observations that high performing systems were well for gender prediction. Sometimes. In Tom- likely to be modelling datasets rather than gen- maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of Sixth Evalua- der, as their accuracy substantially dropped when tion Campaign of Natural Language Processing and tested on different, even related, domains. Speech Tools for Italian. Final Workshop (EVALITA Results from 50 different runs were mostly 2018), Turin, Italy. CEUR.org. above baseline for most prediction tasks, both M. Busger op Vollenbroek, T. Carlotto, T. Kreutz, in-genre and cross-genre, but not particularly M. Medvedeva, C. Pool, J. Bjerva, H. Haagsma, and high overall. Also, the drop between in-genre M. Nissim. 2016. Gron-UP: Groningen user pro- and cross-genre performance is noticeable, but filing notebook for PAN at CLEF. In CLEF 2016 marginal. Neural models appear to perform only Evaluation Labs and Workshop, Working Notes. slightly better than a more classic SVM which Tommaso Caselli, Nicole Novielli, Viviana Patti, and leverages character and word n-grams. The use Paolo Rosso. 2018. Evalita 2018: Overview of of the recently introduced text bleaching strategy the 6th evaluation campaign of natural language processing and speech tools for italian. In Tom- among the engineered features (aimed at reducing maso Caselli, Nicole Novielli, Viviana Patti, and lexicon bias (van der Goot et al., 2018)), does not Paolo Rosso, editors, Proceedings of Sixth Evalua- seem to yield the desired performance in the cross- tion Campaign of Natural Language Processing and genre settings. Speech Tools for Italian. Final Workshop (EVALITA 2018), Turin, Italy. CEUR.org. In the near future, it will be interesting to com- pare these results to human performance, as it has Andrea Cimino, Lorenzo De Mattei, and Felice been observed that for profiling human do not per- Dell’Orletta. 2018. Multi-task Learning in Deep Neural Networks at EVALITA 2018. In Tom- form usually better than machines (Flekova et al., maso Caselli, Nicole Novielli, Viviana Patti, and 2016; van der Goot et al., 2018), to which they Paolo Rosso, editors, Proceedings of Sixth Evalua- provide complementary performance in terms of tion Campaign of Natural Language Processing and correctly predicted instances. Speech Tools for Italian. Final Workshop (EVALITA 2018), Turin, Italy. CEUR.org. Acknowledgments Chris Emmery, Grzegorz Chrupała, and Walter Daele- mans. 2017. Simple queries as distant labels for We would like to thank Eleonora Cocciu and Clau- predicting gender on twitter. In Proceedings of the dia Zaghi for their help with data collection and 3rd Workshop on Noisy User-generated Text, pages 50–55, Copenhagen, Denmark. annotation. We are also grateful to the EVALITA chairs for hosting this task. Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. An- alyzing biases in human perception of user age and References gender from text. In Proceedings of the 54th Annual Meeting of the Association for Computational Lin- Miguel A Alvarez-Carmona, A Pastor López-Monroy, guistics (Volume 1: Long Papers), pages 843–854, Manuel Montes-y Gómez, Luis Villasenor-Pineda, Berlin, Germany, August. Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically categorizing writ- ten texts by author gender. Literary and Linguistic Computing, 17:401–412. Maria Medvedeva, Hessel Haagsma, and Malvina Nis- sim. 2017. An analysis of cross-genre and in-genre performance for author profiling in social media. In Experimental IR Meets Multilinguality, Multimodal- ity, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, pages 211–223. Francisco Rangel, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd author profiling task at pan 2015. In CLEF. Francisco Rangel, Paolo Rosso, Ben Verhoeven, Wal- ter Daelemans, Martin Potthast, and Benno Stein. 2016. Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. In Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, pages 750–784. Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th author pro- filing task at pan 2017: Gender and language variety identification in Twitter. CLEF Working Notes. Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. 2006. Effects of Age and Gender on Blogging. In AAAI Spring Sympo- sium: Computational Approaches to Analyzing We- blogs, pages 199–205. AAAI. Rob van der Goot, Nikola Ljubesic, Ian Matroos, Malv- ina Nissim, and Barbara Plank. 2018. Bleaching text: Abstract features for cross-lingual gender pre- diction. In Iryna Gurevych and Yusuke Miyao, ed- itors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Vol- ume 2: Short Papers, pages 383–389. Ben Verhoeven, Walter Daelemans, and Barbara Plank. 2016. Twisty: A multilingual twitter stylometry cor- pus for gender and personality profiling. In Pro- ceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).