Overview of the EVALITA 2018
Cross-Genre Gender Prediction (GxG) Task
Felice Dell’Orletta Malvina Nissim
ItaliaNLP Lab, ILC-CNR CLCG, University of Groningen
Pisa, Italy The Netherlands
felice.dellorletta@ilc.cnr.it m.nissim@rug.nl
Abstract 1 Introduction
English. The Gender Cross-Genre (GxG) As publishing has become more and more acces-
task is a shared task on author profil- sible and basically cost-free, virtually anyone can
ing (in terms of gender) on Italian texts, get their words spread, especially online. Such
with a specific focus on cross-genre per- ease of disseminating content doesn’t necessarily
formance. This task has been proposed for go together with author identifiability. In other
the first time at EVALITA 2018, provid- words: it’s very simple for anyone to publicly
ing different datasets from different tex- write any text, but it isn’t equally simple to always
tual genres: Twitter, YouTube, Children tell who the author of a text is.
writing, Journalism, Personal diaries. Re-
In the interest of companies who want to adver-
sults from a total of 50 different runs show
tise, or legal institutions, finding out at least some
that the task is difficult to learn in itself:
characteristics of an author is of crucial impor-
while almost all runs beat a 50% base-
tance. Author profiling is the task of automatically
line, no model reaches an accuracy above
discovering latent user attributes from text. Gen-
70%. We also observe that cross-genre
der, which we focus on in this paper, and which
modelling yields a drop in performance,
is traditionally characterised as a binary feature, is
but not as substantial as one would expect.
one of such attributes.
Italiano. GxG (Gender Cross-Genre) Thanks to a series of tasks introduced at the
è la prima campagna di valutazione per PAN Labs in the last five years (pan.webis.
l’identificazione del genere di un autore de, and the production of a variety of gender-
di testi scritti in lingua italiana e fa parte annotated datasets focused on social media (Ver-
di quell’area di studio detta author pro- hoeven et al., 2016; Emmery et al., 2017, e.g.),
filing. In questa edizione di GxG parti- gender prediction has been addressed quite sub-
colare attenzione è stata posta nella va- stantially in NLP. State-of-the-art gender predic-
lutazione dei sistemi in contesti di anal- tion on Twitter for English, the most common plat-
isi cross-dominio. I domini testuali presi form and language used for this task, is approx-
in esame sono stati: Twitter, YouTube, imately 80–85% (Rangel et al., 2015; Alvarez-
Children writing, Journalism, Personal di- Carmona et al., 2015; Rangel et al., 2017; Basile
aries. I risultati ottenuti da un totale di 50 et al., 2017), as obtained at the yearly PAN evalu-
diverse run (prodotte da tre diversi gruppi ation campaigns (pan.webis.de).
di ricerca) mostrano che il task è com- However, in the context of the 2016 PAN eval-
plesso: mentre quasi tutti i sistemi tes- uation campaign, a cross-genre setting was intro-
tati superano la baseline del 50%, nes- duced for gender prediction on English, Spanish,
sun modello raggiunge un’accuratezza su- and Dutch, and best scores were recorded at an
periore al 70%. Si osserva inoltre che average accuracy of less than 60% (Rangel et al.,
i risultati raggiunti nel contesto di anal- 2016). This was achieved by training models on
isi cross-dominio mostrano un calo delle tweets, and testing them on datasets from a differ-
prestazioni non così sostanziale come ci si ent source still in the social media domain, namely
sarebbe potuto aspettare. blogs. To further explore the cross-genre issue,
(Medvedeva et al., 2017) ran additional experi-
ments using PAN data from previous years with more prominent; and (ii) make it easier to quan-
the model that had achieved best results at the tify the loss when modelling gender across gen-
cross-genre PAN 2016 challenge (Busger op Vol- res. Therefore, the gender prediction task must be
lenbroek et al., 2016). The picture they obtain done in two ways:
is mixed in terms of accuracy of cross-genre per-
formance, eventually showing that models are not • using a model which has been trained on the
yet general enough to capture gender accurately same genre
across different datasets.
This is evidence that we have not yet found • using a model which has been trained on any-
the actual dataset-independent features that do in- thing but that genre.
deed capture the way females and males might
write differently. To address this issue, we have We selected five different genres (Section 3),
designed a task specifically focused on cross- and asked participants to submit up to ten different
genre gender detection, and launched it within the models, as per the overview in Table 1. Obviously,
EVALITA 2018 Campaign (Caselli et al., 2018). if one participant wanted to have one single model
The rationale behind the cross-genre settings is as for everything, they could submit one model for all
follows: if we can make gender prediction sta- settings. In the cross-genre setting, the only con-
ble across very different genres, then we are more straint is not using in training any single instance
likely to have captured deeper gender-specific from the genre they are testing on. Other than
traits rather than dataset characteristics. As a by that, participants were free to combine the other
product, this task yields a variety of models for datasets as they wished.
gender prediction in Italian, also shedding light on Participants were also free to use external re-
which genres favour or discourage in a way gender sources, provided the cross-genre settings were
expression, by looking at whether they are easier carefully preserved, and everything used was de-
or harder to model. scribed in detail in their final report.
While Italian has featured in multi-lingual gen-
der prediction at PAN (Rangel et al., 2015), this Measures As standardly done in binary classifi-
is the first task that addresses author profiling for cation tasks with balanced classes (see Section 3),
Italian specifically, within and across genres. we will evaluate performance using accuracy.
For each of the 10 models, five in the in-genre
2 Task settings, and five in the cross-genre settings, we
calculate the accuracy for the two classes, i.e. F
GxG (Gender Cross-Genre) is a task on author and M. In order to derive two final scores, one for
profiling (in terms of gender) on Italian texts, with the in-genre and of for the cross-genre settings, we
a specific focus on cross-genre performance. will simply average over the five accuracies ob-
Given a (collection of) text(s) from a specific tained per genre. In-genre:
genre, the gender of the author has to be predicted.
5
The task is cast as a binary classification task, with P
Accin−genre
j
gender represented as F (female) or M (male). Accin−genre =
j=1
5
Evaluation settings were designed bearing in
mind the question at the core of this task: are and cross-genre:
there indicative traits across genres that can be 5
Acccross−genre
P
leveraged to model gender in a rather genre- j
j=1
independent way? Acccross−genre = 5
We hoped to provide some answers to this ques-
tion by making participants train and test their We keep the two scorings separate. For deter-
models on datasets from different genres. For mining the official “winner”, we can consider the
comparison, participants were also recommended cross-genre ranking, as more specific to this task.
to submit genre-specific models, i.e., tested on the
very same genre they were trained on. In-genre Baselines For all settings, given that the datasets
modelling can (i) shed light on which genres might are balanced for gender distribution, through ran-
be easier to model, i.e. where gender traits are dom assignment we will have 50% accuracy.
Table 1: Models to submit.
IN - GENRE CROSS - GENRE
Twitter in-genre model non-Twitter model for Twitter
YouTube in-genre model non-YouTube model for YouTube
Children in-genre model non-Children model for Children
Journalism in-genre model non-Journalism model for Journalism
Diaries in-genre model non-Diaries model for Diaries
3 Data the second year. It was collected during the two
school years 2012–2013 and 2013–2014.
In order to test the portability and stability of pro-
filing models across genres, we created datasets News/journalism This dataset was created
from five genres. We describe them below, to- scraping two famous Italian online newspapers
gether with the format and train/test split of the (La Repubblica and Corriere della Sera) and
materials distributed to participants. In Figure 1 selecting only single-authored newspaper articles.
we provide a few samples to illustrate the variety Gender assignment was done manually.
of the data, and the format provided to the partici-
Personal diaries In order to include personal
pations (Section 3.3).
writing which is more distant from social media,
3.1 Genres we collected personal diaries that are freely avail-
We selected data from the following genres able as part of the Fondazione Archivio Diaristico
grounding our choice of both availability and wide Nazionale della Città di Pieve Santo Stefano.3 The
variety. documents are of varying but comparable sizes,
and the author’s name is clearly specified in their
Twitter Tweets were downloaded using the metadata. Gender assignment was done manually.
Twitter API and a language identification mod-
ule to restrict the selection to Italian messages1 . 3.2 Train and test sets
Names from usernames were matched with a list For each genre we have a portion of training and
of unambiguous male and female names. a portion of test data. The distribution of gender
labels was controlled for in each dataset (50/50).
YouTube YouTube comments were scraped us-
Additionally, we aimed at providing sets of com-
ing the YouTube API and an available scraper2 .
parable sizes in terms of tokens so as to avoid
Videos were pre-selected manually with the aim to
including training size as a relevant factor. This
avoid gender biases, resulting in a selection from a
was intended for test, too, so as to have the same
few general topics: travel, music, documentaries,
amount of evaluation samples, but due to limited
politics. The names of the comments’ authors are
availability we eventually used a smaller test set
visible, and gender was automatically assigned via
for the Diary genre.
matching first names to the same list of male and
Table 2 shows the size of the training and the
female proper names used for the Twitter dataset.
test sets in terms of tokens and authors. The
Children writing This dataset is a collection of datasets are composed by texts written by multiple
essays written by Italian L1 learners collected dur- users, with possibly multiple documents per user.
ing the first and second year of lower secondary It is also possible that in the Twitter and YouTube
school called CItA (Corpus Italiano di Appren- datasets, different texts by the same user ended up
denti L1, (Barbagli et al., 2016)). CItA contains both in training and in test. For what concerns
essays written by the same students chronologi- the Children writing dataset, training and test con-
cally ordered and covering a two-year temporal tain texts written by same 155 children. Differ-
span. The corpus contains a total of 1,352 es- ently from the Children training set, the Children
says written by 153 students the first year and 155 test set is composed by texts written on the same
1 3
https://developer.twitter.com/ http://archiviodiari.org/index.php/
2
https://github.com/philbot9/ iniziative-e-progetti/brani-di-dirai.
youtube-comment-scraper html.
Twitter
@edmond644 @ilsussidiario Sarebbe vero se li avessimo eletti ma,
non avendolo fatto, "altri" se li meritano.
Children
Questa estate mi sono divertito molto perché mio padre ha preso
casa nella località del Circeo. La casa era a due piani, al piano
terra c’era un giardino dove il mio gatto sela spassava. C’era
molta ombra nel giardino e io mi ci addormivo sempre. Il mare era
poco lontano da casa e ci andavamo ogni giorno e giocavamo a fare
i subacquei. Siamo andati a mangiare la pizza fuori ed era molto
buona.
YouTube
alla fine esce sempre il tuo lato gattaro! sei forte! bellissimo
video come sempre!
Journalism
Elogio alla longevità, l’intervista bresciana a Rita Levi
Montalcini
Trent’anni fa il Nobel a Rita Levi Montalcini. Ecco l’ultima
intervista bresciana a cura di Luisa Monini: «I giovani credano
nei valori, i miei collaboratori sono tutte donne»
Tra le numerose interviste che Rita Levi Montalcini ha avuto la
bontà di concedermi, mi piace ricordare l’ultima, quella dei suoi
100 anni. Eravamo nello studio della sua Fondazione e lei era
particolarmente serena, disponibile. Elegante come sempre. [...]
Diaries
23.9.80
Sergio, volutamente stai coinvolgendo Alessandro in questa nostra
situazione, invece di tenerlo fuori: sai quanto è sensibile,
quanto è fragile, quanto è difficile anche - né puoi ignorare
che non solo lui in particolare ma nessun ragazzino di 14 anni
è in grado di subire o di affrontare o di sostenere una prova così
dolorosa.
Lo stai distruggendo, impedendogli di riflettere da solo,
martellando di parole (o scritti addirittura, come quella tua
dichiarazione) per sentirti meno solo o per annullare la sua
volontà e imporgli la tua, come volevi fare con me: ma non ti
rendi conto che non è amore il tuo, [...]
Figure 1: Sample instances of all five genres from the training sets, as distributed to participants. Chil-
dren, Diaries, and Journalism samples are cut due to space constraints.
Table 2: Size of datasets and label distribution.
T RAINING T EST
Genre F M Tokens F M Tokens
Children 100 100 65,986 100 100 48,913
Diaries 100 100 82,989 37 37 22,209
Journalism 100 100 113,437 100 100 112,036
Twitter 3000 3000 101,534 3000 3000 129,846
Youtube 2200 2200 90,639 2200 2200 61,008
topic at the same time, at the end of the two school lometric analysis, such as the vocabulary richness,
years. For News and Diaries we made sure no au- use of the first or third person, etc.4
thor was included in both training and test. We did
not balance the number of users per genre, nor the ItaliaNLP tested three different classification
number of documents per user, assuming these as models: one based on linear SVM, and two based
rather natural conditions. on Bi-directional Long Short Term Memory (Bi-
LSTM). The two deep neural network architec-
3.3 Format tures use 2-layers of Bi-LSTM. The first Bi-LSTM
The data was distributed as simil-XML. The for- layer encodes each sentence as a token sequence,
mat can be seen in Figure 1. Although we dis- the second layer encodes the sentence sequence.
tributed one file per genre, we still included the These two architectures differ in the learning ap-
genre information in the XML so as to ease the proaches they use: Single-Task Learning (STL)
combination of the different files. and Multi-Task Learning (MTL) (Cimino et al.,
2018).
4 Participants and Results
5 Analysis and Discussion
Following a call for interest, 15 teams registered
for the task and thus obtained the training data. In this section we provide both a discussion of the
Eventually, three teams submitted their predic- approaches and an analysis of the results.
tions, for a total of 50 runs. Three different runs
5.1 Approaches
were allowed per task, and one team experimented
with three different models submitting three dif- Participants experimented with more classical ma-
ferent predictions for each of the 10 subtasks. A chine learning approaches as well as with neural
summary of participants is provided in Table 3, networks. Results show that while neural mod-
while Tables 4 and 5 report the final results on the els achieve globally more accurate results, feature
test sets of the EVALITA 2018 GxG Task. engineered SVMs are as competitive. This holds
both in the in-genre and in the cross-genre settings.
CapetownMilanoTirana proposed a classifier All models suffer to some extent
based on Support Vector Machine (SVM) as learn- from the shift to cross-genre, though the
ing algorithm. They tested different n-gram fea- CapetownMilanoTirana-SVM system appears to
tures extracted at the word level as well as at the be the most robust. This might be due more to
character level. In addition, they experimented the choice of (abstract) features, rather than the
feature abstraction transforming each word into a learning algorithm itself. This system also em-
list of symbols and computing the length of the ob- ploys bleaching (a technique to fade out lexicon
tained word and its frequency (Basile et al., 2018). in favour of more abstract token representation)
UniOr tested several binary classifiers based on in this GxG cross-genre setting, after it had
different learning algorithms. For the official run, shown promise in a cross-lingual profiling task,
they used their two best systems based on Logistic where it was firstly introduced (van der Goot et
Regression (LR) and Random Forest (RF) depend- al., 2018). However, from their cross-validation
ing on the dataset analyzed. As features, they ex- 4
The participation of this team was not followed by a sys-
ploited linguistic parameters extracted using sty- tem description paper.
Table 3: Participants to the EVALITA 2018 GxG Task with number of runs.
Team Name Research Group # Runs
CapetownMilanoTirana Symanto Research, CoGrammar, freelance researcher 10
UniOr Università Orientale di Napoli 10
ItaliaNLP Lab ItaliaNLP Lab, ILC-CNR, Pisa 30
Table 4: Results in terms of Accuracy of the EVALITA 2018 GxG In-Domain Task.
Team Name-Model CH DI JO TW YT TOT
CapetownMilanoTirana-SVM 0.615 0.635 0.480 0.545 0.547 0.564
UniOr-LR-RF 0.550 0.550 0.585 0.49 0.500 0.535
ItaliaNLP Lab-SVM 0.550 0.649 0.555 0.567 0.555 0.575
ItaliaNLP Lab-STL 0.545 0.541 0.500 0.595 0.512 0.538
ItaliaNLP Lab-MTL 0.640 0.676 0.470 0.561 0.546 0.578
avg-Accuracy 0.580 0.610 0.518 0.552 0.532 0.558
Table 5: Results in terms of Accuracy of the EVALITA 2018 GxG Cross-Domain Task.
Team Name-Model CH DI JO TW YT TOT
CapetownMilanoTirana-SVM 0.535 0.635 0.515 0.555 0.503 0.549
UniOr-LR-RF 0.525 0.550 0.415 0.500 0.500 0.498
ItaliaNLP Lab-SVM 0.540 0.514 0.505 0.586 0.513 0.532
ItaliaNLP Lab-STL 0.640 0.554 0.495 0.609 0.510 0.562
ItaliaNLP Lab-MTL 0.535 0.595 0.510 0.500 0.500 0.528
avg-Accuracy 0.555 0.570 0.488 0.550 0.505 0.534
results on training data, where they also perform from the dataset from the target genre. It ap-
an evaluation of feature contribution, it seems pears that no team tried to exploit functions of
that bleaching in this context does not yield the (dis)similarity among genres in order to select sub-
expected benefits (Basile et al., 2018). portions of data for training when testing on a
The use of external resources was globally lit- given genre. The only different model in the way
tle explored, with the exception of generic word the training data from the other genres is used is
embeddings (ItaliaNLP Lab). While such embed- the ItaliaNLP Lab-MTL, but its performance on
dings do not seem to have contributed much to per- the cross-genre setting indicates that this approach
formance, specialised lexica or embeddings could is not robust for this task.
be something to be investigated in the future.
5.2 Results
From a learning settings’ perspective, teams
chose quite straightforward strategies. In-genre, The in-domain results (Table 4) are useful to
all models were trained using data from the tar- identify which genres are overall easier to model
get genre only, with the exception of the Ital- in terms of the author’s gender, and provide an
iaNLP Lab-MTL model, where the adopted multi- overview of gender detection in Italian.
task strategy could use the knowledge from other As could be expected, Diaries are the easiest
genres, even in the in-genre settings. This seems genre to model. This might result from the fact
confirmed comparing the ItaliaNLP Lab-MTL’s that the texts are longer, and are characterised by
results with the twin model ItaliaNLP Lab-STL a more personal and subjective writing style. For
(the same architecture with a single-task setting). example, the collected diaries present an extensive
In the cross-genre scenario, all systems have use of the first and second singular person verbs
used as training all of the available datasets apart and a higher distribution of possessive adjectives.
Due to availability, the Diaries test set is smaller where some models fail to beat the 50% baseline.
than the others, providing thus fewer instances for However, it is interesting to note that the highest
evaluation (see Table 2) and possibly weaker re- score in-genre is achieved by UniOr, which uses
liability of results. However, from the analysis of a selection of stylometric features (Koppel et al.,
results reported by (Basile et al., 2018), we see that 2002; Schler et al., 2006, e.g.), which have long
even in the cross-validated training set (100 sam- been thought to capture unconscious behaviour
ples), accuracy on Diaries is the highest out of the better than just lexical choice (on which most of
five genres. the other models are based, as they mainly use n-
We also see that Children writings carry better grams). The highest Journalism score in the cross-
signal towards gender detection than social media. settings is achieved by CapetownMilanoTirana.
This might be due to the fact that the Children test Cross-genre, we observe that results are on av-
set is composed by documents characterised by erage lower, but only by 2.5 percentage points
a common prompt (in the original collection set- (55.8 vs 53.4), which is less than one would ex-
tings, this was meant to provide evidence of how pect. Some models clearly drop more heavily from
students perceive the different writing instructions in-genre to cross-genre (ItaliaNLP Lab-MTL: .578
received in the considered school years). This fea- vs .528 average accuracy, ItaliaNLP Lab-SVM:
ture makes the children texts a reflexive textual ty- .575 vs .532, UniOr: .535 vs .498). How-
pology, typically characterised by a more subjec- ever, others appear more stable in both settings
tive writing style as we observed also for Diaries. (CapetownMilanoTirana-SVM: .564 vs .549), or
Both Twitter and YouTube score above a 50% even better at the cross- rather than in-genre pre-
baseline, but are clearly harder to model. This diction (ItaliaNLP Lab-STL: .538 vs .562).
could be due to the short texts, which in some From a genre perspective, the drop is more
cases offer very little evidence to go by. Com- substantial for some genres, with Diaries losing
pared to previous results reported for gender de- the most, with large variation though across sys-
tection on Twitter in Italian, as obtained at the tems. For example, the model that achieves best
2015 PAN Lab challenge on author profiling performance on Diaries in-genre (ItaliaNLP Lab-
(Rangel et al., 2015), and on the TwiSty dataset MTL, .676) suffers a drop of almost eight per-
(Verhoeven et al., 2016), scores at GxG are sub- centage points on the same dataset cross-genre
stantially lower (PAN 2015’s best performance (.595). Conversely, CapetownMilanoTirana pre-
on Italian: .8611, TwiSty’s reported F-measure: serve a high performance on the Diaries testset
.7329). The reason for this could be the fact that in both in- and cross-settings (.635), yielding the
in the PAN Lab and Twisty datasets authors are highest cross-performance on this genre. Twitter
represented by a collection of tweets, while we shows the least variation between in- and cross-
do not control for this aspect at all, under the as- genre testing. Not only the losses for all sys-
sumption that if gender traits emerge, these should tems are minimal, but in some cases we even ob-
not depend on having large evidence from a single serve higher scores in the cross-genre setting. Ital-
author. It could therefore be that the PAN’s and iaNLP Lab-STL obtains the highest score on this
TwiSty’s results are inflated by partially modelling test set in both settings, but the performance is
authors rather than gender. Another interesting higher for cross- than in-genre (.609 vs .595).
observation regarding Twitter is that when cross- Finally, some possibly interesting insights also
validating the training set, (Basile et al., 2018) re- emerge from looking at the precision and recall for
port accuracy in the 70s, while their in-genre re- the two classes (which we do not report in tables
sults on the test data are just above 50% (Table 4). as there are too many data points to show). For ex-
The most difficult genre is Journalism. While ample, we observe that for some genres only one
texts can be as long as diaries, results suggest gender ends up being assigned almost entirely by
that the requested jargon and writing conventions all classifiers. This happens in the cross-genre set-
typical of this genre overshadow the gender sig- tings, while the same test set has rather balanced
nal. Moreover, while we have selected only ar- assignments of the two classes in the in-genre set-
ticles written by a single author, there is always tings. In the case of Journalism, all systems in
the chance that revisions are made by an editor be- cross-genre modelling almost only predict female
fore the piece is published. This is the only genre as the author’s gender. At the opposite side of the
spectrum we find YouTube, where almost all test and Hugo Jair Escalante. 2015. INAOE’s partici-
instances are classified as male by almost all sys- pation at PAN’15: Author profiling task. In CLEF
Working Notes.
tems, cross-genre. In this case though we also see
high recall for male in the in-genre setting, though Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta,
not so dominant. While more grounded consider- Simonetta Montemagni, and Giulia Venturi. 2016.
ations are left to a deeper analysis in the future, Cita: an l1 italian learners corpus to study the devel-
opment of writing competence. In In Proceedings
we could speculate that some genres are globally of the 10 thConference on Language Resources and
seen by classifiers as more characteristic of one Evaluation (LREC 2016), pages 88–95.
gender or the other, as learnt from a large amount
Angelo Basile, Gareth Dwyer, Maria Medvedeva, Jo-
of mixed-genre, gender-balanced data. sine Rawee, Hessel Haagsma, and Malvina Nissim.
2017. N-gram: New groningen author-profiling
6 Conclusions model. In Proceedings of the CLEF 2017 Evalua-
tion Labs and Workshop – Working Notes Papers,
Gender detection was for the first time the focus 11-14 September, Dublin, Ireland (Sept. 2017).
of a dedicated task at EVALITA. The GxG task
Angelo Basile, Gareth Dwyer, and Chiara Rubagotti.
specifically focussed on comparing performance 2018. CapetownMilanoTirana for GxG at
within and across genres, building on previous Evalita2018. Simple n-gram based models perform
observations that high performing systems were well for gender prediction. Sometimes. In Tom-
likely to be modelling datasets rather than gen- maso Caselli, Nicole Novielli, Viviana Patti, and
Paolo Rosso, editors, Proceedings of Sixth Evalua-
der, as their accuracy substantially dropped when
tion Campaign of Natural Language Processing and
tested on different, even related, domains. Speech Tools for Italian. Final Workshop (EVALITA
Results from 50 different runs were mostly 2018), Turin, Italy. CEUR.org.
above baseline for most prediction tasks, both
M. Busger op Vollenbroek, T. Carlotto, T. Kreutz,
in-genre and cross-genre, but not particularly M. Medvedeva, C. Pool, J. Bjerva, H. Haagsma, and
high overall. Also, the drop between in-genre M. Nissim. 2016. Gron-UP: Groningen user pro-
and cross-genre performance is noticeable, but filing notebook for PAN at CLEF. In CLEF 2016
marginal. Neural models appear to perform only Evaluation Labs and Workshop, Working Notes.
slightly better than a more classic SVM which Tommaso Caselli, Nicole Novielli, Viviana Patti, and
leverages character and word n-grams. The use Paolo Rosso. 2018. Evalita 2018: Overview of
of the recently introduced text bleaching strategy the 6th evaluation campaign of natural language
processing and speech tools for italian. In Tom-
among the engineered features (aimed at reducing maso Caselli, Nicole Novielli, Viviana Patti, and
lexicon bias (van der Goot et al., 2018)), does not Paolo Rosso, editors, Proceedings of Sixth Evalua-
seem to yield the desired performance in the cross- tion Campaign of Natural Language Processing and
genre settings. Speech Tools for Italian. Final Workshop (EVALITA
2018), Turin, Italy. CEUR.org.
In the near future, it will be interesting to com-
pare these results to human performance, as it has Andrea Cimino, Lorenzo De Mattei, and Felice
been observed that for profiling human do not per- Dell’Orletta. 2018. Multi-task Learning in Deep
Neural Networks at EVALITA 2018. In Tom-
form usually better than machines (Flekova et al., maso Caselli, Nicole Novielli, Viviana Patti, and
2016; van der Goot et al., 2018), to which they Paolo Rosso, editors, Proceedings of Sixth Evalua-
provide complementary performance in terms of tion Campaign of Natural Language Processing and
correctly predicted instances. Speech Tools for Italian. Final Workshop (EVALITA
2018), Turin, Italy. CEUR.org.
Acknowledgments Chris Emmery, Grzegorz Chrupała, and Walter Daele-
mans. 2017. Simple queries as distant labels for
We would like to thank Eleonora Cocciu and Clau- predicting gender on twitter. In Proceedings of the
dia Zaghi for their help with data collection and 3rd Workshop on Noisy User-generated Text, pages
50–55, Copenhagen, Denmark.
annotation. We are also grateful to the EVALITA
chairs for hosting this task. Lucie Flekova, Jordan Carpenter, Salvatore Giorgi,
Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. An-
alyzing biases in human perception of user age and
References gender from text. In Proceedings of the 54th Annual
Meeting of the Association for Computational Lin-
Miguel A Alvarez-Carmona, A Pastor López-Monroy, guistics (Volume 1: Long Papers), pages 843–854,
Manuel Montes-y Gómez, Luis Villasenor-Pineda, Berlin, Germany, August.
Moshe Koppel, Shlomo Argamon, and Anat Rachel
Shimoni. 2002. Automatically categorizing writ-
ten texts by author gender. Literary and Linguistic
Computing, 17:401–412.
Maria Medvedeva, Hessel Haagsma, and Malvina Nis-
sim. 2017. An analysis of cross-genre and in-genre
performance for author profiling in social media. In
Experimental IR Meets Multilinguality, Multimodal-
ity, and Interaction - 8th International Conference of
the CLEF Association, CLEF 2017, Dublin, Ireland,
September 11-14, 2017, pages 211–223.
Francisco Rangel, Paolo Rosso, Martin Potthast,
Benno Stein, and Walter Daelemans. 2015.
Overview of the 3rd author profiling task at pan
2015. In CLEF.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Wal-
ter Daelemans, Martin Potthast, and Benno Stein.
2016. Overview of the 4th author profiling task
at pan 2016: cross-genre evaluations. In Working
Notes Papers of the CLEF 2016 Evaluation Labs.
CEUR Workshop Proceedings, pages 750–784.
Francisco Rangel, Paolo Rosso, Martin Potthast, and
Benno Stein. 2017. Overview of the 5th author pro-
filing task at pan 2017: Gender and language variety
identification in Twitter. CLEF Working Notes.
Jonathan Schler, Moshe Koppel, Shlomo Argamon,
and James W. Pennebaker. 2006. Effects of Age
and Gender on Blogging. In AAAI Spring Sympo-
sium: Computational Approaches to Analyzing We-
blogs, pages 199–205. AAAI.
Rob van der Goot, Nikola Ljubesic, Ian Matroos, Malv-
ina Nissim, and Barbara Plank. 2018. Bleaching
text: Abstract features for cross-lingual gender pre-
diction. In Iryna Gurevych and Yusuke Miyao, ed-
itors, Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics, ACL
2018, Melbourne, Australia, July 15-20, 2018, Vol-
ume 2: Short Papers, pages 383–389.
Ben Verhoeven, Walter Daelemans, and Barbara Plank.
2016. Twisty: A multilingual twitter stylometry cor-
pus for gender and personality profiling. In Pro-
ceedings of the 10th International Conference on
Language Resources and Evaluation (LREC 2016).