Overview of the EVALITA 2018
                  Cross-Genre Gender Prediction (GxG) Task

          Felice Dell’Orletta                              Malvina Nissim
       ItaliaNLP Lab, ILC-CNR                        CLCG, University of Groningen
               Pisa, Italy                                 The Netherlands
 felice.dellorletta@ilc.cnr.it                          m.nissim@rug.nl


                 Abstract                        1   Introduction
English. The Gender Cross-Genre (GxG)            As publishing has become more and more acces-
task is a shared task on author profil-          sible and basically cost-free, virtually anyone can
ing (in terms of gender) on Italian texts,       get their words spread, especially online. Such
with a specific focus on cross-genre per-        ease of disseminating content doesn’t necessarily
formance. This task has been proposed for        go together with author identifiability. In other
the first time at EVALITA 2018, provid-          words: it’s very simple for anyone to publicly
ing different datasets from different tex-       write any text, but it isn’t equally simple to always
tual genres: Twitter, YouTube, Children          tell who the author of a text is.
writing, Journalism, Personal diaries. Re-
                                                    In the interest of companies who want to adver-
sults from a total of 50 different runs show
                                                 tise, or legal institutions, finding out at least some
that the task is difficult to learn in itself:
                                                 characteristics of an author is of crucial impor-
while almost all runs beat a 50% base-
                                                 tance. Author profiling is the task of automatically
line, no model reaches an accuracy above
                                                 discovering latent user attributes from text. Gen-
70%. We also observe that cross-genre
                                                 der, which we focus on in this paper, and which
modelling yields a drop in performance,
                                                 is traditionally characterised as a binary feature, is
but not as substantial as one would expect.
                                                 one of such attributes.
Italiano.       GxG (Gender Cross-Genre)            Thanks to a series of tasks introduced at the
è la prima campagna di valutazione per           PAN Labs in the last five years (pan.webis.
l’identificazione del genere di un autore        de, and the production of a variety of gender-
di testi scritti in lingua italiana e fa parte   annotated datasets focused on social media (Ver-
di quell’area di studio detta author pro-        hoeven et al., 2016; Emmery et al., 2017, e.g.),
filing. In questa edizione di GxG parti-         gender prediction has been addressed quite sub-
colare attenzione è stata posta nella va-        stantially in NLP. State-of-the-art gender predic-
lutazione dei sistemi in contesti di anal-       tion on Twitter for English, the most common plat-
isi cross-dominio. I domini testuali presi       form and language used for this task, is approx-
in esame sono stati: Twitter, YouTube,           imately 80–85% (Rangel et al., 2015; Alvarez-
Children writing, Journalism, Personal di-       Carmona et al., 2015; Rangel et al., 2017; Basile
aries. I risultati ottenuti da un totale di 50   et al., 2017), as obtained at the yearly PAN evalu-
diverse run (prodotte da tre diversi gruppi      ation campaigns (pan.webis.de).
di ricerca) mostrano che il task è com-             However, in the context of the 2016 PAN eval-
plesso: mentre quasi tutti i sistemi tes-        uation campaign, a cross-genre setting was intro-
tati superano la baseline del 50%, nes-          duced for gender prediction on English, Spanish,
sun modello raggiunge un’accuratezza su-         and Dutch, and best scores were recorded at an
periore al 70%. Si osserva inoltre che           average accuracy of less than 60% (Rangel et al.,
i risultati raggiunti nel contesto di anal-      2016). This was achieved by training models on
isi cross-dominio mostrano un calo delle         tweets, and testing them on datasets from a differ-
prestazioni non così sostanziale come ci si      ent source still in the social media domain, namely
sarebbe potuto aspettare.                        blogs. To further explore the cross-genre issue,
                                                 (Medvedeva et al., 2017) ran additional experi-
ments using PAN data from previous years with            more prominent; and (ii) make it easier to quan-
the model that had achieved best results at the          tify the loss when modelling gender across gen-
cross-genre PAN 2016 challenge (Busger op Vol-           res. Therefore, the gender prediction task must be
lenbroek et al., 2016). The picture they obtain          done in two ways:
is mixed in terms of accuracy of cross-genre per-
formance, eventually showing that models are not            • using a model which has been trained on the
yet general enough to capture gender accurately               same genre
across different datasets.
   This is evidence that we have not yet found              • using a model which has been trained on any-
the actual dataset-independent features that do in-           thing but that genre.
deed capture the way females and males might
write differently. To address this issue, we have           We selected five different genres (Section 3),
designed a task specifically focused on cross-           and asked participants to submit up to ten different
genre gender detection, and launched it within the       models, as per the overview in Table 1. Obviously,
EVALITA 2018 Campaign (Caselli et al., 2018).            if one participant wanted to have one single model
The rationale behind the cross-genre settings is as      for everything, they could submit one model for all
follows: if we can make gender prediction sta-           settings. In the cross-genre setting, the only con-
ble across very different genres, then we are more       straint is not using in training any single instance
likely to have captured deeper gender-specific           from the genre they are testing on. Other than
traits rather than dataset characteristics. As a by      that, participants were free to combine the other
product, this task yields a variety of models for        datasets as they wished.
gender prediction in Italian, also shedding light on        Participants were also free to use external re-
which genres favour or discourage in a way gender        sources, provided the cross-genre settings were
expression, by looking at whether they are easier        carefully preserved, and everything used was de-
or harder to model.                                      scribed in detail in their final report.
   While Italian has featured in multi-lingual gen-
der prediction at PAN (Rangel et al., 2015), this        Measures As standardly done in binary classifi-
is the first task that addresses author profiling for    cation tasks with balanced classes (see Section 3),
Italian specifically, within and across genres.          we will evaluate performance using accuracy.
                                                            For each of the 10 models, five in the in-genre
2   Task                                                 settings, and five in the cross-genre settings, we
                                                         calculate the accuracy for the two classes, i.e. F
GxG (Gender Cross-Genre) is a task on author             and M. In order to derive two final scores, one for
profiling (in terms of gender) on Italian texts, with    the in-genre and of for the cross-genre settings, we
a specific focus on cross-genre performance.             will simply average over the five accuracies ob-
   Given a (collection of) text(s) from a specific       tained per genre. In-genre:
genre, the gender of the author has to be predicted.
                                                                                    5
The task is cast as a binary classification task, with                              P
                                                                                          Accin−genre
                                                                                             j
gender represented as F (female) or M (male).                      Accin−genre =
                                                                                   j=1
                                                                                             5
   Evaluation settings were designed bearing in
mind the question at the core of this task: are          and cross-genre:
there indicative traits across genres that can be                                   5
                                                                                          Acccross−genre
                                                                                    P
leveraged to model gender in a rather genre-                                                 j
                                                                                    j=1
independent way?                                                Acccross−genre =                 5
   We hoped to provide some answers to this ques-
tion by making participants train and test their         We keep the two scorings separate. For deter-
models on datasets from different genres. For            mining the official “winner”, we can consider the
comparison, participants were also recommended           cross-genre ranking, as more specific to this task.
to submit genre-specific models, i.e., tested on the
very same genre they were trained on. In-genre           Baselines For all settings, given that the datasets
modelling can (i) shed light on which genres might       are balanced for gender distribution, through ran-
be easier to model, i.e. where gender traits are         dom assignment we will have 50% accuracy.
                                         Table 1: Models to submit.

                            IN - GENRE                          CROSS - GENRE

                   Twitter in-genre model          non-Twitter model for Twitter
                   YouTube in-genre model          non-YouTube model for YouTube
                   Children in-genre model         non-Children model for Children
                   Journalism in-genre model       non-Journalism model for Journalism
                   Diaries in-genre model          non-Diaries model for Diaries


3       Data                                             the second year. It was collected during the two
                                                         school years 2012–2013 and 2013–2014.
In order to test the portability and stability of pro-
filing models across genres, we created datasets         News/journalism This dataset was created
from five genres. We describe them below, to-            scraping two famous Italian online newspapers
gether with the format and train/test split of the       (La Repubblica and Corriere della Sera) and
materials distributed to participants. In Figure 1       selecting only single-authored newspaper articles.
we provide a few samples to illustrate the variety       Gender assignment was done manually.
of the data, and the format provided to the partici-
                                                         Personal diaries In order to include personal
pations (Section 3.3).
                                                         writing which is more distant from social media,
3.1      Genres                                          we collected personal diaries that are freely avail-
We selected data from the following genres               able as part of the Fondazione Archivio Diaristico
grounding our choice of both availability and wide       Nazionale della Città di Pieve Santo Stefano.3 The
variety.                                                 documents are of varying but comparable sizes,
                                                         and the author’s name is clearly specified in their
Twitter Tweets were downloaded using the                 metadata. Gender assignment was done manually.
Twitter API and a language identification mod-
ule to restrict the selection to Italian messages1 .     3.2    Train and test sets
Names from usernames were matched with a list            For each genre we have a portion of training and
of unambiguous male and female names.                    a portion of test data. The distribution of gender
                                                         labels was controlled for in each dataset (50/50).
YouTube YouTube comments were scraped us-
                                                         Additionally, we aimed at providing sets of com-
ing the YouTube API and an available scraper2 .
                                                         parable sizes in terms of tokens so as to avoid
Videos were pre-selected manually with the aim to
                                                         including training size as a relevant factor. This
avoid gender biases, resulting in a selection from a
                                                         was intended for test, too, so as to have the same
few general topics: travel, music, documentaries,
                                                         amount of evaluation samples, but due to limited
politics. The names of the comments’ authors are
                                                         availability we eventually used a smaller test set
visible, and gender was automatically assigned via
                                                         for the Diary genre.
matching first names to the same list of male and
                                                            Table 2 shows the size of the training and the
female proper names used for the Twitter dataset.
                                                         test sets in terms of tokens and authors. The
Children writing This dataset is a collection of         datasets are composed by texts written by multiple
essays written by Italian L1 learners collected dur-     users, with possibly multiple documents per user.
ing the first and second year of lower secondary         It is also possible that in the Twitter and YouTube
school called CItA (Corpus Italiano di Appren-           datasets, different texts by the same user ended up
denti L1, (Barbagli et al., 2016)). CItA contains        both in training and in test. For what concerns
essays written by the same students chronologi-          the Children writing dataset, training and test con-
cally ordered and covering a two-year temporal           tain texts written by same 155 children. Differ-
span. The corpus contains a total of 1,352 es-           ently from the Children training set, the Children
says written by 153 students the first year and 155      test set is composed by texts written on the same
    1                                                       3
   https://developer.twitter.com/                           http://archiviodiari.org/index.php/
    2
   https://github.com/philbot9/                          iniziative-e-progetti/brani-di-dirai.
youtube-comment-scraper                                  html.
  Twitter
  <doc id="778" genre="twitter" gender="M">
  @edmond644 @ilsussidiario Sarebbe vero se li avessimo eletti ma,
  non avendolo fatto, "altri" se li meritano.
  </doc>
  Children
  <doc id="1" genre="children" gender="M">
  Questa estate mi sono divertito molto perché mio padre ha preso
  casa nella località del Circeo. La casa era a due piani, al piano
  terra c’era un giardino dove il mio gatto sela spassava.   C’era
  molta ombra nel giardino e io mi ci addormivo sempre. Il mare era
  poco lontano da casa e ci andavamo ogni giorno e giocavamo a fare
  i subacquei. Siamo andati a mangiare la pizza fuori ed era molto
  buona.
  </doc>
  YouTube
  <doc id="8493" genre="youtube" gender="F">
  alla fine esce sempre il tuo lato gattaro! sei forte! bellissimo
  video come sempre!
  </doc>
  Journalism
  <doc id="118" genre="journalism" gender="F">
  Elogio alla longevità, l’intervista bresciana a Rita Levi
  Montalcini
  Trent’anni fa il Nobel a Rita Levi Montalcini. Ecco l’ultima
  intervista bresciana a cura di Luisa Monini: «I giovani credano
  nei valori, i miei collaboratori sono tutte donne»
  Tra le numerose interviste che Rita Levi Montalcini ha avuto la
  bontà di concedermi, mi piace ricordare l’ultima, quella dei suoi
  100 anni. Eravamo nello studio della sua Fondazione e lei era
  particolarmente serena, disponibile. Elegante come sempre. [...]
  </doc>
  Diaries
  <doc id="107" genre="diary" gender="F">
  23.9.80
  Sergio, volutamente stai coinvolgendo Alessandro in questa nostra
  situazione, invece di tenerlo fuori: sai quanto è sensibile,
  quanto è fragile, quanto è difficile anche - né puoi ignorare
  che non solo lui in particolare ma nessun ragazzino di 14 anni
  è in grado di subire o di affrontare o di sostenere una prova così
  dolorosa.
  Lo stai distruggendo, impedendogli di riflettere da solo,
  martellando di parole (o scritti addirittura, come quella tua
  dichiarazione) per sentirti meno solo o per annullare la sua
  volontà e imporgli la tua, come volevi fare con me: ma non ti
  rendi conto che non è amore il tuo, [...]
  </doc>


Figure 1: Sample instances of all five genres from the training sets, as distributed to participants. Chil-
dren, Diaries, and Journalism samples are cut due to space constraints.
                             Table 2: Size of datasets and label distribution.

                                          T RAINING                      T EST
                    Genre             F      M     Tokens          F       M      Tokens
                    Children        100     100     65,986     100       100       48,913
                    Diaries         100     100     82,989      37        37       22,209
                    Journalism      100     100    113,437     100       100      112,036
                    Twitter        3000    3000    101,534    3000      3000      129,846
                    Youtube        2200    2200     90,639    2200      2200       61,008


topic at the same time, at the end of the two school   lometric analysis, such as the vocabulary richness,
years. For News and Diaries we made sure no au-        use of the first or third person, etc.4
thor was included in both training and test. We did
not balance the number of users per genre, nor the     ItaliaNLP tested three different classification
number of documents per user, assuming these as        models: one based on linear SVM, and two based
rather natural conditions.                             on Bi-directional Long Short Term Memory (Bi-
                                                       LSTM). The two deep neural network architec-
3.3    Format                                          tures use 2-layers of Bi-LSTM. The first Bi-LSTM
The data was distributed as simil-XML. The for-        layer encodes each sentence as a token sequence,
mat can be seen in Figure 1. Although we dis-          the second layer encodes the sentence sequence.
tributed one file per genre, we still included the     These two architectures differ in the learning ap-
genre information in the XML so as to ease the         proaches they use: Single-Task Learning (STL)
combination of the different files.                    and Multi-Task Learning (MTL) (Cimino et al.,
                                                       2018).
4     Participants and Results
                                                       5     Analysis and Discussion
Following a call for interest, 15 teams registered
for the task and thus obtained the training data.      In this section we provide both a discussion of the
Eventually, three teams submitted their predic-        approaches and an analysis of the results.
tions, for a total of 50 runs. Three different runs
                                                       5.1    Approaches
were allowed per task, and one team experimented
with three different models submitting three dif-      Participants experimented with more classical ma-
ferent predictions for each of the 10 subtasks. A      chine learning approaches as well as with neural
summary of participants is provided in Table 3,        networks. Results show that while neural mod-
while Tables 4 and 5 report the final results on the   els achieve globally more accurate results, feature
test sets of the EVALITA 2018 GxG Task.                engineered SVMs are as competitive. This holds
                                                       both in the in-genre and in the cross-genre settings.
CapetownMilanoTirana proposed a classifier                All models suffer to some extent
based on Support Vector Machine (SVM) as learn-        from the shift to cross-genre, though the
ing algorithm. They tested different n-gram fea-       CapetownMilanoTirana-SVM system appears to
tures extracted at the word level as well as at the    be the most robust. This might be due more to
character level. In addition, they experimented        the choice of (abstract) features, rather than the
feature abstraction transforming each word into a      learning algorithm itself. This system also em-
list of symbols and computing the length of the ob-    ploys bleaching (a technique to fade out lexicon
tained word and its frequency (Basile et al., 2018).   in favour of more abstract token representation)
UniOr tested several binary classifiers based on       in this GxG cross-genre setting, after it had
different learning algorithms. For the official run,   shown promise in a cross-lingual profiling task,
they used their two best systems based on Logistic     where it was firstly introduced (van der Goot et
Regression (LR) and Random Forest (RF) depend-         al., 2018). However, from their cross-validation
ing on the dataset analyzed. As features, they ex-        4
                                                            The participation of this team was not followed by a sys-
ploited linguistic parameters extracted using sty-     tem description paper.
              Table 3: Participants to the EVALITA 2018 GxG Task with number of runs.

       Team Name                  Research Group                                            # Runs
       CapetownMilanoTirana       Symanto Research, CoGrammar, freelance researcher              10
       UniOr                      Università Orientale di Napoli                                 10
       ItaliaNLP Lab              ItaliaNLP Lab, ILC-CNR, Pisa                                   30

          Table 4: Results in terms of Accuracy of the EVALITA 2018 GxG In-Domain Task.

            Team Name-Model                   CH       DI       JO       TW      YT      TOT
            CapetownMilanoTirana-SVM          0.615    0.635    0.480    0.545   0.547   0.564
            UniOr-LR-RF                       0.550    0.550    0.585    0.49    0.500   0.535
            ItaliaNLP Lab-SVM                 0.550    0.649    0.555    0.567   0.555   0.575
            ItaliaNLP Lab-STL                 0.545    0.541    0.500    0.595   0.512   0.538
            ItaliaNLP Lab-MTL                 0.640    0.676    0.470    0.561   0.546   0.578
            avg-Accuracy                      0.580    0.610    0.518    0.552   0.532   0.558

        Table 5: Results in terms of Accuracy of the EVALITA 2018 GxG Cross-Domain Task.

            Team Name-Model                   CH       DI       JO       TW      YT      TOT
            CapetownMilanoTirana-SVM          0.535    0.635    0.515    0.555   0.503   0.549
            UniOr-LR-RF                       0.525    0.550    0.415    0.500   0.500   0.498
            ItaliaNLP Lab-SVM                 0.540    0.514    0.505    0.586   0.513   0.532
            ItaliaNLP Lab-STL                 0.640    0.554    0.495    0.609   0.510   0.562
            ItaliaNLP Lab-MTL                 0.535    0.595    0.510    0.500   0.500   0.528
            avg-Accuracy                      0.555    0.570    0.488    0.550   0.505   0.534


results on training data, where they also perform      from the dataset from the target genre. It ap-
an evaluation of feature contribution, it seems        pears that no team tried to exploit functions of
that bleaching in this context does not yield the      (dis)similarity among genres in order to select sub-
expected benefits (Basile et al., 2018).               portions of data for training when testing on a
   The use of external resources was globally lit-     given genre. The only different model in the way
tle explored, with the exception of generic word       the training data from the other genres is used is
embeddings (ItaliaNLP Lab). While such embed-          the ItaliaNLP Lab-MTL, but its performance on
dings do not seem to have contributed much to per-     the cross-genre setting indicates that this approach
formance, specialised lexica or embeddings could       is not robust for this task.
be something to be investigated in the future.
                                                       5.2     Results
   From a learning settings’ perspective, teams
chose quite straightforward strategies. In-genre,      The in-domain results (Table 4) are useful to
all models were trained using data from the tar-       identify which genres are overall easier to model
get genre only, with the exception of the Ital-        in terms of the author’s gender, and provide an
iaNLP Lab-MTL model, where the adopted multi-          overview of gender detection in Italian.
task strategy could use the knowledge from other          As could be expected, Diaries are the easiest
genres, even in the in-genre settings. This seems      genre to model. This might result from the fact
confirmed comparing the ItaliaNLP Lab-MTL’s            that the texts are longer, and are characterised by
results with the twin model ItaliaNLP Lab-STL          a more personal and subjective writing style. For
(the same architecture with a single-task setting).    example, the collected diaries present an extensive
  In the cross-genre scenario, all systems have        use of the first and second singular person verbs
used as training all of the available datasets apart   and a higher distribution of possessive adjectives.
Due to availability, the Diaries test set is smaller     where some models fail to beat the 50% baseline.
than the others, providing thus fewer instances for      However, it is interesting to note that the highest
evaluation (see Table 2) and possibly weaker re-         score in-genre is achieved by UniOr, which uses
liability of results. However, from the analysis of      a selection of stylometric features (Koppel et al.,
results reported by (Basile et al., 2018), we see that   2002; Schler et al., 2006, e.g.), which have long
even in the cross-validated training set (100 sam-       been thought to capture unconscious behaviour
ples), accuracy on Diaries is the highest out of the     better than just lexical choice (on which most of
five genres.                                             the other models are based, as they mainly use n-
   We also see that Children writings carry better       grams). The highest Journalism score in the cross-
signal towards gender detection than social media.       settings is achieved by CapetownMilanoTirana.
This might be due to the fact that the Children test        Cross-genre, we observe that results are on av-
set is composed by documents characterised by            erage lower, but only by 2.5 percentage points
a common prompt (in the original collection set-         (55.8 vs 53.4), which is less than one would ex-
tings, this was meant to provide evidence of how         pect. Some models clearly drop more heavily from
students perceive the different writing instructions     in-genre to cross-genre (ItaliaNLP Lab-MTL: .578
received in the considered school years). This fea-      vs .528 average accuracy, ItaliaNLP Lab-SVM:
ture makes the children texts a reflexive textual ty-    .575 vs .532, UniOr: .535 vs .498). How-
pology, typically characterised by a more subjec-        ever, others appear more stable in both settings
tive writing style as we observed also for Diaries.      (CapetownMilanoTirana-SVM: .564 vs .549), or
   Both Twitter and YouTube score above a 50%            even better at the cross- rather than in-genre pre-
baseline, but are clearly harder to model. This          diction (ItaliaNLP Lab-STL: .538 vs .562).
could be due to the short texts, which in some              From a genre perspective, the drop is more
cases offer very little evidence to go by. Com-          substantial for some genres, with Diaries losing
pared to previous results reported for gender de-        the most, with large variation though across sys-
tection on Twitter in Italian, as obtained at the        tems. For example, the model that achieves best
2015 PAN Lab challenge on author profiling               performance on Diaries in-genre (ItaliaNLP Lab-
(Rangel et al., 2015), and on the TwiSty dataset         MTL, .676) suffers a drop of almost eight per-
(Verhoeven et al., 2016), scores at GxG are sub-         centage points on the same dataset cross-genre
stantially lower (PAN 2015’s best performance            (.595). Conversely, CapetownMilanoTirana pre-
on Italian: .8611, TwiSty’s reported F-measure:          serve a high performance on the Diaries testset
.7329). The reason for this could be the fact that       in both in- and cross-settings (.635), yielding the
in the PAN Lab and Twisty datasets authors are           highest cross-performance on this genre. Twitter
represented by a collection of tweets, while we          shows the least variation between in- and cross-
do not control for this aspect at all, under the as-     genre testing. Not only the losses for all sys-
sumption that if gender traits emerge, these should      tems are minimal, but in some cases we even ob-
not depend on having large evidence from a single        serve higher scores in the cross-genre setting. Ital-
author. It could therefore be that the PAN’s and         iaNLP Lab-STL obtains the highest score on this
TwiSty’s results are inflated by partially modelling     test set in both settings, but the performance is
authors rather than gender. Another interesting          higher for cross- than in-genre (.609 vs .595).
observation regarding Twitter is that when cross-           Finally, some possibly interesting insights also
validating the training set, (Basile et al., 2018) re-   emerge from looking at the precision and recall for
port accuracy in the 70s, while their in-genre re-       the two classes (which we do not report in tables
sults on the test data are just above 50% (Table 4).     as there are too many data points to show). For ex-
   The most difficult genre is Journalism. While         ample, we observe that for some genres only one
texts can be as long as diaries, results suggest         gender ends up being assigned almost entirely by
that the requested jargon and writing conventions        all classifiers. This happens in the cross-genre set-
typical of this genre overshadow the gender sig-         tings, while the same test set has rather balanced
nal. Moreover, while we have selected only ar-           assignments of the two classes in the in-genre set-
ticles written by a single author, there is always       tings. In the case of Journalism, all systems in
the chance that revisions are made by an editor be-      cross-genre modelling almost only predict female
fore the piece is published. This is the only genre      as the author’s gender. At the opposite side of the
spectrum we find YouTube, where almost all test           and Hugo Jair Escalante. 2015. INAOE’s partici-
instances are classified as male by almost all sys-       pation at PAN’15: Author profiling task. In CLEF
                                                          Working Notes.
tems, cross-genre. In this case though we also see
high recall for male in the in-genre setting, though    Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta,
not so dominant. While more grounded consider-            Simonetta Montemagni, and Giulia Venturi. 2016.
ations are left to a deeper analysis in the future,       Cita: an l1 italian learners corpus to study the devel-
                                                          opment of writing competence. In In Proceedings
we could speculate that some genres are globally          of the 10 thConference on Language Resources and
seen by classifiers as more characteristic of one         Evaluation (LREC 2016), pages 88–95.
gender or the other, as learnt from a large amount
                                                        Angelo Basile, Gareth Dwyer, Maria Medvedeva, Jo-
of mixed-genre, gender-balanced data.                     sine Rawee, Hessel Haagsma, and Malvina Nissim.
                                                          2017. N-gram: New groningen author-profiling
6   Conclusions                                           model. In Proceedings of the CLEF 2017 Evalua-
                                                          tion Labs and Workshop – Working Notes Papers,
Gender detection was for the first time the focus         11-14 September, Dublin, Ireland (Sept. 2017).
of a dedicated task at EVALITA. The GxG task
                                                        Angelo Basile, Gareth Dwyer, and Chiara Rubagotti.
specifically focussed on comparing performance            2018.      CapetownMilanoTirana for GxG at
within and across genres, building on previous            Evalita2018. Simple n-gram based models perform
observations that high performing systems were            well for gender prediction. Sometimes. In Tom-
likely to be modelling datasets rather than gen-          maso Caselli, Nicole Novielli, Viviana Patti, and
                                                          Paolo Rosso, editors, Proceedings of Sixth Evalua-
der, as their accuracy substantially dropped when
                                                          tion Campaign of Natural Language Processing and
tested on different, even related, domains.               Speech Tools for Italian. Final Workshop (EVALITA
   Results from 50 different runs were mostly             2018), Turin, Italy. CEUR.org.
above baseline for most prediction tasks, both
                                                        M. Busger op Vollenbroek, T. Carlotto, T. Kreutz,
in-genre and cross-genre, but not particularly            M. Medvedeva, C. Pool, J. Bjerva, H. Haagsma, and
high overall. Also, the drop between in-genre             M. Nissim. 2016. Gron-UP: Groningen user pro-
and cross-genre performance is noticeable, but            filing notebook for PAN at CLEF. In CLEF 2016
marginal. Neural models appear to perform only            Evaluation Labs and Workshop, Working Notes.
slightly better than a more classic SVM which           Tommaso Caselli, Nicole Novielli, Viviana Patti, and
leverages character and word n-grams. The use             Paolo Rosso. 2018. Evalita 2018: Overview of
of the recently introduced text bleaching strategy        the 6th evaluation campaign of natural language
                                                          processing and speech tools for italian. In Tom-
among the engineered features (aimed at reducing          maso Caselli, Nicole Novielli, Viviana Patti, and
lexicon bias (van der Goot et al., 2018)), does not       Paolo Rosso, editors, Proceedings of Sixth Evalua-
seem to yield the desired performance in the cross-       tion Campaign of Natural Language Processing and
genre settings.                                           Speech Tools for Italian. Final Workshop (EVALITA
                                                          2018), Turin, Italy. CEUR.org.
   In the near future, it will be interesting to com-
pare these results to human performance, as it has      Andrea Cimino, Lorenzo De Mattei, and Felice
been observed that for profiling human do not per-        Dell’Orletta. 2018. Multi-task Learning in Deep
                                                          Neural Networks at EVALITA 2018. In Tom-
form usually better than machines (Flekova et al.,        maso Caselli, Nicole Novielli, Viviana Patti, and
2016; van der Goot et al., 2018), to which they           Paolo Rosso, editors, Proceedings of Sixth Evalua-
provide complementary performance in terms of             tion Campaign of Natural Language Processing and
correctly predicted instances.                            Speech Tools for Italian. Final Workshop (EVALITA
                                                          2018), Turin, Italy. CEUR.org.
Acknowledgments                                         Chris Emmery, Grzegorz Chrupała, and Walter Daele-
                                                          mans. 2017. Simple queries as distant labels for
We would like to thank Eleonora Cocciu and Clau-          predicting gender on twitter. In Proceedings of the
dia Zaghi for their help with data collection and         3rd Workshop on Noisy User-generated Text, pages
                                                          50–55, Copenhagen, Denmark.
annotation. We are also grateful to the EVALITA
chairs for hosting this task.                           Lucie Flekova, Jordan Carpenter, Salvatore Giorgi,
                                                          Lyle Ungar, and Daniel Preoţiuc-Pietro. 2016. An-
                                                          alyzing biases in human perception of user age and
References                                                gender from text. In Proceedings of the 54th Annual
                                                          Meeting of the Association for Computational Lin-
Miguel A Alvarez-Carmona, A Pastor López-Monroy,          guistics (Volume 1: Long Papers), pages 843–854,
  Manuel Montes-y Gómez, Luis Villasenor-Pineda,          Berlin, Germany, August.
Moshe Koppel, Shlomo Argamon, and Anat Rachel
 Shimoni. 2002. Automatically categorizing writ-
 ten texts by author gender. Literary and Linguistic
 Computing, 17:401–412.
Maria Medvedeva, Hessel Haagsma, and Malvina Nis-
 sim. 2017. An analysis of cross-genre and in-genre
 performance for author profiling in social media. In
 Experimental IR Meets Multilinguality, Multimodal-
 ity, and Interaction - 8th International Conference of
 the CLEF Association, CLEF 2017, Dublin, Ireland,
 September 11-14, 2017, pages 211–223.
Francisco Rangel, Paolo Rosso, Martin Potthast,
  Benno Stein, and Walter Daelemans.         2015.
  Overview of the 3rd author profiling task at pan
  2015. In CLEF.
Francisco Rangel, Paolo Rosso, Ben Verhoeven, Wal-
  ter Daelemans, Martin Potthast, and Benno Stein.
  2016. Overview of the 4th author profiling task
  at pan 2016: cross-genre evaluations. In Working
  Notes Papers of the CLEF 2016 Evaluation Labs.
  CEUR Workshop Proceedings, pages 750–784.
Francisco Rangel, Paolo Rosso, Martin Potthast, and
  Benno Stein. 2017. Overview of the 5th author pro-
  filing task at pan 2017: Gender and language variety
  identification in Twitter. CLEF Working Notes.
Jonathan Schler, Moshe Koppel, Shlomo Argamon,
  and James W. Pennebaker. 2006. Effects of Age
  and Gender on Blogging. In AAAI Spring Sympo-
  sium: Computational Approaches to Analyzing We-
  blogs, pages 199–205. AAAI.

Rob van der Goot, Nikola Ljubesic, Ian Matroos, Malv-
  ina Nissim, and Barbara Plank. 2018. Bleaching
  text: Abstract features for cross-lingual gender pre-
  diction. In Iryna Gurevych and Yusuke Miyao, ed-
  itors, Proceedings of the 56th Annual Meeting of
  the Association for Computational Linguistics, ACL
  2018, Melbourne, Australia, July 15-20, 2018, Vol-
  ume 2: Short Papers, pages 383–389.
Ben Verhoeven, Walter Daelemans, and Barbara Plank.
  2016. Twisty: A multilingual twitter stylometry cor-
  pus for gender and personality profiling. In Pro-
  ceedings of the 10th International Conference on
  Language Resources and Evaluation (LREC 2016).