=Paper=
{{Paper
|id=Vol-2765/130
|storemode=property
|title=Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model
|pdfUrl=https://ceur-ws.org/Vol-2765/paper130.pdf
|volume=Vol-2765
|authors=Alyssa Lees,Jeffrey Sorensen,Ian Kivlichan
|dblpUrl=https://dblp.org/rec/conf/evalita/LeesSK20
}}
==Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model==
<pdf width="1500px">https://ceur-ws.org/Vol-2765/paper130.pdf</pdf>
<pre>
            Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained
                       Comment-Domain BERT Model

                       Alyssa Lees and Jeffrey Sorensen and Ian Kivlichan
                                         Google Jigsaw
                                         New York, NY
                      (alyssalees|sorenj|kivlichan)@google.com


                       Abstract                                labeled by crowd workers. Note that Perspec-
                                                               tiveAPI actually hosts a number of different mod-
    The Google Jigsaw team produced                            els that each score different attributes. The under-
    submissions for two of the EVALITA                         lying technology and performance of these models
    2020 (Basile et al., 2020) shared tasks,                   has evolved over time.
    based in part on the technology that pow-                     While Jigsaw has hosted three separate Kaggle
    ers the publicly available PerspectiveAPI                  competitions relevant to this competition (Jigsaw,
    comment evaluation service. We present a                   2018; Jigsaw, 2019; Jigsaw, 2020) we have not
    basic description of our submitted results                 traditionally participated in academic evaluations.
    and a review of the types of errors that our
    system made in these shared tasks.                         3   Related Work
                                                               The models we build are based on the popular
1   Introduction                                               BERT architecture (Devlin et al., 2019) with dif-
The HaSpeeDe2 shared task consists of Italian so-              ferent pre-training and fine-tuning approaches.
cial media posts that have been labeled for hate                  In part, our submissions explore the importance
speech and stereotypes. As Jigsaw’s participation              of pre-training (Gururangan et al., 2020) in the
was limited to the A and B tasks, we will be lim-              context of toxicity and the various competition at-
iting our analysis to that portion. The full details           tributes. A core question is to what extent these
of the dataset are available in the task guidelines            domains overlap. Jigsaw’s customized models
(Bosco et al., 2020).                                          (used for the second HaSpeeDe2 submission, and
   The AMI task includes both raw (natural Twit-               both AMI submissions) are pretrained on a set of
ter) and synthetic (template-generated) datasets.              one billion user-generated comments: this imparts
The raw data consists of Italian tweets manually               statistical information to the model about com-
labelled and balanced according to misogyny and                ments and conversations online. This model is fur-
aggressiveness labels, while the synthetic data is             ther fine-tuned on various toxicity attributes (toxi-
labelled only for misogyny and is intended to                  city, severe toxicity, profanity, insults, identity at-
measure the presence of unintended bias (Elisa-                tacks, and threats), but it is unclear how well these
betta Fersini, 2020).                                          should align with the competition attributes. The
                                                               descriptions of these attributes and how they were
2   Background                                                 collected from crowd workers can be found in the
                                                               data descriptions for the Jigsaw Unintended Bias
Jigsaw, a team within Google, develops the Per-                in Toxicity Classification (Jigsaw, 2019) website.
spectiveAPI machine learning comment scoring                      A second question studied in prior work is to
system, which is used by numerous social media                 what extent training generalizes across languages
companies and publishers. Our system is based                  (Pires et al., 2019; Wu and Dredze, 2019; Pa-
on distillation and uses a convolutional neural-               mungkas et al., 2020). The majority of our train-
network to score individual comments according                 ing data is English comment data from a variety
to several attributes using supervised training data           of sources, while this competition is based on Ital-
                                                               ian Twitter data. Though multilingual transfer has
     Copyright ©2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-      been studied in general contexts, less is known
ternational (CC BY 4.0).                                       about the specific cases of toxicity, hate speech,
                 misogyny, and harassment. This was one of the fo-                                                                                     gual teacher model (that is too large to practi-
                 cuses of Jigsaw’s recent Kaggle competition (Jig-                                                                                     cally serve in production) to a smaller CNN. Using
                 saw, 2020); i.e., what forms of toxicity are shared                                                                                   this large teacher model, we initially compared the
                 across languages (and hence can be learned by                                                                                         EVALITA hate speech and stereotype annotations
                 multilingual models) and what forms are different.                                                                                    against the teacher model’s scores for different at-
                                                                                                                                                       tributes. The results are shown in Figure 1 for the
                 4             Submission Details                                                                                                      training data. Perspective is a reasonable detector
                                            Attribute hs                                                     Attribute stereotype
                                                                                                                                                       for the hate speech attribute, but performs less well
                      1.0
                                                                                                                                                       for the stereotype attribute, with the identity attack
                      0.8                                                                                                                              model performing the best.
                                                                                                                                                          Using these same models on the AMI task,
True Positive Rate


                      0.6
                                                                                                                                                       shown in Figure 2 for detecting misogyny proved
                      0.4                                                                                                                              even more challenging. In this case, the aggres-
                                                     identity_attack 83.6%                                               identity_attack 71.2%
                                                     severe_toxicity 82.6%
                                                     toxicity 82%
                                                                                                                         severe_toxicity 70.9%
                                                                                                                         insult 70.5%
                                                                                                                                                       siveness attribute was evaluated only on the sub-
                      0.2
                                                     insult 80.8%                                                        toxicity 70.4%
                                                     profanity 79%                                                       profanity 69.9%               set of the training data labeled misogynous. In
                                                     threat 68.1%                                                        threat 63.6%
                      0.0
                         0.0    0.2         0.4       0.6        0.8            1.0        0.0         0.2        0.4       0.6      0.8         1.0
                                                                                                                                                       this case, the most popular attribute of “toxicity”
                                         False Positive Rate                                                   False Positive Rate
                                                                                                                                                       is actually counter-indicative of the misogyny la-
                                                                                                                                                       bel. The best detector for both of these attributes
                 Figure 1: ROC curves for the PerspectiveAPI                                                                                           appears to be the “threat” model.
                 multilingual teacher model attributes compared to                                                                                        As can be seen, the existing classifiers are all
                 the HaSpeeDe2 attributes (hate speech and stereo-                                                                                     poor predictors of both attributes for this shared
                 type).                                                                                                                                task. Due to errors in our initial analysis, we did
                                                                                                                                                       not end up using any of the models used for Per-
                      1.0
                                      Attribute misogynous                                        Attribute aggressiveness                             spectiveAPI in our final submissions.
                                                                                  1.0

                      0.8


                                                                                                                                                                                                   n
                                                                                                                                                                                                 ch
                                                                                  0.8


                                                                                                                                                                                    ha ssio


                                                                                                                                                                                                 e
                                                                                                                                                                                              ee

                                                                                                                                                                                              yp
                                                                                                                                                                            y
                                                                                                                                                                         or


                                                                                                                                                                                           sp

                                                                                                                                                                                           ot
                                                                                                                                                                                            i
 True Positive Rate


                                                                                                                                                                                  bm
                                                                                                                                                                       eg


                      0.6                                                         0.6


                                                                                                                                                                                        re
                                                                                                                                                                                       te
                                                                                                                                                                        t
                                                                                                                                                                     Ca


                                                                                                                                                                                    ste
                                                                                                                                                                                Su


                      0.4                                                         0.4                                                                                 news      1      0.68   0.64
                                                  threat 61.5%                                                     threat 66%
                                                  identity_attack 56.5%                                            identity_attack 61.8%                                        2      0.64   0.68
                      0.2                         insult 45.2%                    0.2                              severe_toxicity 53.9%
                                                  severe_toxicity 44.4%                                            insult 53.3%                                      tweets     1      0.72   0.67
                                                  toxicity 39.8%                                                   toxicity 48.4%
                                                  profanity 28.8%                 0.0                              profanity 37%                                                2      0.77   0.74
                      0.0
                         0.0    0.2        0.4       0.6       0.8        1.0        0.0         0.2       0.4       0.6       0.8         1.0
                                        False Positive Rate                                             False Positive Rate
                                                                                                                                                       Table 1: Macro-averaged F1 scores for Jigsaw’s
                                                                                                                                                       HaSpeeDe2 Submissions.
                 Figure 2: ROC curves for PerspectiveAPI multi-
                 lingual teacher model attributes compared to the
                 AMI attributes (misogyny and aggressiveness).                                                                                         4.1   HaSpeeDe2
                                                                                                                                                       The Jigsaw team submitted two separate submis-
                    As Jigsaw has already developed toxicity mod-                                                                                      sions that were independently trained for Tasks A
                 els for the Italian language, we initially hoped                                                                                      and B.
                 that these would provide a preliminary baseline
                 for the competition despite the independent na-                                                                                       4.1.1 First Submission
                 ture of the development of the annotation guide-                                                                                      Our first submission, one that did not perform very
                 lines. Our Italian models score comments for tox-                                                                                     well, was based on a simple multilingual BERT
                 icity as well as five additional distinct toxicity at-                                                                                model fine-tuned on 10 random splits of the train-
                 tributes: severe toxicity, profanity, threats, insults,                                                                               ing data. For each split, 10% of the data was
                 and identity attacks. We might expect some of                                                                                         held out to choose an appropriate equal-error-rate
                 these attributes to correlate with the HaSpeeDe2                                                                                      threshold for the resulting model.
                 and AMI attributes, though it is not immediately                                                                                         The BERT fine-tuning system used the 12 layer
                 clear whether any of these correlations should be                                                                                     model (Tensorflow Hub, 2020), a batch size of
                 particularly strong.                                                                                                                  64 and sequence length of 128. A single dense
                    The current Jigsaw PerspectiveAPI models are                                                                                       layer is used to connect to the two output sigmoids
                 typically trained via distillation from a multilin-                                                                                   which are trained using a binary cross-entropy loss
using stochastic gradient descent with early stop-
ping, which is computed using the AUC metric                                  1.0
computed using the 10% held out slice. This
                                                                              0.8
model is implemented using Keras (Chollet and


                                                         True Positive Rate
others, 2015).                                                                0.6
   To create the final submission, the decisions of
the ten separate classifiers were combined in a ma-                           0.4
jority voting scheme (if 5 or more models pro-                                                                    Tweets Hatespeech 85.5%
                                                                              0.2                                 Tweets Stereotype 82.6%
duced a positive detection, the attribute was as-                                                                 News Stereotype 77.3%
signed true).                                                                                                     News Hatespeech 75.7%
                                                                              0.0
                                                                                 0.0   0.2           0.4           0.6       0.8            1.0
4.1.2 Second Submission                                                                             False Positive Rate

Our second submission was based on a similar ap-
proach of fine-tuning a BERT-based model, but           Figure 3: ROC plots for HaSpeeDe2 Test Set La-
one based on a more closely matched training set.       bels.
   The underlying technology we used is the same
as the Google Cloud AutoML for natural language
                                                        point and custom wordpiece vocabulary from Sec-
processing product that had been employed in sim-
                                                        tion 4.1.2. However, a larger batch-size of 128
ilar labeling applications (Bisong, 2019).
                                                        was specified. All models were fine-tuned simul-
   The remaining models built for this competi-
                                                        taneously on misogynous and aggressive labels
tion and in the subsequent section are based on a
                                                        using the provided data, where zero aggressive-
customized BERT 768-dimension 12-layer model
                                                        ness weights were assigned to data points with no
pretrained on 1B user-generated comments using
                                                        misogynous labels.
MLM for 125 steps. This model was then fine-
tuned on supervised comments in multiple lan-              Both submissions were based on ensembles of
guages for six attributes: toxicity, severe toxic-      partitioned models evaluated on a 10% held-out
ity, obscene, threat, insult, and identity hate. This   test set. We explored two different ensembling
model also uses a custom wordpiece model (Wu et         techniques, which we discuss in the next section.
al., 2016) comprised of 200K tokens representing           AMI submission 1 does not not include syn-
tokens from hundreds of languages.                      thetic data. AMI submission 2 includes the syn-
   Our hate speech and misogyny models use a            thetic data and custom biasing mitigation data se-
fully connected final layer that combines the six       lected from Wikipedia articles. Table 2 clearly
output attributes and allows weight propagation         shows that the inclusion of such data significantly
through all layers of the network. Fine-tuning con-     improved the performance on Task B for submis-
tinues on using the supervised training data pro-       sion 2. Interestingly, the inclusion of synthetic and
vided by the competition hosts using the ADAM           bias mitigation data slightly improved the perfor-
optimizer with a learning rate of 1e–5.                 mance in Task A as well.
   Figure 3 displays the ROC curve for our second
submission for each of the news and the tweets
                                                                                                                n
                                                                                                       Sc ssio


datasets as well as for both the hate speech and
                                                                                                              i
                                                                                                         bm

                                                                                                            e
                                                                                               k


                                                                                                         or
                                                                                                s


stereotype attributes.
                                                                                                     Su
                                                                                             Ta


   Our second submission for HaSpeeDe2 con-                                                  A        1      0.738
                                                                                                      2      0.741
sisted of fine-tuning a single model with the pro-                                           B        1      0.649
vided training data with a 10% held-out set. The                                                      2      0.883
custom BERT model was fine-tuned on TPUs us-
                                                        Table 2: Misogynous and Aggressiveness Macro-
ing a relatively small batch size of 32.
                                                        averaged F1 scores for Jigsaw’s AMI Submis-
4.2   AMI                                               sions.
Our submissions for the AMI task only consid-
ered the unconstrained case, due to the use of            The two Jigsaw models ranked in first and sec-
pretrained models. All AMI models were fine-            ond place for Task A. The second submission
tuned on TPUs using the customized BERT check-          ranked first among participants for Task B.
4.2.1 Ensembling Models                                                                                                          B is not surprising given that no bias mitigating
Both the first and second submissions for AMI                                                                                    data or constraints were included in training.
were ensembles of fine-tuned custom BERT mod-
                                                                                                                                 4.2.3    Second Submission
els constructed from partitioned training data. We
explored two ensembling techniques (Brownlee,                                                                                    In order to mitigate bias, we decided to augment
2020):                                                                                                                           the training data set using sentences sampled from
                                                                                                                                 the Italian Wikipedia articles that contain the 17
   • Majority Vote: Each partitioned model was                                                                                   terms listed in the identity terms file provided with
     evaluated using a model specific threshold.                                                                                 the test set data. These sentences were labeled
     The label for each attribute was determined                                                                                 as both non-misogynous and non-aggressive. 11K
     by majority vote among the models.                                                                                          sentences were used for this purpose, with the term
                                                                                                                                 frequencies summarized in Table 3.
   • Average: The raw models probabilities are
     averaged together. The combined model cal-                                                                                              Identity Term   Sentence Count
     culates the labels via custom thresholds de-                                                                                            donna                     4306
                                                                                                                                             donne                     3100
     termined by evaluation on a held-out set.                                                                                               femmine                   1275
                                                                                                                                             femmina                    652
   Thresholds for the individual models in the ma-                                                                                           fidanzata                  538
jority vote and average ensemble were calculated                                                                                             nonna                      378
                                                                                                                                             mamma                      269
to optimize for the point on the held-out data ROC                                                                                           casalinga                  256
curve where |TPR − (1 − FPR)| is minimized.                                                                                                  casalinghe                 187
   The majority voting model performed slightly                                                                                              compagne                   132
                                                                                                                                             compagna                    34
better for both the misogynous and aggressive task                                                                                           mamme                       24
on the held-out sets. As such, both submissions                                                                                              fidanzate                   12
use majority vote.                                                                                                                           nonne                       11
                                                                                                                                             matrone                      9
4.2.2 First Submission                                                                                                                       matrona                      8
                                                                                                                                             morosa                       6
Using the same configuration as Section 4.1.2, we
partitioned the raw training data into ten randomly                                                                              Table 3: Term frequency in Wikipedia sampled
chosen partitions and fine-tuned nine of these us-                                                                               sentences for bias mitigation.
ing the 10% held out portion to compute thresh-
olds. No synthetic or de-biasing data was included                                                                                   The second submission employed the same par-
in this submission.                                                                                                              titioning of data with a held-out set. However the
   We include ROC curves for half of these mod-                                                                                  unconstrained data included the raw training data,
els in Figure 4, to illustrate that they are similar                                                                             the provided synthetic data and our de-biasing
but with some variance when used to score the test                                                                               term data. As with submission 1, majority vote
data.                                                                                                                            was used with custom thresholds determined by
                                                                                                                                 evaluation on the held-out set.
                        1.0                                                                                                          Our first unconstrained submission for AMI
                                                                                                                                 achieved scores of 0.741 for Task A and 0.883 for
                        0.8
                                                                                                                                 Task B.
   True Positive Rate


                        0.6


                        0.4
                                                                                                                                 5     Error Analysis
                                                   misogynous-4 91.5%                             aggressiveness-1 89.6%
                        0.2                        misogynous-0 90.1%
                                                   misogynous-1 90%
                                                                                                  aggressiveness-3 88.5%
                                                                                                  aggressiveness-4 87.9%
                                                                                                                                 We discuss an informal analysis of the errors we
                                                   misogynous-3 89.8%                             aggressiveness-2 87.9%
                        0.0
                                                   misogynous-2 89.3%                             aggressiveness-0 87.3%         observed with each of these tasks. Aside from the
                           0.0   0.2      0.4       0.6      0.8        1.0   0.0   0.2      0.4       0.6      0.8        1.0
                                       False Positive Rate                                False Positive Rate                    typical questions regarding data annotation qual-
                                                                                                                                 ity, and the small sample sizes, we observed some
Figure 4: ROC plots for AMI test set labels for                                                                                  particular instances of avoidable errors.
models pre-ensemble.
                                                                                                                                 5.1     HaSpeeDe2 Errors
   Our first unconstrained submission using major-                                                                               Looking at the largest incongruities as shown in
ity vote for AMI achieved scores of 0.738 for Task                                                                               Table 4 it is clear that context, which is unavail-
A and 0.649 for Task B. The poorer score for Task                                                                                able to our models, and presumably to the mod-
         ID      Comment                                                                               HS    Score
        11355    @user @user @user Giustissimo, non bisogna mai nascondersi nelle ideologie,            1   .00001
                 sopratutto oggi perché non esistono più. Sta di fatto, che le cose più aberranti
                 che leggi oggi sui giornali hanno sempre@a@che fare con stranieri... o rom
                 URL
        10803    #Simone di #CasalBruciato, #Roma: “Certi rom sono cittadini italiani, ma non          1    .00003
                 sono uguali a noi. Uguali non è il termine più giusto da usare”. URL
        11288    I SOLDI DEI DISABILI AI MIGRANTI La regione Emilia Romagna destina                    1    .00003
                 la metà dei fondi destinati alle fasce deboli a progetti per i richiedenti asilo A
                 Reggio Emilia il 69% delle risorse stanziate sono state utilizzate ai richiedenti
                 asilo #PRIMAGLIITALIANI URL
        10420    #MeNeFottoDi questi sfigati #facciamorete che continuano a giustificare ogni          0    0.99996
                 crimine commesso da immigrati... non fate rete, FATE SCHIFO... #facciamo-
                 ciFURBI
        11189    @user Naturalmente in questo caso alla faccia dei comunisti e dei migranti            0    0.99996
                 stitici!
        10483    @user SCHIFOSA IPOCRITA SPONSORIZZI I MUSSULMANI E POI VOI                            0    0.99995
                 DARE I DIRITTI ALLE DONNE SI VEDE CHE SEI POSSEDUTA DAL
                 DIAVOLO SEI BUGIARDA BOLDRINA SAI SOLO PROTESTARE POI
                 TI CHIEDI PERCHÉ IL VERO ITALIANO TI ODIA PERCHÉ SEI UNA
                 SPORCA IPOCRITA

                Table 4: Largest Errors for hate speech classifier on HaSpeeDe2 Tweet data


erators, is important for determining the author’s               amount of training data, even if it creates an imbal-
intent. The use of humor and the practice of quot-               ance, is one way to address this, as we did in the
ing text from another author are also confounding                case of the AMI challenge.
factors. As this task is known to be hard (Vigna
et al., 2017; van Aken et al., 2018), the edge cases             5.2    AMI
display these confounding reasons. Additionally,                 Because we are using ensemble models trained on
as evidenced by the last comment, the subtlety of                partitions of the training set, we observe that the
hate speech that is directed toward the designated               highest-scoring test samples that are marked non-
target for this challenge has not been well cap-                 misogynous and non-aggressive, as well as the
tured.                                                           lowest-scoring misogynous and aggressive com-
   The BERT model that we fine-tuned for this ap-                ments, vary from model to model. However, we
plication is cased, and we see within our errors fre-            display the most frequently occurring mistakes
quent use of all-caps text. However, lower casing                across all ten ensembles in Table 5.
the text has almost no effect on the scores, suggest-               Regarding the false alarms, these comments ap-
ing that the BERT pre-training has already linked                pear to be mislabeled test instances, and there is
the various cased versions of the tokens in the vo-              ample support for this claim in the training data.
cabulary.                                                        The first comment combines both uppercase and
   We analyzed the frequency of word piece frag-                 a missing space. While it’s true that subjunctive
ments in the data and saw no correlation between                 mode is not well represented in the training data,
misclassification and the presence of segmented                  lower casing this sentence produces high scores.
words. This suggests that vocabulary coverage in                 This is also the case with the third example. The
the test set does not play a significant role in ex-             second error seems more subtle, perhaps an at-
plaining our systems’ errors.                                    tempt at humor, but one with no salient misogyny
   Considering the sentence with the high model                  terms.
score for hate speech, several single terms are
                                                                 6     Bias
tagged by the model. For example the term “sfi-
gati” occurs only once in the training data in a                 Because the identity terms for AMI are not ob-
sentence that is marked as non-hate speech. How-                 served with a high frequency in the test data, we
ever, this term is not in our vocabulary and gets                restrict our analysis to the synthetic data set. We
split into pieces “sfiga##ti”, and the prefix “sfiga”            find wide variation in the performance of our in-
appears in two out of three training examples that               dividual models, with one model exhibiting very
are marked hate speech—exactly the kind of data                  poor performance across the subgroups. The sum-
sparsity that leads to unwanted bias. Using a larger             mary of the AUC measurements for these models
                  ID              Comment                                                                                                                        Error Type
                 5466             Odio Sakura per il semplice motivo che qualunque cosa faccia o dica Naruto                                                   Missed Detection
                                  lei lo prende a schiaffi o a pugniHA CHIESTO COME STA SAI DIOSANTO
                                  BRUTTA STRONZA MA CONTRALLI MADONNA SPERO CHE TI UCCI-
                                  DANOscusami Sarada
                 5385             porca troia la prenderei a schiaffi se non fosse mia madre                                                                   Missed Detection
                 5819             ma in tutto ciò lo scopo di anna qual è? far soffrire il mio protetto? IO TI                                               Missed Detection
                                  AMMAZZO COI LANCIAFIAMME OH #TemptationIsland
                 5471             @danielita8811 Che bel culo tutto da sfondare                                                                                       False Alarm
                 5604             @coppiacalda2 Che bel culo da inculare                                                                                              False Alarm

                                                     Table 5: Persistent errors for AMI across different ensembles.

are shown in Figure 5, Figure 6, and Figure 7 using
the technique presented in (Borkan et al., 2019).
There does not appear to be a systemic problem


                                                                                                              0

                                                                                                                     1

                                                                                                                            2

                                                                                                                                   3

                                                                                                                                          4

                                                                                                                                                 5

                                                                                                                                                        6

                                                                                                                                                               7

                                                                                                                                                                       8

                                                                                                                                                                              9
                                                                                                                                                                                     1.0
                                                                                                     nonne    0.95   0.95   0.90   0.82   0.87   0.22   0.83   0.83    0.92   0.73
with bias in these models, but judging based only                                                  matrone    0.96   0.97   0.93   0.94   0.92   0.30   0.94   0.96    0.95   0.95
                                                                                                   mamme      0.95   0.94   0.94   0.94   0.92   0.15   0.92   0.94    0.95   0.90
upon synthetic data is probably unwise. The single                                               casalinghe   0.97   0.97   0.94   0.95   0.94   0.26   0.93   0.95    0.95   0.95   0.8
term “donna” from the test set shows a subgroup                                                  compagne     0.97   0.96   0.93   0.94   0.95   0.22   0.92   0.95    0.95   0.95
                                                                                                    morose    0.96   0.92   0.88   0.87   0.90   0.44   0.88   0.85    0.91   0.87
AUC that drops substantially from the background                                                              0.98   0.97   0.97   0.95   0.96   0.28   0.96   0.95    0.96   0.95
                                                                                                  femmine
AUC for nearly all of the models, perhaps indicat-                                                   donne    0.98   0.96   0.96   0.94   0.94   0.45   0.96   0.96    0.94   0.96   0.6

                                                                                                  fidanzate   0.97   0.97   0.95   0.93   0.93   0.24   0.92   0.93    0.96   0.94
ing limitations of judging based on synthetic data.                                                           0.96   0.97   0.95   0.94   0.95   0.69   0.95   0.95    0.95   0.93
                                                                                                     nonna
                                                                                                   matrona    0.95   0.96   0.95   0.95   0.94   0.76   0.94   0.95    0.95   0.94
                                                                                                                                                                                     0.4
                                                                                                  casalinga   0.96   0.97   0.96   0.95   0.95   0.50   0.95   0.95    0.95   0.94
                                                                                                    morosa    0.96   0.97   0.95   0.94   0.93   0.70   0.93   0.93    0.95   0.93
                                                                                                  femmina     0.99   0.98   0.95   0.95   0.96   0.55   0.96   0.96    0.96   0.95
                 0

                        1

                                2

                                       3

                                              4

                                                      5

                                                             6

                                                                    7

                                                                           8

                                                                                  9


                                                                                         1.0
        nonne    0.95   0.90    0.81   0.59   0.68    0.44   0.62   0.71   0.61   0.55                                                                                               0.2
                                                                                                   mamma      0.96   0.97   0.95   0.94   0.95   0.68   0.95   0.95    0.94   0.92
      matrone    0.97   0.96    0.93   0.94   0.93    0.43   0.94   0.95   0.96   0.94
                                                                                                     donna    0.98   0.97   0.96   0.96   0.97   0.70   0.98   0.96    0.95   0.96
      mamme      0.96   0.96    0.96   0.94   0.93    0.46   0.91   0.94   0.95   0.93
                                                                                                  fidanzata   0.96   0.95   0.93   0.92   0.93   0.76   0.93   0.94    0.94   0.93
    casalinghe   0.97   0.97    0.95   0.96   0.96    0.50   0.94   0.95   0.96   0.95   0.8     compagna     0.96   0.97   0.95   0.95   0.95   0.64   0.94   0.95    0.95   0.95
    compagne     0.96   0.96    0.93   0.93   0.94    0.47   0.92   0.94   0.96   0.94                                                                                               0.0

       morose    0.96   0.93    0.89   0.90   0.92    0.49   0.89   0.86   0.92   0.92
     femmine     0.97   0.95    0.94   0.95   0.93    0.45   0.94   0.94   0.96   0.96
                 0.97   0.96    0.94   0.95   0.94    0.47   0.95   0.94   0.94   0.95   0.6
        donne
     fidanzate   0.98   0.97    0.95   0.95   0.94    0.41   0.92   0.94   0.97   0.92
        nonna    0.97   0.98    0.96   0.96   0.96    0.44   0.97   0.96   0.96   0.94
                                                                                               Figure 6: Background Positive, Subgroup Nega-
      matrona    0.96   0.97    0.95   0.97   0.95    0.52   0.96   0.96   0.97   0.96
                                                                                         0.4
     casalinga   0.96   0.97    0.95   0.97   0.97    0.49   0.97   0.95   0.96   0.95         tive AUC
       morosa    0.97   0.96    0.95   0.95   0.95    0.45   0.94   0.95   0.95   0.96
     femmina     0.97   0.98    0.96   0.96   0.96    0.48   0.96   0.96   0.96   0.96
      mamma      0.96   0.97    0.96   0.96   0.95    0.49   0.96   0.94   0.95   0.94   0.2

        donna    0.97   0.97    0.95   0.96   0.97    0.45   0.97   0.94   0.96   0.93
     fidanzata   0.97   0.96    0.95   0.95   0.96    0.48   0.95   0.95   0.96   0.95
    compagna     0.96   0.97    0.95   0.96   0.95    0.47   0.96   0.95   0.96   0.95
                                                                                                              0

                                                                                                                     1

                                                                                                                            2

                                                                                                                                   3

                                                                                                                                          4

                                                                                                                                                 5

                                                                                                                                                        6

                                                                                                                                                               7

                                                                                                                                                                       8

                                                                                                                                                                              9


                                                                                         0.0                                                                                         1.0
                                                                                                     nonne    0.97   0.94   0.92   0.77   0.86   0.73   0.87   0.92    0.81   0.83
                                                                                                   matrone    0.97   0.95   0.95   0.93   0.94   0.62   0.93   0.92    0.95   0.92
                                                                                                   mamme      0.97   0.97   0.95   0.93   0.94   0.80   0.93   0.93    0.95   0.95
                                                                                                 casalinghe   0.96   0.96   0.95   0.94   0.95   0.71   0.94   0.94    0.96   0.92   0.8
                                                                                                 compagne     0.96   0.96   0.94   0.93   0.93   0.73   0.94   0.92    0.95   0.91

                               Figure 5: Subgroup AUC                                               morose    0.96   0.97   0.94   0.95   0.95   0.53   0.94   0.95    0.95   0.96
                                                                                                  femmine     0.95   0.94   0.92   0.92   0.90   0.65   0.91   0.92    0.93   0.93
                                                                                                              0.96   0.96   0.93   0.94   0.93   0.50   0.91   0.92    0.95   0.91   0.6
                                                                                                     donne
                                                                                                  fidanzate   0.97   0.96   0.94   0.95   0.95   0.66   0.93   0.94    0.95   0.91
                                                                                                     nonna    0.97   0.97   0.96   0.96   0.96   0.26   0.96   0.95    0.97   0.94
7       Conclusions and Future Work                                                                matrona    0.98   0.97   0.94   0.94   0.95   0.21   0.95   0.94    0.96   0.95
                                                                                                                                                                                     0.4
                                                                                                  casalinga   0.97   0.97   0.94   0.95   0.95   0.47   0.96   0.95    0.96   0.93
                                                                                                              0.97   0.96   0.94   0.94   0.95   0.23   0.94   0.95    0.95   0.95
Both of these challenges dealt with issues re-                                                      morosa
                                                                                                  femmina     0.94   0.96   0.94   0.93   0.93   0.40   0.92   0.94    0.94   0.94
lated to content moderation and evaluation of user-                                                mamma      0.97   0.97   0.95   0.95   0.94   0.28   0.95   0.93    0.95   0.94   0.2
                                                                                                              0.95   0.96   0.93   0.93   0.93   0.25   0.92   0.92    0.95   0.89
generated content. While early research raised                                                       donna
                                                                                                  fidanzata   0.97   0.97   0.96   0.97   0.96   0.20   0.96   0.96    0.97   0.95
fears of censorship, the ongoing challenges plat-                                                compagna     0.97   0.97   0.94   0.94   0.94   0.33   0.95   0.94    0.97   0.93
                                                                                                                                                                                     0.0
forms face have made it necessary to consider the
potential of machine learning. Advances in natu-
ral language understanding have produced models
that work surprisingly well, even ones that are able                                           Figure 7: Background Negative, Subgroup Posi-
to detect malicious intent that users try to encode                                            tive AUC
in subtle ways.
   Our particular approach to the EVALITA chal-          Jason Brownlee.  2020.   How to develop vot-
lenges represented an unsurprising application of           ing ensembles with python.    https://
                                                            machinelearningmastery.com/voting-
what has now become a textbook technique: lever-
                                                            ensembles-with-python/, September.
aging the resources of large pre-trained models.
However, many participants achieved nearly simi-         Francois Chollet et al. 2015. Keras.
lar performance levels in the constrained task. We
                                                         Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
regard this as a more impressive accomplishment.            Kristina Toutanova. 2019. BERT: Pre-training of
   Jigsaw continues to apply machine learning to            deep bidirectional transformers for language under-
support publishers and to help them host quality            standing. In Proceedings of the 2019 Conference of
online conversations where readers feel safe par-           the North American Chapter of the Association for
                                                            Computational Linguistics: Human Language Tech-
ticipating. The kinds of comments these chal-               nologies, Volume 1 (Long and Short Papers), pages
lenges tagged are some of the most concerning               4171–4186, Minneapolis, Minnesota, June. Associ-
and pernicious online behaviors, far outside of the         ation for Computational Linguistics.
norms that are tolerated in other public spaces.
                                                         Paolo Rosso Elisabetta Fersini, Debora Nozza. 2020.
But humans and machines both still misinterpret            Ami @ evalita2020: Automatic misogyny identi-
profanity for hostility, and tagging humor, quo-           fication. In Valerio Basile, Danilo Croce, Maria
tations, sarcasm, and other legitimate expressions         Di Maro, and Lucia C. Passaro, editors, Proceedings
for moderation remain serious problems.                    of the 7th evaluation campaign of Natural Language
                                                           Processing and Speech tools for Italian (EVALITA
   Challenges like the AMI and HasSpeede2 com-             2020), Online. CEUR.org.
petitions underscore the importance of under-
standing the relationships between the parties in a      Suchin Gururangan, Ana Marasović, Swabha
conversation, and the participants’ intents. We are        Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
                                                           and Noah A. Smith. 2020. Don’t stop pretraining:
greatly encouraged that attributes that our systems        Adapt language models to domains and tasks.
do not currently capture were somewhat within the          In Proceedings of the 58th Annual Meeting of
reach of our present techniques—but clearly much           the Association for Computational Linguistics,
work remains to be done.                                   pages 8342–8360, Online, July. Association for
                                                           Computational Linguistics.

                                                         Jigsaw.    2018.     Jigsaw toxic comment classifi-
References                                                  cation challenge.      https://www.kaggle.
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-        com/c/jigsaw-toxic-comment-
  cia C. Passaro. 2020. Evalita 2020: Overview              classification-challenge, March.
  of the 7th evaluation campaign of natural language
  processing and speech tools for italian. In Valerio    Jigsaw. 2019. Jigsaw unintended bias in toxic-
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.         ity classification. https://www.kaggle.
  Passaro, editors, Proceedings of Seventh Evalua-          com/c/jigsaw-unintended-bias-in-
  tion Campaign of Natural Language Processing and          toxicity-classification, July.
  Speech Tools for Italian. Final Workshop (EVALITA
                                                         Jigsaw.   2020.      Jigsaw multilingual toxic com-
  2020), Online. CEUR.org.
                                                            ment classification. https://www.kaggle.
Ekaba Bisong, 2019. Google AutoML: Cloud Natu-              com/c/jigsaw-multilingual-toxic-
  ral Language Processing, pages 599–612. Apress,           comment-classification, July.
  Berkeley, CA.
                                                         Endang Wahyu Pamungkas, Valerio Basile, and Vi-
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum       viana Patti. 2020. Misogyny detection in twitter:
  Thain, and Lucy Vasserman. 2019. Nuanced met-            a multilingual and cross-domain study. Information
  rics for measuring unintended bias with real data        Processing & Management, 57(6):102360.
  for text classification. In Companion Proceedings of
  The 2019 World Wide Web Conference, pages 491–         Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
  500.                                                     How multilingual is multilingual BERT? In Pro-
                                                           ceedings of the 57th Annual Meeting of the Asso-
Cristina Bosco, Tommaso Caselli, Gloria Comandini,         ciation for Computational Linguistics, pages 4996–
  Elisa Di Nuovo, Simona Frenda, Viviana Patti,            5001, Florence, Italy, July. Association for Compu-
  Irene Russo, Manuela Sanguinetti, and Marco              tational Linguistics.
  Stranisci. 2020. Hate speech detection task second
  edition (haspeede2) at evalita 2020 task guidelines.   Tensorflow Hub. 2020. Mulilingual L12 H768 A12
  https://github.com/msang/haspeede/                       V2.     https://tfhub.dev/tensorflow/
  blob/master/2020/HaSpeeDe2020_                           bert_multi_cased_L-12_H-768_A-12/2,
  Task_guidelines.pdf.                                     August.
Betty van Aken, Julian Risch, Ralf Krestel, and
  Alexander Löser. 2018. Challenges for toxic com-
  ment classification: An in-depth error analysis. In
  Proceedings of the 2nd Workshop on Abusive Lan-
  guage Online (ALW2), pages 33–42, Brussels, Bel-
  gium, October. Association for Computational Lin-
  guistics.
F. D. Vigna, A. Cimino, Felice Dell’Orletta, M. Petroc-
   chi, and M. Tesconi. 2017. Hate me, hate me not:
   Hate speech detection on facebook. In ITASEC.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, be-
  cas: The surprising cross-lingual effectiveness of
  BERT. In Proceedings of the 2019 Conference on
  Empirical Methods in Natural Language Processing
  and the 9th International Joint Conference on Natu-
  ral Language Processing (EMNLP-IJCNLP), pages
  833–844, Hong Kong, China, November. Associa-
  tion for Computational Linguistics.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
  Le, Mohammad Norouzi, Wolfgang Macherey,
  Maxim Krikun, Yuan Cao, Qin Gao, Klaus
  Macherey, Jeff Klingner, Apurva Shah, Melvin
  Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
  Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
  Kazawa, Keith Stevens, George Kurian, Nishant
  Patil, Wei Wang, Cliff Young, Jason Smith, Jason
  Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
  Macduff Hughes, and Jeffrey Dean. 2016. Google’s
  neural machine translation system: Bridging the gap
  between human and machine translation. CoRR,
  abs/1609.08144.

</pre>