<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alyssa Lees</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeffrey Sorensen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Kivlichan Google Jigsaw New York</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>NY (alyssalees</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>sorenj</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>kivlichan)@google.com</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The Google Jigsaw team produced submissions for two of the EVALITA 2020 (Basile et al., 2020) shared tasks, based in part on the technology that powers the publicly available PerspectiveAPI comment evaluation service. We present a basic description of our submitted results and a review of the types of errors that our system made in these shared tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The HaSpeeDe2 shared task consists of Italian
social media posts that have been labeled for hate
speech and stereotypes. As Jigsaw’s participation
was limited to the A and B tasks, we will be
limiting our analysis to that portion. The full details
of the dataset are available in the task guidelines
        <xref ref-type="bibr" rid="ref4">(Bosco et al., 2020)</xref>
        .
      </p>
      <p>
        The AMI task includes both raw (natural
Twitter) and synthetic (template-generated) datasets.
The raw data consists of Italian tweets manually
labelled and balanced according to misogyny and
aggressiveness labels, while the synthetic data is
labelled only for misogyny and is intended to
measure the presence of unintended bias
        <xref ref-type="bibr" rid="ref8">(Elisabetta Fersini, 2020)</xref>
        .
Jigsaw, a team within Google, develops the
PerspectiveAPI machine learning comment scoring
system, which is used by numerous social media
companies and publishers. Our system is based
on distillation and uses a convolutional
neuralnetwork to score individual comments according
to several attributes using supervised training data
labeled by crowd workers. Note that
PerspectiveAPI actually hosts a number of different
models that each score different attributes. The
underlying technology and performance of these models
has evolved over time.
      </p>
      <p>
        While Jigsaw has hosted three separate Kaggle
competitions relevant to this competition
        <xref ref-type="bibr" rid="ref10 ref11 ref12">(Jigsaw,
2018; Jigsaw, 2019; Jigsaw, 2020)</xref>
        we have not
traditionally participated in academic evaluations.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The models we build are based on the popular
BERT architecture
        <xref ref-type="bibr" rid="ref7">(Devlin et al., 2019)</xref>
        with
different pre-training and fine-tuning approaches.
      </p>
      <p>
        In part, our submissions explore the importance
of pre-training
        <xref ref-type="bibr" rid="ref9">(Gururangan et al., 2020)</xref>
        in the
context of toxicity and the various competition
attributes. A core question is to what extent these
domains overlap. Jigsaw’s customized models
(used for the second HaSpeeDe2 submission, and
both AMI submissions) are pretrained on a set of
one billion user-generated comments: this imparts
statistical information to the model about
comments and conversations online. This model is
further fine-tuned on various toxicity attributes
(toxicity, severe toxicity, profanity, insults, identity
attacks, and threats), but it is unclear how well these
should align with the competition attributes. The
descriptions of these attributes and how they were
collected from crowd workers can be found in the
data descriptions for the Jigsaw Unintended Bias
in Toxicity Classification
        <xref ref-type="bibr" rid="ref11">(Jigsaw, 2019)</xref>
        website.
      </p>
      <p>
        A second question studied in prior work is to
what extent training generalizes across languages
        <xref ref-type="bibr" rid="ref13 ref14 ref14 ref18">(Pires et al., 2019; Wu and Dredze, 2019;
Pamungkas et al., 2020)</xref>
        . The majority of our
training data is English comment data from a variety
of sources, while this competition is based on
Italian Twitter data. Though multilingual transfer has
been studied in general contexts, less is known
about the specific cases of toxicity, hate speech,
misogyny, and harassment. This was one of the
focuses of Jigsaw’s recent Kaggle competition
        <xref ref-type="bibr" rid="ref12">(Jigsaw, 2020)</xref>
        ; i.e., what forms of toxicity are shared
across languages (and hence can be learned by
multilingual models) and what forms are different.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Submission Details</title>
      <p>gual teacher model (that is too large to
practically serve in production) to a smaller CNN. Using
this large teacher model, we initially compared the
EVALITA hate speech and stereotype annotations
against the teacher model’s scores for different
attributes. The results are shown in Figure 1 for the
training data. Perspective is a reasonable detector
for the hate speech attribute, but performs less well
for the stereotype attribute, with the identity attack
model performing the best.</p>
      <p>Using these same models on the AMI task,
shown in Figure 2 for detecting misogyny proved
even more challenging. In this case, the
aggressiveness attribute was evaluated only on the
subset of the training data labeled misogynous. In
this case, the most popular attribute of “toxicity”
is actually counter-indicative of the misogyny
label. The best detector for both of these attributes
appears to be the “threat” model.</p>
      <p>As can be seen, the existing classifiers are all
poor predictors of both attributes for this shared
task. Due to errors in our initial analysis, we did
not end up using any of the models used for
PerspectiveAPI in our final submissions.</p>
      <p>ategory
C
news
tweets</p>
      <p>Submission
1
2
1
2</p>
      <p>hatespeech
0.68
0.64
0.72
0.77</p>
      <p>stereotype
0.64
0.68
0.67
0.74</p>
      <sec id="sec-3-1">
        <title>4.1 HaSpeeDe2</title>
        <p>The Jigsaw team submitted two separate
submissions that were independently trained for Tasks A
and B.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.1.1 First Submission</title>
        <p>Our first submission, one that did not perform very
well, was based on a simple multilingual BERT
model fine-tuned on 10 random splits of the
training data. For each split, 10% of the data was
held out to choose an appropriate equal-error-rate
threshold for the resulting model.</p>
        <p>
          The BERT fine-tuning system used the 12 layer
model
          <xref ref-type="bibr" rid="ref15">(Tensorflow Hub, 2020)</xref>
          , a batch size of
64 and sequence length of 128. A single dense
layer is used to connect to the two output sigmoids
which are trained using a binary cross-entropy loss
1.0
0.8
0.6
0.4
        </p>
        <p>As Jigsaw has already developed toxicity
models for the Italian language, we initially hoped
that these would provide a preliminary baseline
for the competition despite the independent
nature of the development of the annotation
guidelines. Our Italian models score comments for
toxicity as well as five additional distinct toxicity
attributes: severe toxicity, profanity, threats, insults,
and identity attacks. We might expect some of
these attributes to correlate with the HaSpeeDe2
and AMI attributes, though it is not immediately
clear whether any of these correlations should be
particularly strong.</p>
        <p>
          The current Jigsaw PerspectiveAPI models are
typically trained via distillation from a
multilinusing stochastic gradient descent with early
stopping, which is computed using the AUC metric
computed using the 10% held out slice. This
model is implemented using Keras
          <xref ref-type="bibr" rid="ref6">(Chollet and
others, 2015)</xref>
          .
        </p>
        <p>To create the final submission, the decisions of
the ten separate classifiers were combined in a
majority voting scheme (if 5 or more models
produced a positive detection, the attribute was
assigned true).</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.1.2 Second Submission</title>
        <p>Our second submission was based on a similar
approach of fine-tuning a BERT-based model, but
one based on a more closely matched training set.</p>
        <p>
          The underlying technology we used is the same
as the Google Cloud AutoML for natural language
processing product that had been employed in
similar labeling applications
          <xref ref-type="bibr" rid="ref2">(Bisong, 2019)</xref>
          .
        </p>
        <p>
          The remaining models built for this
competition and in the subsequent section are based on a
customized BERT 768-dimension 12-layer model
pretrained on 1B user-generated comments using
MLM for 125 steps. This model was then
finetuned on supervised comments in multiple
languages for six attributes: toxicity, severe
toxicity, obscene, threat, insult, and identity hate. This
model also uses a custom wordpiece model
          <xref ref-type="bibr" rid="ref19">(Wu et
al., 2016)</xref>
          comprised of 200K tokens representing
tokens from hundreds of languages.
        </p>
        <p>Our hate speech and misogyny models use a
fully connected final layer that combines the six
output attributes and allows weight propagation
through all layers of the network. Fine-tuning
continues on using the supervised training data
provided by the competition hosts using the ADAM
optimizer with a learning rate of 1e–5.</p>
        <p>Figure 3 displays the ROC curve for our second
submission for each of the news and the tweets
datasets as well as for both the hate speech and
stereotype attributes.</p>
        <p>Our second submission for HaSpeeDe2
consisted of fine-tuning a single model with the
provided training data with a 10% held-out set. The
custom BERT model was fine-tuned on TPUs
using a relatively small batch size of 32.
4.2</p>
        <p>AMI
Our submissions for the AMI task only
considered the unconstrained case, due to the use of
pretrained models. All AMI models were
finetuned on TPUs using the customized BERT
check1.0
0.8
e
t
a
R0.6
e
v
ii
t
s
o
Pe0.4
u
r
T
0.2
0.00.0
0.2</p>
        <p>Tweets Hatespeech 85.5%
Tweets Stereotype 82.6%
News Stereotype 77.3%</p>
        <p>News Hatespeech 75.7%
0.4 0.6 0.8
False Positive Rate
1.0
point and custom wordpiece vocabulary from
Section 4.1.2. However, a larger batch-size of 128
was specified. All models were fine-tuned
simultaneously on misogynous and aggressive labels
using the provided data, where zero
aggressiveness weights were assigned to data points with no
misogynous labels.</p>
        <p>Both submissions were based on ensembles of
partitioned models evaluated on a 10% held-out
test set. We explored two different ensembling
techniques, which we discuss in the next section.</p>
        <p>AMI submission 1 does not not include
synthetic data. AMI submission 2 includes the
synthetic data and custom biasing mitigation data
selected from Wikipedia articles. Table 2 clearly
shows that the inclusion of such data significantly
improved the performance on Task B for
submission 2. Interestingly, the inclusion of synthetic and
bias mitigation data slightly improved the
performance in Task A as well.</p>
        <p>Task
A
B</p>
        <p>SubmissSiocnore
1 0.738
2 0.741
1 0.649
2 0.883</p>
        <p>The two Jigsaw models ranked in first and
second place for Task A. The second submission
ranked first among participants for Task B.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.2.1 Ensembling Models</title>
        <p>
          Both the first and second submissions for AMI
were ensembles of fine-tuned custom BERT
models constructed from partitioned training data. We
explored two ensembling techniques
          <xref ref-type="bibr" rid="ref5">(Brownlee,
2020)</xref>
          :
• Majority Vote: Each partitioned model was
evaluated using a model specific threshold.
The label for each attribute was determined
by majority vote among the models.
• Average: The raw models probabilities are
averaged together. The combined model
calculates the labels via custom thresholds
determined by evaluation on a held-out set.
        </p>
        <p>Thresholds for the individual models in the
majority vote and average ensemble were calculated
to optimize for the point on the held-out data ROC
curve where jTPR (1 FPR)j is minimized.</p>
        <p>The majority voting model performed slightly
better for both the misogynous and aggressive task
on the held-out sets. As such, both submissions
use majority vote.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.2.2 First Submission</title>
        <p>Using the same configuration as Section 4.1.2, we
partitioned the raw training data into ten randomly
chosen partitions and fine-tuned nine of these
using the 10% held out portion to compute
thresholds. No synthetic or de-biasing data was included
in this submission.</p>
        <p>We include ROC curves for half of these
models in Figure 4, to illustrate that they are similar
but with some variance when used to score the test
data.</p>
        <p>1.0</p>
        <p>B is not surprising given that no bias mitigating
data or constraints were included in training.</p>
      </sec>
      <sec id="sec-3-6">
        <title>4.2.3 Second Submission</title>
        <p>In order to mitigate bias, we decided to augment
the training data set using sentences sampled from
the Italian Wikipedia articles that contain the 17
terms listed in the identity terms file provided with
the test set data. These sentences were labeled
as both non-misogynous and non-aggressive. 11K
sentences were used for this purpose, with the term
frequencies summarized in Table 3.</p>
        <p>Identity Term
donna
donne
femmine
femmina
fidanzata
nonna
mamma
casalinga
casalinghe
compagne
compagna
mamme
fidanzate
nonne
matrone
matrona
morosa</p>
        <p>The second submission employed the same
partitioning of data with a held-out set. However the
unconstrained data included the raw training data,
the provided synthetic data and our de-biasing
term data. As with submission 1, majority vote
was used with custom thresholds determined by
evaluation on the held-out set.</p>
        <p>Our first unconstrained submission for AMI
achieved scores of 0.741 for Task A and 0.883 for
Task B.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Error Analysis</title>
      <p>We discuss an informal analysis of the errors we
observed with each of these tasks. Aside from the
typical questions regarding data annotation
quality, and the small sample sizes, we observed some
particular instances of avoidable errors.
5.1</p>
      <sec id="sec-4-1">
        <title>HaSpeeDe2 Errors</title>
        <p>Our first unconstrained submission using
majority vote for AMI achieved scores of 0.738 for Task
A and 0.649 for Task B. The poorer score for Task
Looking at the largest incongruities as shown in
Table 4 it is clear that context, which is
unavailable to our models, and presumably to the
mod</p>
        <p>
          ID Comment
11355 @user @user @user Giustissimo, non bisogna mai nascondersi nelle ideologie,
sopratutto oggi perche´ non esistono piu`. Sta di fatto, che le cose piu` aberranti
che leggi oggi sui giornali hanno sempre@a@che fare con stranieri... o rom
URL
10803 #Simone di #CasalBruciato, #Roma: “Certi rom sono cittadini italiani, ma non
sono uguali a noi. Uguali non e` il termine piu` giusto da usare”. URL
11288 I SOLDI DEI DISABILI AI MIGRANTI La regione Emilia Romagna destina
la meta` dei fondi destinati alle fasce deboli a progetti per i richiedenti asilo A
Reggio Emilia il 69% delle risorse stanziate sono state utilizzate ai richiedenti
asilo #PRIMAGLIITALIANI URL
10420 #MeNeFottoDi questi sfigati #facciamorete che continuano a giustificare ogni
crimine commesso da immigrati... non fate rete, FATE SCHIFO...
#facciamociFURBI
11189 @user Naturalmente in questo caso alla faccia dei comunisti e dei migranti
stitici!
10483 @user SCHIFOSA IPOCRITA SPONSORIZZI I MUSSULMANI E POI VOI
DARE I DIRITTI ALLE DONNE SI VEDE CHE SEI POSSEDUTA DAL
DIAVOLO SEI BUGIARDA BOLDRINA SAI SOLO PROTESTARE POI
TI CHIEDI PERCH E´ IL VERO ITALIANO TI ODIA PERCH E´ SEI UNA
SPORCA IPOCRITA
1
1
0
0
0
.00003
.00003
erators, is important for determining the author’s
intent. The use of humor and the practice of
quoting text from another author are also confounding
factors. As this task is known to be hard
          <xref ref-type="bibr" rid="ref16 ref17">(Vigna
et al., 2017; van Aken et al., 2018)</xref>
          , the edge cases
display these confounding reasons. Additionally,
as evidenced by the last comment, the subtlety of
hate speech that is directed toward the designated
target for this challenge has not been well
captured.
        </p>
        <p>The BERT model that we fine-tuned for this
application is cased, and we see within our errors
frequent use of all-caps text. However, lower casing
the text has almost no effect on the scores,
suggesting that the BERT pre-training has already linked
the various cased versions of the tokens in the
vocabulary.</p>
        <p>We analyzed the frequency of word piece
fragments in the data and saw no correlation between
misclassification and the presence of segmented
words. This suggests that vocabulary coverage in
the test set does not play a significant role in
explaining our systems’ errors.</p>
        <p>Considering the sentence with the high model
score for hate speech, several single terms are
tagged by the model. For example the term
“sfigati” occurs only once in the training data in a
sentence that is marked as non-hate speech.
However, this term is not in our vocabulary and gets
split into pieces “sfiga##ti”, and the prefix “sfiga”
appears in two out of three training examples that
are marked hate speech—exactly the kind of data
sparsity that leads to unwanted bias. Using a larger
amount of training data, even if it creates an
imbalance, is one way to address this, as we did in the
case of the AMI challenge.
5.2</p>
        <p>AMI
Because we are using ensemble models trained on
partitions of the training set, we observe that the
highest-scoring test samples that are marked
nonmisogynous and non-aggressive, as well as the
lowest-scoring misogynous and aggressive
comments, vary from model to model. However, we
display the most frequently occurring mistakes
across all ten ensembles in Table 5.</p>
        <p>Regarding the false alarms, these comments
appear to be mislabeled test instances, and there is
ample support for this claim in the training data.</p>
        <p>The first comment combines both uppercase and
a missing space. While it’s true that subjunctive
mode is not well represented in the training data,
lower casing this sentence produces high scores.</p>
        <p>This is also the case with the third example. The
second error seems more subtle, perhaps an
attempt at humor, but one with no salient misogyny
terms.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Bias</title>
      <p>
        Because the identity terms for AMI are not
observed with a high frequency in the test data, we
restrict our analysis to the synthetic data set. We
find wide variation in the performance of our
individual models, with one model exhibiting very
poor performance across the subgroups. The
summary of the AUC measurements for these models
are shown in Figure 5, Figure 6, and Figure 7 using
the technique presented in
        <xref ref-type="bibr" rid="ref3">(Borkan et al., 2019)</xref>
        .
There does not appear to be a systemic problem
with bias in these models, but judging based only
upon synthetic data is probably unwise. The single
term “donna” from the test set shows a subgroup
AUC that drops substantially from the background
AUC for nearly all of the models, perhaps
indicating limitations of judging based on synthetic data.
Both of these challenges dealt with issues
related to content moderation and evaluation of
usergenerated content. While early research raised
fears of censorship, the ongoing challenges
platforms face have made it necessary to consider the
potential of machine learning. Advances in
natural language understanding have produced models
that work surprisingly well, even ones that are able
to detect malicious intent that users try to encode
in subtle ways.
      </p>
      <p>Our particular approach to the EVALITA
challenges represented an unsurprising application of
what has now become a textbook technique:
leveraging the resources of large pre-trained models.
However, many participants achieved nearly
similar performance levels in the constrained task. We
regard this as a more impressive accomplishment.</p>
      <p>Jigsaw continues to apply machine learning to
support publishers and to help them host quality
online conversations where readers feel safe
participating. The kinds of comments these
challenges tagged are some of the most concerning
and pernicious online behaviors, far outside of the
norms that are tolerated in other public spaces.
But humans and machines both still misinterpret
profanity for hostility, and tagging humor,
quotations, sarcasm, and other legitimate expressions
for moderation remain serious problems.</p>
      <p>Challenges like the AMI and HasSpeede2
competitions underscore the importance of
understanding the relationships between the parties in a
conversation, and the participants’ intents. We are
greatly encouraged that attributes that our systems
do not currently capture were somewhat within the
reach of our present techniques—but clearly much
work remains to be done.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Ekaba</given-names>
            <surname>Bisong</surname>
          </string-name>
          ,
          <year>2019</year>
          . Google AutoML:
          <source>Cloud Natural Language Processing</source>
          , pages
          <fpage>599</fpage>
          -
          <lpage>612</lpage>
          . Apress, Berkeley, CA.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Borkan</surname>
          </string-name>
          , Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and
          <string-name>
            <given-names>Lucy</given-names>
            <surname>Vasserman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Nuanced metrics for measuring unintended bias with real data for text classification</article-title>
          .
          <source>In Companion Proceedings of The 2019 World Wide Web Conference</source>
          , pages
          <fpage>491</fpage>
          -
          <lpage>500</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Tommaso Caselli, Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Viviana Patti, Irene Russo, Manuela Sanguinetti, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Stranisci</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Hate speech detection task second edition (haspeede2) at evalita 2020 task guidelines</article-title>
          . https://github.com/msang/haspeede/ blob/master/2020/HaSpeeDe2020_ Task_guidelines.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jason</given-names>
            <surname>Brownlee</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>How to develop voting ensembles with python</article-title>
          . https:// machinelearningmastery.com/votingensembles-with-python/, September.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Francois</given-names>
            <surname>Chollet</surname>
          </string-name>
          et al.
          <year>2015</year>
          . Keras.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso Elisabetta Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Debora</given-names>
            <surname>Nozza</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Ami @ evalita2020: Automatic misogyny identification</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of the 7th evaluation campaign of Natural Language Processing</source>
          and
          <article-title>Speech tools for Italian (EVALITA 2020), Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Suchin</given-names>
            <surname>Gururangan</surname>
          </string-name>
          , Ana Marasovic´,
          <string-name>
            <surname>Swabha</surname>
            <given-names>Swayamdipta</given-names>
          </string-name>
          , Kyle Lo, Iz Beltagy, Doug Downey, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          , Online, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jigsaw</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Jigsaw toxic comment classification challenge</article-title>
          . https://www.kaggle. com/c/jigsaw-toxic
          <string-name>
            <surname>-</surname>
          </string-name>
          commentclassification-challenge,
          <year>March</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Jigsaw</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Jigsaw unintended bias in toxicity classification</article-title>
          . https://www.kaggle. com/c/jigsaw-unintended
          <article-title>-bias-intoxicity-classification, July.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Jigsaw</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Jigsaw multilingual toxic comment classification</article-title>
          . https://www.kaggle. com/c/jigsaw-multilingual
          <article-title>-toxiccomment-classification, July.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Endang</given-names>
            <surname>Wahyu</surname>
          </string-name>
          <string-name>
            <surname>Pamungkas</surname>
          </string-name>
          , Valerio Basile, and
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Misogyny detection in twitter: a multilingual and cross-domain study</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>57</volume>
          (
          <issue>6</issue>
          ):
          <fpage>102360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Telmo</given-names>
            <surname>Pires</surname>
          </string-name>
          , Eva Schlinger, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Garrette</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , pages
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          , Florence, Italy, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Tensorflow</given-names>
            <surname>Hub</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Mulilingual L12 H768 A12 V2</article-title>
          . https://tfhub.dev/tensorflow/ bert_multi_cased_L-12
          <string-name>
            <surname>_H-768_</surname>
          </string-name>
          A-
          <volume>12</volume>
          /2, August.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Betty van Aken</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Risch</surname>
          </string-name>
          , Ralf Krestel, and Alexander Lo¨ser.
          <year>2018</year>
          .
          <article-title>Challenges for toxic comment classification: An in-depth error analysis</article-title>
          .
          <source>In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>42</lpage>
          , Brussels, Belgium, October. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>F. D. Vigna</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cimino</surname>
            , Felice Dell'Orletta,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Petrocchi</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hate me, hate me not: Hate speech detection on facebook</article-title>
          .
          <source>In ITASEC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Shijie</given-names>
            <surname>Wu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dredze</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>833</fpage>
          -
          <lpage>844</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Yonghui</given-names>
            <surname>Wu</surname>
          </string-name>
          , Mike Schuster, Zhifeng Chen,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao,
            <given-names>Qin</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,
          <string-name>
            <surname>Wei</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Cliff Young,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Riesa</surname>
          </string-name>
          , Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>CoRR, abs/1609</source>
          .08144.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>