<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alyssa</forename><surname>Lees</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Google Jigsaw New York</orgName>
								<address>
									<region>NY</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jeffrey</forename><surname>Sorensen</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Google Jigsaw New York</orgName>
								<address>
									<region>NY</region>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ian</forename><surname>Kivlichan</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Google Jigsaw New York</orgName>
								<address>
									<region>NY</region>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Jigsaw @ AMI and HaSpeeDe2: Fine-Tuning a Pre-Trained Comment-Domain BERT Model</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">1960BF4294FAF1254B183FE38B76E827</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T01:05+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Google Jigsaw team produced submissions for two of the EVALITA 2020 (Basile et al., 2020)  shared tasks, based in part on the technology that powers the publicly available PerspectiveAPI comment evaluation service. We present a basic description of our submitted results and a review of the types of errors that our system made in these shared tasks.</p><p>Jigsaw, a team within Google, develops the Per-spectiveAPI machine learning comment scoring system, which is used by numerous social media companies and publishers. Our system is based on distillation and uses a convolutional neuralnetwork to score individual comments according to several attributes using supervised training data</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The HaSpeeDe2 shared task consists of Italian social media posts that have been labeled for hate speech and stereotypes. As Jigsaw's participation was limited to the A and B tasks, we will be limiting our analysis to that portion. The full details of the dataset are available in the task guidelines <ref type="bibr" target="#b3">(Bosco et al., 2020)</ref>.</p><p>The AMI task includes both raw (natural Twitter) and synthetic (template-generated) datasets. The raw data consists of Italian tweets manually labelled and balanced according to misogyny and aggressiveness labels, while the synthetic data is labelled only for misogyny and is intended to measure the presence of unintended bias <ref type="bibr" target="#b6">(Elisabetta Fersini, 2020)</ref>. labeled by crowd workers. Note that Perspec-tiveAPI actually hosts a number of different models that each score different attributes. The underlying technology and performance of these models has evolved over time.</p><p>While Jigsaw has hosted three separate Kaggle competitions relevant to this competition <ref type="bibr" target="#b8">(Jigsaw, 2018;</ref><ref type="bibr" target="#b8">Jigsaw, 2019;</ref><ref type="bibr" target="#b8">Jigsaw, 2020)</ref> we have not traditionally participated in academic evaluations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Related Work</head><p>The models we build are based on the popular BERT architecture <ref type="bibr" target="#b5">(Devlin et al., 2019)</ref> with different pre-training and fine-tuning approaches.</p><p>In part, our submissions explore the importance of pre-training <ref type="bibr" target="#b7">(Gururangan et al., 2020)</ref> in the context of toxicity and the various competition attributes. A core question is to what extent these domains overlap. Jigsaw's customized models (used for the second HaSpeeDe2 submission, and both AMI submissions) are pretrained on a set of one billion user-generated comments: this imparts statistical information to the model about comments and conversations online. This model is further fine-tuned on various toxicity attributes (toxicity, severe toxicity, profanity, insults, identity attacks, and threats), but it is unclear how well these should align with the competition attributes. The descriptions of these attributes and how they were collected from crowd workers can be found in the data descriptions for the Jigsaw Unintended Bias in Toxicity Classification <ref type="bibr" target="#b8">(Jigsaw, 2019)</ref> website.</p><p>A second question studied in prior work is to what extent training generalizes across languages <ref type="bibr" target="#b10">(Pires et al., 2019;</ref><ref type="bibr" target="#b13">Wu and Dredze, 2019;</ref><ref type="bibr" target="#b9">Pamungkas et al., 2020)</ref>. The majority of our training data is English comment data from a variety of sources, while this competition is based on Italian Twitter data. Though multilingual transfer has been studied in general contexts, less is known about the specific cases of toxicity, hate speech, misogyny, and harassment. This was one of the focuses of <ref type="bibr">Jigsaw's recent Kaggle competition (Jigsaw, 2020)</ref>  As Jigsaw has already developed toxicity models for the Italian language, we initially hoped that these would provide a preliminary baseline for the competition despite the independent nature of the development of the annotation guidelines. Our Italian models score comments for toxicity as well as five additional distinct toxicity attributes: severe toxicity, profanity, threats, insults, and identity attacks. We might expect some of these attributes to correlate with the HaSpeeDe2 and AMI attributes, though it is not immediately clear whether any of these correlations should be particularly strong.</p><p>The current Jigsaw PerspectiveAPI models are typically trained via distillation from a multilin-gual teacher model (that is too large to practically serve in production) to a smaller CNN. Using this large teacher model, we initially compared the EVALITA hate speech and stereotype annotations against the teacher model's scores for different attributes. The results are shown in Figure <ref type="figure">1</ref> for the training data. Perspective is a reasonable detector for the hate speech attribute, but performs less well for the stereotype attribute, with the identity attack model performing the best.</p><p>Using these same models on the AMI task, shown in Figure <ref type="figure" target="#fig_0">2</ref> for detecting misogyny proved even more challenging. In this case, the aggressiveness attribute was evaluated only on the subset of the training data labeled misogynous. In this case, the most popular attribute of "toxicity" is actually counter-indicative of the misogyny label. The best detector for both of these attributes appears to be the "threat" model.</p><p>As can be seen, the existing classifiers are all poor predictors of both attributes for this shared task. Due to errors in our initial analysis, we did not end up using any of the models used for Per-spectiveAPI in our final submissions. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">HaSpeeDe2</head><p>The Jigsaw team submitted two separate submissions that were independently trained for Tasks A and B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">First Submission</head><p>Our first submission, one that did not perform very well, was based on a simple multilingual BERT model fine-tuned on 10 random splits of the training data. For each split, 10% of the data was held out to choose an appropriate equal-error-rate threshold for the resulting model. The BERT fine-tuning system used the 12 layer model (Tensorflow Hub, 2020), a batch size of 64 and sequence length of 128. A single dense layer is used to connect to the two output sigmoids which are trained using a binary cross-entropy loss using stochastic gradient descent with early stopping, which is computed using the AUC metric computed using the 10% held out slice. This model is implemented using Keras <ref type="bibr">(Chollet and others, 2015)</ref>.</p><p>To create the final submission, the decisions of the ten separate classifiers were combined in a majority voting scheme (if 5 or more models produced a positive detection, the attribute was assigned true).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Second Submission</head><p>Our second submission was based on a similar approach of fine-tuning a BERT-based model, but one based on a more closely matched training set.</p><p>The underlying technology we used is the same as the Google Cloud AutoML for natural language processing product that had been employed in similar labeling applications <ref type="bibr" target="#b1">(Bisong, 2019)</ref>.</p><p>The remaining models built for this competition and in the subsequent section are based on a customized BERT 768-dimension 12-layer model pretrained on 1B user-generated comments using MLM for 125 steps. This model was then finetuned on supervised comments in multiple languages for six attributes: toxicity, severe toxicity, obscene, threat, insult, and identity hate. This model also uses a custom wordpiece model <ref type="bibr" target="#b14">(Wu et al., 2016)</ref> comprised of 200K tokens representing tokens from hundreds of languages.</p><p>Our hate speech and misogyny models use a fully connected final layer that combines the six output attributes and allows weight propagation through all layers of the network. Fine-tuning continues on using the supervised training data provided by the competition hosts using the ADAM optimizer with a learning rate of 1e-5.</p><p>Figure <ref type="figure">3</ref> displays the ROC curve for our second submission for each of the news and the tweets datasets as well as for both the hate speech and stereotype attributes.</p><p>Our second submission for HaSpeeDe2 consisted of fine-tuning a single model with the provided training data with a 10% held-out set. The custom BERT model was fine-tuned on TPUs using a relatively small batch size of 32.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">AMI</head><p>Our submissions for the AMI task only considered the unconstrained case, due to the use of pretrained models. All AMI models were finetuned on TPUs using the customized BERT check- point and custom wordpiece vocabulary from Section 4.1.2. However, a larger batch-size of 128 was specified. All models were fine-tuned simultaneously on misogynous and aggressive labels using the provided data, where zero aggressiveness weights were assigned to data points with no misogynous labels.</p><p>Both submissions were based on ensembles of partitioned models evaluated on a 10% held-out test set. We explored two different ensembling techniques, which we discuss in the next section. AMI submission 1 does not not include synthetic data. AMI submission 2 includes the synthetic data and custom biasing mitigation data selected from Wikipedia articles. Table <ref type="table">2</ref> clearly shows that the inclusion of such data significantly improved the performance on Task B for submission 2. Interestingly, the inclusion of synthetic and bias mitigation data slightly improved the performance in Task A as well. The two Jigsaw models ranked in first and second place for Task A. The second submission ranked first among participants for Task B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Ensembling Models</head><p>Both the first and second submissions for AMI were ensembles of fine-tuned custom BERT models constructed from partitioned training data. We explored two ensembling techniques <ref type="bibr" target="#b4">(Brownlee, 2020)</ref>:</p><p>• Majority Vote: Each partitioned model was evaluated using a model specific threshold.</p><p>The label for each attribute was determined by majority vote among the models.</p><p>• Average: The raw models probabilities are averaged together. The combined model calculates the labels via custom thresholds determined by evaluation on a held-out set.</p><p>Thresholds for the individual models in the majority vote and average ensemble were calculated to optimize for the point on the held-out data ROC curve where |TPR − (1 − FPR)| is minimized.</p><p>The majority voting model performed slightly better for both the misogynous and aggressive task on the held-out sets. As such, both submissions use majority vote.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">First Submission</head><p>Using the same configuration as Section 4.1.2, we partitioned the raw training data into ten randomly chosen partitions and fine-tuned nine of these using the 10% held out portion to compute thresholds. No synthetic or de-biasing data was included in this submission.</p><p>We include ROC curves for half of these models in Figure <ref type="figure" target="#fig_1">4</ref>, to illustrate that they are similar but with some variance when used to score the test data.  Our first unconstrained submission using majority vote for AMI achieved scores of 0.738 for Task A and 0.649 for Task B. The poorer score for Task B is not surprising given that no bias mitigating data or constraints were included in training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3">Second Submission</head><p>In order to mitigate bias, we decided to augment the training data set using sentences sampled from the Italian Wikipedia articles that contain the 17 terms listed in the identity terms file provided with the test set data. These sentences were labeled as both non-misogynous and non-aggressive. 11K sentences were used for this purpose, with the term frequencies summarized in Table <ref type="table" target="#tab_5">3</ref>. The second submission employed the same partitioning of data with a held-out set. However the unconstrained data included the raw training data, the provided synthetic data and our de-biasing term data. As with submission 1, majority vote was used with custom thresholds determined by evaluation on the held-out set.</p><p>Our first unconstrained submission for AMI achieved scores of 0.741 for Task A and 0.883 for Task B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Error Analysis</head><p>We discuss an informal analysis of the errors we observed with each of these tasks. Aside from the typical questions regarding data annotation quality, and the small sample sizes, we observed some particular instances of avoidable errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">HaSpeeDe2 Errors</head><p>Looking at the largest incongruities as shown in Table <ref type="table" target="#tab_6">4</ref> it is clear that context, which is unavailable to our models, and presumably to the mod-ID Comment HS Score 11355 @user @user @user Giustissimo, non bisogna mai nascondersi nelle ideologie, sopratutto oggi perché non esistono più. Sta di fatto, che le cose più aberranti che leggi oggi sui giornali hanno sempre@a@che fare con stranieri. As this task is known to be hard <ref type="bibr" target="#b12">(Vigna et al., 2017;</ref><ref type="bibr">van Aken et al., 2018)</ref>, the edge cases display these confounding reasons. Additionally, as evidenced by the last comment, the subtlety of hate speech that is directed toward the designated target for this challenge has not been well captured.</p><p>The BERT model that we fine-tuned for this application is cased, and we see within our errors frequent use of all-caps text. However, lower casing the text has almost no effect on the scores, suggesting that the BERT pre-training has already linked the various cased versions of the tokens in the vocabulary.</p><p>We analyzed the frequency of word piece fragments in the data and saw no correlation between misclassification and the presence of segmented words. This suggests that vocabulary coverage in the test set does not play a significant role in explaining our systems' errors.</p><p>Considering the sentence with the high model score for hate speech, several single terms are tagged by the model. For example the term "sfigati" occurs only once in the training data in a sentence that is marked as non-hate speech. However, this term is not in our vocabulary and gets split into pieces "sfiga##ti", and the prefix "sfiga" appears in two out of three training examples that are marked hate speech-exactly the kind of data sparsity that leads to unwanted bias. Using a larger amount of training data, even if it creates an imbalance, is one way to address this, as we did in the case of the AMI challenge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">AMI</head><p>Because we are using ensemble models trained on partitions of the training set, we observe that the highest-scoring test samples that are marked nonmisogynous and non-aggressive, as well as the lowest-scoring misogynous and aggressive comments, vary from model to model. However, we display the most frequently occurring mistakes across all ten ensembles in Table <ref type="table" target="#tab_8">5</ref>.</p><p>Regarding the false alarms, these comments appear to be mislabeled test instances, and there is ample support for this claim in the training data. The first comment combines both uppercase and a missing space. While it's true that subjunctive mode is not well represented in the training data, lower casing this sentence produces high scores. This is also the case with the third example. The second error seems more subtle, perhaps an attempt at humor, but one with no salient misogyny terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Bias</head><p>Because the identity terms for AMI are not observed with a high frequency in the test data, we restrict our analysis to the synthetic data set. We find wide variation in the performance of our individual models, with one model exhibiting very poor performance across the subgroups. The summary of the AUC measurements for these models  are shown in Figure <ref type="figure">5</ref>, Figure <ref type="figure">6</ref>, and Figure <ref type="figure" target="#fig_3">7</ref> using the technique presented in <ref type="bibr" target="#b2">(Borkan et al., 2019)</ref>.</p><p>There does not appear to be a systemic problem with bias in these models, but judging based only upon synthetic data is probably unwise. The single term "donna" from the test set shows a subgroup AUC that drops substantially from the background AUC for all of the models, perhaps indicating limitations of judging based on synthetic data. Both of these challenges dealt with issues related to content moderation and evaluation of usergenerated content. While early research raised fears of censorship, the ongoing challenges platforms face have made it necessary to consider the potential of machine learning. Advances in natural language understanding have produced models that work surprisingly well, even ones that are able to detect malicious intent that users try to encode in subtle ways. Our particular approach to the EVALITA challenges represented an unsurprising application of what has now become a textbook technique: leveraging the resources of large pre-trained models. However, many participants achieved nearly similar performance levels in the constrained task. We regard this as a more impressive accomplishment.</p><p>Jigsaw continues to apply machine learning to support publishers and to help them host quality online conversations where readers feel safe participating. The kinds of comments these challenges tagged are some of the most concerning and pernicious online behaviors, far outside of the norms that are tolerated in other public spaces. But humans and machines both still misinterpret profanity for hostility, and tagging humor, quotations, sarcasm, and other legitimate expressions for moderation remain serious problems.</p><p>Challenges like the AMI and HasSpeede2 competitions underscore the importance of understanding the relationships between the parties in a conversation, and the participants' intents. We are greatly encouraged that attributes that our systems do not currently capture were somewhat within the reach of our present techniques-but clearly much work remains to be done.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: ROC curves for PerspectiveAPI multilingual teacher model attributes compared to the AMI attributes (misogyny and aggressiveness).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: ROC plots for AMI test set labels for models pre-ensemble.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>Figure 5: Subgroup AUC</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 7 :</head><label>7</label><figDesc>Figure 6: Background Positive, Subgroup Negative AUC</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>; i.e., what forms of toxicity are shared across languages (and hence can be learned by multilingual models) and what forms are different.</figDesc><table><row><cell cols="6">4 Submission Details</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>1.0</cell><cell></cell><cell>Attribute hs</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>Attribute stereotype</cell></row><row><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>True Positive Rate</cell><cell>0.0 0.0 0.2 0.4 0.6</cell><cell>0.2</cell><cell cols="2">0.4 False Positive Rate 0.6 identity_attack 83.6% 0.8 severe_toxicity 82.6% toxicity 82% insult 80.8% profanity 79% threat 68.1%</cell><cell>1.0</cell><cell>0.0</cell><cell>0.2</cell><cell cols="2">0.4 False Positive Rate 0.6 identity_attack 71.2% 0.8 severe_toxicity 70.9% insult 70.5% toxicity 70.4% profanity 69.9% threat 63.6%</cell><cell>1.0</cell></row><row><cell cols="10">Figure 1: ROC curves for the PerspectiveAPI</cell></row><row><cell cols="10">multilingual teacher model attributes compared to</cell></row><row><cell cols="10">the HaSpeeDe2 attributes (hate speech and stereo-</cell></row><row><cell cols="3">type).</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>1.0</cell><cell cols="2">Attribute misogynous</cell><cell></cell><cell>1.0</cell><cell cols="3">Attribute aggressiveness</cell></row><row><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell></cell><cell>0.8</cell><cell></cell><cell></cell><cell></cell></row><row><cell>True Positive Rate</cell><cell>0.0 0.0 0.2 0.4 0.6</cell><cell>0.2</cell><cell>0.4 False Positive Rate 0.6 threat 61.5% 0.8 identity_attack 56.5% insult 45.2% severe_toxicity 44.4% toxicity 39.8% profanity 28.8%</cell><cell>1.0</cell><cell>0.0 0.4 0.6 0.0 0.2</cell><cell>0.2</cell><cell cols="2">0.4 False Positive Rate 0.6 threat 66% identity_attack 61.8% 0.8 severe_toxicity 53.9% insult 53.3% toxicity 48.4% profanity 37%</cell><cell>1.0</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 3 :</head><label>3</label><figDesc>Term frequency in Wikipedia sampled sentences for bias mitigation.</figDesc><table><row><cell cols="2">Identity Term Sentence Count</cell></row><row><cell>donna</cell><cell>4306</cell></row><row><cell>donne</cell><cell>3100</cell></row><row><cell>femmine</cell><cell>1275</cell></row><row><cell>femmina</cell><cell>652</cell></row><row><cell>fidanzata</cell><cell>538</cell></row><row><cell>nonna</cell><cell>378</cell></row><row><cell>mamma</cell><cell>269</cell></row><row><cell>casalinga</cell><cell>256</cell></row><row><cell>casalinghe</cell><cell>187</cell></row><row><cell>compagne</cell><cell>132</cell></row><row><cell>compagna</cell><cell>34</cell></row><row><cell>mamme</cell><cell>24</cell></row><row><cell>fidanzate</cell><cell>12</cell></row><row><cell>nonne</cell><cell>11</cell></row><row><cell>matrone</cell><cell>9</cell></row><row><cell>matrona</cell><cell>8</cell></row><row><cell>morosa</cell><cell>6</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 4 :</head><label>4</label><figDesc>Largest Errors for hate speech classifier on HaSpeeDe2 Tweet data erators, is important for determining the author's intent. The use of humor and the practice of quoting text from another author are also confounding factors.</figDesc><table><row><cell>.. o rom</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head></head><label></label><figDesc>Odio Sakura per il semplice motivo che qualunque cosa faccia o dica Naruto lei lo prende a schiaffi o a pugniHA CHIESTO COME STA SAI DIOSANTO BRUTTA STRONZA MA CONTRALLI MADONNA SPERO CHE TI UCCI-DANOscusami Sarada Missed Detection 5385 porca troia la prenderei a schiaffi se non fosse mia madre Missed Detection 5819 ma in tutto ciò lo scopo di anna qual è? far soffrire il mio protetto? IO TI AMMAZZO COI LANCIAFIAMME OH #TemptationIsland</figDesc><table><row><cell>ID</cell><cell>Comment</cell><cell>Error Type</cell></row><row><cell cols="3">5466 Missed Detection</cell></row><row><cell cols="2">5471 @danielita8811 Che bel culo tutto da sfondare</cell><cell>False Alarm</cell></row><row><cell cols="2">5604 @coppiacalda2 Che bel culo da inculare</cell><cell>False Alarm</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 5 :</head><label>5</label><figDesc>Persistent errors for AMI across different ensembles.</figDesc><table /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</title>
		<author>
			<persName><forename type="first">Danilo</forename><surname>Valerio Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maria</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lucia</forename><forename type="middle">C</forename><surname>Di Maro</surname></persName>
		</author>
		<author>
			<persName><surname>Passaro</surname></persName>
		</author>
		<ptr target="CEUR.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop</title>
				<editor>
			<persName><forename type="first">Danilo</forename><surname>Valerio Basile</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Maria</forename><surname>Croce</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Lucia</forename><forename type="middle">C</forename><surname>Di Maro</surname></persName>
		</editor>
		<editor>
			<persName><surname>Passaro</surname></persName>
		</editor>
		<meeting>Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop<address><addrLine>EVALITA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Ekaba</forename><surname>Bisong</surname></persName>
		</author>
		<title level="m">Google AutoML: Cloud Natural Language Processing</title>
				<meeting><address><addrLine>Berkeley, CA</addrLine></address></meeting>
		<imprint>
			<publisher>Apress</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="599" to="612" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Nuanced metrics for measuring unintended bias with real data for text classification</title>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Borkan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lucas</forename><surname>Dixon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Sorensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nithum</forename><surname>Thain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lucy</forename><surname>Vasserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Companion Proceedings of The 2019 World Wide Web Conference</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="491" to="500" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Cristina</forename><surname>Bosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tommaso</forename><surname>Caselli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gloria</forename><surname>Comandini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elisa</forename><surname>Di Nuovo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simona</forename><surname>Frenda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Viviana</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Irene</forename><surname>Russo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manuela</forename><surname>Sanguinetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marco</forename><surname>Stranisci</surname></persName>
		</author>
		<ptr target="https://github.com/msang/haspeede/blob/master/2020/HaSpeeDe2020_Task_guidelines.pdf" />
		<title level="m">Hate speech detection task second edition (haspeede2) at evalita 2020 task guidelines</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">Jason</forename><surname>Brownlee</surname></persName>
		</author>
		<ptr target="https://machinelearningmastery.com/voting-ensembles-with-python/" />
		<title level="m">How to develop voting ensembles with python</title>
				<editor>
			<persName><forename type="first">Francois</forename><surname>Chollet</surname></persName>
		</editor>
		<imprint>
			<publisher>Keras</publisher>
			<date type="published" when="2015">2020. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming-Wei</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kenton</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kristina</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-06">2019. June</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Ami @ evalita2020: Automatic misogyny identification</title>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elisabetta</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Debora</forename><surname>Nozza</surname></persName>
		</author>
		<ptr target="CEUR.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian</title>
				<editor>
			<persName><forename type="first">Danilo</forename><surname>Valerio Basile</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Maria</forename><surname>Croce</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Lucia</forename><forename type="middle">C</forename><surname>Di Maro</surname></persName>
		</editor>
		<editor>
			<persName><surname>Passaro</surname></persName>
		</editor>
		<meeting>the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian<address><addrLine>EVALITA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Don&apos;t stop pretraining: Adapt language models to domains and tasks</title>
		<author>
			<persName><forename type="first">Suchin</forename><surname>Gururangan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ana</forename><surname>Marasović</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Swabha</forename><surname>Swayamdipta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyle</forename><surname>Lo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iz</forename><surname>Beltagy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Doug</forename><surname>Downey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noah</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020-07">2020. July</date>
			<biblScope unit="page" from="8342" to="8360" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><surname>Jigsaw</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification" />
		<title level="m">Jigsaw unintended bias in toxicity classification</title>
				<imprint>
			<date type="published" when="2018">2018. 2019. 2020. July</date>
		</imprint>
	</monogr>
	<note>Jigsaw multilingual toxic</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Misogyny detection in twitter: a multilingual and cross-domain study</title>
		<author>
			<persName><forename type="first">Endang</forename><surname>Wahyu Pamungkas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Valerio</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Viviana</forename><surname>Patti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page">102360</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">How multilingual is multilingual BERT?</title>
		<author>
			<persName><forename type="first">Telmo</forename><surname>Pires</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eva</forename><surname>Schlinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dan</forename><surname>Garrette</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-07">2019. July</date>
			<biblScope unit="page" from="4996" to="5001" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Challenges for toxic comment classification: An in-depth error analysis</title>
		<author>
			<persName><forename type="first">Julian</forename><surname>Betty ; Aken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralf</forename><surname>Risch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexander</forename><surname>Krestel</surname></persName>
		</author>
		<author>
			<persName><surname>Löser</surname></persName>
		</author>
		<ptr target="https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)</title>
				<meeting>the 2nd Workshop on Abusive Language Online (ALW2)<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2018-10">2020. 2018. October</date>
			<biblScope unit="page" from="33" to="42" />
		</imprint>
	</monogr>
	<note>Mulilingual L12 H768 A12 V2</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Hate me, hate me not: Hate speech detection on facebook</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">D</forename><surname>Vigna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felice</forename><surname>Dell'orletta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Petrocchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tesconi</surname></persName>
		</author>
		<editor>ITASEC</editor>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT</title>
		<author>
			<persName><forename type="first">Shijie</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Dredze</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-11">2019. November</date>
			<biblScope unit="page" from="833" to="844" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Google&apos;s neural machine translation system: Bridging the gap between human and machine translation</title>
		<author>
			<persName><forename type="first">Yonghui</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mike</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhifeng</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Quoc</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammad</forename><surname>Norouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wolfgang</forename><surname>Macherey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maxim</forename><surname>Krikun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuan</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qin</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Klaus</forename><surname>Macherey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeff</forename><surname>Klingner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Apurva</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Melvin</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaobing</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephan</forename><surname>Gouws</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshikiyo</forename><surname>Kato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Taku</forename><surname>Kudo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hideto</forename><surname>Kazawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Keith</forename><surname>Stevens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><surname>Kurian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nishant</forename><surname>Patil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cliff</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jason</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jason</forename><surname>Riesa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alex</forename><surname>Rudnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oriol</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Greg</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Macduff</forename><surname>Hughes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Dean</surname></persName>
		</author>
		<idno>CoRR, abs/1609.08144</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
