<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Transformer Ensembles for Sexism Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lily</forename><surname>Davies</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Marta</forename><surname>Baldracchi</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Carlo</forename><forename type="middle">Alessandro</forename><surname>Borella</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Konstantinos</forename><surname>Perifanos</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">National and Kapodestrian University of Athens</orgName>
								<address>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff0">
								<address>
									<addrLine>1 codec.ai</addrLine>
									<settlement>London</settlement>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Transformer Ensembles for Sexism Detection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">35EFECA10C26FF66C90CE364CF2B2F2F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:22+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Sexism Detection</term>
					<term>Transformers</term>
					<term>Deep Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This document presents in detail the work done for the sexism detection task at EXIST2021 workshop. Our methodology is built on ensembles of Transformer-based models which are trained on different background and corpora and fine-tuned on the provided dataset from the EXIST2021 workshop. We report accuracy of 0.767 for the binary classification task (task1), and f1 score 0.766, and for the multi-class task (task2) accuracy 0.623 and f1-score 0.535.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>EXIST workshop (sEXism Identification in Social neTworks) is the first shared task at IberLEF 2021 <ref type="bibr" target="#b0">[1]</ref>. The topic of this particular workshop is to build classifiers for sexism detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Dataset and Tasks</head><p>As part of the workshop, an annotated dataset <ref type="bibr" target="#b1">[2]</ref> has been provided, which consists of sexist expressions, both in English and Spanish, commonly used on social media. The source of these texts are Twitter and Gab social media platforms.</p><p>The training dataset consists of 6977 texts, 3436 in English and 3541 in Spanish. In the training set, all of the texts have Twitter as their source. The training dataset has been labelled with two separate label sets, which are essentially the workshop tasks: first, a higher level binary annotation per text, indicating whether the particular text is sexist or not, and a second layer of annotation per text, where if a particular text is identified as sexist, it is also assigned one of the following labels : {IDEOLOGICAL AND INEQUALITY, STEREOTYPING AND DOMINANCE, OBJECTIFICATION, SEXUAL VI-OLENCE, MISOGYNY AND NON-SEXUAL VIOLENCE}.</p><p>For the first task, the dataset contains 3377 texts labelled as sexist and 3600 labelled as non-sexist, so it is rather balanced. For the second task, the distribution of the tweets labelled as sexist in their subcategories is as shown in Table <ref type="table" target="#tab_0">1</ref>: The test dataset consists of 4368 texts, 2208 in English and 2160 in Spanish. In this dataset 3386 of the texts are sourced from Twitter and 982 from the social network Gab.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Architecture</head><p>We combine two major approaches in our design: fine-tuning separate BERT <ref type="bibr" target="#b2">[3]</ref> models trained in Spanish and English; and building ensembles of models per language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Task1</head><p>We fine-tune six separate models starting from different weight configurations, three for English and three for Spanish.</p><p>During testing, we report the majority vote of the ensembles, e.g. if more than one of the classifiers agree on a prediction, we report that prediction as the final decision. Our work is following the reasoning and the results of <ref type="bibr" target="#b3">[4]</ref>, where several models are trained in the same dataset with the same loss function but starting from a different random weight initialisation. The majority vote of the model ensemble then tends to perform better by each standalone model.</p><p>In the training process, we fine-tune the BERT models <ref type="bibr" target="#b4">[5]</ref> on the training data, using an 80-20 training-test split for each model, per language, where the languages available in the dataset are English and Spanish. This choice follows an empirical observation that language-specific pretrained BERT architectures tend to capture better the language subtleties for the tasks posed and tend to perform better in classification benchmarks.</p><p>For English texts we use the pre-trained model from <ref type="bibr" target="#b5">[6]</ref>, available in the Huggingface transformers library <ref type="bibr" target="#b6">[7]</ref>. For Spanish texts we fine-tune the BETO pre-trained model <ref type="bibr" target="#b7">[8]</ref>.</p><p>We use PyTorch <ref type="bibr" target="#b8">[9]</ref> for the implementation. Before feeding the texts to the classifiers, we pre-process them by replacing mentions with the mention token and URLs by the URL token. While theoretically user mentions and handles can implicitly capture social graph structure and potentially increase the classifier performance <ref type="bibr" target="#b9">[10]</ref>, we choose to rely only on textual information and discard social graph or source for the tasks in question.</p><p>We first filter the input training set based on the language indicator and then seed three neural networks with different initial random weights. We train the networks for 10 epochs and select the one with the best accuracy as the final model per training. In testing, we feed the same text to all three neural networks and report the majority vote of the classifiers as the final classification. The final output will be classified as sexist if at least two of the three classifiers report it as sexist and similarly, a text will be classified as non-sexist if at least two out of three classifiers report it as non-sexist. Based on our experiments this tends to give a ∼ 2% increase in the overall reported accuracy. We use Adam optimiser for the training <ref type="bibr" target="#b10">[11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Task2</head><p>For the second task, we train a classifier to distinguish between sexist tweets. That is, we only keep the texts in the training dataset that have been flagged as sexist and we train a multi-class model for these texts.</p><p>Again, we train different models for Spanish and English language texts. Since we have the language labels of the texts there is no need to attempt to determine the language of a text; however, in the general case it would be straightforward to add one more step in the pipeline for language detection using an off the shelf language detection library, such as langdetect (https://pypi.org/project/langdetect/).</p><p>In testing time, we first apply the classifier of task1 to detect whether a tweet is sexist or not. If the text is labelled as non-sexist we report this label for task2 as well. If the task1 model reports sexist as the label, we feed this tweet to the second classifier to obtain a prediction for the sexism categorisation label as described above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results and Discussion</head><p>For the first task, we achieve accuracy of 0.767, with f-score (macro) of 0.766 (10th placement in the ranking). For the second task, accuracy is at 0.623 and f1-score at 0.535 (19th placement in the ranking).</p><p>Breaking down the final results by language, for task1 the accuracy in English texts is at 0.7445 and for Spanish texts at 0.789. For task2, the accuracy in English texts is 0.583 with f1-score 0.493, whereas in Spanish texts accuracy is at 0.664 and f1-score at 0.575. Overall, the system tends to perform better in Spanish, and this is most probably due to the fact that the underlying BERT model for Spanish (BETO) is trained exclusively in Spanish texts.</p><p>Whereas we are using ensembles in the first task, we decided not to adopt this strategy for the multiclass classification for task2, as it would have increased significantly training and testing time as well as computational cost. Instead, we are using the output of the first task and we only train one classifier per language for the second task. It is worth noting here that the observation from <ref type="bibr" target="#b3">[4]</ref> is validated in this case, as the ensemble achieves up to 2% higher accuracy than standalone models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Analysis</head><p>It is interesting to see that our models fail to correctly identify texts that have been labelled as non-sexist due to the presence of words that commonly appear in sexist environments. It also fails to correctly categorise sexist tweets that appear in the same context with non-sexist words (for example, the word friend ), short texts or even sometimes subtle or contextual use of sexists language.</p><p>For task2, the predictions are inheriting the error from the task1 classifier, e.g. falsely reporting sexist tweets, which accumulates with the error of the second model. The confusion matrices for the 2 tasks are shown in figure <ref type="figure" target="#fig_0">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Improvements and future research</head><p>One question we aim to investigate is what is the optimal number of models in the ensemble from a classification accuracy perspective. This is a more general question and direction to investigate for Transformer based models for text classification.</p><p>Additionally, most recently it has been shown that Convolutional Neural Networks can over-perform Transform-based architectures in Natural Language Processing tasks <ref type="bibr" target="#b11">[12]</ref>. The use and fine-tuning of pre-trained convolutions in the domain of abusive and toxic speech would be an interesting direction to investigate.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Confusion matrices</figDesc><graphic coords="4,134.77,370.63,170.08,112.82" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Number of labels per class, training dataset.</figDesc><table><row><cell>Label</cell><cell>Count of examples</cell></row><row><cell>objectification</cell><cell>500</cell></row><row><cell>sexual-violence</cell><cell>517</cell></row><row><cell cols="2">misogyny-non-sexual-violence 685</cell></row><row><cell>stereotyping-dominance</cell><cell>809</cell></row><row><cell>ideological-inequality</cell><cell>866</cell></row><row><cell>non-sexist</cell><cell>3600</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Manuel</forename><surname>Montes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Rosso</surname></persName>
		</author>
		<title level="m">Proceedings of the Iberian Languages Evaluation Forum</title>
				<meeting>the Iberian Languages Evaluation Forum<address><addrLine>IberLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021. 2021. 2021</date>
		</imprint>
	</monogr>
	<note>CEUR Workshop Proceedings</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Overview of EXIST 2021: sEXism Identification in Social neTworks</title>
		<author>
			<persName><forename type="first">Francisco</forename><surname>Rodríguez-Sánchez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jorge</forename><surname>Carrillo-De-Albornoz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Procesamiento del Lenguaje Natural</title>
		<idno type="ISSN">1989-7553</idno>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="issue">0</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming-Wei</forename><surname>Chang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://www.aclweb.org/anthology/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-06">June 2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Long and Short Papers</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning</title>
		<author>
			<persName><forename type="first">Zeyuan</forename><surname>Allen-Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuanzhi</forename><surname>Li</surname></persName>
		</author>
		<idno>arXiv preprint 2012.09816</idno>
		<ptr target="https://www.microsoft.com/en-us/research/publication/towards-understanding-ensemble-knowledge-distillation-and-self-distillation-in-deep-learning/" />
		<imprint>
			<date type="published" when="2020-12">Dec. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Attention is All you Need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><forename type="middle">V</forename><surname>Luxburg</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020</title>
		<author>
			<persName><forename type="first">Sudhanshu</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shivangi</forename><surname>Prasad</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.trac-1.19" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying</title>
				<meeting>the Second Workshop on Trolling, Aggression and Cyberbullying<address><addrLine>Marseille, France</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2020-05">May 2020</date>
			<biblScope unit="page" from="120" to="125" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">HuggingFace&apos;s Transformers: Stateof-the-art Natural Language Processing</title>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lysandre</forename><surname>Debut</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.03771</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>cs.CL</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Spanish Pre-Trained BERT Model and Evaluation Data</title>
		<author>
			<persName><forename type="first">José</forename><surname>Cañete</surname></persName>
		</author>
		<author>
			<persName><surname>Gabriel Chaperon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">PML4DC at ICLR</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Automatic differentiation in PyTorch</title>
		<author>
			<persName><forename type="first">Adam</forename><surname>Paszke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sam</forename><surname>Gross</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>NIPS-W</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Neural Embeddings for Idiolect Identification</title>
		<author>
			<persName><forename type="first">Konstantinos</forename><surname>Perifanos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eirini</forename><surname>Florou</surname></persName>
		</author>
		<idno type="DOI">10.1109/IISA.2018.8633681</idno>
	</analytic>
	<monogr>
		<title level="m">2018 9th International Conference on Information, Intelligence, Systems and Applications (IISA)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="3" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Adam: A Method for Stochastic Optimization</title>
		<author>
			<persName><forename type="first">P</forename><surname>Diederik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jimmy</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980[cs.LG</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Are Pre-trained Convolutions Better than Pre-trained Transformers?</title>
		<author>
			<persName><forename type="first">Yi</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mostafa</forename><surname>Dehghani</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.03322</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>cs.CL</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
