<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sexism Prediction in Spanish and English Tweets Using Monolingual and Multilingual BERT and Ensemble Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Angel Felipe Magnoss~ao de Paula</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Fray da Silv</string-name>
          <email>roberto.fray.silva@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ris S</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Escola Politecnica da Universidade de S~ao Paulo</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The popularity of social media has created problems such as hate speech and sexism. The identi cation and classi cation of sexism in social media are very relevant tasks, as they would allow building a healthier social environment. Nevertheless, these tasks are considerably challenging. This work proposes a system to use multilingual and monolingual BERT and data points translation and ensemble strategies for sexism identi cation and classi cation in English and Spanish. It was conducted in the context of the sEXism Identi cation in Social neTworks shared 2021 (EXIST 2021) task, proposed by the Iberian Languages Evaluation Forum (IberLEF). The proposed system and its main components are described, and an in-depth hyperparameters analysis is conducted. The main results observed were: (i) the system obtained better results than the baseline model (multilingual BERT); (ii) ensemble models obtained better results than monolingual models; and (iii) an ensemble model considering all individual models and the best standardized values obtained the best accuracies and F1-scores for both tasks. This work obtained rst place in both tasks at EXIST, with the highest accuracies (0.780 for task 1 and 0.658 for task 2) and F1-scores (F1-binary of 0.780 for task 1 and F1-macro of 0.579 for task 2).</p>
      </abstract>
      <kwd-group>
        <kwd>Sexism identi cation</kwd>
        <kwd>Sexism classi cation</kwd>
        <kwd>BERT</kwd>
        <kwd>Deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The emergence of social networks and microblogs has created a new medium for
people to express themselves, providing freedom of speech and the possibility
for quickly spreading opinions, news, and information [17,14]. This has impacted
considerably on peoples' lives, by increasing access to all kinds of information.</p>
      <p>
        Nevertheless, a small part of the users employs those media for spreading hate
messages, increasing the impacts of racism, sexism, and others types of prejudices
and hate speech [
        <xref ref-type="bibr" rid="ref10">32,10</xref>
        ].
      </p>
      <p>
        One crucial problem faced by the di erent stakeholders related to social
media platforms is detecting hate speech [
        <xref ref-type="bibr" rid="ref10 ref5 ref6">10,35,6,5</xref>
        ], both in general and
issuespeci c forms. Also, some types of hate speech tend to be more challenging to
identify, as they present characteristics such as irony or sarcasm, among
others [
        <xref ref-type="bibr" rid="ref6">35,6,23</xref>
        ]. Sexism is a type of toxic language that could be used both as hate
speech and, in a much more subtle way, as sarcasm. Sexism is related to all kinds
of behaviors and content that aim to spread prejudice against women, reduce
their importance in society, or behave aggressively or o ensively [
        <xref ref-type="bibr" rid="ref6">27,6,23</xref>
        ]. There
are several forms of sexism, and identifying them in social media messages is
a fundamental challenge among the various natural language processing (NLP)
tasks [
        <xref ref-type="bibr" rid="ref10 ref5 ref6">10,35,27,6,23,5,26</xref>
        ].
      </p>
      <p>
        The detection of sexism can be broken into two main tasks: (i) sexism
identi cation, which aims to identify if a message or post containing sexist contents
(regardless of the type of sexism contained in it); and (ii) sexism classi cation,
which aims to classify the type of sexism contained in a given sexist message or
post [
        <xref ref-type="bibr" rid="ref6">6,27,26,11</xref>
        ]. Both are very relevant, and the second task is dependent on
the rst, as it needs posts that are con rmed as sexist as inputs for the di
erent classi cation models. Additionally, the di culty of using data-driven models
may increase for languages that are more complex or that have fewer resources
available, such as high-quality word embeddings, pre-trained language-speci c
models, task-speci c lexicons, among others.
      </p>
      <p>To advance the state-of-the-art knowledge in both sexism identi cation and
classi cation on social media messages, the Iberian Languages Evaluation
Forum (IberLEF) proposed the sEXism Identi cation in Social neTworks shared
2021 (EXIST 2021) shared task. For the rest of this work, this challenge will
be referred to as EXIST shared task. The main goal of the IberLEF forum is
to promote scienti c advances towards innovative solutions for detecting sexism
on social media platforms [19]. For this reason, the 2021 shared task provided
datasets in English and Spanish for both tasks labeled by experts, following
state-of-the-art data collection and labeling procedures [19]. Those datasets are
expected to become benchmarks for state-of-the-art research on sexism identi
cation and classi cation on social media messages.</p>
      <p>Therefore, a relevant gap in the literature is to develop data-driven models
that better identify and classify sexist content on social media messages,
considering the implementation in di erent languages. This would: (i) advance both
the knowledge on the use of arti cial intelligent models for data-driven sexism
identi cation and detection; (ii) provide a better methodology for identifying and
classifying sexist content, which is highly relevant for identifying unacceptable
user behavior; and (iii) address the problem of generalizing the model
throughout di erent languages. Related to this gap, it is vital to observe that identifying
online sexism can be considerably challenging because posts may have several
forms: they may sound hateful and o ensive, or friendly and funny, misleading
the current classifying models used for this task [27].</p>
      <p>State of the art systems for addressing those tasks for multiple languages
uses the Bidirectional Encoder Representations from Transformers (BERT)
multilingual model, an NLP model that uses transformers and is pre-trained on a
comprehensive text corpora [31,25,21,20,16]. This model is trained on datasets
of multiple languages, but it is not language-speci c. The pre-trained models are
then ne-tuned on task-speci c datasets on the target language.</p>
      <p>The main goal of this work is to propose and evaluate a system to identify and
classify sexist content in social media messages in multiple languages, using the
EXIST 2021 shared task dataset [19] for implementation and evaluation. The
o cial shared task metrics were used: accuracy for task 1 (sexism detection)
and F1-macro for task 2 (sexism classi cation). However, we also implemented
other relevant metrics for NLP tasks to better evaluate the di erent models in
relation to the state of the art baseline model, the multilingual BERT: precision
and recall.</p>
      <p>The three main research questions that are going to be addressed in this
work are: (i) does the use of monolingual BERT models provides better results
than the multilingual BERT model to identify and classify sexist content on
social media messages in English and Spanish?; (ii) does the use of an
ensemble strategy improves the results of the individual models?; and (iii) does the
results di er between the English and Spanish languages? Besides answering
those three questions, this work also conducts an in-depth analysis of the main
hyperparameters for the implemented models for both languages.</p>
      <p>The main contribution of this work is to propose and evaluate the sexism
identi cation and classi cation system in multiple languages considering di
erent components: monolingual BERT models, multilingual BERT, data points
translation, and di erent ensemble strategies. We also explore the main
hyperparameters of the implemented models in-depth, comparing the nal models with
the state-of-the-art BERT multilingual model. This work obtained rst place in
both sexism identi cation and classi cation tasks at the EXIST shared task [19].</p>
      <p>This work is organized in the following sections: section 2 describes the main
concepts and models used for sexism prediction in social media messages;
section 3 contains the main steps of the methodology used; section 4 describes the
proposed system for addressing both sexism identi cation and classi cation;
section 5 contains the main results of the system's implementation on the EXIST
shared task dataset; section 6 contains a discussion of relevant topics on the
system's use, modi cation, and potential improvements; and section 7 concludes
this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Sexism identi cation and classi cation using arti cial intelligence models</title>
      <p>
        The works by [
        <xref ref-type="bibr" rid="ref6">23,27,6</xref>
        ] explore in-depth the impacts of the di erent types of
sexism on social media platforms, describing several important classes of sexism.
As sexism is an important type of hate speech, we also refer the reader to the
works by [
        <xref ref-type="bibr" rid="ref5">5,35</xref>
        ] for excellent reviews on identifying and classifying the di erent
forms of hate speech. The main concepts observed in those works were considered
in our approach.
      </p>
      <p>
        This work addresses two very relevant tasks: (i) sexism identi cation in
natural language texts; and (ii) classi cation of types of sexism in natural language
texts. Some examples of works that addressed the rst task are [21,18,24].
Examples that addressed the second task are [
        <xref ref-type="bibr" rid="ref10">10,16,29</xref>
        ].
      </p>
      <p>It is essential to observe that the second task is considerably more complex
because di erent languages can be used in the di erent classes (as well as the
traditional problems related to social media messages: abbreviations, emojis,
misspellings, memes, among others).</p>
      <p>
        Although there are a variety of di erent models and strategies used for sexism
detection and classi cation, the most traditionally used models are: support
vector machines (SVM), convolutional neural networks (CNN), long short-term
neural networks (LSTM), and BiLSTM [
        <xref ref-type="bibr" rid="ref6">23,13,27,11,6,35</xref>
        ]. In the last years, the
BERT has been widely used [
        <xref ref-type="bibr" rid="ref6">23,13,27,11,6,35</xref>
        ]. This model (and its variations)
have presented the best results on those tasks, as observed in the works by
[16,31,21,25].
      </p>
      <p>The NLP literature addresses several identi cation and classi cation tasks
related to extracting and evaluating opinions from natural language texts. In
general, those tasks are addressed using three main approaches [15,30,22]: (i)
lexical-based, in which speci c dictionaries (lists of words with corresponding
values on important dimensions for the task) are used to classify the input text;
(ii) statistical learning or machine learning-based, in which machine learning and
deep learning models are used, generally with word embeddings or bag of words
models, to classify the text; and (iii) hybrid, in which both lexicons and machine
learning models are used.</p>
      <p>
        However, it is essential to note that: (i) lexical-based systems are not able to
learn (and could be improved by using a deep learning model, such as BERT);
(ii) deep learning models, especially the multilingual BERT, are
state-of-theart on sexism identi cation and classi cation [31,25,21,20,16]; (iii) lexicons tend
to be language-speci c, making it more challenging to apply the solutions for
multiple languages; and (iv) few works use BERT with domain-speci c lexicons
for sexism identi cation and classi cation. One of those lexicons that are highly
relevant in this context is the Hurtlex lexicon [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This was used in the works by
[24,16], among several others.
      </p>
      <p>
        The BERT model was proposed by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and can be described as a language
learning model aimed at providing a general structure that can be further re ned
with ne-tuning on speci c tasks and domains. Its main objective is to learn the
main features and semantics of a language, based on semi-supervised learning
on a vast text corpora (such as the BookCorpus and the Wikipedia database)
[
        <xref ref-type="bibr" rid="ref1 ref9">9,28,1</xref>
        ]. Its architecture and training work ow is composed of three main
components: transformers (which is an advanced deep learning model), bidirectional
training, and use of encoder representations [
        <xref ref-type="bibr" rid="ref1 ref9">9,28,1</xref>
        ].
      </p>
      <p>
        In this work, we use the multilingual BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the English version of the
model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and the Spanish version of the model, called BETO [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For an
indepth analysis of how the BERT model works, we refer the reader to [28]. For
an in-depth comparison with multilingual BERT with other models, as well as
an in-depth description of how they work, we refer the reader to [34].
      </p>
      <p>However, very few works in the literature consider dealing with datasets with
multiple languages. This work addressed this gap by proposing a system that
contains multiple models and ensemble strategies.</p>
      <p>This paper aims to ful ll the gap of evaluating monolingual and
multilingual BERT models for identifying and classifying sexism in texts in multiple
languages.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>The methodology used in this work was composed of six steps. Figure 1 illustrates
the strategy used to tackle each of the tasks. For task 1 (sexism identi cation),
the classi cation models considered two labels: 0 (non-sexist) and 1 (sexist).
For task 2 (sexism classi cation), the tweets labeled as non-sexist (from task
1) were eliminated. Then, the classi cation models were used to predict the
following sexism categories on the remaining tweets: ideological and inequality;
stereotyping and dominance; objecti cation; sexual violence; and misogyny and
non-sexual violence. For a thorough description of those classes, we refer to
the EXIST shared task at IberLEF 2021 [19], which developed and labeled the
dataset that was used in this research.</p>
      <p>The steps of the methodology were:
1. Data collection: we used the dataset developed for the EXIST shared
task at IberLEF 2021 [19]. This dataset contained labeled data from two social
media platforms: Twitter and Gab. For an in-depth description of this dataset,
we refer the reader to Section 5 of this work;</p>
      <p>2. Data processing: for both tasks, we used the following processing
techniques: separation of the dataset between languages (English and Spanish),
tokenization, lemmatization, and elimination of stop words. These are widely used
in the literature for the implementation of machine learning models on NLP
tasks, such as hate speech detection, sexism identi cation, sentiment analysis,
among others [31,25,21,20,16,22]. There was no need to eliminate data points
from the datasets, as the shared task organizers had already thoroughly curated
them. The training subset was then divided into training (80%) and validation
(20%) for cross-validation purposes. Additionally, one of the training
strategies used for some of the models implemented involved translating the social
media messages from one language to the other (for example, from English to
Spanish to train a Spanish language model). This strategy doubled the
number of data points available for the single language models (even if part of the
meaning may have been lost during the translation process). The googletrans
(https://github.com/ssut/py-googletrans) library was used for the
translation process;</p>
      <p>3. Exploratory data analysis: in this step, an exploratory analysis of the
dataset was conducted to understand better the di erent class distributions on
both tasks throughout the training dataset. No data imbalance problems were
observed;</p>
      <p>
        4. Model implementation and hyperparameters analysis: in this
research, we implemented the following models: (i) the BERT Multilingual model
or mBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (named M1 in this research); (ii) single language models (one for
English and one for Spanish, named M2-English and M2-Spanish); (iii) single
language models with translated data points (one for English and one for Spanish,
named M3-English and M3-Spanish); and (iv) ensemble models (used only for
the test subset). All the implementations were conducted with the Hugging Face
BERT implementation library (https://huggingface.co/transformers/index.html)
[33], with 10-fold cross-validation on the training stage. A thorough
hyperparameters analysis was conducted, considering the following hyperparameters and
values: output BERT type (hidden or pooler), batch size (32 and 64), learning rate
(0.00002, 0.00003, and 0.00005), and number of epochs (1 to 8). Following the
o cial metrics of the EXIST 2021 shared task, accuracy was used as the quality
metric for model training on task 1, and F1-macro was used on task 2. Besides
this metric, an analysis of model over tting was conducted for each model, based
on charts that contained the models' accuracies on the di erent epochs;
5. Final models implementation: the nal models and model ensembles
were built using the best hyperparameters identi ed in Step 4. They were then
trained on the whole training datasets (training plus validation subsets). Table
1 contains all the nal models implemented: (i) M1: multilingual model; (ii)
separated single language models without translation on the training datasets
(M2, composed of M2-English and M2-Spanish) and with translation on the
training datasets (M3, composed of M3-English and M3-Spanish); (iii) English
single language model with translation only on the test subset (M4) and on
training and test subsets (M5), both derived from the M3-English model; (iv)
Spanish single language model with translation only on the test subset (M6)
and on training and test subsets to Spanish (M7), both derived from the
M3Spanish model; (v) ensembles considering only the best models: E1 (majority
vote), E2 (higher unstandardized value), and E3 (higher standardized value);
and (vi) ensembles considering all the models: E4 (majority vote), E5 (higher
unstandardized value), and E6 (higher standardized value);
6. Models comparison: the nal comparison of all models was then
conducted on the test subsets. The o cial metrics for the EXIST shared task at
IberLEF 2021 [19] were considered the quality metrics for both sexism identi
cation and classi cation tasks. For task 1, we evaluated the accuracy, precision,
recall, and F1-binary metrics. For task 2, we evaluated the accuracy, precision,
recall, and F1-macro metrics). Additionally, an analysis of a sample of correctly
and incorrectly classi ed data points on the test set was conducted, aiming to
better understand each model's main strengths, weaknesses, and opportunities
for future improvements. Lastly, the best model was chosen.
      </p>
      <p>The implementation was done using Python on a Google Collaboratory Pro
(https://colab.research.google.com/) TPU , with the following technical
speci cations: Intel(R) Xeon(R) CPU @ 2.30GHz CPU, 26GB of RAM, and TPU
v2. The code implemented is available on an open Github repository (https://
github.com/AngelFelipeMP/BERT-tweets-sexims-classification). Section
5 presents the main results of the exploratory data analysis and the model
implementations.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed system: components and implementation</title>
      <p>The proposed system considered two separate work ows: training and testing.
The objective of the training work ow was to ne-tune the pre-trained BERT
models. It considered three options: (i) using a multilingual BERT model
(illustrated in Figure 2), which was also considered our baseline, since it is the state
of the art for multilingual NLP classi cation tasks; (ii) using monolingual BERT
models without data points translation (illustrated in Figure 3); and (iii) using
monolingual BERT models with data points translation (illustrated in Figure
3).</p>
      <p>It is essential to observe that those three options considered 10-fold
crossvalidation on training to identify the best hyperparameter values for each model.
The result from the rst option was the M1 model. The results from the second
option were the models M2-English and M2-Spanish, which would be used as
components of the M2 model. The results from the third option were the models
M3-English and M3-Spanish, which would be used as components of the M3
model.</p>
      <p>Figure 4 illustrates the test work ow. The objective of this work ow was
to use the previously tested models as components for the nal models, their
training on the whole training dataset (training plus validation subsets), and
testing on the test subset. This work ow also introduces the six ensemble models
implemented, considering di erent model con gurations and rules for generating
Fig. 3. Work ow for training the monolingual models (M2-English and M2-Spanish)
and the translated languages models (M3-English and M3-Spanish), considering the
English models as an example.
the models. It is vital to observe that this model could be easily expanded for
other languages, quality metrics, and data sources.</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>This section contains the main research results and is divided into three
subsections: 5.1 contains a description of the dataset used; 5.2 contains the main results
and observations related to the hyperparameters analysis; and 5.3 contains the
nal models' comparison on the test subset, considering four metrics: accuracy,
precision, recall, and F1-score (F1-binary for task 1 and F1-macro for task 2).</p>
      <sec id="sec-5-1">
        <title>Description of the EXIST 2021 shared task dataset</title>
        <p>The dataset from EXIST 2021 shared task at IberLEF 2021 [19] was used in
this work. This dataset contained labeled data from two social media platforms:
(i) Twitter, with 6,977 tweets for training and 3,386 tweets for testing (both
subsets equally distributed between English and Spanish); and (ii) Gab, with
492 gabs in English and 490 gabs in Spanish (used only for testing purposes).
It is important to note that Gab is an uncensored social media website with
considerably fewer users than Twitter.</p>
        <p>It is vital to observe that the labeling procedure adopted by the shared task
organizers considered both experts and crowdsourcing labeling (considering a
speci c procedure developed by experts in this domain). The dataset distribution
was balanced on the training and test subsets. For a thorough description of the
dataset, we refer the readers to IberLEF 2021 [19].</p>
        <p>The ve classes that were used in this work for the sexism classi cation task
(also referred to by the organizers of the dataset as sexism categorization) are
the ones provided by the EXIST challenge dataset[19]. These classes contain, as
described by [19,27]:
{ Ideological and inequality: texts that a rm that the feminist movement
deserves no credits, rejects the existence of inequality between genders, or
claims that men are oppressed gender;
{ Stereotyping and dominance: texts that claim that women are inappropriate
for speci c tasks, suitable only for speci c roles, or that men are superior to
women;
{ Objecti cation: texts that claim that women should have certain physical
qualities or that separate women from their dignity and personal aspects;
{ Sexual violence: texts that contain sexual suggestions or sexual harassment;
{ Misogyny and non-sexual violence: texts that express di erent forms of
hatred and violence towards women.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Hyperparameters analysis</title>
        <p>Due to the considerable di erence between the identi cation and classi cation
tasks, this section will analyze both separately and then conclude with a
comparison of the best hyperparameters values for all models for both tasks. The
results observed in this subsection can be used as a guide for further
implementations of BERT models for sexism identi cation and classi cation, considering
multiple languages.</p>
        <p>Based on the analysis of Table 2, it is important to observe that: (i) the
M2Eng (monolingual without translation) presented better results for the English
language; (ii) the M2-Sp (monolingual without translation) presented better
results for the Spanish language; (iii) the F1-binary for the Spanish language
models (M2-Sp and M3-Sp) presented better results than for the English language
models (M2-Eng and M3-Eng); (iv) most of the models presented better results
by using the hidden output BERT type; (v) all models presented a learning rate
of 0.00005; (vi) most models presented better results with a batch size of 32;
and (vii) most models presented better results with 6 or more epochs. We have
focused the analysis and model choice considering the accuracy, as it was the
o cial metric for the EXIST shared task.</p>
        <p>Table 3 illustrates the best hyperparameters values for task 2 for each of
the ve models implemented on the training step of the proposed system and
their quality metrics on the validation subset. It is possible to observe that: (i)
similar to task 1, the M1 model (multilingual) did not present better results than
any of the languages; (ii) the M3-Eng (monolingual with translation) presented
the best results for the English language; (iii) the M3-Sp (monolingual with
translation) presented better results for the Spanish language; (iv) most of the
models presented better results by using the hidden output BERT type; (v) like
in task 1, all models presented better results by using a learning rate of 0.00005;
(vi) most models presented better results with a batch size of 32; and (vii) most
models presented better results with 7 or 8 epochs.</p>
        <p>Lastly, Table 4 contains a cross-model and cross-language analysis of the
results on the validation subset. It presents the hyperparameter values of the
best models in each category (M1, M2-Eng, M2-Sp, M3-Eng, and M3-Sp), as a
percentage of the total number of models, for each task. For example, on the
rst cell, it is possible to observe that, for the output BERT type on task 1, 60%</p>
        <p>Lang. Model Best hyperp. values Acc. Prec. Rec. F1m
Multi M1-Multi OB:pooler / Lr:0.00005 / Bs:32 / Ne:8 0.636 0.632 0.624 0.604
English MM23--EEnngg OOBB::hhiiddddeenn // LLrr::00..0000000055 // BBss::3322 // NNee::58 00..666611 00..666407 00..665323 00..663120
Spanish MM23--SSpp OOBB::hhiiddddeenn // LLrr::00..0000000055 // BBss::3624 // NNee::87 00..668526 00..665563 00..667500 00..662380
Legend: Lang.: model language; OB: output BERT type; Lr: learning rate; Bs: batch
size; Ne: number of epochs; Acc.: accuracy; Prec.: precision; Rec.: recall; F1m:
F1macro.
of the nal models contained a hidden output BERT, while 40% used the pooler
type. Based on an analysis of this table, it is possible to conclude that: (i) for
both tasks, the hidden output BERT type provided the best results; (ii) higher
learning rates (0.00005) presented the best results for both tasks; (iii) the best
batch size for both tasks was 32; and (iv) the best number of epochs di ered
among tasks, probably due to their di erent nature.
monolingual models and the multilingual model; and (ii) the E6 (ensemble model
considering all individual models and the best standardized values) obtained the
best accuracy and F1-score.</p>
        <p>For task 1, it is essential to observe on Table 5 that: (i) the baseline model
(M1) presented better accuracy results than the M4, M6, and M7 models; (ii)
the E4 model presented results that were comparable to the E6 model in terms
of accuracy, both being considered the best models for this task; (iii) the E1
model presented the best precision; and (iv) the M2 model presented the best
recall. For task 2, it can be observed that: (i) with an exception for M4, all
models presented a better F1-macro than the M1 model, indicating that the use
of monolingual may provide signi cantly better results than multilingual models
for sexism classi cation; and (ii) the E6 model presented the best results for all
metrics, indicating that it outperformed all other models for this task.</p>
        <p>Table 6 presents a comparison of the best individual and ensemble models
for tasks 1 and 2, considering the two o cial metrics of the EXIST shared
task: accuracy and F1-score. Considering the di erences of the F1-scores of each
model and the best model, it is possible to conclude that: (i) the di erences are
signi cantly higher for the task of sexism classi cation; (ii) the baseline model
(M1) obtained the worst F1-score among those models (around 3% lower for
task 1 and 11% for task 2 in comparison to the E6 model); (iii) the baseline
model (M1) obtained the worst accuracy for both tasks; and (iv) although the
E4 model presented similar results for task 1, it observed a 6.39% lower F1-score
in comparison to the E6 model for task 2. The analysis of the models' accuracies
lead to the same conclusions.
Legend: Acc.: accuracy; Prec.: precision; Rec.: recall; F1b: F1b-binary; F1m:
F1-macro. M1 is the baseline (BERT Multilingual). Di E6 is the di erence between
that model's metric and the same metric for the E6 model (best overall model for
both tasks).</p>
        <p>Our approach ranked rst in both sexism identi cation and classi cation
tasks at EXIST, with the highest accuracies (0.780 for task 1 and 0.658 for task
2) and F1-scores (F1-binary of 0.780 for task 1 and F1-macro of 0.579 for task
2), considering the E6 model. We also observed that ensemble models provide a
better generalization.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussions</title>
      <p>This section brie y explores several important aspects related to the system
proposed in this work and its results, encompassing the following topics:
implementation aspects, system design, use of ensembles, system adaptation for other
languages, results obtained concerning the literature, impacts of the di erent
system components, and the use of the proposed system in real scenarios.</p>
      <p>
        It is vital to note that the system proposed in this work can be extended
using additional components with few adaptations to the code. Some additional
interesting components to explore are lexicons (both generalist, such as Vader
[12] and domain-speci c, such as Hurtlex [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), word embeddings, and
transfer learning (via training on multiple datasets). Additional models could also
be implemented to improve feature engineering (such as unsupervised learning
models) or improve prediction quality (such as di erent weak models used in an
ensemble strategy).
      </p>
      <p>
        Although ensemble models are relatively common in other domains, such
as price prediction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and sentiment analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], they are not widely spread
on sexism identi cation and classi cation, as this is a new task. In general, if
the behavior of the weak models can capture di erent aspects of the task, an
ensemble strategy could improve the nal prediction results [
        <xref ref-type="bibr" rid="ref2 ref8">2,8</xref>
        ]. In this work,
we evaluated several simple average ensemble strategies. However, an in-depth
analysis of more complex ensemble strategies with the proposed system could be
conducted in future works. As was observed in this work, the use of ensembles
can signi cantly improve the results obtained by the individual models.
      </p>
      <p>Another important aspect is related to adapting the proposed system for
other languages. Concerning this aspect, it is vital to separate the languages
into two main groups: (i) languages with individual BERT models already
implemented; and (ii) languages that currently have no widely accepted
individual BERT model implemented. In the rst group, the system allows for easy
implementation with minimal coding needed: the only components needed are
the BERT individual model pre-trained on that language and the task-speci c
dataset for ne-tuning and testing.</p>
      <p>In the second group, it is necessary to train a language-speci c BERT model
before using the proposed system. This task demands a considerable amount
of computation power and resources, demanding processing clusters and large
text corpora in the target language (such as the Wikipedia text database).
However, adaptations can be implemented in the proposed system to use di erent
models that are easier to implement and demand fewer data, such as recurrent
neural networks or convolutional neural networks with language-speci c word
embeddings or lexicons. The ensemble component can then be used,
considering the BERT multilanguage (if it encompasses the target language) and the
implemented models.</p>
      <p>Lastly, the proposed system can be implemented and used in real case
scenarios to improve sexism identi cation on social media platforms. After the
hyperparameters and nal models are chosen, as described and explored in this
work, the prediction process is considerably fast. The system has the potential
to be implemented as a separate service on the social media platform, analyzing
the content published by its users and pointing out sexist messages and their
respective classes (considering the ve classes studied in this work).
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and future work</title>
      <p>As was explored throughout this work, a widespread problem on social networks
and microblogs is the misuse of these tools to spread toxic language and sexist
content. Identifying and detecting sexism in those media is considerably
challenging, especially in a scenario with multiple languages. This paper explored
the ne-tuning of multilingual and monolingual BERT models for English and
Spanish and the use of di erent ensemble con gurations to identify and classify
sexism on tweets and gabs. The dataset used was provided by the EXIST shared
task, which contained two tasks: sexism identi cation and sexism classi cation.</p>
      <p>The proposed system in this research considered the use of ne-tuning of
pretrained BERT models and the translation of the training dataset (to increase
the number of data points used by the model for learning) and ensemble models
with di erent characteristics. Our central hypothesis was that using this system
would provide better results than the traditional use of the multilingual BERT
model. Our results have shown that the use of ensembles provided better results
for both tasks, primarily the ensemble that considered all trained models and the
higher standardized label values as the nal predictions. This model obtained
signi cantly better results than the baseline multilingual BERT model, with an
F1-score around 3% higher for the sexism identi cation task and 11% higher
for the sexism classi cation task. Those results and models and the in-depth
hyperparameters analysis that was conducted can be used as a guide for future
research on both tasks.</p>
      <p>Future works are related to: (i) conducting an analysis considering
additional datasets; (ii) implementing additional models; (iii) implementing di erent
ensemble con gurations; (iv) implementing unsupervised models for feature
engineering; (v) analyzing the impacts on the models' results of using lexicons
(both general and domain-speci c) as features; (vi) analyzing the impacts on
the models' results of using word embeddings as features; and (vii)
implementing and evaluating the use of deep reinforcement learning to improve the models'
results, especially on the sexism classi cation problem.
11. Frenda, S., Ghanem, B., Montes-y Gomez, M., Rosso, P.: Online hate speech
against women: Automatic identi cation of misogyny and sexism on twitter.
Journal of Intelligent &amp; Fuzzy Systems 36(5), 4743{4752 (2019)
12. Hutto, C., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment
analysis of social media text. In: Proceedings of the International AAAI Conference on
Web and Social Media. vol. 8 (2014)
13. Istaiteh, O., Al-Omoush, R., Tedmori, S.: Racist and sexist hate speech detection:
Literature review. In: 2020 International Conference on Intelligent Data Science
Technologies and Applications (IDSTA). pp. 95{99. IEEE (2020)
14. Jang, J.W., Park, Y.G., Hur, S.I., An, Y.J.: Study on the impact of activity-based
exible o ce characteristics on the employees' innovative behavioral intention. In:
International Conference on Software Engineering, Arti cial Intelligence,
Networking and Parallel/Distributed Computing. pp. 87{103. Springer (2021)
15. Johnman, M., Vanstone, B.J., Gepp, A.: Predicting FTSE 100 returns and
volatility using sentiment analysis. Accounting Finance 58, 253{274 (2018).
https://doi.org/10.1111/ac .12373
16. Koufakou, A., Pamungkas, E.W., Basile, V., Patti, V.: Hurtbert: Incorporating
lexical features with bert for the detection of abusive language. In: Proceedings of
the Fourth Workshop on Online Abuse and Harms. pp. 34{43 (2020)
17. Lopes, A.R.: The impact of social media on social movements: The new opportunity
and mobilizing structure. Journal of Political Science Research 4(1), 1{23 (2014)
18. Lynn, T., Endo, P.T., Rosati, P., Silva, I., Santos, G.L., Ging, D.: A comparison of
machine learning approaches for detecting misogynistic speech in urban dictionary.
In: 2019 International Conference on Cyber Situational Awareness, Data Analytics
And Assessment (Cyber SA). pp. 1{8. IEEE (2019)
19. Montes, M., Rosso, P., Gonzalo, J., Aragon, E., Agerri, R., Alvarez Carmona,
M., Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Adorno, H.G.,
Gutierrez, Y., Zafra, S.M.J., Lima, S., Plaza-de Arco, F.M., Taule, M.e.:
Proceedings of the iberian languages evaluation forum (iberlef 2021). In: Proceedings of
the Iberian Languages Evaluation Forum (IberLEF 2021). CEUR Workshop
Proceedings (2021)
20. Mozafari, M., Farahbakhsh, R., Crespi, N.: A bert-based transfer learning approach
for hate speech detection in online social media. In: International Conference on
Complex Networks and Their Applications. pp. 928{940. Springer (2019)
21. Mozafari, M., Farahbakhsh, R., Crespi, N.: Hate speech detection and racial bias
mitigation in social media based on bert model. PloS one 15(8), e0237861 (2020)
22. Nassirtoussi, A.K., Aghabozorgi, S., Wah, T.Y., Ngo, D.C.L.: Text mining for
market prediction : a systematic review. Expert Systems with Applications 41(16),
7653{7670 (2014). https://doi.org/10.1016/j.eswa.2014.06.009
23. Pamungkas, E.W., Basile, V., Patti, V.: Misogyny detection in twitter: a
multilingual and cross-domain study. Information Processing &amp; Management 57(6),
102360 (2020)
24. Pamungkas, E.W., Cignarella, A.T., Basile, V., Patti, V., et al.: Automatic identi
cation of misogyny in english and italian tweets at evalita 2018 with a multilingual
hate lexicon. In: Sixth Evaluation Campaign of Natural Language Processing and
Speech Tools for Italian (EVALITA 2018). vol. 2263, pp. 1{6. CEUR-WS (2018)
25. Pavlopoulos, J., Thain, N., Dixon, L., Androutsopoulos, I.: Convai at semeval-2019
task 6: O ensive language identi cation and categorization with perspective and
bert. In: Proceedings of the 13th international Workshop on Semantic Evaluation.
pp. 571{576 (2019)
26. Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., Patti, V.: Resources and
benchmark corpora for hate speech detection: a systematic review. Language Resources
and Evaluation pp. 1{47 (2020)
27. Rodr guez-Sanchez, F., Carrillo-de Albornoz, J., Plaza, L.: Automatic classi cation
of sexism in social networks: An empirical study on twitter data. IEEE Access 8,
219563{219576 (2020)
28. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in bertology: What we know
about how bert works. Transactions of the Association for Computational
Linguistics 8, 842{866 (2020)
29. Shari rad, S., Jacovi, A., Univesity, I.B.I., Matwin, S.: Learning and
understanding di erent categories of sexism using convolutional neural network's lters. In:
Proceedings of the 2019 Workshop on Widening NLP. pp. 21{23 (2019)
30. Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M.: Big data : deep
Learning for nancial sentiment analysis. Journal of Big Data 5(3), 1{25 (2018).
https://doi.org/10.1186/s40537-017-0111-6
31. Sohn, H., Lee, H.: Mc-bert4hate: Hate speech detection using multi-channel bert
for di erent languages and translations. In: 2019 International Conference on Data
Mining Workshops (ICDMW). pp. 551{559. IEEE (2019)
32. Wani, M.A., Agarwal, N., Bours, P.: Impact of unreliable content on social media
users during covid-19 and stance detection system. Electronics 10(1), 5 (2021)
33. Wolf, T., Chaumond, J., Debut, L., Sanh, V., Delangue, C., Moi, A., Cistac, P.,
Funtowicz, M., Davison, J., Shleifer, S., et al.: Transformers: State-of-the-art
natural language processing. In: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations. pp. 38{45 (2020)
34. Wu, S., Dredze, M.: Beto, bentz, becas: The surprising cross-lingual e ectiveness
of bert. arXiv preprint arXiv:1904.09077 (2019)
35. Yin, W., Zubiaga, A.: Towards generalisable hate speech detection: a review on
obstacles and solutions. arXiv preprint arXiv:2102.08886 (2021)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Acheampong</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunoo-Mensah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Transformer models for textbased emotion detection: a review of bert-based approaches</article-title>
          .
          <source>Arti cial Intelligence</source>
          Review pp.
          <volume>1</volume>
          {
          <issue>41</issue>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ballings</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poel</surname>
            ,
            <given-names>D.V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hespeels</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gryp</surname>
          </string-name>
          , R.:
          <article-title>Evaluating multiple classi ers for stock price direction prediction</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>42</volume>
          (
          <issue>20</issue>
          ),
          <volume>7046</volume>
          {
          <fpage>7056</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.eswa.
          <year>2015</year>
          .
          <volume>05</volume>
          .013
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bassignana</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Hurtlex: A multilingual lexicon of words to hurt</article-title>
          .
          <source>In: 5th Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2018</year>
          . vol.
          <volume>2253</volume>
          , pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Can~ete, J.,
          <string-name>
            <surname>Chaperon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
          </string-name>
          , J.:
          <article-title>Spanish pretrained bert model and evaluation data</article-title>
          .
          <source>In: PML4DC at ICLR</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chetty</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alathur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Hate speech review in the context of online social networks</article-title>
          .
          <source>Aggression and violent behavior 40</source>
          ,
          <volume>108</volume>
          {
          <fpage>118</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chiril</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benamara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Origgi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coulomb-Gully</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An annotated corpus for sexism detection in french tweets</article-title>
          .
          <source>In: Proceedings of The 12th Language Resources and Evaluation Conference</source>
          . pp.
          <volume>1397</volume>
          {
          <issue>1403</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Da</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.F.</given-names>
            ,
            <surname>Hruschka</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.R.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hruschka</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.R.</surname>
          </string-name>
          :
          <article-title>Tweet sentiment analysis with classi er ensembles</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>66</volume>
          ,
          <fpage>170</fpage>
          {
          <fpage>179</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          , y Piontti,
          <string-name>
            <given-names>A.P.</given-names>
            ,
            <surname>Madewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.J.</given-names>
            ,
            <surname>Cummings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.A.T.</given-names>
            ,
            <surname>Hitchings</surname>
          </string-name>
          , M.D.T.,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kahn</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vespignani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halloran</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Longini</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <surname>I.M.:</surname>
          </string-name>
          <article-title>Ensemble forecast modeling for the design of COVID-19 vaccine e cacy trials</article-title>
          .
          <source>Vaccine</source>
          <volume>38</volume>
          (
          <issue>46</issue>
          ),
          <volume>7213</volume>
          {
          <fpage>7216</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Founta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Djouvas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chatzakou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leontiadis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blackburn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stringhini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vakali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sirivianos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kourtellis</surname>
          </string-name>
          , N.:
          <article-title>Large scale crowdsourcing and characterization of twitter abusive behavior</article-title>
          .
          <source>In: Proceedings of the International AAAI Conference on Web and Social Media</source>
          . vol.
          <volume>12</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>