<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>When Multiple Perspectives and an Optimization Process Lead to Better Performance, an Automatic Sexism Identification on Social Media With Pretrained Transformers in a Soft Label Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johan Erbani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Előd Egyed-Zsigmond</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diana Nurbakova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre-Edouard Portier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Metoo see u http..</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ Lyon</institution>
          ,
          <addr-line>INSA Lyon, CNRS, UCBL, LIRIS, UMR5205, 20 Avenue Einstein, 69621 Villeurbanne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Even if today, the sexism is socially widely disapproving, it remains an omnipresent phenomenon in our society. But faced with huge quantities of data, social platforms are struggling to identify it. This highlights the need to develop automatic detection tools that can subtly assess the sexistness of usergenerated content. That's what sEXism Identification in Social neTworks (EXIST) is all about. The EXIST 2023 contest consists of three classification tasks : 1. detect sexism, 2. clarify the author's intention and 3. explicit the sexism type. Thanks to these three tasks, each data could be seen from three diferent points of view. This idea, combined with fine-tuned BERTs, model stacking and an optimization process, enabled us to rank 1st in the task 2 and 4th in the task 3 in a soft label context. This paper describes our approach, our negative results and some possible perspectives.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sexism Identification</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Transformer Models</kwd>
        <kwd>BERT</kwd>
        <kwd>Ensemble modeling</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Sexism Detection</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Identifying sexism automatically remains an open problem in Natural Language Processing
(NLP). To address this issue, a series of scientific events called EXIST has been established with
the objective of comprehending sexism in its widest scope. This includes not only explicit
sexism but also more subtle forms of implicit sexist behavior. These scientific initiatives have
the potential to raise awareness about women’s rights issues and promote social cohesion. This
paper describes the DRIM team’s contribution to the three EXIST 2023 shared tasks.</p>
      <p>This work is structured as follows: Section 2 briefly provides a description of several earlier
studies. Section 3 will then present an explanation of tasks 1, 2 and 3, along with the corpus
provided by the organizers. Following that, Section 4 and 5 will outline the experimental
methodology and evaluation results respectively. Finally, in Section 6, we will present the key
ifndings and conclusions of our studies, as well as some potential directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works and Contributions</title>
      <p>
        According to previous EXIST rapports, transformer-based models performed better than the
other technique [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Given this and the limited availability of labeled data, we employed a
ifne-tuning approach using models that were initially pre-trained in a self-supervised manner
using extensive amounts of unlabeled data. More specifically, we have chosen to work with
BERT which is commonly used in state-of-the-art approaches across various NLP problems.
      </p>
      <p>
        As explain in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], previous studies [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">4, 5, 3</xref>
        ] have highlighted the issue of high performance
dependency on the seed value when fine-tuning BERT for downstream tasks, particularly when
the training data is limited. One way of reducing this undesirable efect is to use ensembles
of models as in [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ]. Ensemble Modeling is an approach that encompasses the combination
of multiple models. It is based on the premise that individual models may possess distinct
strengths and limitations, and by merging them, an improved overall performance can be
achieved. The key benefit of employing model ensembles lies in their capacity to diminish
variance and enhance predictive accuracy.
      </p>
      <p>
        The paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used a Multi-Task Learning approach to solve a previous EXIST challenge
edition. Multi-Task Learning is a machine learning method where a model is trained to perform
several diferent tasks simultaneously. Instead of training separate models for each task,
multitask learning aims to share the knowledge and representations learned between the diferent
tasks, which can potentially improve the overall performance of the model.
      </p>
      <p>Our contributions relate to the development of a strategy that combines elements from both
Ensemble Modeling and Multi-Task Learning methods. Our model is built upon three stacked
BERT. Unlike previous approaches, our proposal focuses on observing the same object from
multiple perspectives. The underlying idea was to provide the model with input various aspects
to enhance its comprehension and interpretation abilities. Experimental results show that
our strategy could outperform single-view models. Furthermore, we propose incorporating a
prediction refinement mechanism on top of our models through an optimization process. This
refinement process does not alter the model’s weights but it has enabled us to outperform other
models.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and Tasks</title>
      <p>
        In 2023, the EXIST event involved the categorization of several thousands tweets written in
English and Spanish. This categorization process encompassed three distinct tasks: detect
sexism, clarify the author’s intention and explicit the sexism type. Approximately, whatever
the tasks, the Non-sexist class accounting for half the total annotator votes. The other classes
are evenly distributed among the remaining votes. The data, the tasks and the classes are
summarized in the tables 1, 2 and 3, respectively (see [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] for more details). Each tweet is
annotated by 6 annotators of diferent ages and genders, with the aim of 1. obtaining less biased
labels and 2. learn more subtly to recognize the diferent categories. Indeed, the assumption
that natural language expressions have a single and clearly identifiable interpretation in a given
context is a convenient idealization, but it’s far from reality, as explained in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To cope with
this, EXIST 2023 proposes to learn directly from diferent annotators’ votes.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this section, we describe our approach and the experimental framework employed. We
present here only the attempts that we consider scientifically interesting, the successful ones as
well as the unsuccessful ones, in order to maximize the usefulness of the paper for the reader.</p>
      <p>We began by separating the dataset according to the language of the tweets. All the
exploratory work was carried out on the English data. Once the best method had been determined,
we replicated it on the Spanish data with BERT multilingual. In the following, as the procedures
are identical, we will only describe the process of the English data.</p>
      <sec id="sec-4-1">
        <title>4.1. Baseline</title>
        <p>In order to compare our diferent attempts, we began by implementing a BERT base uncased
baseline using a traditional approach. The architecture used is illustrated in Figure 1. The
process involves taking the BERT classifier token [CLS], applying dropout, passing it through a
dense layer, and finally using a softmax activation function. While using softmax for tasks 1 and
1
2
3</p>
        <sec id="sec-4-1-1">
          <title>Non-sexist</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Sexist</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Non-sexist</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Direct</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Reported</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Judgemental</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>Non-sexist</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>Ideological</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>Inequality</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Stereotyping</title>
        </sec>
        <sec id="sec-4-1-11">
          <title>Dominance</title>
        </sec>
        <sec id="sec-4-1-12">
          <title>Objectification</title>
        </sec>
        <sec id="sec-4-1-13">
          <title>Sexual Violence</title>
        </sec>
        <sec id="sec-4-1-14">
          <title>Misogyny Non</title>
        </sec>
        <sec id="sec-4-1-15">
          <title>Sexual Violence</title>
        </sec>
        <sec id="sec-4-1-16">
          <title>Definition</title>
        </sec>
        <sec id="sec-4-1-17">
          <title>Not sexist</title>
        </sec>
        <sec id="sec-4-1-18">
          <title>Sexist or about sexism (e.g. if the author denounces a sexist act or fact)</title>
        </sec>
        <sec id="sec-4-1-19">
          <title>Not sexist</title>
        </sec>
        <sec id="sec-4-1-20">
          <title>Sexist by itself</title>
        </sec>
        <sec id="sec-4-1-21">
          <title>Report a sexist situation</title>
        </sec>
        <sec id="sec-4-1-22">
          <title>Decrying a social injustice against women</title>
        </sec>
        <sec id="sec-4-1-23">
          <title>Not sexist</title>
        </sec>
        <sec id="sec-4-1-24">
          <title>Discredits the feminist movement, rejects in</title>
          <p>equality between men and women</p>
        </sec>
        <sec id="sec-4-1-25">
          <title>Promotes gender stereotypes, superiority of</title>
          <p>men, and limits women’s abilities</p>
        </sec>
        <sec id="sec-4-1-26">
          <title>Women are presented as objects apart from</title>
          <p>their dignity and personal aspects</p>
        </sec>
        <sec id="sec-4-1-27">
          <title>Sexual suggestions, requests for sexual favors or harassment of a sexual nature Expresses hatred and violence towards women</title>
          <p>
            55%
45%
55%
22%
11%
12%
47%
12%
14%
11%
07%
09%
2 is appropriate because they are single-label tasks, it may not be suitable for task 3, which is a
multi-label task. However, in the specific context of the competition, as explained in section 4.6,
this issue is not problematic. Although the influence of the hyperparameter dropout is minor,
we empirically determined that among the {0.1, 0.3, 0.5} values the optimal rates were 0.5, 0.1
and 0.5 for task 1, 2 and 3 respectively. In accordance with the original BERT paper [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] and
with our experiments, we have chosen a batch size of 32. In order to increase the generalization
capacity of the model, we preferred to optimize with AdamW rather than Adam as proposed in
[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] with learning rate  = 1 − 5. We use the cross-entropy loss as cost function. Training of
the model was stopped at the peak of the ICM on the development set, which turned out to be
4, 7 and 9 epochs for tasks 1, 2 and 3 respectively.
          </p>
          <p>
            For more stability in the initial phase of training, we applied linear learning rate warm-up
during the first steps of the updates followed by a linear decay. However, we did not observe
any positive efect. We also tried to apply the layer-wise learning rate decay technique. As
explained in [
            <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
            ] lower layers encode more general information and top layers are more
specific to the training task. Consequently, the higher a layer is, the larger its learning rate
should be and conversely, the lower it is, the smaller it should be. But again, even if the learning
was faster during the first epochs at the end, we have not seen any improvement. Consequently,
we did not retain these two strategies. However, a more detailed studies of hyperparameters
might have led to further gains.
          </p>
          <p>In the following, we will refer to ℬ1, ℬ2 and ℬ3 as the three BERTs fine-tuned according to
the protocol described above on tasks 1, 2 and 3 respectively.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Preprocessing</title>
        <p>We tried diferent preprocessing listed in the Table 4. When process involves new tokens, we
update the BERT tokenizer to include new tokens, enabling the model to learn from them. In
our study, we found that pre-processing had a minor influence on model performance. The best
results were obtained using the first two approaches. Note that giving the meaning of hashtags
and emojis as we did not seem to provide more information on average.</p>
        <p>&lt; mention &gt; &lt; hashtag &gt; see u &lt; url &gt;
I hate U &amp; f* ck you !!
#Metoo</p>
        <sec id="sec-4-2-1">
          <title>I love U :)</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>I hate you and fuck you !!</title>
          <p>&lt; hashtag &gt; me too</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>I love U &lt; emoji &gt; happy</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Diferent data perspectives</title>
        <p>An intuitive idea illustrated in the Figure 2 is that multiple perspectives of an object can lead to
a better understanding of it. We combined the models introduced in section 4.1 ℬ1, ℬ2, and ℬ3,
to create a meta-model. ℬ1, ℬ2, and ℬ3 were trained on the same data but on diferent tasks 1, 2
and 3 respectively. After training the baseline models, we froze them to prevent further training
and then stacked them together. On top of the stacked models, we added some additional layers.
The resulting architecture is illustrated in Figure 3. This strategy worked well for tasks 1 and
2, but it did not work for task 3. Additionally, the decision of whether to keep softmax on the
top of the frozen networks depended on the task. It was better to keep softmax for task 1 and
better to exclude it for task 2. In the following, we will refer to ℳ1 and ℳ2 to designate the
two meta-models described above.</p>
        <p>Based on our submission, it’s dificult to pinpoint the exact reasons why the strategy worked
for some tasks but not for others. However we assume that task 3 is the most dificult one
and requires subtleties that are not adequately captured by the baseline models ℬ1 and ℬ2. A
supporting evidence is that this strategy exhibited the highest efectiveness on task 1, which
is considered the least complex among the tasks. Consequently, ℬ2 and ℬ3 could presumably
transfer their captured nuances to ℬ1.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Manual features</title>
        <p>In our efort to enhance the performance of our models, we sought out manual features that
could uncover valuable information potentially overlooked by our existing models. These
manual features were subjected to normalization before being stacked with the frozen networks
as shown in Figure 4. Here is a list of the specific additional features we incorporated:
• Tokens. Give the number of &lt; hashtag &gt;, &lt; mention &gt; and &lt; url &gt;;
• Text statistics. Give the number of characters, upper characters, words, sentences, digits,
citations, question marks and exclamation marks;
• Sentiment analysis. With the python library TextBlob, we provide the subjectivity and
• Tenses and modal. Give the number of verbs conjugated in the future, present or past
tense and the number of modal verbs;
• Complexity. Provide the average word size, the average sentence size and a readability
index (Flesch Reading Ease);
• Latent Dirichlet Allocation (LDA). We group test and train sets in a big corpus to
identify main topics by the LDA process.</p>
        <p>Regrettably, utilizing our current methodology, none of the manually engineered features
demonstrated a discernible enhancement in the model’s performance across various tasks. It is
postulated that these features encompass information that has already been assimilated by our
architecture.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Data augmentation</title>
        <p>Three distinct approaches were implemented and evaluated. Unfortunately, none of them
yielded desirable outcomes.</p>
        <p>The first approach involved translating the Spanish data and augmenting the model’s training
set. Against intuition, this augmentation did not lead to performance improvements; instead, it
resulted in decreased performance.</p>
        <p>The second attempt aimed to test the assumption that the translated data was noisy. To
counterbalance this noise, we have tried to select only tweets with a significant informational
content. The methodology entailed training the model on the English data, subsequently testing
it on the translated data, and identifying the instances where the model exhibited significant
errors. These error-prone tweets were presumed to possess heavy information for the model’s
learning. Subsequently, the model’s weights were reset, and the training was repeated on
the base data alongside the translated data with the "most significant" information content.
Unfortunately, this strategy failed to improve the model’s performance.</p>
        <p>Lastly, a third approach was undertaken by applying a similar strategy as described in section
4.3, but with other datasets. Two distinct datasets were utilized: the Spanish data translated
from EXIST 2023 and the EXIST 2022 dataset. The methodology involved stacking ℬ1, ℬ2, and
ℬ3 with additional frozen baseline models trained on specific tasks, such as task 1 of 2022 or
the translated Spanish data of task 3. The underlying principle was to harness the collective
insights of diverse people, represented by diferent models trained on distinct data types, in
order to synthesize their outputs and generate an improved solution. Regrettably, this method
has not produced convincing results.</p>
        <p>These unsuccessful results led us to make two hypotheses. Firstly, we posited that tweets
exhibit significant dissimilarities based on their cultural or temporal origins. Secondly, we
hypothesized that the cultural disparity between Spanish and English annotators could lead to
diferent evaluations.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Best possible distribution</title>
        <p>The dataset construction results in non-continuous labels, despite their apparent continuity.
This is due to the presence of six annotators for each task, which limits the set of possible label
distributions to a finite number. Specifically, all label values are multiples of 1/6. For instance,
in task 1, there exist only seven feasible outputs denoted as  = (1, 2), where 1 = /6 for
 = 0, 1, ... , 6, and 2 = 1 − 1. Similarly, task 2 encompasses 84 possibilities, while task 3
encompasses a substantial number of 28 546 possibilities due to its multi-label nature.</p>
        <p>Exploiting this knowledge, after training the model, it becomes feasible to select, from the
set of possible distributions, the closest one for each prediction. In particular, the multi-label
task 3 could be solved with a softmax on the top of our model. The advantage is that the model
benefits from the gradient backpropagation ofered by the softmax function during its learning
and has a good approximation to a label that does not sum to 1 thanks to the optimization
trick. It is important to note that the process described is an optimization problem and not a
deep-learning problem. Consequently, this optimization procedure does not impact the model
during the training. This sneaky idea has considerably increased the model’s performance. This
optimization procedure will be noted  in the following.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we provide a concise overview of the performance attained by our models
configurations in the EXIST 2023 evaluation. Our rankings are indicated in the Table 5. It is
noteworthy to point out that our current performance has improved slightly on that obtained
during the competition. This is due to the extra time we had to refine the hyperparameters of
our models.</p>
      <p>We can see in the Table 6 that the optimization process significantly improves model
performance. Our meta-model strategy was also efective, but to a lesser extent.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The submission made to the EXIST 2023 evaluation has yielded a considerable number of
negative or neutral results. These outcomes are valuable as they indicate that specific approaches
or hypotheses within our particular context did not produce favorable results. However, we
have put forth two efective strategies.</p>
      <p>Firstly, we proposed an innovative approach that combines ensemble modeling and the
multi-learning model. This approach entails training the same architecture on multiple tasks
using the same data and subsequently stacking the frozen sub-models in a meta-model. Our
ifndings demonstrate that this architecture has the ability to surpass the limitations of single
view models, leading to improved performance.</p>
      <p>Furthermore, we suggested incorporating an optimization process into our model. Our
ranking highlights that in a competition it is more important to be close to labels rather the
ground truth. This optimization process plays a key role in refining the model’s predictions and
enhancing its overall performance.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Johan Erbani would like to sincerely thank his superiors and the entire DRIM team for the trust
they have placed in him. Special thanks are due to Pierre-Yves Genest and Ousmane Touat for
their precious advice and expertise, and to interns Maud Andruszak and Thanh Lam for their
diligent research eforts, who made some contributions through their work on manual features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , L. Plaza,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comet</surname>
          </string-name>
          , T. Donoso, Overview of exist 2021:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>195</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mendieta-Aragón</surname>
          </string-name>
          , G. MarcoRemón, M. Makeienko,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of exist 2022:
          <article-title>sexism identification in social networks</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Villa-Cueva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanchez-Vega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <article-title>Bi-ensembles of transformer for online bilingual sexism detection (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katiyar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          ,
          <article-title>Revisiting few-sample bert ifne-tuning</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>05987</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          , G. Ilharco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>06305</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>A. F. M. de Paula</surname>
          </string-name>
          , R. F. da
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>I. B.</given-names>
          </string-name>
          <string-name>
            <surname>Schlicht</surname>
          </string-name>
          ,
          <article-title>Sexism prediction in spanish and english tweets using monolingual and multilingual bert and ensemble models</article-title>
          ,
          <source>arXiv preprint arXiv:2111.04551</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D.</surname>
            Molina-González,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>López</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <article-title>Sexism identification in social networks using a multi-task learning system</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2021</year>
          )
          <article-title>co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), XXXVII International Conference of the Spanish Society for Natural Language Processing</article-title>
          .,
          <string-name>
            <surname>Málaga</surname>
          </string-name>
          , Spain, volume
          <volume>2943</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>491</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Plaza</surname>
          </string-name>
          , Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
            , Roser Morante, Enrique Amigó, Julio Gonzalo, Damiano Spina,
            <given-names>Paolo</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Overview of EXIST 2023 - Learning with Disagreement for Sexism Identification and Characterization. Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ). Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Anastasia Giachanou,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          , Michalis Vlachos, Guglielmo Faggioli, and Nicola Ferro, Eds.
          <source>September</source>
          <year>2023</year>
          , Thessaloniki, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Plaza</surname>
          </string-name>
          , Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
            , Roser Morante, Enrique Amigó, Julio Gonzalo, Damiano Spina,
            <given-names>Paolo</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Overview of EXIST 2023 - Learning with Disagreement for Sexism Identification and Characterization(Extended Overview)</article-title>
          .
          <source>Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum. Mohammad Aliannejadi</source>
          , Guglielmo Faggioli, Nicola Ferro and Michalis Vlachos, Eds.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dumitrache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          , E. Simpson,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          , Semeval-2021 task 12:
          <article-title>Learning with disagreements</article-title>
          ,
          <source>in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>338</fpage>
          -
          <lpage>347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Delgado</surname>
          </string-name>
          ,
          <article-title>Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>5809</fpage>
          -
          <lpage>5819</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>399</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>399</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>arXiv preprint arXiv:1711.05101</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <article-title>Universal language model fine-tuning for text classification</article-title>
          , arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>06146</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          , W. Kang, Mixout:
          <article-title>Efective regularization to finetune large-scale pretrained language models</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>11299</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>