<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transformer Ensembles for Sexism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lily Davies</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Baldracchi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Alessandro Borella</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos P</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National and Kapodestrian University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>codec.ai</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This document presents in detail the work done for the sexism detection task at EXIST2021 workshop. Our methodology is built on ensembles of Transformer-based models which are trained on di erent background and corpora and ne-tuned on the provided dataset from the EXIST2021 workshop. We report accuracy of 0.767 for the binary classi cation task (task1), and f1 score 0.766, and for the multi-class task (task2) accuracy 0.623 and f1-score 0.535.</p>
      </abstract>
      <kwd-group>
        <kwd>Sexism Detection</kwd>
        <kwd>Transformers</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        EXIST workshop (sEXism Identi cation in Social neTworks) is the rst shared
task at IberLEF 2021 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The topic of this particular workshop is to build
classi ers for sexism detection.
1.1
      </p>
      <sec id="sec-1-1">
        <title>Dataset and Tasks</title>
        <p>
          As part of the workshop, an annotated dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] has been provided, which
consists of sexist expressions, both in English and Spanish, commonly used on social
media. The source of these texts are Twitter and Gab social media platforms.
        </p>
        <p>The training dataset consists of 6977 texts, 3436 in English and 3541 in
Spanish. In the training set, all of the texts have Twitter as their source. The
training dataset has been labelled with two separate label sets, which are
essentially the workshop tasks: rst, a higher level binary annotation per text,
indicating whether the particular text is sexist or not, and a second layer of
annotation per text, where if a particular text is identi ed as sexist, it is also
assigned one of the following labels : fIDEOLOGICAL AND INEQUALITY,
STEREOTYPING AND DOMINANCE, OBJECTIFICATION, SEXUAL
VIOLENCE, MISOGYNY AND NON-SEXUAL VIOLENCEg.</p>
        <p>For the rst task, the dataset contains 3377 texts labelled as sexist and
3600 labelled as non-sexist, so it is rather balanced. For the second task, the
distribution of the tweets labelled as sexist in their subcategories is as shown in
Table 1:</p>
        <p>The test dataset consists of 4368 texts, 2208 in English and 2160 in Spanish.
In this dataset 3386 of the texts are sourced from Twitter and 982 from the
social network Gab.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>
        We combine two major approaches in our design: ne-tuning separate BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
models trained in Spanish and English; and building ensembles of models per
language.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Task1</title>
        <p>We ne-tune six separate models starting from di erent weight con gurations,
three for English and three for Spanish.</p>
        <p>
          During testing, we report the majority vote of the ensembles, e.g. if more
than one of the classi ers agree on a prediction, we report that prediction as the
nal decision. Our work is following the reasoning and the results of [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], where
several models are trained in the same dataset with the same loss function but
starting from a di erent random weight initialisation. The majority vote of the
model ensemble then tends to perform better by each standalone model.
        </p>
        <p>
          In the training process, we ne-tune the BERT models [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] on the training
data, using an 80-20 training-test split for each model, per language, where the
languages available in the dataset are English and Spanish. This choice follows
an empirical observation that language-speci c pretrained BERT architectures
tend to capture better the language subtleties for the tasks posed and tend to
perform better in classi cation benchmarks.
        </p>
        <p>
          For English texts we use the pre-trained model from [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], available in the
Huggingface transformers library [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For Spanish texts we ne-tune the BETO
pre-trained model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          We use PyTorch [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for the implementation.
        </p>
        <p>
          Before feeding the texts to the classi ers, we pre-process them by
replacing mentions with the mention token and URLs by the URL token. While
theoretically user mentions and handles can implicitly capture social graph
structure and potentially increase the classi er performance [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], we choose to rely
only on textual information and discard social graph or source for the tasks in
question.
        </p>
        <p>
          We rst lter the input training set based on the language indicator and then
seed three neural networks with di erent initial random weights. We train the
networks for 10 epochs and select the one with the best accuracy as the nal
model per training. In testing, we feed the same text to all three neural networks
and report the majority vote of the classi ers as the nal classi cation. The nal
output will be classi ed as sexist if at least two of the three classi ers report it
as sexist and similarly, a text will be classi ed as non-sexist if at least two out of
three classi ers report it as non-sexist. Based on our experiments this tends to
give a 2% increase in the overall reported accuracy. We use Adam optimiser
for the training [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Task2</title>
        <p>For the second task, we train a classi er to distinguish between sexist tweets.
That is, we only keep the texts in the training dataset that have been agged as
sexist and we train a multi-class model for these texts.</p>
        <p>Again, we train di erent models for Spanish and English language texts. Since
we have the language labels of the texts there is no need to attempt to determine
the language of a text; however, in the general case it would be straightforward to
add one more step in the pipeline for language detection using an o the shelf
language detection library, such as langdetect (https://pypi.org/project/langdetect/).</p>
        <p>In testing time, we rst apply the classi er of task1 to detect whether a tweet
is sexist or not. If the text is labelled as non-sexist we report this label for task2
as well. If the task1 model reports sexist as the label, we feed this tweet to the
second classi er to obtain a prediction for the sexism categorisation label as
described above.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>For the rst task, we achieve accuracy of 0.767, with f-score (macro) of 0.766
(10th placement in the ranking). For the second task, accuracy is at 0.623 and
f1-score at 0.535 (19th placement in the ranking).</p>
      <p>Breaking down the nal results by language, for task1 the accuracy in English
texts is at 0.7445 and for Spanish texts at 0.789. For task2, the accuracy in
English texts is 0.583 with f1-score 0.493, whereas in Spanish texts accuracy is
at 0.664 and f1-score at 0.575.</p>
      <p>Overall, the system tends to perform better in Spanish, and this is most
probably due to the fact that the underlying BERT model for Spanish (BETO)
is trained exclusively in Spanish texts.</p>
      <p>
        Whereas we are using ensembles in the rst task, we decided not to adopt
this strategy for the multiclass classi cation for task2, as it would have increased
signi cantly training and testing time as well as computational cost. Instead,
we are using the output of the rst task and we only train one classi er per
language for the second task. It is worth noting here that the observation from
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is validated in this case, as the ensemble achieves up to 2% higher accuracy
than standalone models.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Analysis</title>
        <p>It is interesting to see that our models fail to correctly identify texts that have
been labelled as non-sexist due to the presence of words that commonly appear in
sexist environments. It also fails to correctly categorise sexist tweets that appear
in the same context with non-sexist words (for example, the word friend ), short
texts or even sometimes subtle or contextual use of sexists language.</p>
        <p>For task2, the predictions are inheriting the error from the task1 classi er, e.g.
falsely reporting sexist tweets, which accumulates with the error of the second
model. The confusion matrices for the 2 tasks are shown in gure 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Improvements and future research</title>
        <p>One question we aim to investigate is what is the optimal number of models
in the ensemble from a classi cation accuracy perspective. This is a more
general question and direction to investigate for Transformer based models for text
classi cation.</p>
        <p>
          Additionally, most recently it has been shown that Convolutional Neural
Networks can over-perform Transform-based architectures in Natural Language
Processing tasks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The use and ne-tuning of pre-trained convolutions in
the domain of abusive and toxic speech would be an interesting direction to
investigate.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Montes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          , et al.
          <source>Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2021</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Francisco</given-names>
            <surname>Rodr</surname>
          </string-name>
          guez-Sanchez,
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          , et al. \
          <source>Overview of EXIST</source>
          <year>2021</year>
          :
          <article-title>sEXism Identi cation in Social neTworks"</article-title>
          .
          <source>In: Procesamiento del Lenguaje Natural 67.0</source>
          (
          <year>2021</year>
          ). issn:
          <fpage>1989</fpage>
          -
          <lpage>7553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei Chang</surname>
          </string-name>
          , et al. \
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics,
          <year>June 2019</year>
          , pp.
          <volume>4171</volume>
          {
          <fpage>4186</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          - 1423. url: https://www.aclweb.org/anthology/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Zeyuan</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yuanzhi</given-names>
            <surname>Li</surname>
          </string-name>
          , et al. \
          <article-title>Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning"</article-title>
          . In: arXiv preprint
          <year>2012</year>
          .
          <volume>09816</volume>
          (
          <issue>Dec</issue>
          .
          <year>2020</year>
          ). url: https : / / www . microsoft . com / en - us / research / publication / towards - understanding
          <string-name>
            <surname>-</surname>
          </string-name>
          ensemble
          <string-name>
            <surname>-</surname>
          </string-name>
          knowledge
          <article-title>-distillation-and-self-distillation-in-deep-learning/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Noam</given-names>
            <surname>Shazeer</surname>
          </string-name>
          , et al. \
          <article-title>Attention is All you Need"</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . Ed. by I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          , et al. Vol.
          <volume>30</volume>
          . Curran Associates, Inc.,
          <year>2017</year>
          . url: https : / / proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aaPaper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sudhanshu</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shivangi</given-names>
            <surname>Prasad</surname>
          </string-name>
          , et al. \
          <article-title>Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020"</article-title>
          . English. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying. Marseille, France: European Language Resources Association (ELRA),
          <source>May</source>
          <year>2020</year>
          , pp.
          <volume>120</volume>
          {
          <fpage>125</fpage>
          . isbn:
          <fpage>979</fpage>
          -
          <lpage>10</lpage>
          -95546-56-6. url: https://www.aclweb.org/anthology/
          <year>2020</year>
          .trac1.
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lysandre</given-names>
            <surname>Debut</surname>
          </string-name>
          , et al.
          <article-title>HuggingFace's Transformers: Stateof-the-art Natural Language Processing</article-title>
          .
          <year>2020</year>
          . arXiv:
          <year>1910</year>
          .
          <article-title>03771 [cs</article-title>
          .CL].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] Jose Can~ete, Gabriel Chaperon, et al. \
          <article-title>Spanish Pre-Trained BERT Model and Evaluation Data"</article-title>
          .
          <source>In: PML4DC at ICLR</source>
          <year>2020</year>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sam</given-names>
            <surname>Gross</surname>
          </string-name>
          , et al. \
          <article-title>Automatic di erentiation in PyTorch"</article-title>
          . In: NIPS-W.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Konstantinos</surname>
            <given-names>Perifanos</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Eirini</given-names>
            <surname>Florou</surname>
          </string-name>
          , et al. \
          <article-title>Neural Embeddings for Idiolect Identi cation"</article-title>
          .
          <source>In: 2018 9th International Conference on Information, Intelligence, Systems and Applications (IISA)</source>
          .
          <year>2018</year>
          , pp.
          <volume>1</volume>
          {
          <issue>3</issue>
          . doi:
          <volume>10</volume>
          .1109/IISA.
          <year>2018</year>
          .
          <volume>8633681</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <year>2017</year>
          . arXiv:
          <volume>1412</volume>
          .6980 [cs.LG].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Yi</surname>
            <given-names>Tay</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mostafa</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , et al.
          <article-title>Are Pre-trained Convolutions Better than Pre-trained</article-title>
          <string-name>
            <surname>Transformers</surname>
          </string-name>
          ?
          <year>2021</year>
          . arXiv:
          <volume>2105</volume>
          .03322 [cs.CL].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>