EXIST2021: Detecting Sexism with
 Transformers and Translation-Augmented Data

                              Guillem Garcı́a Subies1

Instituto de Ingenierı́a del Conocimiento, Francisco Tomás y Valiente st., 11 EPS, B
             Building, 5th florr UAM Cantoblanco. 28049 Madrid, Spain
                             guillem.garcia@iic.uam.es


        Abstract. This paper describes a system created for the EXIST 2021
        shared task, framed within the IberLEF 2021 workshop. We present an
        approach mainly based in fine-tuned BERT models and Data Augmenta-
        tion with translation and backtranslation. We show an approach to face
        multilingual problems augmenting the data without the overfitting that
        an aggressive backtranslation can generate. Our models far outperform
        the baselines and achieve results close to the state-of-the-art.

        Keywords: Sexism Detection · BERT · Transformers · Data Augmen-
        tation · Backtranslation · Multilingual Corpora


1     Introduction
With the crescent trends in social rights and equal rights demands, NLP can help
in detecting harmful and sexist behaviors. The EXIST (sEXism Identification
in Social neTworks) [16] shared task proposes, during this third edition of the
IberLEF [11] workshop, a dataset to detect sexism in it’s most broad definition
and also kinds of sexism.
    This article summarizes our participation in all the EXIST tasks. Given the
success of Transformer-inspired language models [20], both in academia and in-
dustry [21], we decided to use already pre-trained BERT [4] models. Specifically,
we face the multilingual problem using different models for every language. We
also conjecture that a good way to augment the data in multilingual problems is
translating the data into the other
                                 P languages of the dataset, so instead of having
ni for every language, we have      ni samples for every language. As the dataset
is not too big, we also explore the Data Augmentation with Backtranslation [17].
    In the next section, we will briefly see some previous work related to this
topic. In Section 3 we will go through a brief description of the tasks and the
corpora. Then, in Section 4, we will explain the main ideas behind the proposed
models. In Section 5 we will present a summary of the experiments we carried out
and the results we got. Finally, in Section 6 we will expose the main conclusions
of our work and results and we will also propose some ideas for future work.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2    Related Work

There is an extensive bibliography on Sentiment Analysis and text classification
in social networks, however not that much work has been done about identifying
and classifying sexist behaviors in different languages.
    For instance, Anzovino et al. [1] propose a sexism classification dataset also
in Spanish and in English and proposed some solutions based on n-grams and
classic machine learning models like SVMs. Following that line, Frenda et al. [5]
use the previous dataset (only the English part) in combination with others to
detect both misogyny and sexism following a similar approach of classic NLP.
    More recent work by Grosz et al. [7], focuses on the sexism in the workplace
and they obtain stat-of-the art results with GloVe embeddings and modified
LSTM so they have attention mechanisms.
    Another recent corpus, created by Rodrı́guez-Sánchez et al. [15], is focused
on the detection of a broad amount of sexist behaviors in Spanish tweets, from
the most explicit abuses to some more subtle expressions. They also showed
that BERT-based models perform better that classic approaches or bidirectional
LSTMs.


3    Tasks Description

The main corpus consists of 6977 tweets both in English and Spanish for the
train split and 3368 tweets and 982 “gabs” (from gab.com) also in both languages
for the test one.
    The first task, Sexism Identification, consists of classifying tweets between
sexist and non-sexist. There are 3600 non-sexist tweets and 3377 sexist
tweets, so we can consider that the problem is well-balanced. The metric used
for this dataset is the Accuracy.
    For the second task, Sexism Categorization, there are six classes: non-sexist,
ideological-inequality, stereotyping-dominance, misogyny-non-sexual-
violence, sexual-violence and objectification. As we can see in Table 1, the
dataset is unbalanced, so the F-measure is used as the ranking metric. We can
see a similar distribution of the classes if we split the corpora into Spanish and
English.


    Class                               Nº Samples Task1 Nº Samples Task2
    non-sexist                                  3600           3600
    ideological-inequality                                      866
    stereotyping-dominance                                      809
    misogyny-non-sexual-violence sexist         3377            685
    sexual-violence                                             517
    objectification                                             500
                           Table 1. Distribution of Samples
   In the table below, we can see some illustrative examples of the data and
their labels:


I love poetry books, so I’m reading the one i have on this plane flight
and one of the flight attendants (black women) goes “it’s good to see      non-sexist
            a brotha reading something that’s is so deep”
  Can the fellas participate or is this just for the ladies/Non binary    ideological-
              people because I don’t wanna get clowned.                    inequality
    ive been sooo interesting my whole life and i just want to be a      stereotyping-
                        boring trophy wife now                            dominance
                                                                        misogyny-non-
                             Fucking skank
                                                                        sexual-violence
Bitches be begging me to fw them just to give me a reason not to fw
                                                                        sexual-violence
                               them. Lol
 some women just don’t deserve onlyfans, bitches be UGLY as fuck
     and ask you to pay $20 to see their UGLY FAT BLOTCHED              objectification
                          TITTIES, BITCH!
                      Table 2. Examples of the different classes


4     Models
4.1   Data Preprocessing
We performed a simple preprocessing where we substituted some expressions
with a more normalized form:

 – Every URL was replaced with the string “[URL]” so we don’t get strange
   tokens when the tokenizer tries to process a URL. Furthermore, no semantic
   information about sexism can be inferred from a URL, the only information
   relevant for the model is that there is a URL in that token.
 – The hashtag characters (“#”) were deleted (“#example” → “example”) be-
   cause the base language models we will use, are trained in generic text and
   might not understand their meaning. Furthermore, most of hashtags are used
   the same way as normal words.
 – We replaced every username with the string “[USER]” because the exact
   name of a user does not really add any information about the sexism. The
   only relevant feature is knowing if someone was mentioned or not, but not
   who.
 – Finally we normalized every laugh (“jasjajajajj” → “haha”) so we minimize
   the noise of the misspellings, common in social networks.

4.2   Baselines
We created some baselines so we can compare our models properly. We selected a
HashingVectorizer + RandomForest and a multilingual BERT (mBERT). This
way, we can compare our models to a classic feature extraction model and a
simple BERT-based one.


4.3    Language Models

We decided to fine-tune one language model for every language. For the Span-
ish language, we selected BETO [3], a BERT model trained with the Spanish
Unannotated Corpora (SUC) [2] that has proven to be much better than the mul-
tilingual BERT model. For the English part of the dataset we used the original
BERT model [4].
    For the second task, given the imbalance in the classes, we performed a
hierarchical classification, where the model from Task 1 classifies between sexist
and non-sexist and another model is trained to detect specific kinds of sexism.
    In addition, for the fine-tuning process, we carried out a Grid-search opti-
mization over the main parameters of the neural network: learning rate, batch
size and dropout rate. The search was performed with a 5-fold stratified cross-
validation with the following grid: Learning rate, (1e−6, 1e−5, 3e−5, 5e−5, 1e−
4); batch size, (8, 16, 32) and dropout rate, (0.08, 0.1, 0.12). The best parameters
for both models were: learning rate, 1e − 5; batch size, 16 and dropout rate, 0.1.


4.4    Data Augmentation

As the dataset is relatively small, we decided to run Data Augmentation tech-
niques. We followed two different strategies to increase the amount of data in
the corpora; Backtranslation [17] and translation of the different languages in
the dataset.


Backtranslation This method consists of translating the samples into a pivot
language and then translating them back into the original language. Given that
the existing translation methods are not perfect, we get samples that are written
in a slightly different way, but keep the original meaning. In particular, this
technique has been proven useful for sentiment analysis and with twitter corpora
before [10]. In this case, we used 30 pivot languages. For the translations we used
the translation models of Helsinki NLP [19] based on the Marian model [9] and
the Google Translate API for the ones that were not available in the Helsinki
NLP models.
    The selected languages are the following (expressed in ISO 639-1): eu, la,
zh-cn, hi, bn, pt, ru, ja, pa, mr, te, tr, ko, fr, de, vi, ta, ur, it, ar, fa, ha, kn, id, pl,
uk, ro, eo, sv and el. Also es and en were used for English and Spanish datasets
respectively.
    For every sample in the corpus, we randomly picked one pivot language to
perform the backtranslation, so we ended up with a corpus of twice the size.
Multilingual Translation Following the above reasoning, we can also use la-
beled data (with the same gold standard) in other languages. So we translated
every English sample into Spanish and vice-versa. This way, we should have a
more robust training and avoid overfitting because the “new” samples are com-
pletely new for that language’s model, opposed to the slightly modified samples
from Backtranslation.


5     Experiments and Results

5.1   Experimental Setup

We trained all the models with a NVIDIA Tesla P100-PCIE-16GB GPU and
a Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPU with 500GB of RAM
memory.
    The software we used was Python3.8, transformers 4.5.1 [21], pytorch 1.8.1
[13] and scikit-learn 0.24.1 [14].


5.2   Results

In the Table 3 we can see the results for our models in the test set of the first task.
Note that the Tfid+SVM baseline and the AI-UPV team (winner of the task)
are taken from the task Overview [16]. The runs we presented for the contest
were 2BERTs+Backtranslation, 2BERTs and 2BERTs+Multilingual translation
where 2BERTs refer to the different models used for each language, explained
in Section 4.3. Note that our results correspond to the GuillemGSubies team in
the official leaderboard.
    The Backtranslation models had good results in our first training experi-
ments, however they proved to overfit a lot for this task with an accuracy of
0.7479, just a bit better than the multilingual BERT baseline (0.734), but worse
than just fine-tuned BERTs. This shows that Data Augmentation techniques
are not always useful. Next, we can see that the Multilingual Translation models
obtained an accuracy of 0.7683, which proves a better generalization than the
model without any augmentation (0.7603). With this, our model is positioned
very close to the best result in the competition, that is only 1.58% better.


                    Model                            Accuracy
                    HV+RF                            0.6830
                    Tfidf+SVM                        0.6845
                    mBERT                            0.7341
                    2BERTs+Backtranslation           0.7479
                    2BERTs                           0.7603
                    2BERTs+Multilingual translation 0.7683
                    AI-UPV team                      0.7804
                            Table 3. Results for task1
   For the second task, the results were similar to the ones obtained in the first
task. In the Table 4 we can look at them in more detail. Again, the Tfidf+SVM
baseline and the AI-UPV team results come from the task Overview [16]. We
can see that our models behaved consistently like in the first task, but the results
were not that good. Despite that, the results are still very close to the best.


                   Model                           F-measure
                   Tfidf+SVM                       0.3950
                   HV+RF                           0.4131
                   mBERT                           0.4961
                   2BERTs+Backtranslation          0.5174
                   2BERTs                          0.5218
                   2BERTs+Multilingual translation 0.5295
                   AI-UPV team                     0.5787
                           Table 4. Results for task2


    To sum up the improvements of our models, we can see an ablation study
for our best model (2BERTs+Multilingual translation) in the task 1 where each
entry has a feature removed from the best model. This proves that most of
the ideas introduced, produced some kind of improvement to the system. The
most significant improvement was the selection of good hyperparameters for the
model. Finally, it is also very remarkable that we get a large improvement by
Multilingual Translation, proving our hypothesis about the ability of this Data
Augmentation technique to generalize in Multilingual corpora.


                    Model                           Accuracy
                    Best model                      0.7683
                    Default model (no Grid-Search) 0.7451
                    Uncased                         0.7599
                    No augmentation                 0.7603
                    No preprocessing                0.7678
                   Table 5. Ablation study for the task1 models


6   Conclusions and Future Work

Through this shared task, we have seen that NLP can be of great help in detect-
ing and classifying unwanted toxic and sexist behavior in social networks and
there is still a long way to go.
   The results obtained by our systems are very promising given their great
performance and their simplicity. Furthermore, we proposed a new way of facing
multilingual problems that provides better results. All this is very significant
and could lead to much better results when combined with other improvements
from the state-of-the-art.
    We believe that our results could improve a lot using specific language models
trained with corpora from social networks like TWilBert [6] for Spanish and
BERTweet [12] for English. Another interesting approach would be to use a
general language model and further pre-train it with corpora from the same
domain [18] as the final task. These corpora would be easy to obtain given that
the authors of the EXIST2021 shared task, gathered it from a list of keywords
[16]. Finally, we have proven that good hyperparameters are also key for a good
neural network so a better search, like the Population Based Training [8], would
further improve the model.


Acknowledgments
This work has been partially funded by the Instituto de Ingenierı́a del Conocimiento
(IIC) and the hardware used was also provided by the IIC.


References
 1. Anzovino, M., Fersini, E., Rosso, P.: Automatic identification and classification of
    misogynistic language on twitter. In: NLDB (2018)
 2. Cañete,     J.:    Compilation     of    large    spanish    unannotated       cor-
    pora          (May         2019).         https://doi.org/10.5281/zenodo.3247731,
    https://doi.org/10.5281/zenodo.3247731
 3. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model
    and evaluation data. In: to appear in PML4DC at ICLR 2020 (2020)
 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
 5. Frenda, S., Ghanem, B., Montes-y Gómez, M., Rosso, P.: Online hate speech
    against women: Automatic identification of misogyny and sexism on twitter. Jour-
    nal of Intelligent & Fuzzy Systems 36(5), 4743–4752 (2019)
 6. Ángel González, J., Hurtado, L.F., Pla, F.: Twilbert: Pre-trained
    deep bidirectional transformers for spanish twitter. Neurocomputing
    (2020).             https://doi.org/https://doi.org/10.1016/j.neucom.2020.09.078,
    http://www.sciencedirect.com/science/article/pii/S0925231220316180
 7. Grosz, D., Conde-Cespedes, P.: Automatic detection of sexist statements commonly
    used at the workplace (2020)
 8. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi,
    A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu,
    K.: Population based training of neural networks (2017)
 9. Junczys-Dowmunt, M., Grundkiewicz, R., Dwojak, T., Hoang, H., Heafield,
    K., Neckermann, T., Seide, F., Germann, U., Aji, A.F., Bogoychev, N., Mar-
    tins, A.F.T., Birch, A.: Marian: Fast neural machine translation in C++.
    In: Proceedings of ACL 2018, System Demonstrations. pp. 116–121. As-
    sociation for Computational Linguistics, Melbourne, Australia (Jul 2018).
    https://doi.org/10.18653/v1/P18-4020, https://www.aclweb.org/anthology/P18-
    4020
10. Luque, F.M.: Atalaya at tass 2019: Data augmentation and robust embeddings for
    sentiment analysis (2019)
11. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Ángel Álvarez Carmona,
    M., Álvarez Mellado, E., de Albornoz, J.C., Chiruzzo, L., Freitas, L., Adorno, H.G.,
    Gutiérrez, Y., Zafra, S.M.J., Lima, S., de Arco, F.M.P., Taulé, M.: Proceedings of
    the Iberian Languages Evaluation Forum (IberLEF 2021). In: CEUR Workshop
    Proceedings (2021)
12. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: A pre-trained language model
    for English Tweets. In: Proceedings of the 2020 Conference on Empirical Methods
    in Natural Language Processing: System Demonstrations (2020)
13. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
    Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,
    Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chin-
    tala, S.: Pytorch: An imperative style, high-performance deep learning library. In:
    Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett,
    R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035.
    Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-
    imperative-style-high-performance-deep-learning-library.pdf
14. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
15. Rodrı́guez-Sánchez, F., Carrillo-de Albornoz, J., Plaza, L.: Automatic classification
    of sexism in social networks: An empirical study on twitter data. IEEE Access 8,
    219563–219576 (2020)
16. Rodrı́guez-Sánchez, F., de Albornoz, J.C., Plaza, L., Gonzalo, J., Rosso, P., Comet,
    M., Donoso, T.: Overview of exist 2021: sexism identification in social networks.
    Procesamiento del Lenguaje Natural 67(0) (2021)
17. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation mod-
    els with monolingual data. In: Proceedings of the 54th Annual Meeting of the
    Association for Computational Linguistics (Volume 1: Long Papers). pp. 86–
    96. Association for Computational Linguistics, Berlin, Germany (Aug 2016).
    https://doi.org/10.18653/v1/P16-1009, https://www.aclweb.org/anthology/P16-
    1009
18. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification?
    (2020)
19. Tiedemann, J., Thottingal, S.: OPUS-MT — Building open translation services
    for the World. In: Proceedings of the 22nd Annual Conferenec of the European
    Association for Machine Translation (EAMT). Lisbon, Portugal (2020)
20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need (2017)
21. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,
    Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush,
    A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings
    of the 2020 Conference on Empirical Methods in Natural Language Processing:
    System Demonstrations. pp. 38–45. Association for Computational Linguistics,
    Online (Oct 2020), https://www.aclweb.org/anthology/2020.emnlp-demos.6