GuillemGSubies at IberLEF-2021 DETOXIS
     task: Detecting Toxicity with Spanish BERT

                              Guillem Garcı́a Subies1

Instituto de Ingenierı́a del Conocimiento, Francisco Tomás y Valiente st., 11 EPS, B
            Building, 5th floor UAM Cantoblanco. 28049 Madrid, Spain
                             guillem.garcia@iic.uam.es


        Abstract. This paper describes a system created for the DETOXIS
        2021 shared task, framed within the IberLEF 2021 workshop. We present
        an approach mainly based in fine-tuned BERT models using a Grid-
        Search and Data Augmentation with MLM substitution. This approach
        only takes into account the textual data from the dataset to prove the
        power of language models. Our models far outperform the baselines and
        achieve results close to the state-of-the-art.

        Keywords: Toxicity Detection · BERT · Transformers · Data Augmen-
        tation · BETO


1     Introduction
Polarization can be a very problematic issue in society, especially on social media.
There are manual mechanisms to report these behaviors, however they can be
slow and inefficient. To address this, we can use NLP to detect automatically
these undesirable toxic behaviors. The DETOXIS (DEtection of TOxicity in
comments In Spanish) [15] shared task proposes, during this third edition of the
IberLEF [10] workshop, a corpus to detect toxicity level in comments on internet
forums and newspapers discussions.
    This article summarizes our participation in all the DETOXIS tasks. Given
the success of Transformer-inspired language models [16], both in academia and
industry [17], we decided to use already pre-trained BERT [5] models. Specif-
ically, we will use BETO [4] with some extra transfer learning techniques for
ordinal classification problems and a hyperparameters Grid-Search. To address
the problem of small data, we will use Data Augmentation techniques.
    In the next section, we will briefly see some previous work related to this
topic. In Section 3 we will go through a brief description of the tasks and the
corpus. Then, in Section 4, we will explain the main ideas behind the proposed
models. In Section 5 we will present a summary of the experiments we carried out
and the results we got. Finally, in Section 6 we will expose the main conclusions
of our work and results and we will also propose some ideas for future work.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2   Related Work

There is an extensive bibliography on Sentiment Analysis and text classification
in social networks, however not that much work has been done about identifying
and classifying toxic behaviors until 2019.
    Most of the toxicity detection datasets are focused on classifying what kind
of toxicity is present in the text, instead of the level of toxicity. Basile et al.
[2] propose a task for the identification of toxicity against women and migrant
people in Spanish and English Tweets. Struß et al., [13] also focus on Twitter, but
now with German tweets. Other corpora focus in a more multilingual emphasis
like Kumar et al. [8] and Zampieri et al. [18].
    However, all have a common denominator. The best results have been ob-
tained with some kind of Transformers or BERT-based model. For instance, the
best models in Struß et al., [13] are fine-tuned BERTs with German pretraining
on general data or fine-tuned BERTs pre-trained with specific German tweets.
The best results in Kumar et al. [8] also point at BERT and Transformers,
specifically a bootstrap aggregation of BERT models. Finally, the best results
in Zampieri et al. [18] are also some kind of Transformer based models and
ensembles of them.
    This is a clear indicator of the trends in the state-of-the-art for this topic.


3   Tasks Description

The main corpus consists of 3463 comments posted in Spanish online newspapers
and forums for the train split and 890 for the test one. They were collected from
August 2017 to July 2020. Furthermore, the articles were selected taking into
account their potential toxicity and the number of comments in them (more than
50 comments).
    For the first task, the comments are annotated into two categories; toxic
and not toxic. The second task consists of further classifying that toxicity into
four levels of toxicity; toxicity level 0=not toxic, toxicity level 1=mildly
toxic, toxicity level 2=toxic and toxicity level 3=very toxic.
    In addition to the classification labels, for every sample there are also annota-
tion about the argumentation, constructiveness, stance, target, stereotype, sar-
casm, mockery, insult, improper language, aggressiveness and intolerance. How-
ever none of these are public in the test set, so they will be ignored in this study.
Finally, for every comment, there is also a label indicating if the comment is a
response to another comment or not. This information will not be used either
because we will only focus on the textual data.
    The metrics used to evaluate the results are the F-measure for the task1 and
the Closeness Evaluation Metric (CEM) [1] for the second task. This last metric
is very useful for ordinal classification problems given that it takes into account
the order of the classes using concepts from Measurement Theory.
                            Class        Nº of samples
                            not toxic    2317
                            mildly toxic 808
                            toxic        269
                            very toxic 69
                          Table 1. Distribution of Samples


    In the table above we can see the distribution of the samples in the train split.
The most notable fact is that 67% of the samples are not toxic, so the dataset
is unbalanced. For the second task, we can also see a very notable unbalance.
    In the table below, we can see some illustrative examples of the data and
their labels:


                Los detuvieron en ronda malaga, un saludo                 not toxic
     Loss mas valientes, los que mejor cortan nuestras cabezas, Para
                                                                         mildly toxic
              vosotros, socialistas, izquierdistas, y no racistas,
    Esto es lo que importas cuando los rescatas en lugar de hundirlos.      toxic
    Está claro que vienen los mejores. Haced que pase putos rojos de
                                                                          very toxic
                                    mierda.
                        Table 2. Examples of the different classes


4     Models

4.1    Data Preprocessing

We performed a simple preprocessing where we substituted some expressions
with a more normalized form:

 – Every URL was replaced with the token “[URL]” so we don’t get strange to-
   kens when the tokenizer tries to process and URL. Furthermore, no semantic
   information about toxicity can be inferred from a URL, the only information
   relevant for the model is that there is a URL in that token.
 – Finally we normalized every laugh (“jasjajajajj” → “haha”) so we minimize
   the noise of the misspellings, common in social networks.


4.2    Baselines

We created some baselines so we can compare our models properly. We selected
a HashingVectorizer + RandomForest. This way, we can compare our models to
a classic feature extraction model.
4.3   Language Models

We used BETO [4], a BERT model trained with the Spanish Unannotated Cor-
pora (SUC) [3] that has proven to be much better than the multilingual BERT
model.
   We tried different training strategies given that the classes are related to each
other:


 – The simplest approach we tried is treating both tasks as the same one, with
   a multiclass classification model. For the first task, everything different from
   not toxic would be considered toxic.
 – For the second approach, we first trained a binary classification model to
   distinguish between not toxic and toxic for the first task. Then we trained
   a multiclass model to classify between the three levels of toxicity.
 – Similarly to the last approach, we then tried to transform the second task
   into three different binary classification problems; classifying between not
   toxic and the rest of the classes, mildly toxic and toxic or very toxic,
   and between toxic and very toxic. With this, we tried to have very specific
   models that can differentiate slight changes in toxicity.
 – As there are not too many samples for the last models in the previous ap-
   proach to learn correctly, we also tried with a transfer learning approach,
   similar to the one presented by Sun et al. [14]. Instead of using always the
   same BETO pretrained model for every finetuned model, we used the fine-
   tuned model from the step before, i.e. the model that classifies mildly toxic
   and toxic has as base model the one that classifies not toxic and mildly
   toxic.

    In addition, for the fine-tuning process, we carried out a Grid-search opti-
mization over the main parameters of the neural network: learning rate, batch
size and dropout rate. The search was performed with a 5-fold stratified cross-
validation with the following grid: Learning rate, (1e−6, 1e−5, 3e−5, 5e−5, 1e−
4); batch size, (8, 16, 32) and dropout rate, (0.08, 0.1, 0.12). The best parameters
for both models were: learning rate, 1e − 5; batch size, 16 and dropout rate, 0.1.


4.4   Data Augmentation

As the dataset is relatively small, we decided to run Data Augmentation tech-
niques. The selected strategy was the Data Augmentation through the masking
of words with a Masked Language Model, BETO.
    For every sample in the dataset, we randomly masked 15% of the tokens and
used BETO to predict them, creating a modified sample. With this method, we
obtained double the amount of the original samples.
5     Experiments and Results

5.1   Experimental Setup

We trained all the models with a NVIDIA Tesla P100-PCIE-16GB GPU and
a Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz CPU with 500GB of RAM
memory.
    The software we used was Python3.8, transformers 4.5.1 [17], pytorch 1.8.1
[11], scikit-learn 0.24.1 [12] and nlpaug [9] 1.1.3.


5.2   Results

In the Table 3 we can see the results for our models in the test set of the
first task. Note that the ChainBOW baseline, Word2VecSpacy baseline and the
SINAI team (winner of the task) results are taken from the task Overview [15].
Our runs for this task are BETO-multiclass and BETO-binary both with and
without data augmentation as explained in Section 4.3. Note that some of the
results presented here were obtained after the labeled test set was published so
we could analyze in depth our models.
    We can see that there is almost no difference between the binary and mul-
ticlass models. This might happen because there is a great difference in the
amount of samples and toxicity between the not toxic comments and the rest,
which makes them easy to indentify in every situation. Finally we can see the the
Data Augmentation strategy obtained around 0.02 points more than the models
without augmentation. Our result was the second best in the competition, which
proves that the simplicity of using BETO with some Grid-Search can yield really
good results.


                        Model                F-measure
                        Word2VecSpacy        0.1523
                        ChainBOW             0.3747
                        HV+RF                0.4159
                        BETO-multiclass      0.5721
                        BETO-binary          0.5777
                        BETO-multiclass-aug 0.5981
                        BETO-binary-aug      0.6000
                        SINAI team           0.6461
                           Table 3. Results for task1


    For the second task, the results were similar to the ones obtained in the first
task. In the Table 4 we can look at them in more detail. Again, ChainBOW
baseline, Word2VecSpacy baseline and the SINAI team results are taken from
the task Overview [15]. We can see that our transfer learning approach (BETO-
transfer ) obtains better results than the other approaches and that there is
almost no difference between the simple multiclass approach and the one that
first detects the not toxic comments (BETO-2models-aug). These results are in
line with the ones in the first task, showing that adding the not toxic class to
the models, will not make them worse.
    This results are placed fifth among all the participating teams (24), which
proves that our approach, given it’s simplicity and the lack of any linguistic
analysis, is very good.


                           Model                 CEM
                           Word2VecSpacy         0.6116
                           HV+RF                 0.6214
                           ChainBOW              0.6535
                           BETO-2 models-aug 0.6891
                           BETO-multiclass-aug 0.6913
                           BETO-3 models-aug 0.704
                           BETO-transfer         0.7172
                           BETO-transfer-aug 0.7189
                           SINAI team            0.7495
                            Table 4. Results for task2


6   Conclusions and Future Work

Through this shared task, we have seen that NLP can be of great help in de-
tecting and classifying unwanted toxic behavior in social networks and there is
still a long way to go.
     The results obtained by our systems are very promising given their great
performance and their simplicity. This compilation of methods is very signifi-
cant because it could lead to much better results when combined with other
improvements from the state-of-the-art.
     We believe that our results could improve a lot using specific language models
trained with corpora from social networks like TWilBert [6]. Another interesting
approach would be to use a general language model and further pre-train it with
corpora from the same domain [14] as the final task. Finally, we have proven
that good hyperparameters are also key for a good neural network so a better
search, like the Population Based Training [7], would further improve the model.


Acknowledgments

This work has been partially funded by the Instituto de Ingenierı́a del Conocimiento
(IIC) and the hardware used was also provided by the IIC.
References
 1. Amigó, E., Gonzalo, J., Mizzaro, S., de Albornoz, J.C.: An effectiveness metric for
    ordinal classification: Formal properties and experimental results (2020)
 2. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M.,
    Rosso, P., Sanguinetti, M.: SemEval-2019 task 5: Multilingual detection of
    hate speech against immigrants and women in Twitter. In: Proceedings of
    the 13th International Workshop on Semantic Evaluation. pp. 54–63. Associa-
    tion for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019).
    https://doi.org/10.18653/v1/S19-2007, https://www.aclweb.org/anthology/S19-
    2007
 3. Cañete,     J.:    Compilation      of    large    spanish      unannotated     cor-
    pora          (May          2019).         https://doi.org/10.5281/zenodo.3247731,
    https://doi.org/10.5281/zenodo.3247731
 4. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model
    and evaluation data. In: to appear in PML4DC at ICLR 2020 (2020)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
 6. Ángel González, J., Hurtado, L.F., Pla, F.: Twilbert: Pre-trained
    deep bidirectional transformers for spanish twitter. Neurocomputing
    (2020).              https://doi.org/https://doi.org/10.1016/j.neucom.2020.09.078,
    http://www.sciencedirect.com/science/article/pii/S0925231220316180
 7. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi,
    A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu,
    K.: Population based training of neural networks (2017)
 8. Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Evaluating aggres-
    sion identification in social media. In: Proceedings of the Second Work-
    shop on Trolling, Aggression and Cyberbullying. pp. 1–5. European
    Language Resources Association (ELRA), Marseille, France (May 2020),
    https://www.aclweb.org/anthology/2020.trac-1.1
 9. Ma, E.: Nlp augmentation. https://github.com/makcedward/nlpaug (2019)
10. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Ángel Álvarez Carmona,
    M., Álvarez Mellado, E., de Albornoz, J.C., Chiruzzo, L., Freitas, L., Adorno, H.G.,
    Gutiérrez, Y., Zafra, S.M.J., Lima, S., de Arco, F.M.P., Taulé, M.: Proceedings of
    the Iberian Languages Evaluation Forum (IberLEF 2021). In: CEUR Workshop
    Proceedings (2021)
11. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,
    Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,
    Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chin-
    tala, S.: Pytorch: An imperative style, high-performance deep learning library. In:
    Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett,
    R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035.
    Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-
    imperative-style-high-performance-deep-learning-library.pdf
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
13. Struß, J.M., Siegel, M., Ruppenhofer, J., Wiegand, M., Klenner, M.: Overview
    of germeval task 2, 2019 shared task on the identification of offensive language.
    In: Preliminary proceedings of the 15th Conference on Natural Language Process-
    ing (KONVENS 2019), October 9 – 11, 2019 at Friedrich-Alexander-Universität
    Erlangen-Nürnberg. pp. 352 – 363. German Society for Computational Linguistics
    & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg,
    München [u.a.] (2019), https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93197
14. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification?
    (2020)
15. Taulé, M., Ariza, A., Nofre, M., Amigó, E., Rosso, P.: Overview of the detoxis task
    at iberlef-2021: Detection of toxicity in comments in spanish. Procesamiento del
    Lenguaje Natural 67 (2021)
16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need (2017)
17. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,
    Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush,
    A.M.: Transformers: State-of-the-art natural language processing. In: Proceedings
    of the 2020 Conference on Empirical Methods in Natural Language Processing:
    System Demonstrations. pp. 38–45. Association for Computational Linguistics,
    Online (Oct 2020), https://www.aclweb.org/anthology/2020.emnlp-demos.6
18. Zampieri, M., Nakov, P., Rosenthal, S., Atanasova, P., Karadzhov, G., Mubarak,
    H., Derczynski, L., Pitenis, Z., Çöltekin, Ç.: SemEval-2020 task 12: Multilingual
    offensive language identification in social media (OffensEval 2020). In: Proceed-
    ings of the Fourteenth Workshop on Semantic Evaluation. pp. 1425–1447. Interna-
    tional Committee for Computational Linguistics, Barcelona (online) (Dec 2020),
    https://www.aclweb.org/anthology/2020.semeval-1.188