Transformers and Data Augmentation for
Aggressiveness Detection in Mexican Spanish
Mario Guzman-Silverioa , Ángel Balderas-Paredesa and Adrián Pastor López-Monroya
a
    Mathematics Research Center (CIMAT), Jalisco S/N Valenciana, 36023 Guanajuato, GTO México.


                                         Abstract
                                         In this paper we describe the system designed by the Mathematics Research Center (CIMAT) for par-
                                         ticipating at MEX-A3T 2020. In this work, we addressed the Aggressiveness Detection (AD) task by
                                         exploiting Bidirectional Encoder Representations from Transformers (BERT) and Data Augmentation.
                                         BERT fine-tuning has shown outstanding performance in a wide range of language tasks. However,
                                         according to recent research fine-tuning BERT on small size datasets (<10K instances) often results in
                                         unstable models. In other words, even when only the final layer is randomly initialized, distinct random
                                         seeds lead to substantially different results. In this paper, we use two strategies that are helpful in
                                         producing more stable classification models based on fine-tuned BERTs. The first strategy take advantage
                                         of ensembles, whereas a second strategy relies in data augmentation. The experimental evaluation
                                         showed that our proposals outperforms all baselines by a wide margin, and has the overall first place for
                                         Aggressiveness Detection in Mexican Spanish Tweets.

                                         Keywords
                                         Agressiveness Detection, Transformers, Data Augmentation, Text Classification


1. Introduction
Nowadays Internet users can easily share information in a number of social platforms. In
this regard, the automatic analysis of textual information has been a popular research topic
for the scientific community. This is especially true in applications that could prevent risks.
Recently, some academic competitions have emerged as forums where researches evaluate their
approaches for specific tasks by analyzing the way language is used by people, for example, the
case of offensive language [1].
  The MEX-A3T 2020 [2] forum tackles two tracks focused on digital text forensics: Fake News
(FN), and Aggressiveness Detection (AD) in Mexican Spanish. The interest of this paper is only
on AD, that aims to determine which tweets attempt to insult, offend, attack, or hurt other
people. In this regard, AD could prevent damages and harmful scenarios like cyberbullying
[3]. The models and strategies we propose here are suitable to represent and augment such
aggressive tweets by using Transformers and Data Augmentation. For the FN track due to the
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: mario.guzman@cimat.mx (M. Guzman-Silverio); angel.balderas@cimat.mx (Á. Balderas-Paredes);
pastor.lopez@cimat.mx (A.P. López-Monroy)
url: https://www.cimat.mx/es/adrián-pastor-lópez-monroy (A.P. López-Monroy)
orcid: 0000-0003-1018-4221 (A.P. López-Monroy)
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


      Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0). IberLEF 2020, September 2020, Málaga, Spain.
different structure in documents, our approach was a simple baseline, only for our reference
for future research: a Bag-of-𝑁-grams at word and character level fed into an Support Vector
Machines (SVM).
   The AD task is commonly approached as a supervised classification problem [4, 5, 1]. The
problem has been approached by using a number of strategies, inlcuding: regression models [6],
user network-attributes [7], or distributional terms representations [8]. Recently, with the rise
of deep learning neural models, some authors have been using Recurrent Neuronal Networks
[9], and Convolutional Neural Networks [10].
   One of the most successful approaches is the Bidirectional Encoder Representations from
Transformers (BERT), which was first proposed in [11]. BERT has shown outstanding perfor-
mance in a wide range of Natural Language Processing (NLP) tasks. For the case of offensive
language detection and hate speech detection, BERT has been successfully used by some au-
thors in different ways [12, 13]. For using BERT in general text classification the most common
strategy to take advantage of fine tuning. For example, BERT can be pre-trained in general
domains (e.g., Wikipedia) to model syntactic and semantic properties of language that are useful
in other tasks. In simple words, pre-trained BERT models are fine-tuned to specific domains by
substituting the output layers of the model, and re-training the rest of layers at specific pace.
Notwithstanding the effectiveness of BERT in text classification, several works have pointed out
its instability when fine-tuning BERT on small size datasets [14, 15, 16]. In simple words, even
when only the final layer is randomly initialized, distinct random seeds lead to substantially
different results. In this regard, considering that the AD corpus have less than 10K samples,
we propose to build a classification strategy based on combining several BERT models trained
with different seeds on different augmented datasets. By doing this we aim to get a model that
have in average a solid performance but small variance. According to our evaluation, the use of
ensemble methods with specific voting schemes and adversarial data augmentation can improve
the effectiveness of BERT while maintaining lower variance in performance for the small and
unbalance dataset for AD.
   The remainder of this document is organized as follows: Section 2 presents the proposed
strategies. Section 3 describes experimental settings. The experiments and results are presented
in Section 4. Finally, Section 5 outlines the final conclusions and future work.


2. Stable Classification Strategies
This section describes our two proposed strategies to alleviate the instability of fine-tuning
BERT on few sample and unbalanced datasets. These strategies are built on fine-tuned BERT
models that only vary the random seed of an extra classification layer for domain adaptation.
The first strategy is based on BERT ensembles and two well known voting schemes. The second
strategy uses data augmentation to improve even more the effectiveness and stability. In our
evaluation, we empirically show that both methodologies provide benefits for the AD and both
ranked first place of the challenge.


                                              294
2.1. BERT Ensemble model
There are many ways to combine the information of several models, but we are particularly
interested in those based in ensemble theory. The idea of having an ensemble is that several
models (possibly weak models) can make a strong one [17]. One of the key ideas in successfully
build ensembles is that the prediction space should be diverse [18]. This is that individual
models can have differences among decisions; we hypothesize those are the unstable individual
BERTs. In general, when the latter conditions are met, it could be possible to obtain a stronger
model with simple strategies. In our case we consider the following two straightforward voting
schemes:
    • Majority Voting Scheme: we predict the most voted class among the classifiers of the
      ensemble. In case of tie, we perform random prediction among the classes in question.
    • Weighted Voting Scheme: we aggregate the confidence prediction for classes in each
      model of the ensemble to build a final weighted vote. This confidence prediction, in our
      case is the output of the last Softmax layer.
   These strategies based on ensembles and simple voting schemes have resulted very convenient
and have been explored with different base models in several domains [17, 9]. It is worth to
mention that there other popular alternatives to combine. For example, the Early Fusion strategy
consist in feeding a classifier by using the concatenation of weights in the penultimate layer
of each model [18]. Other strategy could be an End-to-End network of the models. All those
strategies are interesting, but definitely much more computationally expensive than using
individual models and perform voting. For that reason we prefer these simple, yet effective,
voting strategies over the others. In our experimental evaluation we will show their effectiveness
to reduce the variance and improve the performance.

2.2. Data Augmentation
A common and effective technique in deep learning for image related tasks is that of data
augmentation, in which the goal is to create a new training data by means of a transformation:
sometimes simple like rotation, reflections and cropping, sometimes more complex techniques
are used with better results [19].
   Data augmentation for text based tasks is a very different scenario. The information in docu-
ments is sequential and the word, usually taken as the basic unit, has a syntactic and semantic
meaning that depends on the context. Thus, changing the individual words or their order could
result in noisy data that hurts the performance. This is specially true for approaches beyond
the Bags-of-Words and inspired in language modeling like BERT where the order, context and
structure of the text matters. Fortunately, there have been some advances demonstrating sightly
improvements in some scenarios [20]. In this work, we have carefully adapted two methods, and
proposed a new one to perform data augmentation. The explored data augmentation strategies
are the following:
    • Easy Data Augmentation (EDA) [20]: This is the most simple strategy and consists in
      generating new instances by modifying 20% of the original tweets. To this end, four basic
      operations are applied to randomly selected words in each tweet:


                                               295
           1. Replace: select a word and change it by a random synonym.
           2. Insert: randomly choose a synonym of a word, and insert it randomly.
           3. Swap: select two words and swap their positions.
           4. Delete: remove a word from the sentence.

        For Easy Data Augmentation, one extra tweet was created for each tweet.

    • Unsupervised Data Augmentation (UDA) [21]: This implies the use of semisuper-
      vised learning; by augmenting each sentence of the original training set and using the
      kullback-leibler divergence to penalize the difference in the distributions of the logits. For
      Unsupervised Data Augmentation, four elements were created for each selected element
      from the input, EDA was used to create those new elements.

    • Adversarial Data Augmentation (ADA) [22]: At each epoch of the training, an adver-
      sarial method is used to create a new input for the misclassified sentences. For adversarial
      data augmentation a implementation of TextFooler [22] for Spanish was used in which
      the purpose is to create a well classified input for a originally misclassified one.

   It is worth to mention that previous described strategies, originally were designed for English
language and therefore rely on dependent language tools. Thus, for each strategy we did several
adaptations in order to exploit them in Spanish.


3. Datasets, Baselines and Experimental Settings
The MEX-A3T Team provided us the data set, which has 5222 no-aggressive and 2110 aggressive
tweets. For the experiments in this paper we split the training data into 80% for training and
20% for test.
  To compare the proposed strategies in this paper, we have trained two baselines that are
commonly used in the literature:

    • N-grams ensemble with SVM: We used a grid search to find a suitable number of
      unigrams, bigrams and trigrams at word level. We also find the number of 2-6-grams at
      character level. Those features were fed into an SVM to explore the 𝐶 hyperparameter.
    • Bi-LSTM: This baseline is a neural architecture with a Bi-LSTM and a classification layer.
      We use the pre-trained word2vec vectors in Spanish from Caro and Cuervo Institute -
      Linguistic Research Group 1 . We fix the learning rate in 1e-3, and we used the Adam
      Optimizer.

   Regarding fine-tuning BERT, we set the hyper-parameters as the authors in [11] recommend.
We use Adam with a learning rate of 1e-5 and a batch size of 32 for three epochs. We use a
classification layer and as loss function the weighted Cross Entropy Loss by using the proportion
of each class. In this process, we used a BETO pre-trained model in Spanish text [23] by using
the widely known default implementation in [24].
   1
       https:www.datos.gov.co


                                               296
Table 1
Ensembles evaluation using 𝐹1 -Score for the aggressive class and standard deviation over one hundred
runs. For the Ensembles rows we report two results: the left one for majority voting scheme, whereas
the right one for weighted voting scheme.
     Model           𝐹1 -avg offensive   𝐹1 -std offensive   𝐹1 -avg non-offensive   𝐹1 -avg macro
     Ngrams-SVM             74.53               n/a                   89.51               82.02
     Bi-LSTM                73.00               n/a                   87.33               80.17
     Single BERT            79.31              0.836                  91.88               85.60
     Ensemble (5)        79.93|79.88        0.637|0.617            92.14|92.13        86.03|86.01
     Ensemble (10)       79.68|80.01        0.541|0.516            92.17|92.19        85.93|86.10
     Ensemble (20)      79.94|80.17         0.475|0.446            92.23|92.26        86.08|86.22


4. Evaluation
This section describes the experimental evaluation that shows the main findings of this work.
Firstly in Section 4.1 we design a set of experiments to observe the benefits of building and
ensemble of several BERT. Secondly in Section 4.2 we explore data augmentation strategies to
improve even more their performance. In order to have comparable results in this section, we
partitioned the trained dataset in the following way: 80% for training, and 20% for test. The
metrics we use to compare the methods were the 𝐹1 in each class, the macro 𝐹1 and the standard
deviation.

4.1. BERT Ensembles
The purpose of experiments in this section is to empirically show that BERT ensembles are
helpful and more stable for classification. For the experiments in this section BERT was full fine-
tuned (the 130 millions of parameters) two hundred times with different seeds that initialized
the last layer to detect aggressiveness. This pool is then used to compute the averages over one
hundred runs for each voting scheme. This means that, for Single BERT the average performance
of one hundred individual BERTs is reported with the standard deviations. For the case of a
Ensemble (𝑛), 𝑛 different BERTs were randomly taken for each of the one hundred runs; we
report the average performance and the standard deviation.
   In Table 1 we show the performance of BERT ensembles and other reference approaches. First
of all, note that Single BERT clearly outperforms the two baseliness: 1) 𝑁-grams-SVM and 2)
Bi-LSTM. Both of this approaches are very strong references, since we use well known strategies
and heuristics to find suitable hyper-parameters (see Section 3). The margin of improvement
of Single BERT over the two baselines also shows that on average, the fine tuning have been
successfully done.
   From Table 1 one can also note that Single BERT have a standard deviation of .836 in one
hundred runs. However, when several BERTs are combined by using ensembles, the classification
performance improves while the standard deviation decrease. In Table 1 each row Ensemble
(𝑛) (𝑛-BERTs) has two values in each column metric. The left value was obtained by using
a majority voting scheme, whereas the right value represents the performance when using
weighted voting scheme. In Figure 1 the left plot shows the 𝐹1 of the aggressive class as the


                                                 297
Figure 1: 𝐹1 average by number of BERT in each kind of ensemble.


Table 2
Evaluation of trained models on validation data with 𝐹1 -score for the aggressive class and standard
deviation over fifty runs.
              Model              vanilla         eda           uda            adv
              Single BERT      79.66 ± .67   79.29 ± .78   79.55 ± .57    79.66 ± .74
              Ensemble (5)     80.47 ± .43   80.54 ± .48   80.21 ± .47    80.52 ± .50
              Ensemble (10)    80.68 ± .34   80.73 ± .42   80.32 ± .38    80.73 ± .42
              Ensemble (20)    80.69 ± .25   80.87 ± .29   80.39 ± .24    80.92 ± .30


number of BERTs in the ensemble increase up to one hundred. The red line is for majority vote,
whereas the black line is for weighted vote. In a similar way, the right plot shows how the
variance decrease as the size of the ensemble increase. The ensemble based on majority vote
seems to have stable and higher performance until the number of BERTs is greater than sixty.
Thus, the weighted voting scheme seems to be a better choice as the variance is consistently
lower than the majority voting scheme. The CIMAT-1 run reported by the organizers in the
final ranking of the challenge is a simple Ensemble (20). In the following Section, we will show
how data augmentation can be used to improve the performance.

4.2. Data Augmentation
In this section we evaluate the data augmentation strategies described in Section 2.2. The purpose
of these experiments is to improve the classification performance of the previously evaluate
ensemble approaches. In all experiments of this section we will use the weighted voting scheme
for ensemble methods. Furthermore, as data augmentation strategies are computationally
expensive in time and storage, the results of this section are based on fifty runs instead of one
hundred.
   In Table 2 experimental results shows that augmenting the data by using EDA and ADV
improve the ensembles performance, while maintaining the standard deviation low compared
to Single BERT. The UDA strategy actually seems to hurt the performance, but still having
lower variance. The overall trend can be seen in the left plot of Figure 2, where ADV strategy
seems to be the best choice to augment text. In Figure 3 we show the boxplot of fifty runs
of the Ensemble (20) for different data augmentation strategies. Note that the blue bloxplot,


                                                298
Table 3
Evaluation of models, trained without dropout, on validation data with 𝐹1 -score on aggressive category.
               Model              vanilla          eda            uda        hard_adv
               Single BERT      79.70 ± .74    79.28 ± .81    79.85 ± .71   79.06 ± .73
               Ensemble (5)     80.43 ± .37    80.34 ± .64    80.28 ± .40   80.12 ± .45
               Ensemble (10)    80.45 ± .33    80.37 ± .52    80.26 ± .30   80.53 ± .35
               Ensemble (20)    80.46 ± .25    80.53 ± .37    80.18 ± .24   80.75 ± .25


Figure 2: Histograms performance of Ensemble (20) with different Data Augmentation strategies. The
left plot are fine-tuned BERTs with dropout, whereas the right plot are without dropout.


corresponding to the vanilla (no data augmentation) seems to have lower results than the red
boxplot of the adversary data augmentation. Furthermore, the variance is still low in all data
augmentation strategies.
   Finally in Table 3 and the right plot of Figure 2, we show the experimental result of removing
the dropout in last layer of the fine-tuning process of BERT. Reportedly, it is better to have
dropout, but if the test data comes from very similar distribution that could not be necessary.
For example, note that the vanilla strategy (no data augmentation) has very similar results
with and without dropout (see the first row of both tables). However, in Table 3 note that
data augmentation does not grow the performance at the same pace, and in some cases it
hurts. Also note that the size of the ensemble helps, especially if the ADV strategy is used.
Finally, note that UDA strategy improves the Single BERT model, which results in the best single
model. These latter results suggest that while the UDA improves the behavior of the model
by reinforcing positive examples every time, on the other hand, the use of adversaries might
be doing it by learning a wider array of the dataset, specifically, the hard to learn examples.
That would explain that the ensembles behave better as they go larger when using adversarial
augmentation. In the final ranking of the challenge, CIMAT-2 corresponds to an Ensemble (20)
that used EDA data augmentation. As experiments in Table 2 and 3 show, this is not the best
strategy to data augmentation since the performance is lower than ADV with and a similar or
higher standard deviation. The experiments that use UDA and ADV was obtained once the
challenge was finished.


                                                 299
Figure 3: Boxplots on fifty runs of the Ensemble (20) by using different data augmentation strategies
and evaluating the 𝐹1 for the aggressive class.


5. Conclusions
We proposed strategies for AD that are based on ensembles of fine-tuned BERTs. The weighted
voting scheme seems to be helpful for combining the decisions of several models. This means
competitive performance and low variance of the model. The experiments show that there
is space for the data augmentation paradigm in the tool set of the deep learning specialist.
However, even if the results suggest that there is some improvement in the results when using
data augmentation, they also imply certain trade offs. The results indicate that there is more to
be gained when working with adversarial data augmentation on ensembles because the models
seems to learn in a more heterogeneous way. But the cost of doing so is quite elevated because
of the need of using costly methods for the adversarial examples’ generation.


Acknowledgments
Guzman-Silverio and Balderas-Paredes appreciate CONACYT’s support for scholarships 925934
and 928540 respectively.


References
 [1] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. R. Pardo, P. Rosso, M. Sanguinetti,
     Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women
     in twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation,
     2019, pp. 54–63.
 [2] M. E. Aragón, H. Jarquín, M. Montes-y Gómez, H. J. Escalante, L. Villaseñor-Pineda,
     H. Gómez-Adorno, G. Bel-Enguix, J.-P. Posadas-Durán, Overview of mex-a3t at iberlef
     2020: Fake news and aggressiveness analysis in mexican spanish, in: Notebook Papers of


                                                300
     2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain,
     September, 2020.
 [3] N. Safi Samghabadi, A. P. López Monroy, T. Solorio, Detecting early signs of cyberbullying
     in social media, in: Proceedings of the Second Workshop on Trolling, Aggression and
     Cyberbullying, European Language Resources Association (ELRA), Marseille, France, 2020,
     pp. 144–149. URL: https://www.aclweb.org/anthology/2020.trac-1.23.
 [4] M. E. Aragón, A. P. López-Monroy, Author profiling and aggressiveness detection in
     spanish tweets: Mex-a3t 2018., in: IberEval@ SEPLN, 2018, pp. 134–139.
 [5] M. E. Aragón, M. Á. Álvarez-Carmona, M. Montes-y Gómez, H. J. Escalante, L. Villasenor-
     Pineda, D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness
     analysis in mexican spanish tweets, in: Notebook Papers of 1st SEPLN Workshop on
     Iberian Languages Evaluation Forum (IberLEF), Bilbao, Spain, 2019.
 [6] L. P. Del Bosque, S. E. Garza, Aggressive Text Detection for Cyberbullying, in: A. Gelbukh,
     F. C. Espinoza, S. N. Galicia-Haro (Eds.), Human-Inspired Computing and Its Applications,
     Springer International Publishing, Cham, 2014, pp. 221–232.
 [7] D. Chatzakou, N. Kourtellis, J. Blackburn, E. De Cristofaro, G. Stringhini, A. Vakali, Detect-
     ing Aggressors and Bullies on Twitter, in: Proceedings of the 26th International Conference
     on World Wide Web Companion, WWW ’17 Companion, International World Wide Web
     Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp.
     767–768.
 [8] H. J. Escalante, E. Villatoro-Tello, S. E. Garza, A. P. López-Monroy, M. Montes-y-Gómez,
     L. Villaseñor-Pineda, Early detection of deception and aggressiveness using profile-based
     representations, Expert Systems with Applications 89 (2017) 99 – 111.
 [9] G. K. Pitsilis, H. Ramampiaro, H. Langseth, Effective hate-speech detection in twitter data
     using recurrent neural networks, Applied Intelligence 48 (2018) 4730–4742.
[10] Z. Zhang, D. Robinson, J. Tepper, Detecting hate speech on twitter using a convolution-gru
     based deep neural network, in: European semantic web conference, Springer, 2018, pp.
     745–760.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
     anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[12] A. Paraschiv, D.-C. Cercel, Upb at germeval-2019 task 2: Bert-based offensive language
     classification of german tweets, in: Preliminary proceedings of the 15th Conference on
     Natural Language Processing (KONVENS 2019). Erlangen, Germany: German Society for
     Computational Linguistics & Language Technology, 2019, pp. 396–402.
[13] M. Mozafari, R. Farahbakhsh, N. Crespi, A bert-based transfer learning approach for
     hate speech detection in online social media, in: International Conference on Complex
     Networks and Their Applications, Springer, 2019, pp. 928–940.
[14] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, N. Smith, Fine-tuning pretrained
     language models: Weight initializations, data orders, and early stopping, arXiv preprint
     arXiv:2002.06305 (2020).


                                               301
[15] T. Zhang, F. Wu, A. Katiyar, K. Q. Weinberger, A. Yoav, Revisiting few-sample bert fine-
     tuning, arXiv preprint arXiv:2006.05987 (2020). URL: https://arxiv.org/pdf/2006.05987.pdf.
[16] M. Mosbach, M. Andriushchenk, D. Klakow, On the stability of fine-tuning bert:miscon-
     ceptions, explanations, and strong baselines, arXiv preprint arXiv:2006.04884 (2020).
[17] S. Vega-Pons, J. Ruiz-Shulcloper, A survey of clustering ensemble algorithms, International
     Journal of Pattern Recognition and Artificial Intelligence 25 (2011) 337–372.
[18] L. Rokach, Ensemble-based classifiers, Artificial Intelligence Review 33 (2010) 1–39.
[19] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le, Autoaugment: Learning augmenta-
     tion strategies from data, in: Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2019, pp. 113–123.
[20] J. Wei, K. Zou, EDA: Easy data augmentation techniques for boosting performance on
     text classification tasks, in: Proceedings of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong
     Kong, China, 2019, pp. 6382–6388. URL: https://www.aclweb.org/anthology/D19-1670.
     doi:1 0 . 1 8 6 5 3 / v 1 / D 1 9 - 1 6 7 0 .
[21] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, Q. V. Le, Unsupervised data augmentation, 2019. URL:
     http://arxiv.org/abs/1904.12848, cite arxiv:1904.12848.
[22] D. Jin, Z. Jin, J. T. Zhou, P. Szolovits, Is bert really robust? a strong baseline for natural
     language attack on text classification and entailment, 2019. a r X i v : 1 9 0 7 . 1 1 9 3 2 .
[23] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation
     data, in: to appear in PML4DC at ICLR 2020, 2020.
[24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Brew, Huggingface’s transformers: State-of-the-art natural language
     processing, ArXiv abs/1910.03771 (2019).


                                               302