=Paper= {{Paper |id=Vol-2943/detoxis_paper1 |storemode=property |title=GTH-UPM at DETOXIS-IberLEF 2021: Automatic Detection of Toxic Comments in Social Networks |pdfUrl=https://ceur-ws.org/Vol-2943/detoxis_paper1.pdf |volume=Vol-2943 |authors=Sergio Esteban Romero,Ricardo Kleinlein,Cristina Luna-Jiménez,Juan Manuel Montero,Fernando Fernández-Martínez |dblpUrl=https://dblp.org/rec/conf/sepln/RomeroKJMF21 }} ==GTH-UPM at DETOXIS-IberLEF 2021: Automatic Detection of Toxic Comments in Social Networks== https://ceur-ws.org/Vol-2943/detoxis_paper1.pdf
       GTH-UPM at DETOXIS-IberLEF 2021:
      Automatic Detection of Toxic Comments in
                  Social Networks

        Sergio Esteban Romero, Ricardo Kleinlein, Cristina Luna-Jiménez,
             Juan Manuel Montero[0000−0002−7908−5400] , and Fernando
                     Fernández-Martı́nez[0000−0003−3877−0089]

          Speech Technology Group, Center for Information Processing and
    Telecommunications, E.T.S.I. de Telecomunicación, Universidad Politécnica de
              Madrid, Av. Complutense, No 30, Madrid, 28040, Spain
    sergio.estebanro@alumnos.upm.es, {ricardo.kleinlein, cristina.lunaj,
               juanmanuel.montero, fernando.fernandezm}@upm.es


        Abstract. Sadly, the presence of toxic messages on social networks,
        whether in the form of stereotypes, sarcasm, mockery, insult, inappro-
        priate language, aggressiveness, intolerance, or typical of hate speech
        against immigrants and / or women, among others, is relatively frequent.
        This presence should not be ignored by the scientific community, since
        it is their responsibility to develop tools and systems that allow their
        automatic detection and elimination. In this paper, we present an ex-
        ploratory analysis in which different deep learning (DL) models for the
        detection of toxic expressions have been evaluated on the DETOXIS-
        IberLEF 2021 challenge using the official release of the NewsCom-TOX
        corpus. Particularly, we compare traditional RNN and state-of-the-art
        transformer models. Our experiments confirmed that optimum perfor-
        mance can be obtained from transformer models. Specifically, top per-
        formance was achieved by fine tuning a BETO model (the pre-trained
        BERT model for the Spanish language from the Universidad de Chile)
        for the toxicity detection tasks. Another contribution of this analysis is
        the validation of the proposed method for adding task-specific vocabulary
        (new tokens) that could help to effectively extend the original vocabulary
        of the pre-trained models.

        Keywords: Classification task · Toxicity detection · Recurrent networks
        · Attention · Transformer models · Transfer learning · Social networks



1     Introduction
The automatic detection of toxic language, especially in online tweets and com-
ments, is a task that has attracted growing interest from the NLP (Natural Lan-
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
guage Processing) community in recent years and has become a tremendously
popular and active research area because of its impact on modern society.
    In this regard, the DETOXIS challenge is a great opportunity to tackle the
hard task of identifying toxic comments in social media. The detection of toxicity
is not an easy task at all. It involves much more than just identifying some specific
words or sentences, we must also take the context into account which makes the
task even more complex. The present work is well aligned with such interest
and its objective is the design and implementation of computational models for
toxicity assessment and classification of comments in Spanish. Different models
have been proposed and evaluated for both subtasks of the challenge:
 – Subtask 1: Toxicity detection task is a binary classification task that consists
   of classifying the content of a comment as toxic (toxic=yes) or not toxic
   (toxic=no).
 – Subtask 2: Toxicity level detection task is a more fine grained classification
   task in which the aim is to identify the level of toxicity of a comment (0=
   not toxic; 1= mildly toxic; 2= toxic and 3: very toxic).
   In this paper we will present the different toxicity recognition models we
have developed as part of our participation in DETOXIS-Iberlef 2021. Our first
model is an extension of the system we previously developed for intent detection
and classification based on word-embeddings and recurrent neural networks [6].
Our second model, the model that we have used in our final submissions to the
challenge, is a transformer based model which has been developed by fine tuning
a BETO model (the pre-trained BERT model for the Spanish language from
the Universidad de Chile) for the toxicity detection tasks. The lessons learned
from our experiments and experience in participating in the challenge have been
reported in the next sections.


2   Related work
Automatic detection of toxic comments to ensure its deletion is mandatory in
a world surrounded by social media. Traditional methods such as RNN models
have provided efficient solutions for similar tasks in multiple fields. Also, RNN
models based on Bi-LSTMs allow to explore the context of the sentence improv-
ing its performance. However, recent advances such as the BERT model have
greatly contributed to enhance natural language processing [5]. The BERT fam-
ily of algorithms is based on the Transformer architecture [22], a particular type
of neural processing unit that outperforms traditional LSTM cells [9]. Trans-
formers have become state-of-the-art models in many of the most popular NLP
tasks including automatic toxicity detection in texts [19][13].


3   The NewsCom-TOX dataset
Data provided for the challenge is grouped in NewsCom-Tox dataset which con-
tains around 4357 posts. Most comments are in response of articles from different
Spanish newspapers. Each comment is classified as toxic or not toxic and also in
four different levels of toxicity. In addition, some other features are included such
as argumentation, sarcasm, mockery or insult, for instance [21]. Furthermore, we
must consider that classes are not balanced since it contains a higher amount of
non toxic comments.


4     RNN based model

As a first solution for the toxicity analysis we have used a Recurrent Neural
Network (RNN), a type of model widely used in the analysis of Twitter mes-
sages, for example. RNNs have the ability to process their inputs sequentially,
performing the same operation, ht = fW (xt , ht−1 ), on each of the different ele-
ments that constitute our input sequence (i.e. words or, to be more exact, their
corresponding embeddings), where ht is the hidden state, t the time step, and
W the weights of the network.
    As it can be observed, the operation is formulated in such a way that the
hidden state at each time step depends on the previous hidden states. Hence, the
order of the elements in our sequences (i.e. the order of the words) is particularly
important. As an immediate consequence, RNNs allow us to handle inputs (i.e.
sentences) of variable length, which happens to be an essential feature given the
nature of our problem.
    Among the different possible architectures of this type of networks, we have
opted for the so-called Long Short-Term Memory (LSTM) networks [2], a special
type of RNNs that help preventing the typical vanishing gradient problem of
standard RNNs by introducing a gating mechanism to ensure proper gradient
flow through the network. LSTMs main characteristic is the ability to learn long-
term dependencies. To do this, these networks are supported by basic constituent
units called cells that are provided with mechanisms that allow deciding for each
cell what information is preserved from that provided by the previous cells, and
what information is provided to the next ones, both depending on the cell’s
current state.


4.1   Embeddings

(Word) embeddings are vector-type representations obtained for words in reduced-
dimensional vector spaces where semantically similar words are always close to
each other. The Fasttext project [8], recently open-sourced by Facebook Re-
search, enables a fast and effective method to learn word embeddings that are
very useful in text classification, clustering and information retrieval. In this
work, the proposed model uses Fasttext word embeddings to represent the vec-
tors for the words as input of the network.
    At the time of training, FastText trains by sliding a window over the input
text and either learning the target word from the remaining context (also known
as continuous bag of words, CBOW), or all the context words from the target
word (“Skip-gram”). Learning can be viewed as a series of updates to a neural
network with two layers of weights and three layers of neurons, in which the
outer layer has one neuron for each word in the vocabulary and the hidden
layer has as many neurons as there are dimensions in the embedding space. This
approach is very similar to Word2Vec [14]. However, unlike Word2Vec, fastText
might also learn vectors for sub-parts of words: so-called character n-grams. This
ensures that for instance the words love, loved and beloved all have similar vector
representations, even if they tend to show up in different contexts. This feature
enhances learning on heavily inflected languages [3].


4.2   Model description

Our approach is based on a 2-layer Bidirectional-LSTM model with a deep self-
attention mechanism which is represented in Figure 1. The model is implemented
in Pytorch [15] and based on the architecture proposed in [2].


Embedding layer The model is designed to work with sequences of words as
inputs, thus allowing us to process any type of sentence. For this, a first embed-
ding layer is provided that collects the embeddings x1 , x2 , . . . , xN corresponding
to each of the words w1 , w2 , . . . , wN constituing the sentence we want to process,
where N is the number of words in our sentence. We initialize the weights of the
embedding layer with our pre-trained word embeddings.


Bi-LSTM layer A standard LSTM model behaves in a unidirectional way,
that is, the network takes as input the direct sequence of word embeddings
and produces the outputs h1 , h2 , . . . , hN , where hi is the hidden state of the
LSTM cell at time step i, summarizing all the information that the network has
accumulated from our sentence up to word wi .
    Instead, we have used a bi-directional LSTM (Bi-LSTM) that allows us to
collect such information in both directions. In particular, a Bi-LSTM consists
                         −−−−→
of 2 LSTMs, a forward LST M that allows the analysis of the sentence from
                                            ←−−−−
w1 to wN , and an inverse or backward LST M which allows a similar analysis
to be carried out but in the opposite direction, from wN to w1 . To obtain the
definitive outputs of our Bi-LSTM layer, we simply concatenate for each word
the outputs obtained from the analysis performed in each specific direction (see
Equation 1 in which || corresponds to the concatenation operator and L to the
size of each LSTM).
                                →
                                − ←   −
                           hi = hi || hi , where hi ∈ R2L                         (1)


Attention layer In order to identify the most informative words when deter-
mining the polarity of the sentence, the model uses a deep self-attention mech-
anism. Thus, actual importance and contribution of each word is estimated by
means of a multilayer perceptron (MLP) composed of 2 layers with a non-linear
activation function (tanh) similar to that proposed in [16].
   The MLP learns the attention function g as a probability distribution on
the hidden states hi , that allows us to obtain the attention weights ai that each
word receives. As the output of the attention layer the model simply computes
the convex combination r of the LSTM outputs hi with weights ai , where a
convex combination is a linear combination of points where all the coefficients
are non-negative and add up to 1.

Output layer Finally, we use r as a feature vector which we feed to a final
task-specific layer for classification. In particular, we use a fully-connected layer,
followed by a softmax operation, which outputs the probability distribution over
the classes.




                        Fig. 1. Proposed RNN based model.




5    Transformer based model
Willing to improve the results obtained with the RNN model, we decided to use
transformers, a type of model able to process its inputs sequentially as well. For
this task we used BERT which stands for Bidirectional Encoder Representations
from Transformers allowing to work in both directions. Also, something that
make them so powerful is their attention mechanism for identifying which are
the keywords in a sentence. These models usually receive sentences as inputs
that are divided into single tokens, obtaining a sequence of them.
    The way this process is carried out depends on the tokenizer used but BERT’s
one is based on words and subwords. So, for instance, if a word is not included
in the original vocabulary, it will be divided in a sequence of subtokens that all
together would form the initial word. However, when we want to finetune our
model within a specific field, it is usual that this occurs with many common
words related to the topic, so what we decided to test is whether adding new
tokens to our initial vocabulary results in a better toxicity recognition. These
new tokens will correspond to the most frequent words in our training dataset
that were not already included in the original tokenizer vocabulary.


5.1   Model adaptation

Pre-trained NLP models have led to breakthrough performance improvements in
different tasks including intent recognition or sentiment analysis, among many
others [17]. However, the adoption of pre-trained language models still must face
two important challenges as their applications expand [20]:

 1. First, the need for large training resources; training requires substantial com-
    putation and data, see, e.g., BERTlarge [5], RoBERTa [12], while most com-
    mon situation is that available training resources are significantly constrained
    or limited.
 2. Second, the need for extending pre-trained models with domain-specific vo-
    cabulary: every target domain, such as the social media domain on which
    this work focuses, has its own vocabulary, and sentences in the domain may
    have words from both the original language model’s vocabulary and new
    domain-specific vocabulary. Being able to operate on this mixture of vocab-
    ulary is essential in achieving high performance on downstream tasks in the
    new domain [7].

    Instead of constructing our model with a new vocabulary from scratch, which
would require substantial computational resources and training data, or simply
adapting the existing pre-trained model on the original vocabulary, which would
lead to sub-optimal performance on downstream tasks, we have adopted a simple
but effective approach that addresses both challenges explicitly. Particularly, our
method aims at including only a reduced subset of words from the new domain’s
vocabulary, carefully selected in a rationale way by attending to their actual
frequency in our training data, while being able to reuse and adapt the original
pre-trained model. This helps reducing required computation and training data
while enhancing recognition performance.
6     Evaluation
6.1   Experimental setup
To prevent overfitting all the experiments have been carried out following a 5-fold
cross-validation scheme. Each setup has been trained for 100 epochs and ’Stop-
Early’ has been adopted as the stopping criterion. No exhaustive exploration of
the hyper-parameters of our models was conducted. Models were trained during
100 epochs using an Adam optimizer [11], with initial learning rate of 0.001,
batch size of 32, and early-stopping after 5 epochs without improvement in the
F1 classification score. For the calculation of F1 we used the weighted version
that takes into account the number of examples available for each different class.

RNN specific setup With regards to the RNN model both the bi-LSTM and
attention layers had a 0.3 dropout rate. The encoder layers had a size L of
150 or 200. As a way to increase input variability from epoch to epoch, input
embeddings were randomly added white noise with 0.15 probability rate in order
to increase the robustness of the model. Also, given that classes were not perfectly
balanced, to prevent introducing bias in our models we applied class weights
to the loss function, penalizing more the misclassification of under-represented
classes. These weights were computed as the inverse frequencies of the classes in
the training set.

BERT specific setup Our BERT model has been implemented and fine-tuned
for the toxic comments classification tasks using the Simple Transformers library
[18]. Although pre-trained tokenizers work at both word and subword levels, the
top N new tokens to be added to the vocabulary (i.e. those that happen to be
the most frequent in our training data) have been included as word-level units.
The rest of new tokens, those connected to more infrequent words, just get split
into smaller units to ensure that there are no out-of-vocabulary tokens and all
vocabulary units get updated reasonably frequently during training.

6.2   RNN model results
We have evaluated two different RNN like models whose main difference is the
encoder size. Corresponding results are detailed in Table 1. Other values were
also tested though size 200 yielded best performance. Nonetheless, performance
has demonstrated to be significantly better for even the most simple version of
our BERT based model: the model fine-tuned from the cased version of BETO,
a BERT model trained on a big Spanish corpus that can be found in [4], without
explicitly adding any new domain-specific vocabulary.

6.3   BERT model results
After confirming the superiority of the BERT based approach, we compared
different pre-trained models’ performance after fine-tuning them on the first
  Table 1. RNN vs BERT 5-fold CV results for DETOXIS-Iberlef 2021 Subtask 1.

                   Model      Description     Weighted F1
                   RNN1       Encoder size 150 72.17 %
                   RNN2       Encoder size 200 72.50 %
                   Cased-BETO Frozen tokenizer 75.26 %


downstream subtask: identifying whether a comment is toxic or not, a binary
classification problem. As shown in Figure 2, the cased version of BETO clearly
outperforms the other two models: the uncased BETO version and the standard
BERT multilingual base model, a model pre-trained on the top 104 languages
with the largest Wikipedia using a masked language modeling (MLM) objective
[5].




Fig. 2. BERT models comparison: 5-fold CV results for DETOXIS-Iberlef 2021 Subtask
1.


    Then, besides comparing alternative pre-trained models, we also measured
the impact of our vocabulary extension method. The results of this analysis have
been also presented in Figure 2. Results shown there correspond to independent
experiments in which a different number of N new tokens (i.e. top N words
included as new words) is tested. Results are reported starting from 25 new
tokens at first and up to 100, increasing the amount by 25 on each different run.
    As it can be deduced from the figure, our vocabulary extension method
demonstrates to be effective achieving a top performance of 76.72% and an
improvement over the baseline performance obtained when our model is sim-
ply fine-tuned without explicitly adding any new word (i.e. 75.26%, previously
reported in Table 1). However, results become worse when the amount of new
tokens exceeds a certain small limit, which suggests the importance of finding an
adequate balance between the increased complexity of our target model (i.e. the
number of new units-embeddings to be learnt) and the available training data.
    For the second subtask we have followed exactly the same procedures obtain-
ing similar results as the shown above. In this subtask we are facing a multi-class
classification problem, where the goal is to identify the toxicity level of every
comment in a 0 to 3 scale (i.e. 0: not toxic; 1: mildly toxic; 2: toxic and 3: very
toxic). The evaluation results for the adopted experimental setup based on the
5-fold cross-validation scheme have been presented in Figure 3. In this case, only
results obtained for our top-performing approach based on the cased version
of the BETO pre-trained model have been reported. Besides, and for a proper
comparison and analysis, the result corresponding to the case where we sim-
ply fine-tune the model while freezing the original tokenizer (i.e. we do not add
any new domain-specific vocabulary) has also been included at the beginning
of the series (i.e. the “0” column). Again, we demonstrate that our vocabulary
extension method consistently outperforms the prior approach based on general
vocabulary. However, once again we confirm that new additive vocabulary can
only be introduced to some extent because no further improvement is observed
beyond 100 new domain-specific words (best performance is achieved again by
the 25 new tokens configuration).




Fig. 3. Analysis of the vocabulary extension method for the top-performing approach
based on the cased version of the BETO pre-trained model: 5-fold CV results for
DETOXIS-Iberlef 2021 Subtask 2.
Analyzing the effect of different data pre-processing methods Adopting
the top-performing approach based on the cased version of the BETO pre-trained
model as a reference, we decided to further explore the use of some of the most
popular text pre-processing techniques to find out whether they are actually
useful or not. Evaluated techniques include the following:
 – Removing stop words: stop words are removed with the help of the spaCy
   library [10].
 – Removing punctuation: punctuation marks are removed with the help of the
   Spacy library.
 – Lemmatization: Spacy lemmatization is applied to generate the root form of
   the words.
 – Basic text normalization: special words including emojis, emails, percent-
   ages, money, phone numbers, times, dates, urls and/or hashtags are assim-
   ilated and replaced by a special tokens, such as MAIL, DATE, URL,... to
   prevent information from being lost during data representation.


Table 2. Results obtained for our top-performing approach based on the cased ver-
sion of the BETO pre-trained model when enriched with 25 new tokens and applying
a specific text preprocessing technique: 5-fold CV results for DETOXIS-Iberlef 2021
Subtask 1.

                      Technique                  Weighted F1
                      without basic preprocessing 75.91 %
                      lemmatization               74.11 %
                      removing stopwords          74.03 %
                      removing punctuation        73.82 %


    The obtained results have been summarised and sorted in terms of perfor-
mance in Table 2. As it can be observed, none of the applied techniques that are
aimed at removing tokens were found to be effective. In short, DL methods that
do use embedding representations seems to not require the removal of anything.
Specifically, in that n-dimensional vector words like “dogs” and “dog” would
already be closer to each other. So, the need to lemmatize becomes unnecessary.
    With regards to stopwords, although it could be convenient to remove many
of them, we should notice that stopword lists may contain words which shouldn’t
be removed in certain domains or tasks, as it happens to be the case. Generally
speaking, we should not remove anything (e.g., a word or a punctuation mark)
that could be useful in some way. Again, DL models working with vector em-
beddings, similarly to lemmatization, are currently the best methods to handle
and filter those irrelevant terms.
    Finally, it is interesting to conclude that basic text normalization has proven
to be successful (i.e. performance decreases if we omit it). In this case, the process
of transforming some words into their single canonical form still helps our model
by reducing the number of unique words (i.e. reducing the vocabulary size helps
reducing the model complexity and improving its performance).
6.4   DETOXIS-Iberlef 2021 Challenge results

After carefully analyzing the obtained results that have been previously pre-
sented, we decided to submit the runs on both DETOXIS-Iberlef 2021 challenge
subtasks for our top-performing approach based on the cased version of the
BETO pre-trained model when gradually increasing the amount of new tokens
in steps of 25 from 0 to 100, thus resulting in 5 different runs and submissions
for each task. Details about the challenge and its evaluation are presented in
[21].


            Table 3. DETOXIS-Iberlef 2021 SUBTASK 1 Top 5 ranking

                         Ranking Team Name      F1 Toxic
                                 Gold standard 1.000
                         1       SINAI          0.6461
                         2       FuillemGSubies 0.600
                         3       AI-UPV         0.5996
                         4       DCG            0.5734
                         5       GTH-UPM 0.5726



    In both tasks, we are named as GTH-UPM. These rankings are available on
DETOXIS-Iberlef 2021 official website [1]. Our result in Table 3 corresponds to
the run with 100 new tokens. In this case, a relatively large deviation can be
observed when comparing the official result from the challenge with our previous
best result. In this regard, it is worth mentioning that, in addition to the inherent
difficulty of the task itself, our model was not optimized for the individual F1
measure over the toxic class but for F1 weighted over the two classes: toxic and
non-toxic. This result has been considered satisfactory since it has been obtained
by means of the proposed vocabulary extension method.


            Table 4. DETOXIS-Iberlef 2021 SUBTASK 2 Top 5 ranking

            Ranking Team Name      CEM RPB Pearson Accuracy
                    Gold standard 1.000 0.8213 1.0000 1.0000
            1       SINAI          0.7495 0.2612 0.4957 0.7654
            2       Team Sabari    0.7428 0.2670 0.5014 0.7464
            3       DCG            0.7300 0.3925 0.4544 0.7329
            4       GTH-UPM        0.7256 0.1545 0.4298 0.7318
            5       GuillemGSubies 0.7189 0.2449 0.4451 0.6835



   If we move on to the second subtask, our result in Table 4 corresponds to
the run with 75 new tokens, thus also demonstrating the convenience of the
proposed extension method. In this case, in spite of the mismatch between the
optimization/evaluation parameters (i.e. the subtask aims at CEM evaluation
metric while our model was specifically trained targeting weighted-F1) our model
has achieved a significantly better result than the best result that we previously
obtained.

7   Conclusions
In this paper, we have presented an exploratory analysis in which different deep
learning (DL) models for the detection of toxic expressions have been evalu-
ated on the DETOXIS-IberLEF 2021 challenge using the official release of the
NewsCom-TOX corpus. Particularly, we have compared traditional RNN and
state-of-the-art transformer models including standard BERT [5] and its BETO
variant [4]. Our experiments have confirmed that optimum performance can be
obtained from transformer models. Specifically, better performance has been
achieved by simply fine tuning the BETO model for the toxicity detection tasks.
    As another important contribution of this work, we have proposed and vali-
dated a simple but effective method for extending our pre-trained models with
domain-specific vocabulary. The method accounts for term frequencies to rank
and select specific words to be added at the word level. As a result, the perfor-
mance of the extended model can be significantly improved. This approach could
be particularly attractive to ad-hoc and special-purpose or very specific domains
with unique vocabularies where limited training data is available. Nonetheless,
additional work still needs to be done with regards to automatically finding or
identifying the exact and optimal amount of new tokens to be added (i.e. the pre-
cise value that achieves a good balance between model complexity and available
training data).
    Furthermore, unlike traditional approaches not using DL nor embedding rep-
resentations, when using transformer-like models and testing different text pre-
processing methods, it has been observed that preserving the raw structure of
the texts, by not removing anything while simply performing a very basic text
normalisation, helps achieving optimal performance.

Funding The work leading to these results has been supported by the Span-
ish Ministry of Economy, Industry and Competitiveness through the CAVIAR
(MINECO, TEC2017-84593-C2-1-R) and AMIC (MINECO, TIN2017-85854-C4-
4-R) projects (AEI/FEDER, UE). Ricardo Kleinlein’s research was supported
by the Spanish Ministry of Education (FPI grant PRE2018-083225).

Acknowledgments We gratefully acknowledge the support of NVIDIA Cor-
poration with the donation of the Titan X Pascal GPU used for part of this
research.

References
 1. Detoxis-IberLEF 2021 results (2021), https://detoxisiberlef.wixsite.com/website/evaluation-
    results, [Online; accessed 21-June-2021]
 2. Baziotis, C., Nikolaos, A., Chronopoulou, A., Kolovou, A., Paraskevopoulos, G.,
    Ellinas, N., Narayanan, S., Potamianos, A.: Ntua-slp at semeval-2018 task 1: Pre-
    dicting affective content in tweets with deep attentive rnns and transfer learning.
    Proceedings of The 12th International Workshop on Semantic Evaluation (2018).
    https://doi.org/10.18653/v1/s18-1037, http://dx.doi.org/10.18653/v1/S18-1037
 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vec-
    tors with subword information. Transactions of the Association for Com-
    putational Linguistics 5, 135–146 (2017). https://doi.org/10.1162/tacl a 00051,
    https://www.aclweb.org/anthology/Q17-1010
 4. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
    trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of
    deep bidirectional transformers for language understanding. In: Proceedings of
    the 2019 Conference of the North American Chapter of the Association for
    Computational Linguistics: Human Language Technologies, Volume 1 (Long
    and Short Papers). pp. 4171–4186. Association for Computational Linguis-
    tics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423,
    https://www.aclweb.org/anthology/N19-1423
 6. Fernández-Martı́nez, F., Griol, D., Callejas, Z., Luna-Jiménez, C.:
    An approach to intent detection and classification based on at-
    tentive    recurrent     neural    networks.    In:  Proc.    IberSPEECH       2021.
    pp.       46–50        (2021).      https://doi.org/10.21437/IberSPEECH.2021-10,
    http://dx.doi.org/10.21437/IberSPEECH.2021-10
 7. Garneau, N., Leboeuf, J., Lamontagne, L.: Predicting and interpreting embeddings
    for out of vocabulary words in downstream tasks. CoRR abs/1903.00724 (2019),
    http://arxiv.org/abs/1903.00724
 8. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning
    word vectors for 157 languages. In: Proceedings of the Eleventh Interna-
    tional Conference on Language Resources and Evaluation (LREC 2018). Euro-
    pean Language Resources Association (ELRA), Miyazaki, Japan (May 2018),
    https://www.aclweb.org/anthology/L18-1550
 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
    put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
    https://doi.org/10.1162/neco.1997.9.8.1735
10. Honnibal,       M.,      Montani,      I.,   Van      Landeghem,       S.,    Boyd,
    A.:      spaCy:        Industrial-strength      Natural     Language       Process-
    ing      in      Python        (2020).     https://doi.org/10.5281/zenodo.1212303,
    https://doi.org/10.5281/zenodo.1212303
11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,
    Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
    ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
    (2015), http://arxiv.org/abs/1412.6980
12. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
    approach. CoRR abs/1907.11692 (2019), http://arxiv.org/abs/1907.11692
13. Maslej-Krešňáková, V., Sarnovský, M., Butka, P., Machová, K.: Com-
    parison of deep learning models and various text pre-processing tech-
    niques for the toxic comments classification. Applied Sciences 10(23)
    (2020). https://doi.org/10.3390/app10238631, https://www.mdpi.com/2076-
    3417/10/23/8631
14. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word rep-
    resentations in vector space (2013), http://arxiv.org/abs/1301.3781
15. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
    Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS
    2017 Workshop on Autodiff (2017), https://openreview.net/forum?id=BJJsrmfCZ
16. Pavlopoulos, J., Malakasiotis, P., Androutsopoulos, I.: Deep learning for
    user comment moderation. In: Proceedings of the First Workshop on Abu-
    sive Language Online. pp. 25–35. Association for Computational Linguistics,
    Vancouver, BC, Canada (Aug 2017). https://doi.org/10.18653/v1/W17-3004,
    https://www.aclweb.org/anthology/W17-3004
17. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained models for
    natural language processing: A survey. Science in China E: Technological Sciences
    63(10), 1872–1897 (Oct 2020). https://doi.org/10.1007/s11431-020-1647-3
18. Rajapakse, T.C.: Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers
    (2019)
19. Schmidt, A., Wiegand, M.: A survey on hate speech detection using natural lan-
    guage processing. In: Proceedings of the Fifth International Workshop on Natural
    Language Processing for Social Media. pp. 1–10. Association for Computational
    Linguistics, Valencia, Spain (Apr 2017). https://doi.org/10.18653/v1/W17-1101,
    https://www.aclweb.org/anthology/W17-1101
20. Tai, W., Kung, H.T., Dong, X., Comiter, M., Kuo, C.F.: exBERT: Extend-
    ing pre-trained models with domain-specific vocabulary under constrained train-
    ing resources. In: Findings of the Association for Computational Linguis-
    tics: EMNLP 2020. pp. 1433–1439. Association for Computational Linguis-
    tics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.129,
    https://www.aclweb.org/anthology/2020.findings-emnlp.129
21. Taulé, M., Ariza, A., Nofre, M., Amigó, E., Rosso, P.: Overview of the detoxis task
    at iberlef-2021: Detection of toxicity in comments in spanish. Procesamiento del
    Lenguaje Natural 67 (2021)
22. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
    Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V.,
    Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances
    in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),
    https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-
    Paper.pdf