=Paper= {{Paper |id=Vol-2943/detoxis_paper5 |storemode=property |title=SINAI at IberLEF-2021 DETOXIS task: Exploring Features as Tasks in a Multi-task Learning Approach to Detecting Toxic Comments |pdfUrl=https://ceur-ws.org/Vol-2943/detoxis_paper5.pdf |volume=Vol-2943 |authors=Flor Miriam Plaza-del-Arco,M. Dolores Molina-González,L. Alfonso Ureña-López,M. Teresa Martín-Valdivia |dblpUrl=https://dblp.org/rec/conf/sepln/ArcoMLV21 }} ==SINAI at IberLEF-2021 DETOXIS task: Exploring Features as Tasks in a Multi-task Learning Approach to Detecting Toxic Comments== https://ceur-ws.org/Vol-2943/detoxis_paper5.pdf
      SINAI at IberLEF-2021 DETOXIS task:
     Exploring Features as Tasks in a Multi-task
       Learning Approach to Detecting Toxic
                     Comments

      Flor Miriam Plaza-del-Arco, M. Dolores Molina-González, L. Alfonso
                 Ureña-López, and M. Teresa Martı́n-Valdivia

    Department of Computer Science, Advanced Studies Center in ICT (CEATIC)
          Universidad de Jaén, Campus Las Lagunillas, 23071, Jaén, Spain
                {fmplaza, mdmolina, laurena, maite}@ujaen.es



        Abstract. This paper describes the participation of the SINAI research
        group at DETOXIS (DEtection of TOxicity in comments In Spanish)
        shared task at IberLEF 2021. The proposed system follows a Multi-
        task Learning approach where multiple tasks related to toxic comments
        identification are learned in parallel while using a shared representation.
        Specifically, we use the dataset features provided by the organizers as
        tasks along with the combination of polarity classification, emotion clas-
        sification and offensive language detection tasks to explore if they help
        in the identification of toxic comments. Our proposal ranked first in both
        DETOXIS subtasks, toxicity detection and toxicity level detection.

        Keywords: Multi-Task Learning · BERT · Toxic Features · Sentiment
        Analysis.




1     Introduction
Toxic comment classification is a field of research that has attracted increas-
ing interest in the Natural Language Processing (NLP) community in recent
years. In this task, the organizers defined a toxic comment as “a comment that
denigrates, hates or vilifies, attacks, threatens, insults, offends or disqualifies a
person or group of people based on characteristics such as race, ethnicity, nation-
ality, political ideology, religion, gender and sexual orientation, among others”.
Therefore, toxicity term will be used as an umbrella term to include different
definitions used in the literature to describe hate speech [5, 4], abusive [20], ag-
gressive [14], and offensive language [28]. In fact, these different terms address
different aspects of toxic language [24].
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
    Detecting online toxicity may be difficult, as it can be expressed in different
ways: explicitly (through insult, mockery or inappropriate humor) or implicitly
(through sarcasm). Another aspect to take into account is the the presence of
different levels of intensity in toxicity (from rude and offensive comments to
more aggressive ones, the latter being those that incite hatred or even physical
violence).
    In this paper, we present the systems we developed as part of our participa-
tion in the DETOXIS (DEtection of TOxicity in comments In Spanish) shared
task [26], at IberLEF 2021 [19] in both subtasks. The aim of DETOXIS is the
detection of toxicity in comments posted in Spanish in response to different on-
line news articles related to immigration. The DETOXIS task is divided into
two related classification subtasks: (1) toxicity detection and (2) toxicity level
detection. The first subtask consists of detecting whether or not a comment is
toxic while the second one aims to categorize the comment according to four
levels of toxicity (0: not toxic, 1: mildly toxic, 2: toxic, and 3: very toxic).
    The rest of the paper is structured as follows. In Section 2 we explain the
data used in our experiments. In Section 3, we describe our proposal to ad-
dress the task. In Section 4 and 5, we present the experiment setup and results,
respectively. Finally, the conclusion is presented in Section 6.


2   Corpora

To run our experiments, we used the Spanish dataset provided by the organizers
of the DETOXIS task at IberLEF 2021. The DETOXIS dataset was collected
from the NewsCom-TOX dataset. This dataset consists of 4,357 comments (ap-
proximately) posted in response to different articles extracted from Spanish on-
line newspapers (ABC, elDiario.es, El Mundo, NIUS, etc.) and discussion forums
(such as Menéame) from August 2017 to July 2020. These articles were manu-
ally selected taking into account their controversial subject matter, their poten-
tial toxicity and the number of comments posted (minimum 50 comments). A
keyword-based approach was used to search for articles primarily related to im-
migration. Comments were selected in the same order in which they appear in the
web timeline. The author (anonymous), date and time the comment was posted
are also retrieved. The number of comments ranged from 65 to 359 comments
per article. On average, approximately 30% of the comments are toxic. Each
comment was annotated into two categories “toxic” and “non-toxic”, and sub-
sequently those annotated as “toxic” were assigned with different toxicity levels
(non-toxic, slightly toxic, toxic, and very toxic). In addition, the following char-
acteristics were also annotated: argumentation, constructiveness, stance, target,
stereotype, sarcasm, mockery, insult, improper language, aggressiveness and in-
tolerance. All of these characteristics (or categories) have a binary classification,
except for the level of toxicity. Each comment was annotated by three annota-
tors and, once all comments for each item were annotated, an inter-annotator
agreement test was performed.
    In addition, we used in our experiments other corpora corresponding to tasks
that could be related to detection of toxicity from social media including polarity
classification (InterTASS), emotion classification (EmoEvent and Universal Joy),
HS identification (HatEval and HaterNet), and aggressiveness detection (MEX-
A3T). The datasets are described below:

 – International TASS Corpus (InterTASS) was released in 2017 [17] with
   Spanish tweets and updated in 2018 with texts written in three different
   variants of Spanish from Spain, Costa Rica and Peru [16]. In 2019, InterTASS
   was enlarged with new texts written in two new Spanish variants: Uruguayan
   and Mexican [10] and finally, it was completed with Chilean-Spanish Tweets
   in 2020 [13]. The corpus released in 2019 is the one used in this paper. At
   least three annotators annotated each tweet with its level of polarity, which
   could be labeled as positive, negative, neutral and none.
 – EmoEvent [23] is a multilingual emotion dataset based on events that took
   place in April 2019. It focuses on tweets in the areas of entertainment, catas-
   trophes, politics, global commemoration and global strikes. For the creation
   of the corpus, the authors collected Spanish and English tweets from the
   Twitter platform. Then, each tweet was labeled with one of seven emotions,
   six Ekman’s basic emotions plus the “neutral or other emotions” label. Fo-
   cusing on the Spanish language, a total of 8,409 were labeled by three Ama-
   zon Mechanical Turkers.
 – Universal Joy[15] is a new data set of over 530k anonymized public Face-
   book posts across 18 languages. It was collected in October 2014 by searching
   for public Facebook posts with a Facebook “feelings tag”, and labeled with
   five different emotions: anger, anticipation, fear, joy, and sadness. There is a
   wide variety in the amount of data per language, ranging from 284,265 posts
   for English, the most frequent language, to 869 posts for Bengali. We used
   the 31,326 Spanish posts.
 – HatEval was provided by organizers in SemEval 2019 Task 5 [5]. The task
   consisted in detecting hateful content in Twitter posts, against two targets:
   women and immigrants. For the creation of the corpus, the data was collected
   using a different time frame. The majority of tweets against women were de-
   rived from an earlier collection made in the context of two earlier challenges
   on misogynistic speech identification, whose collection phase began on July
   2017 and ended on November 2017 [12, 11]. The remaining tweets were col-
   lected from July to September 2018. The dataset contains tweets composed
   of an identifier, the text of the tweet and the mark of HS, which is 0 if
   the text is not hateful and 1 if the text is hateful speech against women or
   immigrants.
 – HaterNet [22] was built for the intelligent system of the same name, used
   by the National Office against Hate Crimes of the Spanish Secretary of State
   for Security. For the creation of this corpus, over 2 million tweets originated
   in Spain on different random dates between February 2017 and December
   2017 were collected. Subsequently, the tweets were filtered using six HS dic-
   tionaries and one dictionary containing generic insults. After this, only 6000
   tweets were selected due to time restrictions, to be manually labeled by four
   experts with different backgrounds and in case of a tie a fifth person, cast the
   deciding vote. Finally, out of the 6000 tweets, 1,567 were labeled as hateful
   and 4,433 as non-hateful.
 – MEX-A3T [3]. It was provided by the organizers in IberEval 2018: Author-
   ship and aggressiveness analysis in Mexican Spanish tweets [1]. They built
   a corpus of tweets to detect aggressiveness from Mexican accounts collected
   from August to November of 2017. In order to extract the tweets, they
   selected a set of terms that served as seeds. Then, they used both words
   non-colloquial in the Dictionary of Mexicanisms and classified as vulgar.
   The hashtags were related to sexism, homophobia, politics and discrimina-
   tion. They used Mexico City as the center and extracted all tweets that were
   within a radius of 500 km. Finally, two people labeled the collected tweets.
   The dataset contains tweets composed of an identifier, the text of the tweet,
   and the mark of aggressiveness, being 0 if the tweet is not-aggressive and 1
   if the tweet is aggressive.


3   System overview

In this section, we describe the systems developed for the DEtection of TOxicity
shared task in Spanish comments at IberLEF 2021.
     We propose a Multi-Task Learning (MTL) system using the well-known
Transformer-based model BERT which has been proven to be very successful
in many natural language processing tasks.
     In the MTL scenario, the goal is to learn multiple tasks simultaneously in-
stead of learning them separately in order to improve performance on each task
[7]. These tasks are usually related, although they may have different data or fea-
tures. By sharing representations across related tasks, we can allow our model to
better generalize to our original task. In this study, we used tasks related to the
toxicity comment detection task. These tasks include hate speech detection, of-
fensive language identification, polarity classification, and emotion classification,
sharing the same type of source: social media platforms (Twitter and Facebook).
Moreover, we consider each of the features provided in the DETOXIS dataset
(constructiveness, argumentation, mockery, sarcasm, positive stance, negative
stance, target person, target group, stereotype, insult, improper language, ag-
gressiveness, intolerance) as specific tasks to train our system. From now, we
will refer to all these specific tasks (the ones provided in the DETOXIS dataset)
as tasks related to toxicity comments features.
     To develop the MTL system, we follow the most widely used technique in
neural networks introduced by [7], the hard parameter sharing approach. It con-
sists of a single encoder that is shared and updated between all tasks, while
keeping a few task-specific layers to specialize in each task. [25].
     The general architecture of the MTL-BERT model is shown in Figure 1. The
shared layers are based on BERT [9]. Following Devlin et al., 2018, in the first
step, all the inputs are converted to WordPieces [27], two additional tokens are
added at the start ([CLS]) and end ([SEP]) of the input sequence, respectively.
In the shared layers, the BERT model first converts the input sequence to a
sequence of embedding vectors. This semantic representation is shared across all
tasks. Then, on top of the shared BERT layers, the task-specific output heads are
created for each task, and task heads are attached to a common sentence encoder.
Finally, the layers are fine-tuned according to the given set of downstream tasks.



                        Fig. 1. Proposed MTL system for the DETOXIS task.
                                               Task-specific linear heads


              Task 1             Task 2                Task 3                 (...)              Task N




    Shared Encoder


                                               WordPiece Embeddings



    [CLS]      Token1       Token2        Token3       Token4        Token5           Token6   Token7     [SEP]




4      Experimental setup

4.1         Dataset preprocessing

We perform a social media specific data cleaning in the corpora related to Twitter
and Facebook (InterTASS, EmoEvent, Universal Joy, HatEval, HaterNet, MEX-
A3T) before including the texts in the models. The following practices to prepare
the text for deep learning experiments have been carried out using the ekphrasis
module [6]:

 – URLs, emails, users’ mentions, percentages, monetary amounts, time and
   date expressions, and phone numbers are normalized.
 – Hashtags are unpacked and split to their constituent words.
 – Elongated words and repeated characters in words are annotated and re-
   duced.
 – Emojis are converted to its alias.

    As the DETOXIS task dataset provided by the organizers includes responses
to different articles extracted from Spanish online newspapers, we performed a
different data cleaning that includes the following steps:
 – Remove URLs, hashtags and users’ mentions.
 – Reduce words with more than 4 repeated characters to 3 repetitions.
 – Remove multiple spaces.
 – Remove texts with only numbers.


4.2   System settings

All the models were implemented using PyTorch, a high-performance deep learn-
ing library [21] based on the Torch library. The experiments were run on a single
Tesla-V100 32 GB GPU with 192 GB of RAM.
    During the evaluation phase, we train the model on the training set provided
by the organizers, then we evaluate it on the test set.
    Regarding our participation, we submitted five runs using the proposed MTL-
based system. The details of the modules and the differences of the five settings
are described below.

 – Run 1. In order to establish a baseline in our study and compare the results
   with the MTL scenario, the first run correspond to our baseline, a single-
   task learning approach which involves only the DETOXIS dataset. For this
   setting, we use the well-known Transformer BERT.
 – Run 2. In this setting, our goal is to train the MTL system on the tasks
   which are related to the identification of toxicity comments. Specifically,
   HS identification, offensive language detection and the toxicity comments
   features (constructiveness, argumentation, mockery...) explained in Section
   3. Our assumption is that all these tasks are related to the inappropriate
   behavior on the web, therefore the knowledge share during training among
   these tasks may benefit to the task of toxic comments identification even if
   the texts correspond to different language registers from social media and
   newspapers.
 – Run 3. This configuration includes run 2 but with the addition of a new task:
   polarity classification. Our goal is to leverage on the sentiment expressed in
   the posts to aid in the classification of toxic comments. Our assumption is
   that toxicity is associated with a negative polarity, then the knowledge share
   can help to detect easily toxic comments. For the polarity classification task,
   we use the InterTASS dataset.
 – Run 4. This configuration includes run 2 but with the addition of a new
   task: emotion classification. In this setting, our goal is to leverage in the iden-
   tification of emotion categories to aid in the classification of toxic comments.
   Our assumption is that negative emotions such as anger, fear, sadness and
   disgust could be related to toxicity while positive emotions are not. For the
   emotion analysis task, we use the EmoEvent and Universal Joy datasets.
 – Run 5. In this setup, we have included the polarity and emotion classifica-
   tion tasks in run 2. Therefore, in this setting the MTL system is trained on
   the different tasks explainded above. We expect that the combination of all
   the tasks helps to identify toxic comments.
    In all the runs, the DETOXIS training set has also been used to train the
MTL system, and then we have evaluated the shared task using the DETOXIS
test set.
    Since the DETOXIS dataset is composed of Spanish texts, while training the
MTL system we use the BETO model [8] trained on Spanish texts. We employ
the following hyperparameters in the five runs: learning rate as 2e-05, batch
size as 16, dropout probability as 0.01, the optimization algorithm Adamw, and
maximum epoch as 3.



5   Results

In this section we present the results obtained by the different runs we have ex-
plored in both subtasks of the competition. In order to evaluate them we use the
official competition metrics for subtask 1 (F-measure) and subtask 2 (Closeness
Evaluation Metric (CEM) [2]). In addition, for the level detection subtask, the
organizers has provided evaluation results with Rank Biased Precision (RBP)
[18], Pearson coefficient, and Accuracy (Acc).
    We evaluated our five runs on subtasks 1 and 2 of the DETOXIS shared
task. The results obtained are shown in Table 1 and 2, respectively. As can be
seen, in subtask 1, the different settings of the MTL system have outperformed
our baseline BETO (Run 1). It should be noted that the best setting in both
subtasks is Run 5 in which all tasks related to toxicity detection in comments
are combined. Specifically, in subtask 1, run 5 outperforms with a substantial
margin (3,77%) our baseline BETO. For subtask 1 it can also be observed that
Run 3 achieves remarkable results, and Run 4 surpassed the baseline BETO,
therefore we can confirm our hypothesis that sentiment analysis helps the task
of detecting toxicity in comments. Regarding subtask 2, runs 4 and 5 surpasses
the baseline BETO in terms of CEM score, which means that emotion analysis
along the combination of HS identification, offensive language detection and the
tasks related to toxicity comments features (argumentation, constructiveness,
sarcasm, mockery...) could benefit the detection of toxicity comments.
    It should be remarked that although the datasets of some of the tasks ex-
plored (sentiment analysis, hate speech detection and offensive language identi-
fication) include posts from social media, the MTL seems to be able to transfer
the knowledge to a different language register employ in the comments posted
in response to different articles extracted from online newspapers (DETOXIS
dataset).
    Finally, our results in the competition for both subtasks among the partici-
pants (Table 3 and Table 4) show the success of our proposed model achieving
the first place in the ranking in both subtasks. The representations computed by
the encoder embed the affective knowledge and the knowledge related to toxic-
ity detection tasks (offensive language, constructiveness, sarcasm, among others)
allows the MTL model to identify toxic comments more accurately.
Table 1. Results in subtask 1 on the test set of DETOXIS shared task. Best result is
marked in bold.

                                   Run        F-measure
                                   1          0.6084
                                   2          0.6172
                                   3          0.6406
                                   4          0.6125
                                   5          0.6461

Table 2. Results in subtask 2 on the test set of DETOXIS shared task. Best result is
marked in bold.

                     Run      CEM          RBP       Pearson Acc
                     1        0.7421       0.2722    0.5065     0.7419
                     2        0.7344       0.3425    0.4638     0.7441
                     3        0.7389       0.2499    0.4580     0.7396
                     4        0.7425       0.2361    0.4892     0.7553
                     5        0.7495       0.2612    0.4957     0.7654

    Table 3. Ranking of participants’ systems in subtask 1 of DETOXIS shared task.

                         Ranking       Team                F-measure
                         1             SINAI (run 5)          0.6461
                         2             GuillemGSubies         0.6000
                         3             AI-UPV                 0.5996
                         12            ToxicityAnalizers      0.4562
                         -             BOWClassifier          0.1837
                         31            JOREST                 0.0246

    Table 4. Ranking of participants’ systems in subtask 2 of DETOXIS shared task.

          Ranking Team                      CEM        RBP       Pearson Acc
          1         SINAI (run 5)           0.7495     0.2612    0.4957   0.7654
          2         Team Sabari             0.7428     0.2670    0.5014   0.7464
          3         DCG                     0.7300     0.3925    0.4544   0.7329
          11        ToxicityAnalizers       0.6332     0.0709    0.1805   0.6139
          -         BOWClassifier           0.6318     0.1657    0.1688   0.7329
          24        JosepCarles LNR         0.5376     0.0705    0.0072   0.4949



6      Conclusion

This paper presents the participation of the SINAI research group at the DE-
tection of TOxicity in comments in Spanish shared task at IberLEF 2021. Our
proposal explores how transferred knowledge from tasks related to the identi-
fication of toxicity language (polarity classification, emotion classification, hate
speech detection, offensive language detection, constructiveness, argumentation,
sarcasm, mockery, etc.) may help in a text classification task like DETOXIS.
Experiments conducted show the efficacy of our proposed approach in achieving
convincing performance in both subtasks. Further exploration on how and which
of the features we use in our MTL approach (constructiveness, argumentation,
sarcasm, mockery, etc.) helps to the identification of toxic comments are left as
future work, and we welcome the community to contribute.


Acknowledgement

This work has been partially supported by a grant from European Regional
Development Fund (FEDER), LIVING-LANG project [RTI2018-094653-B-C21],
and Ministry of Science, Innovation and Universities (scholarship [FPI-PRE2019-
089310]) from the Spanish Government.


References

 1. Álvarez-Carmona, M.Á., Guzmán-Falcón, E., Montes-y-Gómez, M., Jair-Escalante,
    H., Villaseñor-Pineda, L., Reyes-Meza, V., Rico-Sulayes, A.: Overview of MEX-
    A3T at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish
    tweets. In: Proceedings of the Third Workshop on Evaluation of Human Language
    Technologies for Iberian Languages (IberEval 2018) co-located with 34th Con-
    ference of the Spanish Society for Natural Language Processing (SEPLN 2018),
    Sevilla, Spain, September 18th, 2018. CEUR Workshop Proceedings, vol. 2150,
    pp. 74–96. CEUR-WS.org (2018)
 2. Amigó, E., Gonzalo, J., Mizzaro, S., Carrillo-de Albornoz, J.: An effectiveness
    metric for ordinal classification: Formal properties and experimental results. arXiv
    preprint arXiv:2006.01245 (2020)
 3. Aragón, M.E., Jarquı́n-Vásquez, H.J., Montes-y-Gómez, M., Escalante, H.J., Vil-
    laseñor-Pineda, L., Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G.:
    Overview of MEX-A3T at IberLEF 2020: Fake News and Aggressiveness Analysis
    in Mexican Spanish. In: Proceedings of the Iberian Languages Evaluation Forum
    (IberLEF 2020) co-located with 36th Conference of the Spanish Society for Nat-
    ural Language Processing (SEPLN 2020), Málaga, Spain, September 23th, 2020.
    CEUR Workshop Proceedings, vol. 2664, pp. 222–235. CEUR-WS.org (2020)
 4. Plaza-del Arco, F.M., Molina-González, M.D., Ureña-López, L.A., Martı́n-
    Valdivia, M.T.: Comparing pre-trained language models for spanish hate speech
    detection. Expert Systems with Applications 166, 114120 (2021)
 5. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M.,
    Rosso, P., Sanguinetti, M.: SemEval-2019 task 5: Multilingual detection of
    hate speech against immigrants and women in Twitter. In: Proceedings of
    the 13th International Workshop on Semantic Evaluation. pp. 54–63. Associa-
    tion for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019).
    https://doi.org/10.18653/v1/S19-2007
 6. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep
    lstm with attention for message-level and topic-based sentiment analysis. In: Pro-
    ceedings of the 11th International Workshop on Semantic Evaluation (SemEval-
    2017). pp. 747–754. Association for Computational Linguistics, Vancouver, Canada
    (August 2017)
 7. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997)
 8. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre-
    trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020)
 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
    rectional transformers for language understanding (2018)
10. Dı́az-Galiano, M.C., Garcı́a-Vega, M., Casasola, E., Chiruzzo, L., Garcı́a-
    Cumbreras, M.Á., Martı́nez-Cámara, E., Moctezuma, D., Montejo-Ráez, A.,
    Sobrevilla-Cabezudo, M.A., Sadit-Tellez, E., et al.: Overview of TASS 2019: One
    More Further for the Global Spanish Sentiment Analysis Corpus. In: IberLEF@
    SEPLN. pp. 550–560 (2019)
11. Fersini, E., Nozza, D., Rosso, P.: Overview of the evalita 2018 task on automatic
    misogyny identification (ami). EVALITA Evaluation of NLP and Speech Tools for
    Italian 12, 59 (2018)
12. Fersini, E., Rosso, P., Anzovino, M.: Overview of the task on automatic misogyny
    identification at ibereval 2018. In: Rosso, P., Gonzalo, J., Martı́nez, R., Montalvo,
    S., de Albornoz, J.C. (eds.) Proceedings of the Third Workshop on Evaluation
    of Human Language Technologies for Iberian Languages (IberEval 2018). CEUR
    Workshop Proceedings, vol. 2150, pp. 214–228. CEUR-WS.org (2018)
13. Garcı́a-Vega, M., Dı́az-Galiano, M.C., Garcı́a-Cumbreras, M.Á., Plaza-del-Arco,
    F.M., Montejo-Ráez, A., Jiménez-Zafra, S.M., Martı́nez-Cámara, E., et al.:
    Overview of TASS 2020: Introducing emotion detection. In: Proceedings of the
    Iberian Languages Evaluation Forum (IberLEF 2020) co-located with 36th Con-
    ference of the Spanish Society for Natural Language Processing (SEPLN 2020),
    Málaga, Spain, September 23th, 2020. CEUR Workshop Proceedings, vol. 2664,
    pp. 163–170. CEUR-WS.org (2020)
14. Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Evaluating aggression iden-
    tification in social media. In: Proceedings of the Second Workshop on Trolling,
    Aggression and Cyberbullying. pp. 1–5 (2020)
15. Lamprinidis, S., Bianchi, F., Hardt, D., Hovy, D.: Universal joy a data set and
    results for classifying emotions across languages. In: Proceedings of the Eleventh
    Workshop on Computational Approaches to Subjectivity, Sentiment and Social
    Media Analysis. pp. 62–75 (2021)
16. Martı́nez-Cámara, E., Almeida-Cruz, Y., Dı́az-Galiano, M.C., Estévez-Velarde, S.,
    Garcı́a-Cumbreras, M.Á., Garcı́a-Vega, M., Gutiérrez, Y., Montejo-Ráez, A., Mon-
    toyo, A., Muñoz, R., Piad-Morffis, A., Villena-Román, J.: Overview of TASS 2018:
    Opinions, health and emotions. In: Proceedings of TASS 2018: Workshop on Se-
    mantic Analysis at SEPLN, TASS@SEPLN 2018. CEUR Workshop Proceedings,
    vol. 2172, pp. 13–27. CEUR-WS.org (2018)
17. Martı́nez-Cámara, E., Dı́az-Galiano, M.C., Garcı́a-Cumbreras, M.A., Garcı́a-Vega,
    M., Villena-Román, J.: Overview of TASS 2017. Proceedings of TASS pp. 13–21
    (2017)
18. Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effective-
    ness. ACM Transactions on Information Systems (TOIS) 27(1), 1–27 (2008)
19. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez Carmona,
    M., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L.,
    Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-del-Arco,
    F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum
    (IberLEF 2021) (2021)
20. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language
    detection in online user content. In: Proceedings of the 25th international confer-
    ence on world wide web. pp. 145–153 (2016)
21. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
    T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-
    performance deep learning library. In: Advances in neural information processing
    systems. pp. 8026–8037 (2019)
22. Pereira-Kohatsu, J.C., Quijano-Sánchez, L., Liberatore, F., Camacho-Collados, M.:
    Detecting and monitoring hate speech in twitter. Sensors 19(21), 4654 (2019)
23. Plaza-del-Arco, F., Strapparava, C., Ureña-Lopez, L.A., Martin-Valdivia, M.T.:
    EmoEvent: A Multilingual Emotion Corpus based on different Events. In: Pro-
    ceedings of the 12th Language Resources and Evaluation Conference. pp. 1492–
    1498. European Language Resources Association, Marseille, France (May 2020),
    https://www.aclweb.org/anthology/2020.lrec-1.186
24. Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., Patti, V.: Resources and bench-
    mark corpora for hate speech detection: a systematic review. Language Resources
    and Evaluation pp. 1–47 (2020)
25. Ruder, S.: Neural transfer learning for natural language processing. Ph.D. thesis,
    NUI Galway (2019)
26. Taulé, M., Ariza, A., Nofre, M., Amigó, E., Rosso, P.: Overview of the detoxis task
    at iberlef-2021: Detection of toxicity in comments in spanish. Procesamiento del
    Lenguaje Natural 67 (2021)
27. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
    M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine transla-
    tion system: Bridging the gap between human and machine translation. CoRR
    abs/1609.08144 (2016)
28. Zampieri, M., Nakov, P., Rosenthal, S., Atanasova, P., Karadzhov, G., Mubarak,
    H., Derczynski, L., Pitenis, Z., Çöltekin, Ç.: Semeval-2020 task 12: Multilingual
    offensive language identification in social media (offenseval 2020). arXiv preprint
    arXiv:2006.07235 (2020)