AI-UPV at IberLEF-2021 DETOXIS task:
Toxicity Detection in Immigration-Related Web
   News Comments Using Transformers and
               Statistical Models

      Angel Felipe Magnossão de Paula[0000−0001−8575−5012] and Ipek Baris
                        Schlicht[0000−0002−5037−2203]

                       Universitat Politècnica de València, Spain
                           {adepau, ibarsch}@doctor.upv.es


        Abstract. This paper describes our participation in the DEtection of
        TOXicity in comments In Spanish (DETOXIS) shared task 2021 at the
        3rd Workshop on Iberian Languages Evaluation Forum. The shared task
        is divided into two related classification tasks: (i) Task 1: toxicity detec-
        tion and; (ii) Task 2: toxicity level detection. They focus on the xeno-
        phobic problem exacerbated by the spread of toxic comments posted in
        different online news articles related to immigration. One of the nec-
        essary efforts towards mitigating this problem is to detect toxicity in
        the comments. Our main objective was to implement an accurate model
        to detect xenophobia in comments about web news articles within the
        DETOXIS shared task 2021, based on the competition’s official metrics:
        the F1-score for Task 1 and the Closeness Evaluation Metric (CEM)
        for Task 2. To solve the tasks, we worked with two types of machine
        learning models: (i) statistical models and (ii) Deep Bidirectional Trans-
        formers for Language Understanding (BERT) models. We obtained our
        best results in both tasks using BETO, a BERT model trained on a big
        Spanish corpus. We obtained the 3rd place in Task 1 official ranking with
        the F1-score of 0.5996, and we achieved the 6th place in Task 2 official
        ranking with the CEM of 0.7142. Our results suggest: (i) BERT models
        obtain better results than statistical models for toxicity detection in text
        comments; (ii) Monolingual BERT models have an advantage over mul-
        tilingual BERT models in toxicity detection in text comments in their
        pre-trained language.

        Keywords: Spanish text classification · Toxicity detection · Deep Learn-
        ing · Transformers · BERT · Statistical models.


1     Introduction
The increase in the number of news pages where the reader can openly discuss
the articles has driven the dissemination of internet users’ opinions through so-
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
cial media [18, 10]. A survey carried out in the US by The Center for Media
Engagement at the University of Texas at Austin states that most of the com-
ments on news articles are posted by internet users who we call active-users or
influencers [16]. They are highly active and generate huge amounts of data.
    The imbalance in the amount of data generated by influencers and non-active
users creates a distorted reality where influencers’ opinions end up representing
the opinion of all internet users to society [2]. This distorted reality can aggravate
the existing social problems, as is the case with xenophobia, a heavy sense of
aversion, or dread of people from other countries [19].
    In recent years, the problem with xenophobia has been exacerbated by the
increase in the spread of toxic comments posted in different online news articles
related to immigration [3]. One of the first steps to mitigate the problem is to
detect toxic comments regarding news articles [5]. For this reason, the Iberian
Languages Evaluation Forum proposed the DEtection of TOXicity in comments
In Spanish (DETOXIS) shared task 2021 [17].
    The DETOXIS shared task comprises Task 1 and Task 2, which are respec-
tively toxicity detection and toxicity level detection. The two tasks are performed
on comments posted in Spanish in response to different online news articles re-
lated to immigration. Task 1 is a binary classification problem where the objec-
tive is to classify a Spanish text comment as ‘toxic’ or ‘not toxic’. Task 2 aims
to classify the same comment but among four classes: ‘not toxic’, ‘mildly toxic’,
‘toxic’, or ‘very toxic’. Table 1 displays examples of comments classified across
all classes.

                             Table 1. Comments examples

  Toxicity Toxicity level                       New’s comment
  not toxic    not toxic    Proximamente en su barrio
              mildly toxic Vienen a pagarnos las pensiones
    toxic
                            asi me gusta, que se maten entre ellos y en alta mar. Mas
                 toxic
                            inmigrantes asi porfavor
                            A esosmoros hay que echarlos pero ya.O los politicos
              very toxic
                            hacen algo o la gente tendra que ”actuar”


   The detection of toxicity in comments is mostly done with Machine Learning
(ML) models, especially deep learning models, which require large amounts of
annotated datasets for robust predictions [8]. However, labeling toxicity is a
challenging and time-consuming task that requires many annotators to avoid
bias, and the annotators should be aware of social and cultural contexts [15, 11].
   Our main goal was to implement an accurate model to detect xenophobic
comments on web news articles within the DETOXIS shared task 2021, using
the competition’s official metrics. We decided to solve the problem by applying
models that can learn using only a small amount of data, which can be done
with statistical models and most advanced pre-trained deep learning models.
Roughly speaking, there are two types of statistical models: Generative and Dis-
criminative [9]. We chose to use one of each type. Thus, we tried a Naive Bayes
(Generative) and a Maximum Entropy (Discriminative) model. Among the most
advanced and highly effective deep learning models is Deep Bidirectional Trans-
formers for Language Understanding (BERT), which comes with its parameters
pre-trained in an unsupervised manner in a large corpus [7]. Therefore, it only
needs a tuned train that can be run on a small set of data, which suits our
problem. Our source code is publicly available1
    The work’s main contribution is to help in the effort to improve the results
in the identification of toxic comments in news articles related to immigration.
Unlike the vast majority of works [14], we use ML models that can tackle the
xenophobia detection problem having only little data available. The second con-
tribution is to build ML models and find their best configuration to deal not
only with the classification of news articles as ‘toxic’ and ‘not toxic’, but also to
infer the toxicity level of the comments into ‘not toxic’, ‘mildly toxic’, ‘toxic’, or
‘very toxic’. As far as we know, there are few works in the literature in which
the solution model tries to infer the toxicity level of the comments posted in the
news related to immigration. On the DETOXIS official ranking, we obtained the
3rd place in Task 1 with the F1-score of 0.5996, and we achieved the 6th place
in Task 2 with the CEM of 0.7142.
    The article is organized as follows: Section 2 contains the methodology with
fundamental concepts; Section 3 describes the experiments; Section 4 contains
the results and discussions, and and Section 5 draws some conclusion and future
work.


2     Methodology

This section explains the data structure, the evaluation metrics, and the ML
models applied to solve our classification problems. In addition, the text repre-
sentation used to encode the text comments.


2.1    Dataset

The DETOXIS shared task organization granted its participants the NewsCom-
TOX dataset [17] divided into train set and test set where text data are in
Spanish. The train set consists of 3463 instances, and the test set consists of 891
instances. Both sets have as main labels: (i) ‘Comment id’ and (ii) ‘Comment’;
but only the train set has the labels: (iii) ‘Toxicity’ and (iv) ‘Toxicity level’, re-
spectively for Task 1 and Task 2. The ‘Comment id’ is a unique reference number
assigned to each instance within the NewsCom-TOX dataset. The ‘Comment’
label is a text message posted in response to a Spanish online news article from
different sources such as El Mundo, NIUS, ABC, etc., or discussion forums like
1
    https://github.com/AngelFelipeMP/Machine-Learning-Tweets-Classification
Menéame. Moreover, ‘Toxicity’ labels the comment for a particular instance
between ‘toxic’ or ‘not toxic’ and the ‘Toxicity level’ label classifies the same
comment as ‘not toxic’, ‘mildly toxic’, ‘toxic’, or ‘very toxic’. Table 2 shows the
label’s distribution for ‘Toxicity’ and ‘Toxicity level’. We can see that the labels
are unbalanced in both cases.


                              Table 2. Data distribution

                Toxicity (Task 1)                Toxicity level (Task 2)
            Label     Number of instances      Label       Number of instances
          not toxic          2316            not toxic            2317
                                            mildly toxic          808
            toxic            1147
                                               toxic              269
                                             very toxic            69


    The data annotation process was carried out by four annotators where two
were linguists experts, and two were trained linguistic students. Three of them
labeled all news article comments in parallel. Once they finished, an inter-
annotator agreement test was executed. When a disagreement happens, the three
annotators plus the senior annotator reviewed it in order to achieve accordance
with the final label [17]. In Table 1, we can see examples from the DETOXIS
train set of comments and its labels attributed by the annotators for ‘Toxicity’
and ‘Toxicity level’.
    Next we explain how we used the data during the project development in
both tasks. First, we applied 10-fold cross-validation in the train set to find the
best ML model. After that, we trained the selected model in the whole train
set. Subsequently, we applied the selected model to make predictions on the
official test set, as shown in Figure 1. These predictions were submitted to the
DETOXIS shared task 2021.


2.2   Evaluation metrics

Because the train set is imbalanced, as we can see in Table 2, we selected eval-
uation metrics that are able to fairly evaluate ML models in this circumstance.
For Task 1, we adopted Accuracy, Recall, Precision, and F1-score, which was
the DETOXIS official evaluation metric for Task 1. For Task 2, we adopted Ac-
curacy, F1-macro, F1-weighted, Recall, Precision, and CEM [1], the DETOXIS
official evaluation metric for Task 2. We used the DETOXIS official metrics as
performance measures to rank and select the best ML models during the cross-
validation process for Task 1 and Task 2.
                                Fig. 1. Workflow.


2.3   Models


There are two types of statistical models: the Generative models and the Dis-
criminative models [9]. We used one model from each kind. We adopted the
Naive Bayes (Generative) model and the Maximum Entropy (Discriminative)
model. Among the Transformers models, we decided to use the BERT models,
one of the most advanced and highly effective Transformer models. They come
with their parameters pre-trained in an unsupervised manner in a large corpus
[7]. Therefore, they only need a supervised fine-tune train in the downstream
task that can be run on a small set of data. We adopted: (i) the BETO model,
a BERT model trained on a big unannotated Spanish corpus composed of three
billion tokens [4]; and (ii) the mBERT, a BERT model pre-trained on the top
102 languages with the most extensive Wikipedia corpus. However, the balance
among the language in the corpus was not perfect. For example, the English
partition of the corpus was 1000 bigger than the Icelandic partition [6].


2.4   Text representation


To represent our text data in a way that the statistical models could handle
it, we used two encode methods: (i) Bag of Words (BOW) [20]; and (ii) Term
Frequency - Inverse Document Frequency (TF-IDF) [13]. The BOW represents a
text comment by a unidimensional vector whose length is the size of the training
vocabulary. In this case, each column of this vector contains the number of times
a particular word from the vocabulary appears in the specific comment. The TF-
IDF representation for each text comment is also a flat vector with the size of
the training vocabulary. However, the value for each word on the vector follows
the well-known TF-IDF calculation [13].
3     Experiments

This section explains the environment setup, the data preprocessing, and sta-
tistical models’ feature extraction. Furthermore, the section also contains expla-
nations for the 10-fold cross-validation process and how we selected the model
to make predictions on the DETOXIS test set, which we submitted as our final
results to the competition.


3.1    Environment setup

For code purposes, we used python 3.7.10. As a code editor/machine, we used
Google collaborator2 . The main python libraries that we used were: (i) NumPy
1.19.5 to work with matrix, (ii) Pandas 1.1.5 to handle and visualize data, (iii)
Spacy 2.2.4 and (iv) the Natural Language Toolkit (NLTK) 3.2.5 for natural
language transformations, (v) Pytorch 1.6.0, and (vi) Transformers 3.0.0 to ac-
tually implement the BERT models. In addition, we used (vii) Sklearn 0.22.2 to
implement the statistical models.


3.2    Preprocessing

For both tasks, we only preprocess the data for the statistical models. The pre-
processing step was carried out on the text data from the train and test sets.
We used the built-in python model for Regular Expression (RegEx) and the
NLTK python library. Applying RegEx, we removed stock market tickers, old-
style retweet text, hashtags, hyperlinks and changed the numbers to the tag
“<number>”. We employed the NLTK on the text comments to remove stop-
words, stem and tokenize the words.


3.3    Feature extraction

The feature extraction process was executed to focus on achieving good results
with the statistical models. These models’ performance is susceptible to their
input features [12]. Hence, after preprocessing the datasets for the statistical
models, we executed the feature extraction process to create good input features.
We encode the text comments in two different manners: (i) BOW [20]; and (ii)
TF-IDF [13].
    The two proposed encode methods are based on word occurrences, and unfor-
tunately, they completely ignore the relative position information of the words
in the comments. Therefore, we lose the information about the local ordering
of the words. In order to mitigate this problem and preserve some of the local
word ordering information, we increase the vocabulary by extracting 2-grams
and 3-grams from the text comments despite dimensional increasing.
2
    https://colab.research.google.com/
3.4   Cross-validation
The cross-validation process was performed on the train set aiming to find the
best ML model to make the prediction on the DETOXIS test set. We can see the
summary of our cross-validation process in Figure 2. During the cross-validation,
each statistical model received different input features, and the BERT models
tried different hyper-parameters.


               Fig. 2. General diagram ML models cross-validation.


    In Figure 3, we can see the cross-validation process that focuses on the BERT
models. We tried different combinations for the Output BERT, Learning Rate,
Batch Size, and Epochs. The BERT models are composed of a pre-trained model
plus a linear layer at the top which receives the output of the pre-trained BERT
model. We have two different options for the Output BERT: (i) the sequence of
hidden states at the output of the last layer which we performed a mean pooling
and max pooling operation and concatenated them into a unified unidimensional
vector that we called ‘hidden’; (ii) the pooler of the last layer’s hidden state of
the first token of the sequence further processed by a linear layer and a tanh
activation function that we called ‘pooler’. For the Learning Rate, we tried 1E-5,
3E-5, and 5E-5. For the Batch Size, we tried 8, 16, 32, and 64. The number of
Epochs was from 1 to 20.


                      Fig. 3. Cross-validation BERT models.


   Figure 4 illustrates the cross-validation process for the statistical models. We
can see that we tried four algorithm versions of the Naive Bayes (NB) model:
the Multinomial, the Bernoulli, the Gaussian, and the Complement ones. On the
other hand, we tried only the original version of the Maximum Entropy (ME)
model but with different solvers: the liblinear, newton, sag, saga, and lbfgs. We
call solvers the algorithms used in the optimization problem. We tried different
vocabulary sizes for all statistical models using different n-grams combinations
and the two encode methods: BOW and TF-IDF.


                    Fig. 4. Cross-validation statistical models.


    Tables 3, 4, 5, and 6 show the results of the 10-fold cross-validation process
for the statistical models. Tables 3 and 4, in this sequence, show the results of
the ME model and the NB model for Task 1. Tables 5 and 6 respectively show
the results of the ME model and the NB model for Task 2. The first column in
Tables 3 and 4 shows the solver, and the first column in Tables 5 and 6 shows
the NB algorithm. All the other columns have the same meaning in the four
tables.
    Thus, the tables’ third column shows the n-grams used as a vocabulary where
the numbers within the parentheses are the lower and upper limits of the n-gram
word range used. The n-gram range of (1, 1) means only 1-grams, (1, 2) means
1-grams and 2-grams, and (1, 3) means 1-grams, 2-grams, and 3-grams. The
rest of the columns have the evaluation metrics for each group of the selected
parameters. For Task 1, the evaluation metrics are Accuracy, F1-score, Recall,
and Precision, and for Task 2, the evaluation metrics are Accuracy, F1-macro,
F1-weighted, Recall, Precision, and CEM.


                   Table 3. Cross-validation ME models for Task 1

           Solver Encoder Vocabulary Accuracy F1-score Recall Precision
                               (1,1)-grams 0.7112   0.3031 0.2021   0.6882
                      TF-IDF
                               (1,2)-grams 0.6976   0.1546 0.0949   0.8786
          Liblinear            (1,3)-grams 0.6893   0.1036 0.0635   0.6500
                               (1,1)-grams 0.6994   0.4652 0.3993   0.5649
                      BOW
                               (1,2)-grams 0.7118   0.4353 0.3434   0.6060
                               (1,3)-grams 0.7126   0.4101 0.3128   0.6188
                               (1,1)-grams 0.7115   0.3043 0.2030   0.6890
                      TF-IDF
                               (1,2)-grams 0.6979   0.1547 0.0949 0.8928
          Newton               (1,3)-grams 0.6893   0.1036 0.0635   0.6500
                               (1,1)-grams 0.6991   0.4645 0.3984   0.5647
                      BOW
                               (1,2)-grams 0.7106   0.4331 0.3416   0.6030
                               (1,3)-grams 0.7132   0.4106 0.3128   0.6212
                               (1,1)-grams 0.7115   0.3043 0.2030   0.6890
                      TF-IDF
                               (1,2)-grams 0.6979   0.1547 0.0949 0.8928
            Sag                (1,3)-grams 0.6893   0.1036 0.0635   0.6500
                               (1,1)-grams 0.7002   0.4679 0.4019 0.5670
                      BOW
                               (1,2)-grams 0.7141   0.4427 0.3512   0.6110
                               (1,3)-grams 0.7155   0.4168 0.3189   0.6248
                               (1,1)-grams 0.7112   0.3031 0.2021   0.6882
                      TF-IDF
                               (1,2)-grams 0.6979   0.1562 0.0957   0.8800
            Saga               (1,3)-grams 0.6893   0.1036 0.0635   0.6500
                               (1,1)-grams 0.6985   0.4652 0.4002   0.5629
                      BOW
                               (1,2)-grams 0.7135   0.4432 0.3521   0.6102
                               (1,3)-grams 0.7181   0.4246 0.3242   0.6337
                               (1,1)-grams 0.7115   0.3043 0.2030   0.6890
                      TF-IDF
                               (1,2)-grams 0.6979   0.1547 0.0949 0.8928
           Lbfgs               (1,3)-grams 0.6893   0.1036 0.0635   0.6500
                               (1,1)-grams 0.6991   0.4645 0.3984   0.5647
                      BOW
                               (1,2)-grams 0.7106   0.4331 0.3416   0.6030
                               (1,3)-grams 0.7132   0.4106 0.3128   0.6212
         Table 4. Cross-validation NB models for Task 1

NB Algorithm Encoder Vocabulary Accuracy F1-score Recall Precision
                       (1,1)-grams 0.6933   0.1703 0.1062 0.7282
              TF-IDF
                       (1,2)-grams 0.6878   0.0995 0.0609   0.7167
Multinomial            (1,3)-grams 0.6843   0.0805 0.0487   0.5500
                       (1,1)-grams 0.6685   0.4868 0.4821   0.4960
              BOW
                       (1,2)-grams 0.6480   0.5232 0.5868   0.4736
                       (1,3)-grams 0.5795   0.5355 0.7289   0.4238
                       (1,1)-grams 0.6674   0.3344 0.2574   0.4979
              TF-IDF
                       (1,2)-grams 0.6524   0.1910 0.1274   0.4297
  Bernoulli            (1,3)-grams 0.6529   0.1765 0.1160   0.4168
                       (1,1)-grams 0.6674   0.3344 0.2574   0.4979
              BOW
                       (1,2)-grams 0.6524   0.1910 0.1274   0.4297
                       (1,3)-grams 0.6529   0.1765 0.1160   0.4168
                       (1,1)-grams 0.4730   0.4192 0.5831   0.3294
              TF-IDF
                       (1,2)-grams 0.5287   0.3918 0.4733   0.3386
  Gaussian             (1,3)-grams 0.5307   0.3961 0.4794   0.3418
                       (1,1)-grams 0.4675   0.4282 0.6102   0.3317
              BOW
                       (1,2)-grams 0.5223   0.4068 0.5090   0.3425
                       (1,3)-grams 0.5249   0.4084 0.5099   0.3442
                       (1,1)-grams 0.6604   0.4083 0.3800   0.4633
              TF-IDF
                       (1,2)-grams 0.6785   0.3255 0.2648   0.4845
Complement             (1,3)-grams 0.6835   0.3378 0.2727   0.5071
                       (1,1)-grams 0.6234   0.5165 0.6156   0.4472
              BOW
                       (1,2)-grams 0.5928   0.5216 0.6749   0.4263
                       (1,3)-grams 0.5215   0.5256 0.8004 0.3915
                     Table 5. Cross-validation ME models for Task 2

 Solver Encoder Vocabulary Accuracy F1-macro F1-weighted Recall Precision CEM
                     (1,1)-grams 0.6826    0.2363     0.5753   0.2691   0.2798   0.7070
            TF-IDF
                     (1,2)-grams 0.6809    0.2233     0.5610   0.2629   0.2693   0.6972
Liblinear            (1,3)-grams 0.6797    0.2214     0.5585   0.2616   0.2649   0.6923
                     (1,1)-grams 0.6526    0.3236     0.6034   0.3129   0.4473   0.6831
            BOW
                     (1,2)-grams 0.6722    0.3070     0.6038   0.3059   0.4539   0.7018
                     (1,3)-grams 0.6780    0.2944     0.5998   0.2995   0.4418 0.7080
                     (1,1)-grams 0.6826    0.2509     0.5870   0.2767   0.3199   0.7067
            TF-IDF
                     (1,2)-grams 0.6846    0.2302     0.5691   0.2675   0.2934   0.7041
Newton               (1,3)-grams 0.6829    0.2267     0.5643   0.2652   0.2649   0.6977
                     (1,1)-grams 0.6465   0.3587     0.6125    0.3367 0.4942 0.6827
            BOW
                     (1,2)-grams 0.6682    0.3176     0.6079   0.3117   0.4504   0.6984
                     (1,3)-grams 0.6740    0.2997     0.6028   0.3019   0.4303   0.7022
                     (1,1)-grams 0.6826    0.2509     0.5870   0.2767   0.3199   0.7067
            TF-IDF
                     (1,2)-grams 0.6846    0.2302     0.5691   0.2675   0.2934   0.7041
  Sag                (1,3)-grams 0.6829    0.2267     0.5643   0.2652   0.2649   0.6977
                     (1,1)-grams 0.6460    0.3393     0.6107   0.3251   0.4386   0.6824
            BOW
                     (1,2)-grams 0.6690    0.3101     0.6077   0.3073   0.4306   0.6997
                     (1,3)-grams 0.6742    0.2919     0.6021   0.2977   0.3976   0.7039
                     (1,1)-grams 0.6826    0.2509     0.5870   0.2767   0.3199   0.7067
            TF-IDF
                     (1,2)-grams 0.6846    0.2302     0.5691   0.2675   0.2934   0.7041
  Saga               (1,3)-grams 0.6829    0.2267     0.5643   0.2652   0.2649   0.6977
                     (1,1)-grams 0.6459    0.3380     0.6102   0.3241   0.4387   0.6825
            BOW
                     (1,2)-grams 0.6699    0.3145     0.6077   0.3101   0.4526   0.7015
                     (1,3)-grams 0.6763    0.2902     0.6027   0.2974   0.4028   0.7053
                     (1,1)-grams 0.6826    0.2509     0.5870   0.2767   0.3199   0.7067
            TF-IDF
                     (1,2)-grams 0.6846    0.2302     0.5691   0.2675   0.2934   0.7041
 Lbfgs               (1,3)-grams 0.6829    0.2267     0.5643   0.2652   0.2649   0.6977
                     (1,1)-grams 0.6465    0.3587    0.6125    0.3367 0.4942 0.6827
            BOW
                     (1,2)-grams 0.6682    0.3176     0.6079   0.3117   0.4504   0.6984
                     (1,3)-grams 0.6740    0.2997     0.6028   0.3019   0.4303   0.7022
                Table 6. Cross-validation NB models for Task 2

NB Algorithm Encoder Vocabulary Accuracy F1-macro F1-weighted Recall Precision CEM
                       (1,1)-grams 0.6743   0.2160    0.5523     0.2576   0.2393   0.6808
              TF-IDF
                       (1,2)-grams 0.6769   0.2161    0.5528     0.2583   0.2417 0.6882
Multinomial            (1,3)-grams 0.6766   0.2158    0.5523     0.2580   0.2686   0.6881
                       (1,1)-grams 0.6151   0.2736    0.5806     0.2807   0.2796   0.6384
              BOW
                       (1,2)-grams 0.6061   0.2747    0.5798     0.2858   0.2858   0.6473
                       (1,3)-grams 0.5137   0.2657    0.5250     0.2878   0.2810   0.6133
                       (1,1)-grams 0.6220   0.2472    0.5491     0.2631   0.2782   0.6210
              TF-IDF
                       (1,2)-grams 0.6396   0.2212    0.5451     0.2504   0.2302   0.6286
  Bernoulli            (1,3)-grams 0.6396   0.2175    0.5426     0.2485   0.2252   0.6268
                       (1,1)-grams 0.6220   0.2472    0.5491     0.2631   0.2782   0.6210
              BOW
                       (1,2)-grams 0.6396   0.2212    0.5451     0.2504   0.2302   0.6286
                       (1,3)-grams 0.6396   0.2175    0.5426     0.2485   0.2252   0.6268
                       (1,1)-grams 0.4031   0.2311    0.4429     0.2345   0.2471   0.5075
              TF-IDF
                       (1,2)-grams 0.4923   0.2530    0.5056     0.2540   0.2586   0.5376
  Gaussian             (1,3)-grams 0.4915   0.2532    0.5058     0.2541   0.2592   0.5386
                       (1,1)-grams 0.4000   0.2333    0.4418     0.2398   0.2504   0.5081
              BOW
                       (1,2)-grams 0.4834   0.2534    0.5009     0.2566   0.2590   0.5361
                       (1,3)-grams 0.4834   0.2538    0.5015     0.2570   0.2597   0.5370
                       (1,1)-grams 0.5911   0.2746    0.5648     0.2837   0.2893   0.5948
              TF-IDF
                       (1,2)-grams 0.6483   0.2749    0.5811     0.2880 0.3171 0.6342
Complement             (1,3)-grams 0.6497   0.2800   0.5845      0.2936   0.3162   0.6377
                       (1,1)-grams 0.4932   0.2916    0.5289     0.3211   0.3002   0.5751
              BOW
                       (1,2)-grams 0.3647   0.2509    0.4386     0.3365 0.3035     0.5459
                       (1,3)-grams 0.2010   0.1749    0.2712     0.3261   0.3086   0.4923
    Tables 7, 8, 9, and 10 show the top 5 results obtained in the 10-fold cross-
validation process for the BERT models. We used the DETOXIS official metrics
to rank the models, which are the F1-score for Task 1 and the CEM for Task 2.
Tables 7 and 8, in this sequence, show the top 5 results of the mBERT model and
the BETO model for Task 1. Tables 9 and 10 respectively show the top 5 results
of the mBERT model and the BETO model for Task 2. In all four tables, the
first column shows the BERT model, and the second column displays the type
of Output BERT. The third column shows Learning Rate, the fourth column
shows the Batch Size, and the fifth column indicates the number of Epochs. The
rest of the columns have the evaluation metrics for each group of the selected
parameters. For Task 1, the evaluation metrics are Accuracy, F1-score, Recall,
and Precision, and for Task 2, the evaluation metrics are Accuracy, F1-macro,
F1-weighted, Recall, Precision, and CEM.


             Table 7. Top 5 mBERT models cross-validation for Task 1

             Output Learning Batch
     Model                         Epochs Accuracy F1-score Recall Precision
             BERT Rate        Size
              pooler   3E-05   32     11     0.6972   0.6010 0.6842 0.5594
              hidden   5E-05   32      8     0.7094    0.5865 0.6167 0.5759
    mBERT
              hidden   5E-05   32      9     0.7102    0.5838 0.6202 0.5713
              hidden   3E-05   64     16     0.7259    0.5819 0.5778 0.6083
              pooler   3E-05   16      8     0.6838    0.5798 0.6715 0.5319


             Table 8. Top 5 BETO models cross-validation for Task 1

             Output Learning Batch
     Model                         Epochs Accuracy F1-score Recall Precision
             BERT Rate        Size
             pooler    1E-05   32     4     0.7446    0.6314 0.6514 0.6338
             pooler    1E-05   64     7     0.7415    0.6276 0.6578 0.6184
     BETO
             pooler    1E-05   16     7     0.7267    0.6265 0.6829 0.6016
             pooler    5E-05   64    14     0.7554    0.6245 0.6203 0.6485
             pooler    3E-05   64    16     0.7565    0.6237 0.6117 0.6568
             Table 9. Top 5 mBERT models cross-validation for Task 2

         Output Learning Batch                  F1      F1
Model                          Epochs Accuracy                Recall Precision CEM
         BERT Rate        Size                 macro weighted
          pooler    1E-05       16     12      0.7031 0.4165 0.7477 0.4206 0.4483 0.7599
         hidden     1E-05       16     14      0.6955 0.4158 0.7486 0.4252 0.4344    0.7588
mBERT
         hidden     3E-05       16        4    0.7006 0.3839 0.7475 0.3970 0.4054    0.7581
         hidden     1E-05       16     10      0.7011 0.3832 0.7515 0.3917 0.4210    0.7580
          pooler    1E-05       16        4    0.6974 0.3984 0.7496 0.4067 0.4182    0.7580

             Table 10. Top 5 BETO models cross-validation for Task 2

        Output Learning Batch                  F1      F1
Model                         Epochs Accuracy                Recall Precision CEM
        BERT Rate        Size                 macro weighted
        hidden     1E-05      16      4       0.7170 0.4035 0.7678 0.4091 0.4469 0.7769
        hidden     1E-05       8      3       0.7165 0.4138 0.7696 0.4151 0.4611    0.7747
BETO
        hidden     3E-05      32      6       0.7188 0.4096 0.7483 0.4173 0.4355    0.7746
        hidden     3E-05      64      5       0.7148 0.4178 0.7632 0.4235 0.4461    0.7746
        hidden     1E-05       8      5       0.7153 0.4168 0.7649 0.4219 0.4592    0.7739


3.5     Best model

At the end of the cross-validation, we selected the best model for each task
accordingly with the DETOXIS official metric for the specified task, as shown
in Figure 5.


                           Fig. 5. Selection of the best ML model.


    Table 11 shows the best results for each ML model tried on the cross-
validation process for Task 1, which are mBERT, BETO, NB, and ME. Table 12
displays the top 5 best model for the Task 1 during the whole cross-validation
process.
     Table 11. The best result of each model in the cross-validation for Task 1

                     Model Accuracy F1-score Recall Precision
                    BETO         0.7446   0.6314 0.6514          0.6338
                    mBERT        0.6972   0.6010 0.6842          0.5594
                     NB          0.5795   0.5355 0.7289          0.4238
                     ME          0.7002   0.4679 0.4019          0.5670


                Table 12. Top 5 models cross-validation for Task 1

             Output Learning Batch
     Model                         Epochs Accuracy F1-score Recall Precision
             BERT Rate        Size
     BETO pooler     1E-05        32       4       0.7446    0.6314 0.6514         0.6338
     BETO pooler     1E-05        64       7       0.7415    0.6276 0.6578         0.6184
     BETO pooler     1E-05        16       7       0.7267    0.6265 0.6829         0.6016
     BETO pooler     5E-05        64      14       0.7554    0.6245 0.6203         0.6485
     BETO pooler     3E-05        64      16       0.7565    0.6237 0.6117         0.6568


    Table 13 shows the best models performace for each ML model on the cross-
validation process for Task 2, which are mBERT, BETO, ME, and NB. Table
14 displays the top 5 best model for the Task 2 whole cross-validation process.


     Table 13. The best result of each model in the cross-validation for Task 2

           Model Accuracy F1-macro F1-weighted Recall Precision CEM
          BETO     0.7170        0.4035        0.7678     0.4091   0.4469 0.7769
          mBERT    0.7031        0.4165        0.7477     0.4206   0.4483 0.7599
           ME      0.6780        0.2944        0.5998     0.2995   0.4418 0.7080
           NB      0.6769        0.2161        0.5528     0.2583   0.2417 0.6882


                Table 14. Top 5 models cross-validation for Task 2

        Output Learning Batch                  F1      F1
Model                         Epochs Accuracy                Recall Precision CEM
        BERT Rate        Size                 macro weighted
BETO hidden     1E-05       16      4      0.7170       0.4035   0.7678   0.4091    0.4469 0.7769
BETO hidden     1E-05        8      3      0.7165       0.4138   0.7696   0.4151    0.4611 0.7747
BETO hidden     3E-05       32      6      0.7188       0.4096   0.7483   0.4173    0.4355 0.7746
BETO hidden     3E-05       64      5      0.7148       0.4178   0.7632   0.4235    0.4461 0.7746
BETO hidden     1E-05        8      5      0.7153       0.4168   0.7649   0.4219    0.4592 0.7739
     After the cross-validation, we chose the best model for Task 1, which following
Table 12 is BETO with the respective parameters: (i) pooler as Output BERT;
(ii) 1E-05 Learning Rate; (iii) Batch Size equal 32; and (iv) 4 training Epochs.
We also selected the best model for Task 2 that following Table 14 is BETO with
the respective parameters: (i) hidden Output BERT; (ii) 1E-05 Learning Rate;
(iii) Batch Size equal 16; and (iv) 4 training Epochs. Having the best models
and their parameters, we trained the models on the train set.
     Once the best models are trained, we use those models to make the predic-
tions on the DETOXIS test set. These predictions afterward were submitted to
the DETOXIS shared task organization as our final results.


4   Results and Discussion

We discovered important information on the cross-validation results. Looking at
Table 3, we can see that the ME model achieves its best results on Task 1 with
the BOW encode based on the F1-score evaluation metric, which is 0.4679. The
highest Accuracy 0.7126 and Recall 0.4019 are also performed with the BOW
encode. The only performance metric in which the TF-IDF encode obtains a
higher score is Precision that is 0.8928. Thus, we can conclude that BOW is
the best encoding for the ME model on Task 1 in the DETOXIS training set.
Moreover, employing the Sag solver, the ME model achieved a higher F1-score,
Recall, and Precision. Hence, it seems to us that Sag was the best solver for the
ME model on Task 1 in the DETOXIS training set. We do not have a definitive
conclusion about the vocabulary size because the ME model achieved its highest
results with different numbers of n-grams for each metric.
    Observing Table 4, we see that the NB model achieves its best results on
Task 1 with the BOW encode based on the F1-score evaluation metric, which
is 0.5355. The NB model also obtained the higest Recall 0.8004 with the BOW
encode, but its highest results for Accuracy 0.6933 and Precision 0.7282 were
with the TF-IDF encode. Therefore, we can not conclude which encode method
is the best for the NB model on Task 1 in the DETOXIS training set. A similar
case occurs with the vocabulary size where the NB model that employed 1-grams,
2-grams, and 3-grams achieved the highest F1-score and Recall. However, the
NB model with a 1-grams vocabulary size obtained the highest Accuracy and
Precision. The different NB algorithms obtained a similar performance based
on the F1-score. In most cases, they achieved their best results with the BOW
encode.
    We can see in Table 5 that the ME model achieved its best results on Task
2 with the BOW encode based on the CEM evaluation metric, which is 0.7080.
The highest F1-macro 0.3587, F1-weighted 0.6125, Recall 0.3367, and Precision
0.4942 are also obtained with the BOW encode. The only performance metric in
which the TF-IDF encode obtains a higher score is the Accuracy, which is 0.6846.
Thus, we can conclude that BOW is the best encoding for the ME model on Task
2 in the DETOXIS training set. Moreover, employing the Newton solver, the ME
model achieved a higher Accuracy, F1-macro, F1-weighted, Recall, and Precision.
Hence, it seems to us that Newton was the best solver for the ME model on
Task 2 in the DETOXIS training set. We concluded that the vocabulary size of
1-grams is the best for the ME model on Task 2 in the DETOXIS training set
because the ME model achieved its highest Accuracy, F1-macro, F1-weighted,
Recall, and Precision.
    Table 6 shows that the NB model achieved its best results on Task 2 with
the TF-IDF encode based on the CEM evaluation metric, which is 0.6882. The
NB model also obtained its highest Accuracy 0.6769, F1-weighted 0.5845, and
Precision 0.3171, with the TF-IDF encode, but its highest results for F1-macro
0.2916 and Recall 0.3365 were obtained with the BOW encode. Therefore, we
can conclude that the TF-IDF encode best suits the NB model on Task 2 in the
DETOXIS training set. We see indications in Table 6 that the ideal vocabulary
for the NB model on Task 2 in the DETOXIS training set is composed of 1-grams
and 2 grams. Once with this vocabulary, the model obtained its highest Accuracy,
Recall, Precision, and CEM results. The different NB algorithms obtained similar
performance based on the CEM ranged from 0.49 to 0.68.
    Based on the F1-score, the mBERT model achieved its best performance on
Task 1 with a value of 0.6010, as we can see in Table 7. The model parameters are
the following: (i) pooler as Output BERT; (ii) 3E-05 Learning Rate; (iii) Batch
Size equal 32; and (iv) 11 training Epochs. Table 8 shows that the BETO model
obtained its best performance on Task 1 also based on the F1-score with the
following parameters: (i) pooler as Output BERT; (ii) 1E-05 Learning Rate; (iii)
Batch Size equal 32; and (iv) 4 training Epochs. The BETO model obtained a F1-
score value of 0.6314, which was also the highest among all the ML models in the
cross-validation process. For this reason, the BETO model with the mentioned
parameters was used for our Task 1 official prediction on the DETOXIS test set.
These predictions afterward were submitted as our official Task 1 results.
    Observing Table 9, we can conclude that based on the CEM, the mBERT
model achieved its best performance on Task 1 with the following parameters: (i)
pooler as Output BERT; (ii) 1E-05 Learning Rate; (iii) Batch Size equal to 16;
and (iv) 12 training Epochs. This model achieved the CEM of 0.7599. Table 10
shows that the BETO model obtained its best performance on Task 2 also based
on the CEM with the following parameters: (i) hidden as Output BERT; (ii) 1E-
05 Learning Rate; (iii) Batch Size equal to 16; and (iv) 4 training Epochs. The
BETO model obtained CEM value of 0.7769, which was also the highest among
all the ML models in the cross-validation process. For this reason, the BETO
model with the mentioned parameters was used for our Task 2 official prediction
on the DETOXIS test set. These predictions afterward were submitted as our
official Task 2 results.
    To sum up the comments about the cross-validation results, looking at Tables
12 and 14, we can see that the BETO model with different combinations of
parameters obtained the five first positions on the ranking for the best ML
model for Task 1 and Task 2.
    The DETOXIS organization provided us with the results of the test set.
Table 15 shows our result on Task 1 plus the three official DETOXIS baselines:
Random Classifier, Chain BOW, and BOW Classifier. Our model obtained an
F1-score around 59% greater than the results obtained by the best baseline on
Task 1.

                        Table 15. Test set results for Task 1

                                   Model        F1-score
                                 BETO          0.5996
                             Random Classifier 0.3761
                               Chain BOW       0.3747
                              BOW Classifier 0.1837


   Table 16 shows the results of our model and the three DETOXIS baselines
on Task 2. Our BETO model was able to achieve a CEM of 9% higher than the
best DETOXIS baseline result obtained by the Random Classifier.


                        Table 16. Test set results for Task 2

                                   Model         CEM
                                 BETO          0.7142
                               Chain BOW       0.6535
                              BOW Classifier 0.6318
                             Random Classifier 0.4382


   On the DETOXIS official ranking, we obtained 3rd place on Task 1 with
F1-score 0.5996, and we achieved 6th place on Task 2 with CEM 0.7142.


5    Conclusion and Future Work
Xenophobia is a problem which is aggravated by the increase in the spread of
toxic comments posted in different online news articles related to immigration.
In this paper, to address this problem within the DETOXIS 2021 shared task,
we tried two types of ML models: (i) statistical models and (ii) BERT models.
We obtained the best results in both tasks using BETO, a BERT model pre-
trained with a big Spanish corpus. Our contributions are as follows: (i) help in
the effort to improve the results in the identification of toxic comments in news
articles related to immigration. Unlike the vast majority of works, we use ML
models that can tackle the xenophobia detection problem having only little data
available; (ii) We build an ML model and find its best configuration to deal not
only with the classification of news articles as ‘toxic’ and ‘not toxic’, but also to
infer the toxicity level of the comments into ‘not toxic’, ‘mildly toxic’, ‘toxic’, or
‘very toxic’.
    Based on the DETOXIS official metrics, we concluded that our results indi-
cate that: (i) BERT models obtain better results than statistical models for tox-
icity and toxicity level detection in text comments; and (ii) Monolingual BERT
models achieve higher results in comparison with the multilingual BERT models
in toxicity detection and toxicity level detection in their pre-trained language.
    After all, our BETO model obtained the 3rd position on Task 1 official rank-
ing with the F1-score of 0.5996, and it achieved the 6th position on Task 2 official
ranking with the CEM of 0.7142. As future work, we aim to include sentiment
lexicons on the model’s input to boost its performance.


References
 1. Amigó, E., Gonzalo, J., Mizzaro, S., Carrillo-de Albornoz, J.: An effectiveness
    metric for ordinal classification: Formal properties and experimental results. arXiv
    preprint arXiv:2006.01245 (2020)
 2. Baeza-Yates, R.: Biases on social media data: (keynote extended abstract).
    In: Companion Proceedings of the Web Conference 2020. p. 782–783. WWW
    ’20, Association for Computing Machinery, New York, NY, USA (2020).
    https://doi.org/10.1145/3366424.3383564
 3. Blaya, C., Audrin, C.: Toward an understanding of the characteristics of secondary
    school cyberhate perpetrators. In: Frontiers in Education. vol. 4, p. 46. Frontiers
    (2019)
 4. Canete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model
    and evaluation data. PML4DC at ICLR 2020 (2020)
 5. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection
    and the problem of offensive language. In: Proceedings of the International AAAI
    Conference on Web and Social Media. vol. 11 (2017)
 6. Devlin, J.: Multilingual bert readme document. https://github.com/
    google-research/bert/blob/a9ba4b8d7704c1ae18d1b28c56c0430d41407eb1/
    multilingual.md (2018)
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 8. Gheisari, M., Wang, G., Bhuiyan, M.Z.A.: A survey on deep learning in big data. In:
    2017 IEEE International Conference on Computational Science and Engineering
    (CSE) and IEEE International Conference on Embedded and Ubiquitous Comput-
    ing (EUC). vol. 2, pp. 173–180 (2017). https://doi.org/10.1109/CSE-EUC.2017.215
 9. Jebara, T.: Machine learning: discriminative and generative, vol. 755. Springer
    Science & Business Media (2012)
10. Kim, J.W., Chen, G.M.: Exploring the influence of comment tone and content
    in response to misinformation in social media news. Journalism Practice 15(4),
    456–470 (2021). https://doi.org/10.1080/17512786.2020.1739550
11. Korencic, D., Baris, I., Fernandez, E., Leuschel, K., Sánchez Salido, E.: To block or
    not to block: Experiments with machine learning for news comment moderation. In:
    Proceedings of the EACL Hackashop on News Media Content Analysis and Auto-
    mated Report Generation. pp. 127–133. Association for Computational Linguistics,
    Online (Apr 2021), https://www.aclweb.org/anthology/2021.hackashop-1.18
12. Nelson, W.B.: Accelerated testing: statistical models, test plans, and data analysis,
    vol. 344. John Wiley & Sons (2009)
13. Pimpalkar, A.P., Raj, R.J.R.: Influence of pre-processing strategies on the perfor-
    mance of ml classifiers exploiting tf-idf and bow features. ADCAIJ: Advances in
    Distributed Computing and Artificial Intelligence Journal 9(2), 49–68 (2020)
14. Plaza-Del-Arco, F.M., Molina-González, M.D., Ureña López, L.A., Martı́n-
    Valdivia, M.T.: Detecting misogyny and xenophobia in spanish tweets us-
    ing language technologies. ACM Trans. Internet Technol. 20(2) (Mar 2020).
    https://doi.org/10.1145/3369869
15. Risch, J., Krestel, R.: Delete or not delete? semi-automatic comment moderation
    for the newsroom. In: Proceedings of the first workshop on trolling, aggression and
    cyberbullying (TRAC-2018). pp. 166–176 (2018)
16. Stroud, N.J., Van Duyn, E., Peacock, C.: News commenters and news comment
    readers. Engaging News Project pp. 1–21 (2016)
17. Taulé, M., Ariza, A., Nofre, M., Amigó, E., Rosso, P.: Overview of the detoxis task
    at iberlef-2021: Detection of toxicity in comments in spanish. Procesamiento del
    Lenguaje Natural 67 (2021)
18. Winter, S., Brückner, C., Krämer, N.C.: They came, they liked, they commented:
    Social influence on facebook news channels. Cyberpsychology, Behavior, and Social
    Networking 18(8), 431–436 (2015)
19. Xenophobia: Retrieved from https://en.oxforddictionaries.com/definition/money.
    Oxford Online Dictionary (2021)
20. Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical
    framework. International Journal of Machine Learning and Cybernetics 1(1-4),
    43–52 (2010)