“DEMBO” at IberLEF-2021 DETOXIS task:
               Toxicity Analysis in comments using
                   Machine Learning Models

                       Álvaro Mazcuñán & Miquel Marín (DEMBO)
                Escuela Técnica Superior de Ingeniería Informática, ETSINF
                         Universitat Politècnica de València, Spain
                           {almazher,mimaco1}@inf.upv.es

             Abstract. In this paper we will explain briefly how the “Dembo” team
            12approached the problem of detecting toxic language. This project covers two

             tasks. The first is to classify whether a text (set of comments) is toxic or not
             and the second one is to classify that text into different levels of toxicity.
             Two models for those sub-tasks have been submitted to the competition. The
             first one is the hybrid stacking model in which Support Vector Machine,
             Decision Tree, Random Forest and Multi-layer Perceptron models have been
             used with a logistic regression with the function of being metalearner. The
             second is the BETO model, which is a variant of the BERT model. In the
             DETOXIS ranking the team has finished in the following positions: 11th for
             subtask1 (toxicity) and 8th for subtask2 (toxicity_level).

             Keywords: Stacking, Support Vector Machines, Multi-layer Perceptron,
             Logistic Regression, Decision Tree, Random Forest, Transformers, BETO.


1   Introduction
The challenge of dealing with hate speech is ancient, but the speed of today's hate speech
poses a uniquely modern quandary. While there is no precise definition of hate speech, it is
generally speech that is intended not only to insult or ridicule, but to cause lasting by
attacking something that is particularly important to the victim. Hate speech is widespread
in online forums and social media. Some previous work has been carried out on the subject
of Hate Speech. [1, 2]

Once the problem of detecting toxic language has been introduced, it should be said that in
the different models made, except for the BERT model, the sklearn 3library has been used.
Two datasets were available for this task. The first was the train dataset, with a total of 3958
comments. On the other hand, in order to validate the quality of the model, we had the test
dataset, which consisted of 891 comments. Overall, this dataset had a wide variety of
variables such as: constructiveness, positive / negative stance, stereotype, sarcasm,
aggressiveness among others. However, for this competition, only the comment variable was
used, i.e., the variable containing the different comments from various Spanish newspapers
such as ABC, elDiario.es, El Mundo, etc.

The aim of this competition was twofold. On the one hand, it was required to label the
comments in the test set with 0 or 1, i.e., whether these comments were non-toxic or toxic,
respectively. For this purpose, the variable toxicity was available in the training set to be

      IberLEF2021, September 2021, Málaga, Spain.
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
able to evaluate the corresponding machine learning models. In addition, in the second
subtask, the objective was a little more complicated, because in this case we were asked to
label the same set of tests but, in this case, adding different levels of toxicity:


   0 → Not toxic
   1 → Mildly toxic
   2 → Toxic
   3 → Very toxic

As discussed above for the first subtask, in order to evaluate our models, the toxicity_level
variable was available in the set of 3958 comments.


2   System

2.1 Preprocessing


Before performing any separation of the data in order to train the corresponding models, it
was decided to carry out a small amount of data cleaning/preprocessing. To do this, the first
thing that was done was to remove from the messages those characters that were emojis,
hashtags (#), URLs and other special characters. Once these parts of the comments had been
eliminated, we began tokenizing the text, i.e. separating the comment into words and, with
this, obtaining a list of these words.

With this list of words, the next step was to eliminate stopwords, i.e. words that have no
meaning in themselves. This group of words usually consists of articles, pronouns,
prepositions, adverbs and some verbs in particular.
The next step is to apply the stemming and lemmatization algorithms. The former works
by stemming the end or the beginning of the word, taking into account a list of common
prefixes and suffixes that can be found in an inflected word. It has to be said that this
approach can be successful in some occasions but not always.

Below is an example of how the stemming algorithm works:

studies → -es → studi
studying → -ing → study

On the other hand, the lemmatization algorithm takes into account the morphological
analysis of words. An example is shown as in the previous case:

studies → third person, present tense of the verb study → study
studying → gerund of the verb study → study

Once all the pre-processing part of the comments has been done, we can move on to the text
representation part that has been used for this contest. However, before going into this, a
training and test partition has to be carried out. Specifically, in the training dataset (the
one containing 3958 messages) a training and validation partition will be made (around 10-
20% depending on the algorithm used).
Once this is done, the remaining 891 comments from the test set can be used to classify the
messages. It’s fair to say that the test set is used to evaluate the models in a “real world
scenario” with new unseen posts that correspond to the same topics as the ones present in
the train set. In addition to performing the corresponding partitions, it has to be observed
whether the classes are balanced or not, as this can lead to problems when evaluating the
subsequent models. In the variable containing the binary classification, toxicity, 2316
comments can be observed with class 0, i.e. non-toxic, and the rest (1147) with class 1,
meaning that they are toxic. According to the criteria considered by the team, it was decided
not to carry out any class balancing task. However, the situation varies in the toxicity_level
variable. This variable contains 2317 comments with class 0, 808 with class 1, 269 with class
2 and, finally, 69 with the most toxic class. In this case, it was decided to carry out a
balancing task while training the models. In the conclusions, some proposals for future work
will be mentioned and one of them will be the balancing task before training the models. (the
solution adopted will be briefly explained in the part on model evaluation).


2.2 Text representation


Having considered the issue of class balancing, we now move on to explain the text
representation techniques used. The first of these is the bag of words [3]. It is a way of
representing the vocabulary that we will use in our models and consists of creating a matrix
in which each column is a token and the number of times each token appears in each sentence
is counted.

The problem with this technique is that it only considers unigrams. For this reason, the n-
grams technique was also used, as word order could be considered in this way. It was
decided to use between 2 and 3-grams (bigrams-trigrams). The procedure would be the same
as before but taking into account the latter approach.

The third and last text representation technique used was Term-Frequency - Inverse
Document Frequency (TF-IDF) [4]. This technique consists of measuring how important a
word is within a text, i.e. each word will have an associated weight, depending on its
importance. It should be noted that this technique was used for the Stacking model and was
applied to the initial text. The previous techniques (BOW and Ngrams) were used for
individual models such as Support Vector Machines, Logistic Regression, among others.

2.3 Method

Having explained the techniques used to represent comments, we now move on to mention
the different Machine Learning models that were used: Support Vector Machine (SVM),
Decision Tree, Logistic Regression, Multi-layer Perceptron (MLP), Random Forest,
Stacking and BETO.

Since the Stacking and BETO models have been submitted as runs in the competition, the
strategy used in both models will be briefly explained.

It must be said that the Stacking model [5] is a combination of some of the previous models,
specifically we used Support Vector Machine (SVM), Decision Tree, Random Forest and
Multi-layer Perceptron (MLP) as base models and, as a meta learner model, logistic
regression. In addition, in some of the previous methods, in order to obtain the ideal
parameters for each of them, fine tuning was used, specifically, employing the Grid search
technique [6].


                     Fig. 1. General structure of the Stacking system


As previously mentioned, the issue of imbalance is a problem to be taken into account when
obtaining the labels for the different types of toxicity (toxicity_level). Therefore, for all the
previous models, except for the BETO model, an extra parameter called stratify of the
sklearn library was used at the time of training these models, which allowed the classes to
be balanced while performing the training task.

Finally, the BERT [7] technique was also used, specifically the dccuchile/bert-base-spanish-
wwm-uncased model of Hugging Face for comments in Spanish (BETO)14. BERT is a
Transformer that uses an attention mechanism that learns the contextual relationships
between words in a certain text (in this case comments). Moreover, a Transformer comprises
two structures: an encoder that reads the text input and a decoder that produces a task
prediction. Since the goal of BERT is to generate a language model, only the encoder
mechanism is needed. In this model, a maximum comment length of 200 characters and a
batch size of 16 were used for training.


                                      Fig. 2. BERT Model

3    Results

Once a brief description of the models used has been given, we move on to the results.
However, it should first be noted that in order to evaluate the quality of these models,
different evaluation measures have been used. In the first subtask, which refers to the
detection of whether a message is toxic or not, the F1 score measure was used. However, in
the second subtask, which refers to the detection of a text according to its level of toxicity,


1 https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased
 more measures are used: CEM (Closeness Evaluation Metric), which is used for ordinal
 ranking tasks, RBP (Rank Biased Precision) [8], which is suitable when we are retrieving
 highly toxic comments from large texts, Pearson's coefficient and finally the accuracy. For
 the DETOXIS competition, due to the fact that only a maximum of 5 runs could be sent, we
 decided to send the Stacking and BETO models. The results for the first subtask were as
 follows (compared to a baseline model, specifically, a Random Classifier): 5

                                         System       F1-Score
                                         BETO          0.4632
                                       Stacking        0.3893
                                     RandomClf         0.3761


                     Table 1. Model performance on the testing set (toxicity task)


 On the other hand, the results for the second subtask were the following:


                    System         CEM          RBP           Pearson      Accuracy
                     BETO         0.6703        0.1037          0.2677      0.6936
                   Stacking       0.6258        0.0999          0.1529      0.7160
                 RandomClf        0.4382        0.0390        -0.0455       0.2278


                 Table 2. Model performance on the testing set (toxicity levels task)


 4    Conclusion and Future work

 In the DETOXIS [9] ranking we have finished in the following positions: 11th for the first
 subtask (toxicity) and 8th for the second one (toxicity_level).

 Throughout this project, different approaches could be adopted in order to obtain the best
 possible results in labelling comments with their corresponding toxicity values. However,
 due to lack of time, some of the improvements we had in mind could not be implemented.
 Therefore, knowing this, the possible improvements of this project are as follows:

 1.- Due to the fact that in BETO the accuracy is not entirely good, what could be done is the
 following: hence the classes are unbalanced, more precisely, the non-toxic class. It has
 around 2000 samples more (out of 3958 comments) than the other one which is the toxic.
 In this case, we could predict the level of toxicity of a comment according to whether the
 model has previously predicted it as toxic or not. Therefore, we could first make a prediction
 on the toxicity variable and, once we have obtained the predictions, we would obtain a new
 new_predictions column and with this we will only work with those comments that the
 model has predicted as toxic.

 2.- Perform balancing tasks before training the models.


2 All our code is available in our Github: https://github.com/alvaro-mazcu-herreros/DETOXIS_2021
3.- Use more variables such as sarcasm, aggressiveness, etc. and not just the information of
the comment itself.


References

1. Basile, V., Bosco, C., Fersini, E., Debora, N., Patti, V., Pardo, et al. (2019). Semeval-2019 task 5:
Multilingual detection of hate speech against immigrants and women inTwitter. 13th International
Workshop on Semantic Evaluation (pp. 54-63), Association for Computational Linguistics.Kumar, R.,
Ojha, A. K., Malmasi, S., & Zampieri, M. (2018). Benchmarking aggression identification in social
media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
(pp. 1-11)


2. Sai Saketh Alurul , Binny Mathew , Punyajoy Saha , and Animesh Mukherjee, «Deep Learning
Models for Multilingual Hate», Indian Institute of Technology Kharagpur, 2020
3. Yin Zhang, Rong Jin, Zhi-Hua Zhou, «Understanding Bag-of-Words Model: A Statistical
Framework», International Journal of Machine Learning and Cybernetics, 2010
4. Joon-Min Gil, Sang-Woon Kim, Research paper classifcation systems based on TF-IDF and LDA
schemes, Human-centric Computing and Information Sciences volume, 2019
5. Alexandre Alves, Stacking machine learning classifiers to identify Higgs bosons at the LHC, Journal
of Instrumentation, 2016
6. Siji George, B.Sumathi, Grid Search Tuning of Hyperparameters in Random Forest Classifier for
Customer Feedback Sentiment Prediction, International Journal of Advanced Computer Science and
Applications, 2020
7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, «BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, arXiv , 2018
8. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness.
ACM Transactions on Information Systems (TOIS), 27(1), 1-27.
9. Taulé, Mariona, Alejandro Ariza, Montserrat Nofre, Enrique Amigó, Paolo Rosso (2021). ‘Overview
of the DETOXIS Task at IberLEF-2021: DEtection of TOXicity in comments In Spanish’,
Procesamiento del Lenguaje Natural, Vol. 67.