UniNE at PAN-CLEF 2021
(Notebook for PAN at CLEF 2021)

Catherine Ikae1
1
    University of Neuchâtel, Switzerland, Avenue du 1er-Mars 26, 2000 Neuchâtel, Switzerland


                                         Abstract
                                         The paper describes the work done on the PAN 2021 task about profiling Hate Speech Spreaders in
                                         Spanish and English messages extracted from Twitter. We implement a simple Ensemble Classifier class
                                         that allows us to combine seven different machine-learning classifiers, which predict a class by simply
                                         taking the majority rule of the predictions by the classifiers. We also propose a reduced set of features
                                         that are obtained by considering terms with df > 3 and tf > 1 thereby eliminating terms that only appear
                                         once in the corpus. The features are ranked according to their term difference in each category. Each
                                         category contributes an equal number of features to the classification task. With 800 features from each
                                         class, our model achieves an accuracy of 0.66 for the English dataset and 0.81 for the Spanish dataset
                                         attaining an average score of 0.735.

                                         Keywords
                                         Author profiling, Ensemble classifier, Two-step feature selection, profiling Hate Speech Spreaders


1. Introduction
Hate Speech is defined as any communication that disparages a person or a group based on
some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion,
or other characteristics as defined in the Encyclopedia of the American Constitution [1]. Hate
speech leads to discrimination against particular categories of people and undermines equality,
which is a big issue for each civil society as explained by [2].
   Given the large number of people using social media such as Twitter, Facebook as a means of
communication and sharing ideas, which are of great benefit to humanity since information
shared can reach a big audience in a short time. However, this benefit is not without challenges,
these channels of communication have also been exploited to propagate hate speech and spread
false news resulting in hate crimes [3]. The ubiquity of social networks and the low cost of
using them render the propagation of hate speech a real concern for our society.
   The lack of editorial control over the spread of hate speech has the potential to harm and
damage the targeted members of the society. One can also add that the involved companies
do not propose (or do not want to propose) a real control on the flow of information using
their networks. Hence, a real need for an automatic mechanism to identify the presence of hate
speech spreaders is an important research topic.
   A large number of researchers have been drawn to this area to develop automated methods of
hate speech detection [4]. This has become a major natural language processing (NLP) research
CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" catherine.ikae@unine.ch (C. Ikae)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
topic in the recent years as, for example, [5] who suggest an evaluation task focusing on hate
speech. Therefore, this study contributes to this area of research by applying the two-step
feature selection technique [6] and an ensemble of ML classifiers on PAN-CLEF 2021 hate speech
datasets.


2. Corpus: Overall Statistics
The training corpus was available in the English and Spanish languages. The English training
dataset had 100 documents(authors) of label 0 (normal set of tweets) and 100 documents(authors)
of label 1 (tweets containing some form of hate speech). The Spanish training dataset had 100
documents(authors) of label 0 and 100 documents(authors) of label 1. Each of these documents
contained 200 tweets [7]. A document refers to a set of tweets by an author.
   The mean length of tokens per English document with label 0 is 3,270, with the maximum
number of tokens in a document being 4,231 and minimum is 2,057. Those with label 1 mean
length of tokens per document is 3,303, with the maximum number of tokens in a document
being 4,637 and minimum is 2,058.
   For the Spanish documents, the mean length of tokens per document with label 0 is 3,198,
With maximum number of tokens in a document being 4,256 and minimum is 2,134 and with
label 1 is 3,565, with maximum number of tokens in a document being 4,246 and minimum is
2,374.

Table 1
Overall statistics about the training data in both languages
                                        English                               Spanish
                                    0             1                       0             1
       Nb. doc.                    100           100       Nb. doc.      100           100
       Nb tweets                 20,000        20,000     Nb tweets    20,000        20,000
       Mean length                3,270         3,303    Mean length    3,198         3,565
       |Voc|                     14,103        13,338       |Voc|      24,310        23,722

  Example of tweets from each of the classes and languages are as shown in Table 2 and Table 3


3. Ensemble Classifier
An ensemble classifier is one that stacks several performing classifiers to come up with the best
prediction from the combined classifiers [8]. The advantage of combining classifiers is to take
advantage of the good performance from a set of classifiers that will result in better prediction
results as compared to a single classifier on its own.
  For our task, we consider the following classifiers namely:
   1. Linear Discriminant Analysis (LDA) finds a linear combination of features that separates
      two or more classes of objects in order to classify them [9];
   2. Gradient Boosting (GB) which modifies weak learners into strong learners [10];
Table 2
Sample of three tweets in English for each class
                                                   English


 Class 0


 Class 1


Table 3
Sample of three tweets in Spanish for each classes
                                                   Spanish


 Class 0


 Class 1


   3. Extremely Randomized Trees (Extra Trees ET) is an ensemble learning technique which
      aggregates the results of multiple de−correlated decision trees constructed from the
      original training sample to obtain its classification result [11];
   4. Gaussian Naive Bayes (G_NB) a variant of Naive Bayes that follows Gaussian normal
      distribution [12];
   5. Bernoulli Naive Bayes (B_NB) [12] performs classification by assuming each feature to
      be a binary-valued (Bernoulli, Boolean);
   6. Random Forest (RF) an ensemble of decision trees that combines learning models to
      increases classification accuracy [13];
   7. AdaBoost creates a strong classifier from a number of weak classifiers [14].
We used the scikit-learn Python machine learning library that provides an implementation of
stacking for machine learning [8] to integrate these classifiers.


4. Feature Selection
Good classification results come from a good feature set generated from that training dataset.
The features are also used to understand and explain the difference between the hate speech
spreaders and other users. We propose a technique that is capable of reducing the features
space, the two-stage feature selection strategy [6].
   The two-stage feature selection strategy works by considering tokens according to their
document frequency (df) and term frequency difference. A threshold of three (df > 3) and
(tf > 1) was used for this task. With these two constraints, we create a feature set capable of
distinguishing each category. From the reduced number of tokens obtained by applying df > 3,
a term frequency difference is computed but only tokens with a term frequency greater than 1
(tf > 1) are put into consideration to leave out those tokens that appear only once in the text.
   With 70 documents taken from class 0 and 70 from class 1, we create our training set and
the remaining 30 from class 0, 30 from class 1 is used as the test set. The features extracted
from this selection is as described below: We begin our feature selection with 11723 features
from class 0 and 11161 features from class 1. The features are reduced to 5970 for class 0 and
5799 for class 1 by considering only terms with tf >1. The final reduced set is obtained by using
frequency difference between the tokens from the two classes and checking if it has a df > 3.
With the features ranked according to their term frequency difference in each class, the number
of features used in creating the model can be selected in descending order from each category.

Table 4
Two-step feature selection for English and Spanish dataset
                                          English                              Spanish
                             All      tf >1    tf diff and df >3    All    tf >1    tf diff and df >3
 vocubulary_0               11723     5970            2932         19311   8135            3230
 vocubulary_1               11161     5799            2855         19030   8519            3596
 Total number of features                             5787                                 6826

   Using Shift Graphs [15], words are sorted by their absolute contribution to the difference
between classes. The shift graphs are created with the document frequencies of the resulting
features Figure 1 and Figure 2. With sys 2 belonging to class 1 and sys 1 to class 0. Words with
high discriminating power in a class are shown at the top of the chart with longer bars and
those with lower discriminating power have shorter bars. The bars represent the document
frequency difference between classes. The same approach is used in the entire training dataset
to obtain features that will be used to create the model for testing.


5. Evaluation
To train our model, features are extracted from the training documents by taking into account
the steps explained in section 4. Features to be considered must have a tf > 1 and df > 3 from
which the ranking is done according to the difference in term frequencies. The k feature set at
each selection picks equal values from each subset, that is half the feature from class 0 and the
other half from class 1.
  The value of k is increased from 200 (100 from class 0 and 100 from class 1) to 2000. The
accuracy of several classifiers are computed as shown in the table 5 and table 6. It was easy to
analyse the performance of the classifiers where we can see that an increase in the number of
features also had an increase in the accuracy of the classification. Taking an example of the
Figure 1: Shift graph for the                    Figure 2: Shift graph for the
English dataset                                  Spanish dataset


LDA line one of the Table 5 and Table 6 shows increased accuracy from k = 200 to k = 1600
where we get a maximum accuracy of 0.82. An increase in the features at this point decreases
the accuracy to 0.75 at k = 2000. Since not all classifiers produced similar results as the LDA, an
ensemble of two classifiers was built with LDA and G_NB that gave an overall performance
best accuracy at k = 1600 (800 from class 0 and 800 from class 1).
   Table 7 depicts the accuracy rate achieved with our model under different conditions and
for both languages. In the first row, 800 words from class 0 and 800 words from class 1 have
been used to build the document surrogates and an ensemble of only two classifiers is used for
classification giving an average score of 0.655. In the second line, the vocabulary size is kept
the same but an ensemble of seven classifiers are used namely: Linear Discriminant Analysis,
Gradient Boosting, Extremely Randomized Trees, Gaussian Naive Bayes, Bernoulli Naive Bayes,
Random Forest and AdaBoost resulting into an average score on 0.735.
Table 5
Evaluation based on different feature sizes
                                                 English
   Classifiers                200    400      600    800   1000   1200   1400   1600   1800   2000
   LDA                        0.62   0.55     0.57 0.58    0.60   0.60   0.60   0.65   0.68   0.65
   GaussianPro                0.55   0.55     0.55 0.55    0.55   0.55   0.55   0.55   0.55   0.55
   GradientBoost              0.65   0.55     0.60 0.65    0.58   0.53   0.48   0.48   0.47   0.48
   ExtraTrees                 0.68   0.60     0.58 0.58    0.57   0.67   0.55   0.68   0.58   0.63
   KNN                        0.52   0.48     0.50 0.48    0.50   0.48   0.50   0.50   0.52   0.52
   GaussianNB                 0.50   0.50     0.62 0.60    0.63   0.68   0.73   0.68   0.68   0.70
   MultinomialNB              0.50   0.50     0.50 0.50    0.50   0.50   0.50   0.50   0.50   0.50
   BernoulliNB                0.50   0.57     0.58 0.62    0.62   0.60   0.63   0.65   0.62   0.63
   DecisionTree               0.60   0.50     0.57 0.47    0.50   0.57   0.57   0.58   0.55   0.55
   RandomForest               0.65   0.53     0.63 0.55    0.58   0.60   0.50   0.55   0.65   0.60
   LogisticReg                0.53   0.55     0.55 0.55    0.55   0.55   0.55   0.55   0.55   0.55
   MLP                        0.50   0.52     0.52 0.53    0.53   0.55   0.55   0.55   0.55   0.57
   AdaBoost                   0.58   0.53     0.50 0.55    0.60   0.52   0.52   0.57   0.57   0.57
   Bagging                    0.58   0.53     0.58 0.52    0.62   0.67   0.58   0.58   0.58   0.58
   SGD                        0.52   0.52     0.55 0.52    0.45   0.52   0.53   0.55   0.52   0.52
   XGB                        0.62   0.53     0.63 0.57    0.58   0.55   0.53   0.50   0.53   0.53
   SVM                        0.57   0.57     0.57 0.58    0.58   0.58   0.58   0.58   0.58   0.58
   Ensemble(LDA + G_NB)       0.57   0.52     0.62 0.62    0.63   0.68   0.67   0.72   0.70   0.58


6. Conclusion
The paper describes the machine learning ensemble approach for hate speech spreaders detection
task. We proposed an ensemble based on Linear Discriminant Analysis, Gradient Boosting,
Extremely Randomized Trees, Gaussian Naive Bayes, Bernoulli Naive Bayes, Random Forest
and AdaBoost classifiers. The resulting performance gave us an accuracy of about 0.66 for the
English dataset and 0.81 for the Spanish dataset. Our approach is capable of distinguishing
hate/non-hate speech spreaders since the features set used in the classification are drawn
from both classes in equal numbers. A term frequency difference is used to determine the
discriminating power of each feature in the class. These features indicate the difference that
exists between the two classes and are ranked according to their frequency difference. For
future work, the idea is to compare chi2 and mutual information feature ranking with the hope
of boosting the feature selection.
Table 6
Evaluation based on different feature sizes
                                                 Spanish
   Classifiers                200     400     600    800   1000    1200     1400   1600   1800     2000
   LDA                        0.61    0.63    0.63 0.77    0.77    0.82     0.78   0.82   0.80     0.75
   GaussianPro                0.60    0.60    0.60 0.60    0.60    0.60     0.60   0.60   0.60     0.60
   GradientBoost              0.72    0.72    0.68 0.72    0.73    0.72     0.70   0.75   0.73     0.73
   ExtraTrees                 0.77    0.77    0.75 0.72    0.72    0.72     0.73   0.77   0.73     0.70
   KNN                        0.60    0.60    0.60 0.60    0.58    0.58     0.58   0.58   0.58     0.58
   GaussianNB                 0.72    0.73    0.73 0.75    0.78    0.78     0.72   0.77   0.75     0.75
   MultinomialNB              0.50    0.50    0.50 0.50    0.50    0.50     0.50   0.50   0.50     0.50
   BernoulliNB                0.72    0.72    0.72 0.70    0.72    0.72     0.73   0.72   0.70     0.70
   DecisionTree               0.60    0.65    0.65 0.57    0.58    0.58     0.60   0.60   0.70     0.65
   RandomForest               0.73    0.73    0.77 0.68    0.73    0.78     0.70   0.73   0.75     0.77
   LogisticReg                0.58    0.58    0.60 0.60    0.60    0.60     0.60   0.60   0.60     0.60
   MLP                        0.63    0.63    0.65 0.65    0.65    0.65     0.65   0.65   0.67     0.67
   AdaBoost                   0.63    0.70    0.73 0.67    0.73    0.62     0.75   0.70   0.63     0.72
   Bagging                    0.65    0.63    0.65 0.72    0.70    0.68     0.72   0.73   0.65     0.70
   SGD                        0.63    0.47    0.58 0.48    0.48    0.50     0.57   0.57   0.55     0.60
   XGB                        0.68    0.72    0.73 0.75    0.75    0.77     0.75   0.77   0.73     0.72
   SVM                        0.58    0.60    0.60 0.60    0.60    0.60     0.60   0.60   0.60     0.58
   Ensemble(LDA + G_NB)       0.68    0.68    0.70 0.80    0.78    0.82     0.76   0.82   0.78     0.73

Table 7
Official Evaluation with (k = 1600)
                                    TIRA Test Results
                                       ENGLISH                    SPANISH          Average Score
      Ensemble (LDA + G_NB)               0.57                      0.74               0.655
      Ensemble (LDA + G_NB + B_NB +       0.66                      0.81               0.735
      GB + ET + RF + ADB)


References
 [1] Encyclopedia of the American Constitution, 2nd ed. / adam winkler, associate editor for
     the second edition. ed., Macmillan Reference USA, New York, 2000.
 [2] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM
     Computing Surveys (CSUR) 51 (2018) 1 – 30.
 [3] S. Abro, S. Shaikh, Z. H. Khand, Z. Ali, S. Khan, G. Mujtaba, Automatic hate speech detection
     using machine learning: A comparative study, International Journal of Advanced Computer
     Science and Applications 11 (2020). URL: http://dx.doi.org/10.14569/IJACSA.2020.0110861.
     doi:10.14569/IJACSA.2020.0110861.
 [4] A. Schmidt, M. Wiegand, A survey on hate speech detection using natural language
     processing, in: Proceedings of the Fifth International Workshop on Natural Language
     Processing for Social Media, Association for Computational Linguistics, Valencia, Spain,
     2017, pp. 1–10. URL: https://www.aclweb.org/anthology/W17-1101. doi:10.18653/v1/
     W17-1101.
 [5] Ò. Garibo i Orts, Multilingual detection of hate speech against immigrants and women
     in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech
     detection, in: Proceedings of the 13th International Workshop on Semantic Evaluation,
     Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 460–
     463. URL: https://www.aclweb.org/anthology/S19-2081. doi:10.18653/v1/S19-2081.
 [6] C. Ikae, S. Nath, J. Savoy, Unine at pan-clef 2019: Bots and gender task, in: CLEF, 2019.
 [7] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech
     Spreaders on Twitter Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook
     Papers, CEUR-WS.org, 2021.
 [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
 [9] P. Xanthopoulos, P. M. Pardalos, T. B. Trafalis, Linear Discriminant Analysis, Springer New
     York, New York, NY, 2013, pp. 27–33. URL: https://doi.org/10.1007/978-1-4419-9878-1_4.
     doi:10.1007/978-1-4419-9878-1_4.
[10] R. E. Schapire, The strength of weak learnability, in: Machine Learning, 1990.
[11] P. Geurts, Extremely randomized trees, in: MACHINE LEARNING, 2003, p. 2006.
[12] T. Ngo, Data mining: Practical machine learning tools and technique, third edition by
     ian h. witten, eibe frank, mark a. hell, SIGSOFT Softw. Eng. Notes 36 (2011) 51–52. URL:
     https://doi.org/10.1145/2020976.2021004. doi:10.1145/2020976.2021004.
[13] L. Breiman, 1 random forests–random features, 1999.
[14] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning – data mining,
     inference, and prediction, ????
[15] R. J. Gallagher, M. Frank, L. Mitchell, A. J. Schwartz, A. J. Reagan, C. Danforth, P. Dodds,
     Generalized word shift graphs: a method for visualizing and explaining pairwise compar-
     isons between texts, EPJ Data Science 10 (2021) 1–29.