Profiling Hate Spreaders using word N-grams
Notebook for PAN at CLEF 2021

Jorge Alcañiz1 , José Andrés1
1
    Universitat Politècnica de València


                                         Abstract
                                         With the rise of social media over the last decade, the amount of content that is published every day on the
                                         internet has become huge. Unfortunately, as the amount of published content grows, the amount of hate
                                         speech that can be found on social media also grows. This fact motivates the creation of systems that could
                                         automatically detect this undesired behaviors in order to report them to the competent authorities. With
                                         this purpose, we have developed a system that detect users which could be considered as hate spreaders
                                         employing a TF-IDF vectorizer in combination with an SVM, achieving an accuracy of 81% over the
                                         Spanish dataset and 69% over the English dataset.


1. Introduction
   Hate speech is commonly defined as a propaganda of ideas based on the superiority of a
group of people because of their race, color or ethnic origin. This problem is not novel, as
it has been present in our society during centuries, but due to the rise of social media it has
reached unprecedented levels. Given the huge amount of content generated by the users and
the impossibility to manually check all the content, the automatic detection of hate speech has
become a relevant task.
   With this purpose, the aim of this competition is to automatically detect hate speech, but
employing an author profiling perspective instead of a tweet perspective. Therefore, we are
interested in detecting which users could be considered as hate spreaders.
   This paper presents our participation in the Author Profiling task at PAN [1] for detecting hate
spreaders [2]. Our method follows the ideas presented in [3], which were focused on employing
n-grams of chars and words as features and an SVM as classifier. Moreover, we have tried
different classifiers and we have compared the obtained accuracies between them. The rest of this
paper is structured as follows: Section 2 describes the dataset used for this shared task, Section
3 presents the preprocessing that we have applied for each language, Section 4 presents our
approach to the problem, the results obtained per each model and a discussion of them and finally
Section 5 summarizes the paper and proposes possible future works.


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" joralvi1@inf.upv.es (J. Alcañiz); joanmo2@inf.upv.es (J. Andrés)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
2. Corpus
  The dataset of this competition was composed by 200 authors for each language, where each
author was composed by 200 tweets. From the 200 authors, 100 were hate spreaders and the
other 100 were not. Moreover, we would like to remark that all the urls, links, hashtags and user
mentions on the corpus were masked by a unique token for each type.


3. Preprocessing
   The objective of this step is to reduce the vocabulary size by merging different token occurrences
that are referring to the same concept. For the preprocessing of the dataset we have followed
these steps.
   In first place, as we are interested in detecting hate speech at author level, we have concatenated
all the tweets of each author into one single string. Then, we have converted the text to lowercase
and we have replaced all the emojis and emoticons by the token “emoji” employing a regular
expression. After that, we have applied a different linguistic preprocessing for each language: in
the English dataset, we have replaced some contractions by their expanded form, (for example,
the token: “you’d” has been replaced by “you would”, the token “it’s” has been replaced by “it
is”, etc), meanwhile in the Spanish dataset we have replaced some words by their homologous
colloquial tokens (for instance, the token “por” has been replaced by “x” and the token “que” has
been replaced by “k”). Then, we have removed all the punctuation signs such as points, commas,
exclamation signs, etc. Moreover, we have reduced different forms of expressing laugh such as
“hahahah”, “ahahha”,“jajajaja”, “lol”,“lmao” to the token “haha”. Finally, we have removed the
stopwords and performed stemming on both datasets.


4. Our Approach
   In this task we have considered the feature extraction process and the machine learning model
estimation as a combined optimization process. Therefore, we have performed an extensive
grid search to choose the best combination of hyper-parameters for the tfidf vectorizer and the
different machine learning models employed. To assess the performance of our classifier, we
have performed a 10-fold cross validation over the training dataset.
   For feature extraction, we have employed a TF-IDF [4] vectorizer, which allows us to quantify
the importance of every sequence of terms present in the corpus, multiplying the term frequency
in the text by the inverse document frequency of the term in the corpus. The hyper-parameters
of the Vectorizer are the following: “analyzer”, which denotes the level at which the feature
extraction is performed (either word level or character level), “ngram_range”, which denotes
the order of the employed language model and “min_df”, which removes those n-grams whose
document frequencies are lower than a given threshold.
   Before starting to discuss the selected classifiers, we would like to remark that we have decided
to avoid employing deep learning for this task. This is due to the fact that we only have 200
samples for each author, and given that deep learning is usually data-hungry, it could easily turn
into overfit. Therefore, among all the possible machine learning models, we have chosen the
Table 1
Hyper-parameters tested during grid search for Tfidf Vectorizer.
                            Vectorization        Hyper-parameters
                              Analyzer        {’word’,’char’,’char_wb’}
                              Ngrams             {(1,1), (1,2), (2,2)}
                               min_df        {1,2,3,4,5,6,7,8,9,10,11}


following: logistic regression (LR) [5], Naive Bayes (NB)[6], Support Vector Machine (SVM)
[7], Random Forest (RF) [8], multiple linear models trained with Stochastic Gradient Descent
(SGD) [9] and K-nearest neighbor (KNN) [10]:

    • Logistic regression: Basic algorithm for binary classification, equivalent to “Linear
      regression” but taking a logit function. The hyper-parameters taken for this model are,
      “penalty” of L2, with a “solver” liblinear and the regularization coefficient “C”.


    • Naive Bayes: A well-known technique which has been employed for tackling many
      information retrieval problems. For this model, the hyper-parameters used are the smoothing
      parameter “alpha” and “fit_prior”, which denotes if the model wants to learn the prior of
      every class in the model.


    • Support Vector Machine: A linear classification model that employs as decision boundary
      the maximal margin hyperplane. This fact is relevant to our task, given that due to the
      moderate size of the dataset and the large number of extracted features, many possible
      decision boundaries exist. Moreover, this model also allows solving non-linear problems by
      applying the appropriate kernel function. The tuned hyper-parameters are the regularization
      coefficient “C” and the employed kernel.


    • Random Forest: An ensemble model of multiple decision trees. The tuned hyper-
      parameters are “criterion”, to measure the quality of every split, and “min_samples_leaf”,
      which denotes the number of samples required to be a leaf node.


    • Stochastic Gradient Descent classifier: An optimization technique that allows us to fit
      linear classifiers employing gradient descent. The hyper-parameters selected for this model
      are the following: “‘loss” criteria, where each one of the losses represent a linear classifier
      (for example, having as loss “log” will result in a logistic regression model with SGD
      training). In this task we have used an L2 “penalty” and an “‘alpha” to regularize terms.
    • K-Nearest Neighbor: A well known non-parametric classifier where the class for each test
      sample is computed from a simple majority vote of the K nearest neighbors of each point.
      The tuned hyper-parameters for this model are the “weights”, which could be uniform (each
      point is weighted equally) or proportional to the distance from their neighbors. Among all
      the possible distance metrics, we have tried the following: Euclidean distance, Manhattan
      distance and Minkowski distance. Finally, we have also tuned the number of neighbors
      “n_neighbors” to consider during the voting and the “leaf_size” of each branch.

 Finally, we would like to remark that we have used scikit-learn [11] as toolkit for the employed
machine learning models.

Table 2
Best hyper-parameters and accuracy obtained performing a 10-fold cross validation.
 Model    Language                Model Hyper-params                 Tf-idfVect Hyper-params         Acc. (%)
                                                                                word,
   LR         EN                          C: 100                                                      68.50
                                                                      unigram and bigrams,
              ES                          C: 100                            min_df: 11                80.50
                                                                                word,
  NB          EN                 alpha: 0.25, prior: False                                            65.55
                                                                      unigram and bigrams,
              ES                 alpha: 0.25, prior: True                    min_df: 8                79.00
                                                                                word,
   RF         EN       criterion:“gini”, depth: 4, min_samples: 10                                    67.00
                                                                      unigram and bigrams,
              ES        criterion:“gini”, depth: 4, min_samples: 8           min_df: 8                80.50
                                                                                word,
                         weights:“distance”, metric: “euclidean”,
  KNN         EN                                                      unigram and bigrams,            65.00
                              neighbors: 5, leaf_size=20
                                                                             min_df: 8
                                                                                word,
                         weights:“distance”, metric: “euclidean”,
              ES                                                              bigrams,                77.00
                              neighbors: 10, leaf_size=20
                                                                             min_df: 9
                                                                                word,
  SGD         EN              alpha:0.01, loss:“perceptron”                                           70.00
                                                                      unigram and bigrams,
              ES                alpha:0.001, loss:“hinge”                    min_df: 9                80.00
                                                                                word
  SVM         EN                  C: 0.1, kernel: “linear”                   unigrams,                70.50
                                                                             min_df: 4
                                                                                word
              ES                   C: 1, kernel: “linear”                    unigrams,                80.90
                                                                            min_df: 10


  If we take a look at Tab. 2, it can be seen that linear models such as SVM, logistic regression
and SGD classifier have performed particularly well at this task. This is due to the fact that the
number of features is much larger than the number of samples, making it feasible to separate the
two classes linearly. From these models, the best performing one has been the SVM, achieving
an 80.90% of accuracy for the Spanish dataset and a 70.50% of accuracy for the English dataset.
Again, this is motivated by the fact that the SVM chooses the separating hyperplane with the
maximal margin from among all the possible separating hyperplanes. Other techniques such as
Random Forest, Naïve Bayes classifier and K-NN also performed well, but they did not achieve
the results obtained by the linear models.


Figure 1: Accuracies obtained by each model during 10-fold CV.


   As final model, we have chosen an SVM for both datasets, given that it is the classifier which
has achieved the highest estimated accuracy in both languages. Finally, we have trained an SVM
for each language, employing the full train dataset and the hyper-parameters described in Tab. 2.
The results obtained in the competition with these models are the following:

Table 3
Train and test obtained accuracies employing an SVM.
                                           Estimated acc.    Test
                              Language
                                           (Train 10 - CV)   Acc.
                                 ES           80.90%         81%
                                 EN           70.50%         69%


  It can be seen that our estimation of the performance of the system has worked reasonably
well, matching the test accuracy very similar levels to the ones predicted during training. Test
accuracy results have been provided by the TIRA evaluation platform [12].


5. Conclusion
To sum up, we have described the methods employed for this task. We have detailed how our
whole system works, from the preprocessing step to the estimation of the best hyper-parameters
for the feature extractor and the machine learning models. We have also seen that our estimation
of the accuracy employing a 10 fold CV is consistent with the test results. As future works, we
would like to test an ensemble model of different classifiers to see if it can beat the performance
achieved by our SVM.


References
 [1] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov,
     M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann,
     M. Wolska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate
     Speech Spreaders on Twitter,and Style Change Detection, in: 12th International Conference
     of the CLEF Association (CLEF 2021), Springer, 2021.
 [2] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech
     Spreaders on Twitter Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook
     Papers, CEUR-WS.org, 2021.
 [3] J. Pizarro, Using n-grams to detect fake news spreaders on twitter, in: CLEF, 2020.
 [4] K. S. Jones, A statistical interpretation of term specificity and its application in retrieval,
     Journal of documentation (1972).
 [5] R. Pearl, L. J. Reed, On the rate of growth of the population of the united states since 1790
     and its mathematical representation, Proceedings of the National Academy of Sciences of
     the United States of America 6 (1920) 275.
 [6] M. E. Maron, J. L. Kuhns, On relevance, probabilistic indexing and information retrieval,
     Journal of the ACM (JACM) 7 (1960) 216–244.
 [7] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
 [8] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
 [9] L. Bottou, Online learning and stochastic approximations, On-line learning in neural
     networks 17 (1998).
[10] B. W. Silverman, M. C. Jones, E. fix and jl hodges (1951): An important contribution to
     nonparametric discriminant analysis and density estimation: Commentary on fix and hodges
     (1951), International Statistical Review/Revue Internationale de Statistique (1989) 233–238.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the
     Journal of machine Learning research 12 (2011) 2825–2830.
[12] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.