HSSD: Hate Speech Spreader Detection using
N-grams and Voting Classifier
(Notebook for PAN at CLEF 2021)

Fazlourrahman Balouchzahi1 , Hosahalli Lakshmaiah Shashirekha2 and
Grigori Sidorov1
1
    Center for Computing Research, Instituto Politécnico Nacional, CDMX, Mexico
2
    Department of Computer Science, Mangalore University, Mangalore, India


                                         Abstract
                                         Profane or abusive speech with the intention of humiliating and targeting individuals, a specific com-
                                         munity or groups of people is called Hate Speech (HS). Identifying and blocking HS contents is only a
                                         temporary solution. Instead, developing systems that are able to detect and profile the content polluters
                                         who share HS will be a better option. In this paper, we, team MUCIC, present the proposed Voting Clas-
                                         sifier (VC) submitted to Hate Speech Spreader Detection shared task organized by PAN 2021. The task
                                         includes profiling HS spreaders for two languages, namely, English and Spanish from the text collected
                                         from Twitter. This task can be modeled as a binary text classification problem to classify an author
                                         (Twitter user) based on his/her tweets as ‘Hate speech spreader’ or ‘Not’. The proposed models utilizes
                                         a combination of traditional char and word n-grams with syntactic ngrams as features extracted from
                                         the training set. These features are fed to a VC that employs three Machine Learning (ML) classifiers
                                         namely, Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) with hard
                                         and soft voting. The proposed models with accuracies of 73% and 83% for English and Spanish languages
                                         respectively, obtained second rank in the shared task.

                                         Keywords
                                         Hate Speech Spreader, Machine Learning, N-grams, Voting Classifier,


1. Introduction
Rapid dissemination, low cost, ease of access, and more importantly anonymity are the signifi-
cant features of social media in current era [1, 2, 3]. There are so many religions, communities,
groups of people and their subdivisions in this world whose thoughts and beliefs vary from
one another. Mutual tolerance and respect is very essential for co-existence and peaceful living
[4] on this earth. However, in some cases, one group’s dogma can be against another as well
creating panic and disturbances in the society. With inimical intentions or just for fun, there are
users who share HS and profane content over social media or even offline. Online HS contents

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" frs1_b@yahoo.com (F. Balouchzahi); hlsrekha@gmail.com (H. L. Shashirekha); sidorov@cic.ipn.mx
(G. Sidorov)
~ https://mangaloreuniversity.ac.in/dr-h-l-shashirekha (H. L. Shashirekha); http://www.cic.ipn.mx/~sidorov/
(G. Sidorov)
 0000-0003-1937-3475 (F. Balouchzahi)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
on social media are more fearsome and troublesome due to rapid dissemination of information
[5]. HS contents usually originate from people or a group who are prejudiced with the intention
of discriminating and targeting a race, religion or with sexual orientation of people who are
noxious and harmful for society. Hence, the task of HS detection and profiling the spreaders is
being indispensable [6, 7] in order to avoid the spread of HS and the possible damage it could
cause to the society.
   Appropriate tools and benchmarked labeled corpora are required to address the challenges of
HS detection and profiling the spreaders [8]. In order to address these challenges, PAN [9] at
Conference and Labs of the Evaluation Forum (CLEF) 2021 has called for a shared task: Profiling
Hate Speech Spreaders on Twitter [10] for two languages namely, English and Spanish. The
datasets provided by PAN consists of texts collected from Twitter and the task can be modeled as
a binary Text Classification (TC) problem where a user based on his/her tweets can be identified
as ‘HS spreader’ or ‘Not’. As one of the participating team in this task, we, team MUCIC, have
proposed an ensemble model that utilize the strength of three Machine Learning (ML) classifiers
namely, Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) as
estimators to build a robust VC.
   LR is an impressive algorithm for binary and linear classification problems which models
the probability of a discrete outcome from an input variable. Ease of realize and exquisite
performance are the major features of this algorithm for binary classification [11]. SVM as
a supervised ML algorithm has been widely used for classification and regression tasks. The
main significance of SVM is identifying optimal boundary which effectively distinguish the
classes in training data. SVM uses kernel trick technique to transform data and based on these
transformations it will find an optimal boundary between the possible outputs . While a single
Decision Tree (DT) consists of root and decision nodes with a top down greedy approach that
splits the dataset into smaller subsets, RF is itself an ensemble learning model which employs a
set of DTs and computes majority voting for the prediction of terminal nodes to determine the
final prediction for the given input [12, 13].
   Traditional n-grams are a set of co-occurring items or elements such as characters, words,
Part-Of-Speech (POS) tags, etc. as they appear in a text. But, the idea of Syntactic n-grams
(sn-grams) is to follow a path in the syntactic tree to construct n-grams, rather than taking them
from surface representation. In other words, the sequence of words that appear in the path of a
syntactic tree are considered as neighbors and the real neighbors of words based on syntactic
relations [14, 15, 16] are extracted. To obtain the benefits of both the n-grams structures, the
traditional char and word n-grams are extracted and combined with sn-grams as a feature set
and transformed into vectors using CountVectorizer to feed the VC model. Rest of the paper
is organized as follows: the related work and methodology are discussed in Section 2 and 3
respectively followed by results in Section 4. The paper eventually concludes with future work
in Section 5.


2. Related Work
Most of HS detection tasks are modeled as short TC and rarely has been explored as profiling
task. Some of the recent works on HS detection and text profiling have been reviewed here.
Zimmerman et al. [17] has proposed an ensemble of Deep Learning (DL) models for HS detection
and also Sentiments Analysis (SA) from tweets. The authors ensembled 10 Convolution Neural
Network (CNN) models by summing softmax results from the underlying models and then
averaging it. Considering the average soft-max score of all models, the class with highest
average is assigned to the given tweet. Utilizing the publicly available embedding models,
this model was evaluated on two datasets, namely, abusive speech [18] and SemEval 2013
SA [19] and obtained average F1-scores of 77.83 and 70.36 respectively, with batch size and
epochs of 10 each. HASOC 2020 [20] shared task organized by Forum for Information Retrieval
Evaluation (FIRE) 2020 consists of two subtasks; i) a binary TC task where a given text should
be categorized as HOF (containing HS contents) or NOT (Not Offensive) and ii) texts identified
as HOF should be further classified into one of three categories namely, Hate speech (HATE),
OFFENSIVE and PROFANITY. Datasets for this task has been provided for three languages,
namely, English, Hindi, and German as detailed in [18]. Overall results reported by HASOC
shows very competitive performances among the teams and differences between the F1-score
of best performances and average ones are less than 0.04.
   As a participant of HASOC 2020, Balouchzahi et al. [1] developed two models namely,
ensemble of ML classifiers (LR, SVM, and RF) and Universal Language Model Fine-Tuning
(ULMFiT) based on Transfer Learning approaches. The authors also employed ULMFiT as an
estimator along with LR and RF. Texts are preprocessed by removing punctuations, stopwords,
non-alphabets and unnecessary characters. fast.ai1 and sklearn2 libraries are used to build
ULMFiT model and ML classifiers using pre-trained LM and combination of char and word
n-grams respectively. For the first subtask an ensemble of SVM, LR, and ULMFiT obtained
0.497 and 0.518 F1-scores for English and Hindi respectively and ensemble of LR, SVM, and
RF achieved 0.504 F1-score for German language. Also ULMFiT model submitted for second
subtask in English language achieved F1-score of 0.265. Shashirekha et al. [21] ensembled three
ML classifiers namely, Gradient Boosting, Random Forest and eXtreme Gradient Boosting as VC
with soft voting configuration for HASOC 2020. After removing punctuation symbols, numeric
data, stop words, uninformative words and frequently occurring words, features such as number
of words, characters, punctuations, and length of the words are extracted from the training texts
of all languages. Further, for English language, number of upper case characters, title words,
and the frequency distribution of POS tags ie., Noun, Verb, Adjective, Adverb, and Pronoun
are computed and used as additional features. These features are transformed to vectors using
CountVectorizer and fed to the proposed model and obtained F1-scores of 0.5046, 0.5106, and
0.5033 for first subtask for English, German, and Hindi languages respectively. The proposed
model also obtained 0.2596, 0.2595, and 0.2488 F1-scores for second subtask for English, German
and Hindi respectively.
   PAN at CLEF have managed to go further in identifying the content polluters who share HS,
fake news, etc. or identifying bots from human followed by gender detection and profiling.
Some of them are PAN 2018: Multimodal Gender Identification In Twitter [22], PAN 2019: Bots
and Gender Profiling in Twitter [23] and PAN 2020: Profiling Fake News Spreaders on Twitter
[24]. The task of profiling fake news spreaders on Twitter in PAN 2020 consists of datasets

   1
       https://www.fast.ai/
   2
       https://scikit-learn.org/stable/
for Spanish and English languages which includes 100 tweets per user and totally 300 users
per language as training set and 100 tweets per user and totally 100 users per language as
test set. Shashirekha et al. [2, 3] submitted two models, namely, ULMFiT and ensemble of
ML classifiers as a VC for this task. They scraped raw texts from Wikipedia for Spanish and
English languages and applied basic preprocessing steps. Preprocessed texts were used to train
general domain Language Model (LM) and texts from training set were used to fine-tune the
LM and finally the LM was employed to build target model for detecting fake news spreaders.
Similar to Balouchzahi et al. [1] fast.ai library has been used to build LM and target model. For
ML VC model construction, training set was first preprocessed by eliminating stopwords and
punctuation, converting emoji to text and lemmatizing the words followed by feature extraction.
Unigram TFIDF, N-gram TF combined with Doc2vec are extracted as features and scaled by
MaxAbsScaler. A combination of Chi-square test, Mutual Information, and F-test algorithms
are used to select important features which are in turn used to train the proposed VC. As per
the results reported by PAN, ULMFiT and ML VC models obtained average accuracies of 0.63
and 0.70 respectively.


3. Methodology
The significance of ensembling ML models lies in improving the strength and covering the
weakness of individual classifier models. Taking a note of this concept, a VC model of three ML
estimators namely, SVM, LR, and RF is developed by exploiting hard and soft voting configuration
for English and Spanish languages respectively.
   ML models used in the proposed VC are chosen because of their efficient performances for
binary classification as proved in the available literature and based on our experiments. While
RF which is already a method of ensembling utilize 10,000 decision trees as estimators, SVM
uses linear kernel. Rest of parameters for these two models and all parameters of LR estimator
have been set to default. As a preprocessing step, texts are striped and hashtags such as USER,
URL, and RT are removed and all words are converted to lower case for English. However,
preprocessing is avoided for Spanish language texts as our experiments without preprocessing
performed better.
   A feature extraction module as shown in Figure 1 is used to extract char (2, 3, 4, 5) and
word (2, 3) n-grams and sn-grams (2, 3). SNgramExtractor3 library has been used to extract
sn-grams from English and Spanish texts. The extracted features are transformed to vectors
using CountVectorizer. Figure 2 illustrates the structure of proposed VC model graphically.


4. Experimental Results
4.1. Dataset
Datasets provided by PAN consists of a training set of 200 XML files for each language and each
XML file represents a user with 200 tweets. The test set consists of 100 XML files per language
and the proposed models should identify whether a user (represented by an XML file) is a ‘HS
   3
       https://pypi.org/project/SNgramExtractor/
Figure 1: Feature Extraction module


Figure 2: Structure of proposed VC model


spreader’ or ‘Not’ based on the analysis of tweets given in the XML files. Details of the training
data along with label distribution which is presented in Figure 3 illustrates that the dataset is
completely balanced.

4.2. Results
PAN uses TIRA Integrated Research Architecture submission system [25] that provides Virtual
Machine (VM) for the shared task participants through which they can submit and evaluate
their proposed models. As PAN encourages early bird submission of the models, the initial
models of the proposed approach are submitted through TIRA and due to technical issues the
final model and predictions on test set (labels) are submitted through mail. Performances of the
models are evaluated by the task organizer based on accuracy metric and the results in shared
task website4 illustrate that the VC model obtained accuracies of 83% and 73% for Spanish and
English texts respectively.
   Performances of the best teams presented in Table 1 shows very competitive results and our
proposed models (mentioned as MUCIC) obtained second rank in the shared task. The highest

   4
       https://pan.webis.de/clef21/pan21-web/author-profiling.htmlresults
Figure 3: Distribution of labels in the training data provided by PAN


Table 1
Best performing teams in shared task
                           Team             English    Spanish   Average
                           SiinoDiNuovo       73.0       85.0      79.0
                           MUCIC              73.0       83.0      78.0
                           tamayo             74.0       82.0      78.0
                           andujar            72.0       82.0      77.0
                           anitei             72.0       82.0      77.0
                           anwar              72.0       82.0      77.0


accuracies reported for Spanish and English texts are 85% and 74% respectively.


5. Conclusion and Future Work
Following the adventures in text processing tasks, PAN 2021 called for a shared task to detect
Hate Speech Spreaders in English and Spanish language tweets. This challenge is tackled by
team MUCIC by building a robust VC using ML classifiers and traditional char and word n-grams
along with syntactic n-grams as features to train VC model. Our team (MUCIC) obtained second
rank with an average accuracy of 78% in the shared task. As future work we would like to
explore more feature sets with ML models and also experimenting DL and TL approaches.


6. Acknowledgment
Team MUCIC deeply appreciates the efforts, guidance and support of the shared task organizers
and reviewers for the valuable comments and suggestions.
References
 [1] F. Balouchzahi, H. L. Shashirekha, Las for HASOC - learning approaches for hate speech
     and offensive content identification, in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.),
     Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation, Hyderabad,
     India, December 16-20, 2020, volume 2826 of CEUR Workshop Proceedings, CEUR-WS.org,
     2020, pp. 145–151. URL: http://ceur-ws.org/Vol-2826/T2-6.pdf.
 [2] H. L. Shashirekha, F. Balouchzahi, Ulmfit for twitter fake news spreader profiling, in:
     L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 -
     Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, 2020,
     volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: http://ceur-ws.org/
     Vol-2696/paper_126.pdf.
 [3] H. L. Shashirekha, M. D. Anusha, N. S. Prakash, Ensemble model for profiling fake news
     spreaders on twitter, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working
     Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece,
     September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
     URL: http://ceur-ws.org/Vol-2696/paper_136.pdf.
 [4] V. Sinha, Theorising’talk’about’religious pluralism’and’religious harmony’in singapore,
     Journal of Contemporary Religion 20 (2005) 25–40.
 [5] C. Bosco, D. Felice, F. Poletto, M. Sanguinetti, T. Maurizio, Overview of the evalita 2018
     hate speech detection task, in: EVALITA 2018-Sixth Evaluation Campaign of Natural
     Language Processing and Speech Tools for Italian, volume 2263, CEUR, 2018, pp. 1–9.
 [6] V. Basile, C. Bosco, E. Fersini, N. Debora, V. Patti, F. M. R. Pardo, P. Rosso, M. Sanguinetti,
     et al., Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and
     women in twitter, in: 13th International Workshop on Semantic Evaluation, Association
     for Computational Linguistics, 2019, pp. 54–63.
 [7] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM
     Computing Surveys (CSUR) 51 (2018) 1–30.
 [8] F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, V. Patti, Resources and benchmark corpora
     for hate speech detection: a systematic review, Language Resources and Evaluation (2020)
     1–47.
 [9] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov,
     M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol-
     ska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate Speech
     Spreaders on Twitter,and Style Change Detection, in: 12th International Conference of
     the CLEF Association (CLEF 2021), Springer, 2021.
[10] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech
     Spreaders on Twitter Task at PAN 2021, in: A. J. M. M. F. P. Guglielmo Faggioli, Nicola Ferro
     (Ed.), CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021.
[11] A. Subasi, Practical Machine Learning for Data Analysis Using Python, Academic Press,
     2020.
[12] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining,
     inference, and prediction, Springer Science & Business Media, 2009.
[13] K. Kirasich, T. Smith, B. Sadler, Random forest vs logistic regression: binary classification
     for heterogeneous datasets, SMU Data Science Review 1 (2018) 9.
[14] G. Sidorov, Continuous and noncontinuous syntactic n-grams, in: Syntactic n-grams in
     Computational Linguistics, Springer, 2019, pp. 63–67.
[15] G. Sidorov, Syntactic dependency based n-grams in rule based automatic english as second
     language grammar correction, International Journal of Computational Linguistics and
     Applications 4 (2013) 169–188.
[16] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic
     n-grams as machine learning features for natural language processing, Expert Systems
     with Applications 41 (2014) 853–860.
[17] S. Zimmerman, U. Kruschwitz, C. Fox, Improving hate speech detection with deep learn-
     ing ensembles, in: Proceedings of the Eleventh International Conference on Language
     Resources and Evaluation (LREC 2018), 2018.
[18] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate
     speech detection on twitter, in: Proceedings of the NAACL student research workshop,
     2016, pp. 88–93.
[19] P. Nakov, S. Rosenthal, Z. Kozareva, V. Stoyanov, A. Ritter, T. Wilson, SemEval-2013 task 2:
     Sentiment analysis in Twitter, in: Second Joint Conference on Lexical and Computational
     Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on
     Semantic Evaluation (SemEval 2013), Association for Computational Linguistics, Atlanta,
     Georgia, USA, 2013, pp. 312–320. URL: https://www.aclweb.org/anthology/S13-2052.
[20] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire
     2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english
     and german, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
[21] M. D. Anusha, H. L. Shashirekha, An ensemble model for hate speech and offensive
     content identification in indo-european languages, in: P. Mehta, T. Mandl, P. Majumder,
     M. Mitra (Eds.), Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation,
     Hyderabad, India, December 16-20, 2020, volume 2826 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2020, pp. 253–259. URL: http://ceur-ws.org/Vol-2826/T2-20.pdf.
[22] F. Rangel, P. Rosso, M. Montes-y Gómez, M. Potthast, B. Stein, Overview of the 6th author
     profiling task at pan 2018: multimodal gender identification in twitter, Working Notes
     Papers of the CLEF (2018) 1–38.
[23] F. Rangel, P. Rosso, Overview of the 7th author profiling task at pan 2019: bots and gender
     profiling in twitter, in: Working Notes Papers of the CLEF 2019 Evaluation Labs Volume
     2380 of CEUR Workshop, 2019.
[24] F. Rangel, A. Giachanou, B. Ghanem, P. Rosso, Overview of the 8th author profiling task
     at pan 2020: Profiling fake news spreaders on twitter, in: CLEF, 2020.
[25] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, Tira integrated research architecture, in:
     Information Retrieval Evaluation in a Changing World, Springer, 2019, pp. 123–160.