=Paper= {{Paper |id=Vol-3159/T1-27 |storemode=property |title=Hate and Offensive Language Detection using BERT for English Subtask A |pdfUrl=https://ceur-ws.org/Vol-3159/T1-27.pdf |volume=Vol-3159 |authors=Md Saroar Jahan,Djamila Romaissa Beddiar,Mourad Oussalah,Nabil Arhab,Yazid Bounab |dblpUrl=https://dblp.org/rec/conf/fire/JahanB0AB21 }} ==Hate and Offensive Language Detection using BERT for English Subtask A== https://ceur-ws.org/Vol-3159/T1-27.pdf
Hate and Offensive language detection using BERT
for English Subtask A
Md Saroar Jahan1 , Djamila Romaissa Beddiar1 , Mourad Oussalah1 , Nabil Arhab1 and
yazid bounab1
1
    University of Oulu, Faculty of Information Tech., CMVS, PO Box 4500, Oulu 90014, FINLAND


                                         Abstract
                                         This paper presents the results and main findings of the HASOC-2021 Hate/Offensive Language Identi-
                                         fication Subtask A. The work consisted of fine-tuning pre-trained transformer networks such as BERT
                                         and an ensemble of different models, including CNN and BERT. We have used the HASOC-2021 En-
                                         glish 3.8k annotated twitter dataset. We compare current pre-trained transformer networks with and
                                         without Masked-Language-Modelling (MLM) fine-tuning on their performance for offensive language
                                         detection. Among different BERT MLM fine-tuned BERT-base, BERT-large, and ALBERT outperformed
                                         other models; however, BERT and CNN ensemble classifier that applies majority voting outperformed
                                         other models, achieving 85.1% F1 score on both hate/non-hate labels. Our final submission achieved
                                         77.0 F1 in the HASOC-2021 competition.

                                         Keywords
                                         BERT fine-tuning, Offensive language identification, Hate speech, BERT performance comparison.




1. Introduction
The emergence of Web 2.0 platforms that enabled user-generated content and participatory
culture has witnessed the proliferation of online hate speech at an unprecedented level, increas-
ing the likelihood of random people of any age group being subject to online harassment and
abuse through some internet forum message board or social network platform. Hate speech
is a complex phenomenon, intrinsically associated with relationships between groups, and
relies on language nuances. Nobata et al. [1] define hate speech as - "Language which attacks
or demeans a group based on race, ethnic origin, religion, gender, age, disability, or sexual
orientation/gender identity".
   In the past few years, the automatic detection of hate speech, cyber-bullying, or aggressive
and offensive language became a vividly studied task in natural language processing (NLP).
Past research has examined various characteristics of offensive language such as the cyber
aggression [2, 3], abusive language [1, 4], hate speech [5, 6], Racism [7] and offensive language
[8, 9].
   Several workshops (e.g., SemEval-2019[10], SemEval-2020[11], HASOC-2019[12], HASOC-
2020[13] ) have been organized to find the state-of-the-art practices and new solutions for

FIRE 2021: Forum for Information Retrieval Evaluation, 13-17 December, India
" mjahan18@edu.oulu.fi (M. S. Jahan); Djamila.Beddiar@oulu.fi (D. R. Beddiar); mourad.oussalah@oulu.fi
(M. Oussalah); Nabil.Arhab@oulu.fi (N. Arhab); yazid.bounab@oulu.fi (y. bounab)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
efficient offensive text identification.
   For example, in SemEval-2019, Task A (offensive language detection) was the most popular
sub-task with 104 participating teams. Among the top-performing team Liu et al. [14] used
BERT-base-uncased with default-parameters, with a max sentence length of 64 and trained for 2
epochs and achieved 82.9% F1 score. The top nonBERT model by Mahata et al. [15] was ranked
fifth. They used an ensemble of CNN and BLSTM+BGRU.
   In SemEval-20, 145 teams submitted official runs. The best team Wiedemann et al. [16]
achieved an F1 score of 0.9204 using an ensemble of ALBERT models of different sizes. The
top-10 teams were close to each other and employed BERT, RoBERTa or XLM-RoBERTa models;
sometimes CNNs and LSTMs were also mentioned either for comparison or hybridization
purposes.
   In HASOC-2020, over 40 research groups participated in HASOC-2020 competition. The
top-ranked submission for Hindi-hate speech detection used a CNN with FastText embeddings
as input [17]. The best performance for German hate speech detection task was achieved using
a fine-tuned versions of BERT, DistilBERT and RoBERTa [18]. Similarly, the top performance in
English-language hate speech detection was based on BERT and another deep learning-based
model.
   This year 2021, HASOC[19] [20] offers three different tasks and a separate dataset for each
subtask. Subtask-1A offers tasks in English, Hindi with 2 problems, and Marathi with 1 problem.
The subtasks-1B dataset contains English, Hindi, and subtask2 code-mixed Hindi tweets. In our
participation, we have participated in subtask-A for English identification of Hate/offensive
Twitter posts, and used the HASOC-2021 provided a dataset for training and validation. Regard-
ing the state-of-art practice[21] in the field of hate-speech text detection, our contribution in
this paper is threefold:
   1. We compare different pre-trained transformer-based neural network model’s performance
      and explain model performance.
   2. We study how an additional fine-tuning step with masked language modeling (MLM) of
      the best individual model conducted on in-domain data affects the model performance.
   3. An ensemble of the different models presented, including CNN+BERT.
   The paper is structured as follows. In Section 2, we describe our methodology (dataset
annotation schema, preprocessing, and classifier architectures including the machine learning
models and the associated feature engineering), In Section 3, the details of our result are reported.
In Section 4, an error analysis is carried out. Finally, conclusive statements are drawn in Section
5.


2. Methodology
The overall experimentation methodology includes a three-stage process: (i) data collection
and preprocessing (ii) experiment machine learning (ML) models architecture, and (iii) error
analysis. The experiment environment is the same for all experiments (e.g., data preprocessing,
ML architecture, test data, and error analysis). See Figure 1 for a high level description of our
methodology whose details are presented in the following subsections.
Figure 1: A high level system architecture diagram.


2.1. Dataset
To train our models and compare our results with state-of-the-art models, we used English
twitter dataset from HASOC-2021. The HASOC task organizer already annotated datasets
for English subtask A. Table 1 shows an example of datasets and annotation. For instance, if
the Twitter post contains any hate or offensive word or represents any offensive context, it is
considered HOF (hate or/and offensive), otherwise NOT. The total size of the dataset is 3843,
among which 2051(65%) contain HOF and the rest 1342 (35%) NOT.
   1. (NOT) Non Hate-Offensive - This post does not contain any Hate speech, profane, offensive
      content.
   2. (HOF) Hate and Offensive - This post contains Hate, offensive, and profane content.

Table 1
Example tweets and annotation from English datasets.
 Example of Tweet                                                     Task A label   Dataset Size
 "@hemantmkpandya @news24tvchannel @Aloksharmaaicc @man-              HOF            3843
 akgupta You are a donkey that’s why only one is talking."
 "@For 18-18 hours the termite went and hollowed out a 70 year old    NOT
 strong tree in 7 years !!. #ResignModi"
 "Fattu hai bjp wala #CruelMamata #BengalViolence #BengalBurning"     HOF
 "Goodbye Sher-e-Bihar. . . May Allah bless Tala Saheb from high to   NOT
 high in Jannatul Firdous Amen. JusticeForShahabuddin"




2.2. Dataset Prepossessing
We have removed special characters, numbers (e.g., @,0-9), newlines, mention tags, and links
for data preprocessing. We have not removed hashtags since we found them important for
subsequent reasoning. Table 2 shows an example of preprocessing.

Table 2
Example dataset preprocessing.
 Before preprocessing                                After preprocessing
 @hemantmkpandya @news24tvchannel @Alok-             You are a donkey that’s why only one is talking
 sharmaaicc @manakgupta You are a donkey
 that’s why only one is talking.
 Do not look away.               #IndiaCovidCrisis   Do not look away. #IndiaCovidCrisis
 https://t.co/oHsnIXlEla


   In order to quantify the influence of the various preprocessing units, we carried out a simple
task of HASOC-2021 HOF accuracy rate using Logistic Regression (LR) classifier with TF-IDF
features whose results are summarized in Table 3. One can see for instance, that the use of
uppercase to lowercase conversion and emoji removal in the preprocessing stage does not affect
the overall result. However, Newline + Tab Token, mention tag, and URL + Special Characters
removal worked well and improved almost 0.5% in performance accuracy. Since hashtag (#)
removal decreases 0.8% performance, we have not removed hashtags from our dataset. This
provides a basis for optimal preprocessing pipeline to be used in subsequent tasks.

Table 3
Accuracy scores changes in preprocessing. Result obtain using LR with Tf-IDF.
                     Preprocessing Type                       Accuracy scores
                     All removed except Hash (#) tag          80.4
                     URL, Special Ch., Newline, Tab Token     80.1
                     USERNAME (user) mention tag              80.2
                     RT                                       80.1
                     Emoji                                    79.9
                     Stop-word                                79.6
                     Steming                                  79.4
                     Lowercase conversion                     79.9
                     Hash (#) tag                             78.6
                     No Preprocessing                         79.9




2.3. Models setup
We have used a set of well acknowledged models in hate speech detection tasks as per previous
HOF competitions. Especially, three types of classifiers have been utilized: BERT models, CNN
and baseline LR model with TF-IDF features as follows:

2.3.1. Convolution Neural Network (CNN) Model
we adopted [22] CNN, architecture, where the input layer is represented as a concatenation of
the words forming the post (up to 70 words), except that each word is now represented by its
FastText embedding representation with 300 embedding vectors. A convolution 1D operation
with a kernel size 3 was used together with a max-over-time pooling operation over the feature
map with a layer dense 50. Dropout on the penultimate layer with a constraint on l2-norms of
the weight vector was used for regularization. The details of the implementation are reported
on our GitHub page of this project with datasets and codes1 .

2.3.2. Transformer Network Models
BERT – is the Bidirectional Encoder Representations from Transformers: this seminal transformer-
based language model applies an attention mechanism that enables learning contextual relations
between words in a text sequence [23] (Devlin et al., 2019). Two training strategies that BERT
follows:
   1. MLM : where 15 % of the tokens in a sequence replaced (masked) for which the model
      learns to predict the original tokens, and
   2. Next sentence prediction (NSP): here the model receives two sentences as input and model
      learns whether the second sentence is a successor of the first sentence in their original
      document context.
  RoBERTa – is a replication of BERT developed by Facebook [24] with known as Robustly
Optimized BERT Pretraining Approach with the following modifications:
   1. training the model longer with bigger batches as well as more and cleaner data and discard
      the NSP objective,
   2. training on longer sequences, and
   3. dynamically changes the masking patterns, e.g. taking care of masking complete multi-
      word units. .
   XLM-RoBERTa – XLM-R: this is a cross-lingual or multilingual version of RoBERTa which
is trained on more than 100 languages at once [25] (Conneau et al., 2019).
   ALBERT – represent A Lite BERT, which is a alteration on BERT especially to overcome
training time and memory limitations issues [14] (Lan et al., 2019): The main contributions that
ALBERT makes over BERT are:
   1. decomposing the embedding parameters into smaller matrices that will be projected to
      the hidden space separately,
   2. in contrast to BERT’s simpler NSP objective it based on sentence order prediction (SOP),
      share parameters across layers to improve or stabilize the learned parameters.

2.3.3. Ensemble model
We created an ensemble model using majority voting rule. Especially, we tested ensemble model
of i) All BERT models; ii) BERT-large-uncased + BERT-base-uncased + ALBERT-xxlarge-v2; iii)
CNN + BERT-large-uncased + BERT-base-uncased + ALBERT-xxlarge-v2. In the case of an even
number of models, the ensemble model takes into account the decision weight generated by
each classifier to yield the final output (HOF versus NOT).
   1
       https://github.com/saroarjahan/HASOC-2021-TASKA (accessed September 08, 2021)
3. Experiment Setup
Initially, we employed a random split of the original dataset into 80% for training and 20% for
testing and validation, ensuring the same proportion of dataset for all kinds of model learning.
Four types of classifiers were implemented: Logistics regression (LR) with word-level TF-IDF,
Convolution Neural Network (CNN) with FastText word embedding, and BERT pre-trained
model. For BERT model setup, we fine-tuned different transformer models with the HASOC-
2021 training data using the corresponding test data for validation. The following models
were tested: BERT-base and BERT-large (uncased), RoBERTa-base and RoBERTa-large, XLM-
RoBERTa, BERT-multilingual, and four different ALBERT models (large-v1, large-v2, xxlarge-v1,
and xxlarge-v2). Each model was fine-tuned for 6 epochs with a learning rate of 5e-6, maximum
sequence length of 128, and batch size 4. After each epoch, the model was evaluated on the
validation set. The best-performing epoch was saved for the ensembling. We tested ensemble
models by majority vote from all models such as BERT-base, BERT-large, and ALBERT.


4. Results
Table 4 shows the results of binary offensive language detection (assuming all tweets as either
hate or non-hate) using LR baseline, CNN with word embedding, as well as for the individual
fine-tuned transformer models and their corresponding ensembles. CNN, BERT-base-uncased,
BERT-large-uncased, and ALBERT-xxlarge-v2 transformer models largely outperform the LR
baseline. Comparing BERT and CNN models, BERT-based-uncased and BERT-large-uncased
slightly (.1%) outperform CNN model. CNN exhibits a much better performance compared to
baseline and some BERT models. achieving 83.3% F1 score.
   Our best individual model is BERT-based-uncased with an F1-score of 83.4 %. The experiment
also showed that BERT-uncased performed 2% better than BERT-cased. When comparing the
different pre-trained transformer models, interesting results emerged as well. For example,
none of the Multilingual BERT models has outperformed the BERT-large or BERT-base model.
This was not fully a surprise since the multilingual pre-trained model was trained on the top
104 languages with the largest Wikipedia using a masked language modeling (MLM) objective;
however, it contains only a small percentage of English tokens. Therefore, it might have fallen
short when the HASOC-2021 dataset is mono English.
   We expected dehateBERT would perform better since it pertained only to English and included
several hate corpus; however, the results showed only 79.4% F1 scores. One possible explanation
could due to our dataset’s diverse pattern. Since this dataset only focuses on the Twitter dataset
and all the tweets collected from recent tweets and recent events (e.g., Indian politics, Covid).
Therefore, dehateBERt cannot generalize much since it was only pre-trained with some specific
hate dataset, limiting the dimension of pre-trained data.
   Regarding the ensembles of model variants, we can see that the performance is very marginal
when all transformers are ensembled together. This is probably due to the fact that some
transformers model have not performed well at first hand, so mixing low performed transformed
models have reduced the overall performance. However, when we ensembled only the best
performing models (BERT-large, BERT-base, and ALBERT), we see our models have increased
Table 4
Accuracy and F1 scores for Hate-Offensive detection Task A, English dataset (best in bold).

 Classifier                                          NOT              HOF              BOTH label
 Feature Name                                        Acc    F1        Acc    F1        Acc      F1
 LR+Word Level TF-IDF                                69.3   69        92.4   92.2      80.4     80
 CNN +Word Emb.                                      72.1   72        95.4   95.2      83.3     83.3
 BERT-large-cased                                    71.2   71        94.3   94.2      82.45    82.15
 BERT-large-uncased                                  72.2   72        95.4   95.3      82.7     83.4
 BERT-base-cased                                     69.3   69.1      92.4   92.3      80.3     80
 BERT-base-uncased                                   72.3   72.1      95.6   95.4      83.8     83.5
 BERT-base-multilingual                              65.6   65        88.    88.1      75.7.4   75.3
 BERT-base-multilingual-uncased                      65.8   65.2      88.7   88.5      76.5     76
 RoBERTa-base                                        65.9   65.7      89     88.9      78.8     78
 RoBERTa-large                                       66.5   66        89.6   89.1      79.5     79
 XLM-RoBERTa                                         65.7   65.3      88.6   88.4      76.6     76.3
 ALBERT-large-v1                                     69.4   69.3      92.5   92.4      80.5     80.1
 ALBERT-xxlarge-v2                                   70.4   69.7      92.6   92.8      81.5     81.3
 ALBERT-xxlarge-v2                                   70     69.3      93     92.7      82.8     82.4
 Dehatebert-mono-english                             69.5   69.2      92.5   92.2      80       79.4
                          Ensembles of Models (Majority voting applied)
 All BERT model                                      69.5   69.4      92.9   92.7      81       80.3
 BERT-large-uncased + BERT-base-uncased +            74.8   74.7      96.1   96        84.9     84.6
 ALBERT-xxlarge-v2
 CNN + BERT-large-uncased + BERT-base-               76.1   75.8      96.6   96.4      85.5     85.1
 uncased + ALBERT-xxlarge-v2




1% F1 scores. Furthermore, when we ensembled CNN with best performing BERT models, the
performance further improved.
  We have submitted our best ensemble (CNN + BERT-large-uncased + BERT-base-uncased +
ALBERT-xxlarge-v2) results to the HASOC-21 competition, and official result, we have received
an 77% F1 score (Table 5).
Table 5
Official results of our HASOC-21 test set submissions for English tasks A.
                   Team name       Task                                  Macro F1
                   TeamOulu        English Task A                        77




Figure 2: Confusion matrix from 2117 test samples.


5. Error Analysis
Although we have obtained 85.5% accuracy and 85.1% F1 score, the model still exhibits a portion
of false detection. To understand this phenomenon better, we performed an error analysis of
the model’s performance. For this purpose, we randomly prepared subsets of test data, then
manually inspected the classifier output. Error test data contained 200 samples, among which
68 were non-hate speech and 132 were hate samples.
   From figure 2, we see that most of the error is related to false positive (FP), where our
classifier is not performing much while detecting non-hate samples. In contrast, 1414 hate
samples were correctly detected, and only 47 resulted as false negative (FN). Since our train
dataset majority (65%) contains hate samples, it seems that our model is better trained or biased
towards hate-offensive class. Table 6 represents FP and FN metrics. Here, we can see that some
data were actually incorrectly annotated (4 false predictions, 3 were wrongly annotated). Since
this test data was a split of original data, it indicates that our model performs much better, some
false predictions are actually cast into correct prediction, and some of the error comes from the
dataset annotation itself.
Table 6
False-positive (FP) and false-negative (FN) examples from first 200 test samples.
 Prediction          Tweets                                                          Original   Predict
                                                                                     Label      label
 FP                  nah i mean i’m not trying to be a dick here, i’m just           0          1
                     trying to make the point that it’s not as easy as it ’feels’,
                     you know?
 FP                  sadly enough innocent poor people of are dying but ur           0          1
                     is evolves back from #human to #monky
 FP                  you was right dr and you are right ..very sad for our coun-     0          1
                     try #murderer_modi #resignmodi
 FN                  do i want to die or do i just want to stop feeling empty        1          0
                     everyday



6. Conclusion
After last year’s HASOC-2020 shared task on offensive language detection, BERT models
emerged as state-of-the-art in HOF, although a fine-tuning of the models is still challenging and
open to debate. This motivates our approach in this paper where ensembling state-of-the-art
BERT models and CNN, while seeking optimal preprocessing strategy is promoted. Especially,
in 2021 competition, we performed different experiments of twitter preprocessing, BERT-fine
tune models, and an ensemble of other models. Our tweet preprocessing showed removing the
mentioned tag and removing special characters and URLs were useful and increased almost 1%
of models’ performance. However, eliminating hashtags and stemming reduced the performance
of the model. Among different transformers, BERT-base outperformed other models, including
BERT-multilingual, RoBERTA, and Dehatebert-mono-english. Among BERT-cased and uncased
versions, BERT-uncased showed better performance compared to BERT-cased. Surprisingly,
CNN performed much better than most BERT models and performed close to the best performing
BERT model. In addition, the ensemble of best-performing models showed further improvement.
Our best test ensemble models showed 85.1 F1 scores, and the official submission result showed
77% f1 scores.


7. Acknowledgments
This project was partially funded by EU Project WaterLine (Downscaling Remotely Sensed
Products to Improve Hydrological Modelling Performance), and EU Project YoungRes (#823701),
which are gratefully acknowledged.


References
 [1] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive language detection in
     online user content, in: Proceedings of the 25th international conference on world wide
     web, 2016, pp. 145–153.
 [2] R. Kumar, A. K. Ojha, S. Malmasi, M. Zampieri, Benchmarking aggression identification
     in social media, in: Proceedings of the First Workshop on Trolling, Aggression and
     Cyberbullying (TRAC-2018), 2018, pp. 1–11.
 [3] D. R. Beddiar, M. S. Jahan, M. Oussalah, Data expansion using back translation and
     paraphrasing for hate speech detection, Online Social Networks and Media 24 (2021)
     100153.
 [4] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on arabic social media,
     in: Proceedings of the first workshop on abusive language online, 2017, pp. 52–56.
 [5] Y. J. Foong, M. Oussalah, Cyberbullying system detection and analysis, in: 2017 European
     Intelligence and Security Informatics Conference (EISIC), IEEE, 2017, pp. 40–46.
 [6] C. Abderrouaf, M. Oussalah, On online hate speech detection. effects of negated data
     construction, in: 2019 IEEE International Conference on Big Data (Big Data), IEEE, 2019,
     pp. 5595–5602.
 [7] I. Kwok, Y. Wang, Locate the hate: Detecting tweets against blacks, in: Proceedings of the
     AAAI Conference on Artificial Intelligence, volume 27, 2013.
 [8] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the germeval 2018 shared task on the
     identification of offensive language (2018).
 [9] M. S. Jahan, Team oulu at semeval-2020 task 12: Multilingual identification of offensive
     language, type and target of twitter post using translated datasets, in: Proceedings of the
     Fourteenth Workshop on Semantic Evaluation, 2020, pp. 1628–1637.
[10] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Semeval-2019 task 6:
     Identifying and categorizing offensive language in social media (offenseval), arXiv preprint
     arXiv:1903.08983 (2019).
[11] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der-
     czynski, Z. Pitenis, Ç. Çöltekin, Semeval-2020 task 12: Multilingual offensive language
     identification in social media (offenseval 2020), arXiv preprint arXiv:2006.07235 (2020).
[12] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of the
     hasoc track at fire 2019: Hate speech and offensive content identification in indo-european
     languages, in: Proceedings of the 11th forum for information retrieval evaluation, 2019,
     pp. 14–17.
[13] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire
     2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english
     and german, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
[14] P. Liu, W. Li, L. Zou, Nuli at semeval-2019 task 6: Transfer learning for offensive language
     detection using bidirectional transformers, in: Proceedings of the 13th international
     workshop on semantic evaluation, 2019, pp. 87–91.
[15] D. Mahata, H. Zhang, K. Uppal, Y. Kumar, R. Shah, S. Shahid, L. Mehnaz, S. Anand, Midas
     at semeval-2019 task 6: Identifying offensive posts and targeted offense from twitter,
     in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp.
     683–690.
[16] G. Wiedemann, S. M. Yimam, C. Biemann, Uhh-lt & lt2 at semeval-2020 task 12: Fine-
     tuning of pre-trained transformer networks for offensive language detection, arXiv preprint
     arXiv:2004.11493 (2020).
[17] R. Raja, S. Srivastavab, S. Saumyac, Nsit & iiitdwd@ hasoc 2020: Deep learning model for
     hate-speech identification in indo-european languages (2021).
[18] B. L. R. Kumar, B. Lahiri, A. K. Ojha, A. Bansal, Comma@ fire 2020: Exploring multilingual
     joint training across different classification tasks, in: Working Notes of FIRE 2020-Forum
     for Information Retrieval Evaluation, Hyderabad, India, 2020.
[19] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Satapara, P. Majumder, J. Schäfer, T. Ranas-
     inghe, M. Zampieri, D. Nandini, A. K. Jaiswal, Overview of the HASOC subtrack at FIRE
     2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Lan-
     guages, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021. URL: http://ceur-ws.org/.
[20] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri,
     Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content
     Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in:
     FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December
     2021, ACM, 2021.
[21] M. S. Jahan, M. Oussalah, A systematic review of hate speech automatic detection using
     natural language processing, arXiv preprint arXiv:2106.00742 (2021).
[22] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint
     arXiv:1408.5882 (2014).
[23] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[25] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, arXiv preprint arXiv:1911.02116 (2019).