A comparison of deep learning models for hate speech
                                detection
                                Eglė Kankevičiūtė1,2,*, Milita Songailaitė1,2,*, Justina Mandravickaitė1,2,*,
                                Danguolė Kalinauskaitė1,2,* and Tomas Krilavičius1,2,*
                                1
                                    Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania
                                2
                                    Centre for Applied Research and Development, Lithuania


                                                 Abstract
                                                 Hate speech is a complex and non-trivial phenomenon that is difficult to detect. Existing datasets used for training hate
                                                 speech detection models are annotated based on different definitions of this phenomenon, and similar instances can be
                                                 assigned to different annotation categories based on these differences. The goal of our experiment is to evaluate selected
                                                 hate speech detection models for English language from the perspective of inter-annotator agreement, i.e. how the selected
                                                 models “agree” in terms of annotation of hate speech instances.
                                                     For model comparison we used English dataset from HASOC 2019 shared task and 3 models: BERT-HateXplain, HateBERT
                                                 and BERT. Inter-annotator agreement was measured with pairwise Cohen’s kappa and Fleiss’ kappa. Accuracy was used
                                                 as additional metric for control. The experiment results showed that even if the accuracy is high, the reliability, measured
                                                 via inter-annotator agreement, can be low. We found that the best accuracy in hate speech detection was achieved with
                                                 BERT-HateXplain model, however, Cohen’s kappa metric for the results of this model was close to 0, meaning that the results
                                                 were random and not reliable for real life use. On the other hand, comparison of BERT and HateBERT models revealed that
                                                 annotations are quite similar and they have the best Cohen’s kappa score, suggesting that similar neural network architectures
                                                 can deliver not only high accuracy, but also correlating results and reliability. As for Fleiss’ kappa, a comparison of expert
                                                 annotations and three selected models gave an estimate of a slight agreement, confirming that high accuracy can go together
                                                 with low reliability of the model.

                                                 Keywords
                                                 Hate speech, deep learning, model comparison, HASOC 2019 dataset, English language


                                1. Introduction                                                                              tion [4]. Also, it was found out that most of the publicly
                                                                                                                             available datasets are incompatible due to different defi-
                                Hate speech is a complex and non-trivial phenomenon                                          nitions attributed to similar concepts [5]. Moreover, hate
                                that is difficult to detect. Online hate speech is assumed                                   speech datasets can have very similar labels, so some
                                to be an important factor in political and ethnic violence                                   studies merge them together into one class to reduce
                                such as the Rohingya crisis in Myanmar [1], [2]. There-                                      class imbalance [10]. However, this practice could make
                                fore, media platforms are pressured to timely detection                                      a negative impact on research as distinction between
                                and elimination of hate speech occurrences [3]. This ten-                                    classes is necessary. For example, merging the offensive
                                dency led to increasing efforts in terms of hate speech                                      language and hate speech classes of [6] dataset in [3] and
                                detection, and a number of hate speech detection models                                      [12] or the racist language and sexist language classes
                                have been developed.                                                                         of [11] dataset in [13] and [14]. In hate speech research
                                   Existing datasets used for training hate speech detec-                                    abusive language or toxic comments can cover several
                                tion models are annotated based on different definitions                                     paradigms [10], therefore following available definitions
                                of this phenomenon, and similar instances can be as-                                         is very important. Similarly, it was suggested that offen-
                                signed to different annotation categories based on these                                     sive language is not the same as hate speech and therefore
                                differences in perception of what constitutes hate speech.                                   they should not be merged [6].
                                Analysis of the effects of definition on the annotation                                         Following other authors, such as [6], [7], [8] and [9],
                                reliability led to the conclusion that hate speech phe-                                      the summarised definition of hate speech is the following:
                                nomenon requires a stronger and more uniform defini-                                         hate speech describes negative attributes or deficiencies
                                                                                                                             of groups of individuals because they are members of
                                IVUS 2022: 27th International Conference on Information                                      a particular group. Hateful comments occur toward
                                Technology, May 12, 2022, Kaunas, Lithuania                                                  groups because of race, political opinion, sexual orienta-
                                *
                                 Corresponding author.                                                                       tion, gender, social status, health condition, etc. As it was
                                $ egle.kankeviciute@stud.vdu.lt (E. Kankevičiūtė);                                           suggested in [6] and [9], offensive comments could be
                                milita.songailaite@stud.vdu.lt (M. Songailaitė);
                                justina.mandravickaite@vdu.lt (J. Mandravickaitė);                                           attributed to separate class and offensive language could
                                danguole.kalinauskaite@vdu.lt (D. Kalinauskaitė);                                            be defined as an attempt of degrading, dehumanizing,
                                tomas.krilavicius@vdu.lt (T. Krilavičius)                                                    insulting an individual and / or threatening with violent
                                              © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                              Commons License Attribution 4.0 International (CC BY 4.0).

                                              CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
acts.
   As one of the reasons why it is difficult to detect hate
speech is varied definitions in different studies [4],[5],
a comparison of different hate speech detection mod-
els not in terms of performance but in terms of what is
marked as hate speech could contribute to more compre-
hensive understanding of the phenomenon and its timely
identification. Following the latter notion, the goal of
this experiment is to evaluate selected hate speech detec-
tion models for English language from the perspective of      Figure 1: Projection of word and phrase insertions showing
inter-annotator agreement, i.e. how the selected models       that words of similar meaning in space are adjacent [17]
“agree” in terms of annotation of hate speech instances.
   Section II presents methods used as well as experi-
mental setup, Section III describes the data used in the    The simplest way to represent words in numeric values
experiment, Section IV reports the results, and Section Vis with One-Hot Encoding [24]. This method is one of
ends this paper with conclusions and future plans.       the most popular and works well when there are not
                                                         many different categories (up to 15 works best, although
                                                         in some cases it may work poorly with fewer).
2. Methods and experimental                                 The single hot encoding is a method which creates new
     setup                                               binary columns of categorical variables, where value of
                                                         1 indicates that the original data row belongs to that
For our experiment we selected 3 popular hate speech category. For example, we have the original data: red,
detection models for English language and tested them red, red, red, yellow, green, yellow. For each possible
on HASOC 2019 dataset. Our setup consisted of 4 “an- value a separate column is created and where the initial
notators” - results provided by aforementioned 3 models value is red, we enter 1 in the corresponding column,
and annotations presented in HASOC 2019 dataset. The while in the other columns 0s are inserted (Fig. 2) [36].
annotations mentioned were treated as “gold standard”.
   In the following sections, methods of data
representation are presented (it was important for
selecting hate speech detection models), and hate speech
detection models as well as inter-annotator metrics used
in our experiment are introduced.

2.1. Basic word embeddings
Perception of natural language from textual data is an        Figure 2: Example of a single hot encoding [18]
important area of artificial intelligence. As images can be
perceived as pixels for a computer, language also needs a
way to be represented as textual data in a way that can         Although this method is simple and easy to learn, it has
be processed automatically. For example, the sentence         major drawbacks. Because we only give our computer
The cat sat on the mat cannot be directly processed or        ones and zeros, it cannot interpret any meaning from this
understood by a computer system. One of the best              data (calculating cosine similarity will always result in
methods to represent this for a computer is to convert        zero or near-zero values). This is where pre-trained word
the words into real numeric vectors - word embeddings         embeddings and BERT embeddings help and that is why
[16]. Word embeddings associate each word in the              they have become popular in variety of natural language
vocabulary (a set of words) with a real-valued vector set     processing tasks, including hate speech detection.
in a predefined N-dimensional space (Fig. 1). After
transforming the words or sentences into their                2.2. Pre-trained word embedding models
embeddings, it is possible to model the semantic
importance of a word in numerical form and thus to            It is often an optimal solution to use pre-trained models
carry out mathematical operations [35].                       for deep learning tasks. A pre-trained model is developed
   This vector mapping can be learned using unsuper-          and trained by someone to solve a specific problem based
vised methods such as statistical document analysis or        on chosen data [37]. Using pre-trained models saves
by using supervised techniques, for example, neural net-      time spent on training the model or in search of efficient
work model developed for tasks such as sentiment analy-       neural network architecture. Two main ways to use a pre-
sis or document classification [38].
trained model is fixed feature extraction or fine-tuning       sentences that are shorter than the longest sentence, ze-
of the model and adapting it to the problem at hand [19].      ros are added, i.e. the lengths of the sentences are made
   The fine-tuning of the model is done in one step. Fig. 3    equal. This step is called padding [25]. Next, word embed-
represents process, where each user-generated comment          dings are used - taking each word for each of them one
for hate speech detection is classified according to a fine-   specific vector is assigned. Each value of these vectors
tuned BERT model[20].                                          represents one aspect of the words (Fig. 4).
   The feature-based approach involves two steps. First,
each text, for example, a user-generated comment, is
represented as a sequence of words or subwords, and
each word or the insertion of each subword is calculated
using fastText or BERT models. Second, this sequence
of insertions will form the input to the neural network
(NN) classifier, where the final decision regarding label
of the input text will be made (Fig. 3) [20]. For this task
a variety of deep neural network (DNN) architectures
can be used, for example, deep recurrent neural network
(RNN) [31], deep convolutional neural network (CNN)
[33], gated recurrent unit (GRU) [3], long short-term
memory (LSTM) [34], etc. The most suitable architecture
usually is selected via experiments and by combining
more than one architecture for the task.                       Figure 4: Three steps before word embeddings [21]


                                                               BERT is based on the transformer architecture, there-
                                                            fore it uses the attention mechanism. Attention is a way
                                                            of looking at the relationship between the words in each
                                                            sentence, and it allows for BERT to take into account
                                                            a very large amount of context of a concrete size, both
                                                            from the left and the right of a particular word [20]. By
                                                            examining the working principle of BERT word embed-
                                                            dings, it can be seen that when inputting an English word
                                                            with an ambiguous meaning, for example, crush, BERT
Figure 3: Illustrative explanation of the feature-based and
fine-tuning methodologies [20]                              model can understand that this is a word with several
                                                            different meanings (each word is inserted according to
                                                            the context in which it was used). On the other hand,
                                                            in Word2Vec or fastText based models every word has
2.3. BERT embeddings                                        a single meaning (it specifies only one vector for all the
                                                            different meanings of this word) [36].
BERT - Bidirectional Encoder Representations from              In addition, BERT uses tokenization of word parts or
Transformers, released in 2018 by Google AI Language subwords. For example, the English word singing can be
researchers. BERT features the state-of-the-art perfor- represented as two strings: sing and ing. The advantage
mance on most NLP problems [25]. BERT word embed- of this is that when a word is not in the BERT dictionary,
dings can take one or two sentences as input and use a it can be split into parts to produce rare word embeddings
special token [SEP] to separate them. The [CLS] token is [20]. This type of embeddings was used in all 3 chosen
always placed at the beginning of the text and is a char- hate speech detection models.
acteristics of classification tasks. These characters are
always required, even if we have only one sentence or if
we are not using BERT model for classification tasks [35]
                                                            2.4. Selected hate speech detection
as it helps the algorithm to distinguish between different         models
sentences.                                                  For our experiment we selected three BERT models which
   Thus, for BERT model to be able to distinguish be- were differently pre-trained for the hate speech recogni-
tween words, there are normally three main steps. First, tion task:
as mentioned above, the [SEP] and [CLS] characters are
added at the beginning and at the end of the sentence.            • BERT-HateXplain.1
Next, an index is specified for each word and, finally, for
                                                               1
                                                                   Available at https://github.com/hate-alert/HateXplain.
        • HateBERT.2
        • BERT.3                                                                          𝑓 𝑙𝑒𝑖𝑠𝑠(𝜅) =
                                                                                                           𝑃 − 𝑃𝑒
                                                                                                                                       (3)
                                                                                                           1 − 𝑃𝑒
   The selected models were trained on different datasets
and used for classifying texts as either hate speech, offen-
sive or non-hate. BERT model was trained using tweets
                                                                   3. Data
from Twitter [30], BERT-HateXplain also was trained                For model comparison we used English dataset from
using Twitter and, additionally, Gab4. Moreover,                   HASOC 2019 shared task5 . The data source is Twitter,
Human Rationales were included as part of the                      and the data was sampled using keywords or hashtags
training data to boost the performance [29].                       relevant for hate speech [15]. All the tweets were anno-
HateBERT model was trained using RAL-E: the Reddit                 tated by 2 annotators. When there was a mismatch in the
Abusive Language English dataset [30].                             annotation between annotators, the tweet was assigned
                                                                   to a third annotator. The dataset has been labelled with
2.5. Inter-annotator agreement                                     5 classes:
In linguistics inter-annotator agreement is a formal                        • NOT - Non Hate / Non Offensive Content: posts
means of comparing annotator performance in terms                             with no hate, profane or offensive content.
of reliability [26]. The annotation guidelines define a cor-                • HOF - Hate Speech and Offensive Language:
rect annotation for each relevant instance. As the actual                     posts with hate, offensive or profane content.
annotations are created by the annotators, there is no                      • HATE - Hate Speech: posts contain hateful con-
reference dataset against which to check if the annota-                       tent.
tions are correct. Therefore, common practice is to check
                                                                            • OFFN - Offensive Language: posts contain offen-
for reliability of the annotation process, assuming that if
                                                                              sive content.
the annotation process is not reliable, then annotations
                                                                            • PRFN - Profane Language: posts contain profane
cannot be expected to be correct.
                                                                              words but hate or offensive content is absent.
   For our experiment, we chose inter-annotator agree-
ment to evaluate how the selected models for hate speech              We have chosen 3 of these classes for evaluation,
detection “agree” in terms of annotation of hate speech            namely, NOT, HATE and OFFN, as these were the classes
instances. We selected Cohen’s kappa, Fleiss’ kappa and            our selected models were trained to identify. PRFN (pro-
Accuracy metrics.                                                  fane language) class was merged with NOT (non hate /
   Accuracy is one of the metrics for evaluating classifi-         non offensive content) as it did contain neither HATE
cation models. Having more than two classes, the targets           (hate speech) nor OFFN (offensive language) content [32].
are calculated as part of the correctly predicted sample           The number of records assigned to each class is shown
in the test set, divided by all predictions made in the test       in Table 1.
set hey [39]                                                          The dataset has 2 subsets - training subset (5852 posts)
                                                                   and test subset (1153 posts). We performed evaluation
                        Number of correct predictions              on these subsets with different models separately.
         𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                         (1)
                         Total number of predictions
                                                                   Table 1
   Cohen’s kappa is commonly used for measuring the
                                                            Distribution of classes
degree of agreement between two raters on a nominal
scale. This coefficient also controls for random agreement   Data Subset NOT posts HATE posts OFFN posts
[28]. Cohen’s kappa has value 1 for perfect agreement         Training           4042        1443          667
between the raters and value 0 - for random agreement.         Testing            958         124           71
As we compared more than 2 models (“annotators"), we
used pairwise Cohen’s kappa (2). Fleiss’ kappa (3) is used
for analyzing agreement between more than two raters
rating nominal categories [27] and its value for perfect 4. Results
agreement is 1, while 0 marks random agreement.
                                                            We used disagree6 library developed for the Python pro-
                                     𝑝0 − 𝑝𝑒
                       𝑐𝑜ℎ𝑒𝑛(𝜅) =                       (2) gramming language. It was used to calculate the number
                                     1 − 𝑝𝑒                 of disagreements between three models and expert an-
2
  Available at https://github.com/tommasoc80/HateBERT.      notations. This makes it easier to understand how hate
3
    Available at https://github.com/google-research/bert.
4                                                                  5
    Gab is American microblogging and social networking service.       Available at https://hasocfire.github.io/hasoc/2019/dataset.html.
                                                                   6
    Available at https://gab.com.                                      Available at https://github.com/o-P-o/disagree/.
speech is treated by each of the selected models. After re-  Cohen’s kappa coefficient is a quantitative measure of
viewing the data, it was found that the most coincidences two evaluators (annotators) evaluating the same thing, a
are in those comments that are marked as NOT (non hate    measure of reliability adjusted for how often annotators
speech or non offensive language). For models and ex-     agree. A coefficient value of 0 means that the consen-
perts, it is easier to distinguish these types of commentssus of the evaluators is random, and 1 means that the
because of the large amount of comments with this label   evaluators fully agree [26]. It is possible for the statistic
present in the dataset. The biggest discrepancies are ob- to be negative, which can occur by chance if there is no
served where the content contains hate speech (HATE)      relationship between the ratings of the two raters, or it
(Table 2).                                                may reflect a real tendency of the raters to give differing
                                                          ratings [22].
Table 2                                                      When this metric is applied to the models, the best
Disagreements in annotations                              result was obtained between BERT and HateBERT mod-
                                                          els. BERT-HateXplain model has a coefficient of almost
                               2 do not 3 do not          0 (0.007), indicating that most consensus is random and
  Data Subset All match
                               match         match        that the model is not reliable, even though Accuracy is
  Training       2919          2556          377          high. However, all models have a relatively low Cohen’s
  Testing        802           298           52           kappa coefficient (Fig. 7 and Fig. 8), therefore it would be
                                                          incorrect to rely on the results of these models for auto-
   After calculating Accuracy of the models, it was ob- mated hate speech detection without taking into account
served that BERT-HateXplain model has the highest esti- their limitations.
mate, which reaches almost 68 percent using the training
subset. Accuracy becomes even greater when using the
testing subset, in this case accuracy stands at nearly 82
percent. However, all models do not differ by a large
percentage, as HateBERT model reached 77 percent and
BERT model with 75 percent had the lowest Accuracy
score using the testing subset (Fig. 5 and Fig. 6).


                                                              Figure 7: Cohen’s kappa for training subset


Figure 5: Accuracy of training subset


                                                              Figure 8: Cohen’s kappa for testing subset


                                                                We have also calculated Fleiss’ kappa coefficient,
                                                             which is defined as extended the case of Cohen’s kappa,
Figure 6: Accuracy of testing subset                         where the annotations of more than two evaluators can
                                                             be compared. A comparison of expert annotations and 3
   From the results obtained, it can be seen that with selected models gave an estimate of 0.122 using training
larger amount of data Accuracy percentage drops down. subset and 0.163 for testing subset. According to [23],
It is also important to note that there is a small amount such Fleiss’ kappa ratio refers to a slight agreement.
of OFFN (offensive) and HATE (hate speech) comments             The results showed that the selected models, namely,
in the test data subset and for that reason it is easier for BERT,   HateBERT and BERT-HateXplain, which are
the model to achieve higher accuracy.                        trained on English datasets, are not very reliable. Al-
                                                             though selected models are popular in hate speech re-
search, when evaluated against selected inter-annotator             the European Refugee Crisis,” arXiv e-prints, arXiv-
agreement metrics, it can be seen that their performance            1701, 2017.
is not enough to solve the hate speech detection tasks.         [5] P. Fortuna, J. Soler, & L. Wanner, “Toxic, hateful,
                                                                    offensive or abusive? what are we really classify-
                                                                    ing? An empirical analysis of hate speech datasets,”
5. Conclusions and future plans                                     In Proceedings of the 12th language resources and
                                                                    evaluation conference, pp. 6786–6794, 2020.
In this paper, we presented an inter-annotator agreement
                                                                [6] T. Warmsley, M. Macy, I. Weber, “Automated hate
for hate speech detection tasks between three different
                                                                    speech detection and the problem of offensive lan-
BERT models using HASOC 2019 dataset. The experi-
                                                                    guage,” In Proceedings of the eleventh international
ment results showed that it is not correct to rely only on
                                                                    conference on web and social media, AAAI, pp 512–
Accuracy metric, even if Accuracy percentage is high, be-
                                                                    515, 2017.
cause the reliability could be low. To check if the model is
                                                                [7] A. Schmidt, & M. Wiegand, “A survey on hate
reliable we chose Cohen’s kappa and Fleiss’ kappa. In our
                                                                    speech detection using natural language process-
selected models we found that the highest Accuracy was
                                                                    ing,” In Proceedings of the fifth international work-
achieved with BERT-HateXplain model, even so, when
                                                                    shop on natural language processing for social
calculating the Cohen’s kappa metric the estimate was
                                                                    media, Association for Computational Linguistics
almost 0, which means that model’s results were random
                                                                    (ACL) pp. 1–10, 2017.
and were not reliable for real life use. However, compar-
                                                                [8] P. Fortuna, & S. Nunes, “A survey on automatic
ing BERT and HateBERT models we saw that annotations
                                                                    detection of hate speech in text,” ACM Computing
are quite similar, and their Cohen’s kappa metric result
                                                                    Surveys (CSUR), 51(4), pp. 1–30, 2018.
suggests that similar neural network architectures can
                                                                [9] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Sa-
deliver not only high accuracy, but also correlating re-
                                                                    tapara, T. Ranasinghe, & M. Zampieri, “Overview
sults and reliability. As for Fleiss’ kappa, a comparison of
                                                                    of the hasoc subtrack at fire 2021: Hate speech and
expert annotations and three selected models gave an es-
                                                                    offensive content identification in English and Indo-
timate of a slight agreement (0.122 for training subset and
                                                                    Aryan languages and conversational hate speech,”
0.163 for testing subset), confirming that high Accuracy
                                                                    In Forum for Information Retrieval Evaluation, pp.
can go together with low reliability of the model.
                                                                    1–3, 2021.
   Our future plans include wider model testing with dif-
                                                               [10] K. Madukwe, X. Gao, & B. Xue, “In data we trust: A
ferent annotation schemes (e.g. distinguish profane lan-
                                                                    critical analysis of hate speech detection datasets,”
guage, sexist language, misogyny, etc.) and data sources
                                                                    In Proceedings of the Fourth Workshop on Online
as well. We also plan to test models for different lan-
                                                                    Abuse and Harms, pp. 150–161, 2020.
guages, e.g. Russian, Spanish, German, French, etc. We
                                                               [11] Z. Waseem, & D. Hovy, “Hateful symbols or hateful
plan to use the knowledge gained from this experiment
                                                                    people? Predictive features for hate speech detec-
for developing hate speech detection model for Lithua-
                                                                    tion on Twitter,” In Proceedings of the North Amer-
nian language as well.
                                                                    ican chapter of the association for computational
                                                                    linguistics: Human language technologies 2016, As-
References                                                          sociation for Computational Linguistics (ACL), pp.
                                                                    88–93, 2016.
 [1] Reuters. 2018. Why Facebook is los-                       [12] Z. Zhang, & L. Luo, “Hate speech detection: A
     ing the war on hate speech in Myan-                            solved problem? the challenging case of long tail
     mar.        https://www.reuters.com/investigates/              on Twitter,” Semantic Web, 10(5), pp. 925–945, 2019.
     special-report/myanmar-facebook-hate/.         (Ac-       [13] H. Watanabe, M. Bouazizi, & T. Ohtsuki, “Hate
     cessed on 10/03/2022).                                         speech on twitter: A pragmatic approach to collect
 [2] M. A. Rizoiu, T. Wang, G. Ferraro, & H. Suominen,              hateful and offensive expressions and perform hate
     “Transfer learning for hate speech detection in so-            speech detection,” IEEE access, 6, pp. 13825–13835,
     cial media,” arXiv preprint arXiv:1906.03829, 2019.            2018.
 [3] Z. Zhang, D. Robinson, and J. Tepper, “Detecting          [14] M. Wiegand, J. Ruppenhofer, & T. Kleinbauer, “De-
     hate speech on Twitter using a convolution-gru                 tection of abusive language: the problem of biased
     based deep neural network,” In European semantic               datasets,” In Proceedings of the 2019 conference
     web conference, pages 745–760. Springer, 2018.                 of the North American Chapter of the Association
 [4] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N.                 for Computational Linguistics: human language
     Kurowsky, & M. Wojatzki, “Measuring the Relia-                 technologies, volume 1 (long and short papers), pp.
     bility of Hate Speech Annotations: The Case of                 602–608, 2019.
                                                               [15] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave,
     C. Mandlia, & A. Patel, “Overview of the hasoc                rankings? A theoretical and a simulation approach
     track at fire 2019: Hate speech and offensive con-            using the sum of the pairwise absolute row differ-
     tent identification in Indo-European languages,” In           ences (PARDs),” Journal of Statistical Theory and
     Proceedings of the 11th forum for information re-             Practice 14.3 pp. 1–16, 2020.
     trieval evaluation, pp. 14–17, 2019.                     [28] A. De Raadt, M. J. Warrens, R. J. Bosker, and H. AL
[16] Y. Li and Y. Tao, “Word embedding for understand-             Kiers, “Kappa coefficients for missing data,” Educa-
     ing natural language: a survey,” Guide to big data            tional and psychological measurement 79, no. 3, pp.
     applications. Springer, Cham, 2018. 83–104. 2018.             558–576, 2019.
[17] A. Pogiatzis, “NLP: Contextualized word embed-           [29] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P.
     dings from bert,” Medium, 20-Mar-2019. [Online].              Goyal, and A. Mukherjee, “Hatexplain: A bench-
     Available: https://towardsdatascience.com/nlp-                mark dataset for explainable hate speech detec-
     extract-contextualized-word-embeddings-from-                  tion,” arXiv.org, 18-Dec-2020. [Online]. Available:
     bert-keras-tf-67ef29f60a7b. [Accessed: 25-Mar-                https://arxiv.org/abs/2012.10289. [Accessed: 29-
     2022].                                                        Mar-2022].
[18] D. Becker, “Using categorical data with one hot          [30] T. Caselli, V. Basile, J. Mitrović, and M. Granitzer,
     encoding,” Kaggle, 22-Jan-2018. [Online]. Available:          “Hatebert: Retraining bert for abusive language de-
     https://www.kaggle.com/dansbecker/using-                      tection in English,” arXiv.org, 04-Feb-2021. [On-
     categorical-data-with-one-hot-encoding.           [Ac-        line]. Available: https://arxiv.org/abs/2010.12472.
     cessed: 25-Mar-2022].                                         [Accessed: 29-Mar-2022].
[19] Z. Yichu and S. Vivek, “A Closer Look at                 [31] R. Alshaalan and H. Al-Khalifa, “Hate speech de-
     How Fine-tuning Changes BERT,” 2021. Available:               tection in Saudi Twittersphere: A deep learning ap-
     https://arxiv.org/pdf/2106.14282.pdf. [Accessed: 25-          proach,” In Proceedings of the Fifth Arabic Natural
     Mar-2022].                                                    Language Processing Workshop, pp. 12–23. 2020.
[20] A. G. D’Sa, I. Illina, and D. Fohr, “Bert and fasttext   [32] S. Malmasi and M. Zampieri, “Challenges in dis-
     embeddings for automatic detection of toxic speech,”          criminating profanity from hate speech." Journal of
     2020 International Multi-Conference on: “Organi-              Experimental & Theoretical Artificial Intelligence
     zation of Knowledge and Advanced Technologies”                30, no. 2, pp. 187–202, 2018.
     (OCTA), 2020.                                            [33] M. A. Bashar, and R. Nayak, “QutNocturnal@
[21] M. Mirshafiee, “Step by step introduction                     HASOC’19: CNN for hate speech and offensive
     to word embeddings and Bert Embeddings,”                      content identification in Hindi language,” arXiv
     Medium, 07-Oct-2020. [Online]. Available:                     preprint arXiv:2008.12448, 2020.
     https://mitra-mirshafiee.medium.com/step-by-             [34] G. L. De la Pena Sarracén, R. G. Pons, C. E. M.
     step-introduction-to-word-embeddings-and-                     Cuza, and P. Rosso, “Hate speech detection using
     bert-embeddings-1779c8cc643e.              [Accessed:         attention-based lstm,” EVALITA evaluation of NLP
     25-Mar-2022].                                                 and speech tools for Italian 12, pp. 235–238, 2018.
[22] J. Sim and C. C. Wright, “The Kappa statistic in         [35] K. Ethayarajh, “How contextual are contextualized
     Reliability Studies: Use, interpretation, and sample          word representations? comparing the geometry
     size requirements,” Physical Therapy, vol. 85, no. 3,         of BERT, ELMo, and GPT-2 embeddings,” arXiv
     pp. 257–268, 2005.                                            preprint arXiv:1909.00512, 2019.
[23] J. R. Landis and G. G. Koch, “The measurement of         [36] F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla,
     observer agreement for categorical data,” Biomet-             C. Meaney, & F. Rudzicz, “A survey of word em-
     rics, vol. 33, no. 1, p. 159, 1977.                           beddings for clinical text,” Journal of Biomedical
[24] P. Rodríguez, M. A. Bautista, J. Gonzàlez, & S. Es-           Informatics, 100, 100057, 2019.
     calera, “Beyond one-hot encoding: Lower dimen-           [37] M. Mosbach, M. Andriushchenko, and D. Klakow,
     sional target embedding,” Image and Vision Com-               “On the stability of fine-tuning bert: Misconcep-
     puting, 75, 21–31, 2018.                                      tions, explanations, and strong baselines,” arXiv
[25] W. Zhang, W. Wei, W. Wang, L. Jin, & Z. Cao, “Re-             preprint arXiv:2006.04884, 2020.
     ducing BERT computation by padding removal and           [38] M. K. Dahouda, and J. Inwhee, “A Deep-Learned
     curriculum learning,” In 2021 IEEE International              Embedding Technique for Categorical Features En-
     Symposium on Performance Analysis of Systems                  coding,” IEEE Access 9: 114381–114391, 2021.
     and Software (ISPASS) (pp. 90–92), 2021.                 [39] Google Developers, “Classification:              Ac-
[26] R. Artstein, “Inter-annotator agreement,” In Hand-            curacy,”       Google.       [Online].     Available:
     book of linguistic annotation (pp. 29-7-313).                 https://developers.google.com/machine-
     Springer, Dordrecht, 2017.                                    learning/crash-course/classification/accuracy.
[27] L. Bartok and M. A. Burzler, “How to assess rater             [Accessed: 14-Jun-2022].