A comparison of deep learning models for hate speech detection Eglė Kankevičiūtė1,2,*, Milita Songailaitė1,2,*, Justina Mandravickaitė1,2,*, Danguolė Kalinauskaitė1,2,* and Tomas Krilavičius1,2,* 1 Vytautas Magnus University, Faculty of Informatics, Vileikos street 8, LT-44404 Kaunas, Lithuania 2 Centre for Applied Research and Development, Lithuania Abstract Hate speech is a complex and non-trivial phenomenon that is difficult to detect. Existing datasets used for training hate speech detection models are annotated based on different definitions of this phenomenon, and similar instances can be assigned to different annotation categories based on these differences. The goal of our experiment is to evaluate selected hate speech detection models for English language from the perspective of inter-annotator agreement, i.e. how the selected models “agree” in terms of annotation of hate speech instances. For model comparison we used English dataset from HASOC 2019 shared task and 3 models: BERT-HateXplain, HateBERT and BERT. Inter-annotator agreement was measured with pairwise Cohen’s kappa and Fleiss’ kappa. Accuracy was used as additional metric for control. The experiment results showed that even if the accuracy is high, the reliability, measured via inter-annotator agreement, can be low. We found that the best accuracy in hate speech detection was achieved with BERT-HateXplain model, however, Cohen’s kappa metric for the results of this model was close to 0, meaning that the results were random and not reliable for real life use. On the other hand, comparison of BERT and HateBERT models revealed that annotations are quite similar and they have the best Cohen’s kappa score, suggesting that similar neural network architectures can deliver not only high accuracy, but also correlating results and reliability. As for Fleiss’ kappa, a comparison of expert annotations and three selected models gave an estimate of a slight agreement, confirming that high accuracy can go together with low reliability of the model. Keywords Hate speech, deep learning, model comparison, HASOC 2019 dataset, English language 1. Introduction tion [4]. Also, it was found out that most of the publicly available datasets are incompatible due to different defi- Hate speech is a complex and non-trivial phenomenon nitions attributed to similar concepts [5]. Moreover, hate that is difficult to detect. Online hate speech is assumed speech datasets can have very similar labels, so some to be an important factor in political and ethnic violence studies merge them together into one class to reduce such as the Rohingya crisis in Myanmar [1], [2]. There- class imbalance [10]. However, this practice could make fore, media platforms are pressured to timely detection a negative impact on research as distinction between and elimination of hate speech occurrences [3]. This ten- classes is necessary. For example, merging the offensive dency led to increasing efforts in terms of hate speech language and hate speech classes of [6] dataset in [3] and detection, and a number of hate speech detection models [12] or the racist language and sexist language classes have been developed. of [11] dataset in [13] and [14]. In hate speech research Existing datasets used for training hate speech detec- abusive language or toxic comments can cover several tion models are annotated based on different definitions paradigms [10], therefore following available definitions of this phenomenon, and similar instances can be as- is very important. Similarly, it was suggested that offen- signed to different annotation categories based on these sive language is not the same as hate speech and therefore differences in perception of what constitutes hate speech. they should not be merged [6]. Analysis of the effects of definition on the annotation Following other authors, such as [6], [7], [8] and [9], reliability led to the conclusion that hate speech phe- the summarised definition of hate speech is the following: nomenon requires a stronger and more uniform defini- hate speech describes negative attributes or deficiencies of groups of individuals because they are members of IVUS 2022: 27th International Conference on Information a particular group. Hateful comments occur toward Technology, May 12, 2022, Kaunas, Lithuania groups because of race, political opinion, sexual orienta- * Corresponding author. tion, gender, social status, health condition, etc. As it was $ egle.kankeviciute@stud.vdu.lt (E. Kankevičiūtė); suggested in [6] and [9], offensive comments could be milita.songailaite@stud.vdu.lt (M. Songailaitė); justina.mandravickaite@vdu.lt (J. Mandravickaitė); attributed to separate class and offensive language could danguole.kalinauskaite@vdu.lt (D. Kalinauskaitė); be defined as an attempt of degrading, dehumanizing, tomas.krilavicius@vdu.lt (T. Krilavičius) insulting an individual and / or threatening with violent © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings acts. As one of the reasons why it is difficult to detect hate speech is varied definitions in different studies [4],[5], a comparison of different hate speech detection mod- els not in terms of performance but in terms of what is marked as hate speech could contribute to more compre- hensive understanding of the phenomenon and its timely identification. Following the latter notion, the goal of this experiment is to evaluate selected hate speech detec- tion models for English language from the perspective of Figure 1: Projection of word and phrase insertions showing inter-annotator agreement, i.e. how the selected models that words of similar meaning in space are adjacent [17] “agree” in terms of annotation of hate speech instances. Section II presents methods used as well as experi- mental setup, Section III describes the data used in the The simplest way to represent words in numeric values experiment, Section IV reports the results, and Section Vis with One-Hot Encoding [24]. This method is one of ends this paper with conclusions and future plans. the most popular and works well when there are not many different categories (up to 15 works best, although in some cases it may work poorly with fewer). 2. Methods and experimental The single hot encoding is a method which creates new setup binary columns of categorical variables, where value of 1 indicates that the original data row belongs to that For our experiment we selected 3 popular hate speech category. For example, we have the original data: red, detection models for English language and tested them red, red, red, yellow, green, yellow. For each possible on HASOC 2019 dataset. Our setup consisted of 4 “an- value a separate column is created and where the initial notators” - results provided by aforementioned 3 models value is red, we enter 1 in the corresponding column, and annotations presented in HASOC 2019 dataset. The while in the other columns 0s are inserted (Fig. 2) [36]. annotations mentioned were treated as “gold standard”. In the following sections, methods of data representation are presented (it was important for selecting hate speech detection models), and hate speech detection models as well as inter-annotator metrics used in our experiment are introduced. 2.1. Basic word embeddings Perception of natural language from textual data is an Figure 2: Example of a single hot encoding [18] important area of artificial intelligence. As images can be perceived as pixels for a computer, language also needs a way to be represented as textual data in a way that can Although this method is simple and easy to learn, it has be processed automatically. For example, the sentence major drawbacks. Because we only give our computer The cat sat on the mat cannot be directly processed or ones and zeros, it cannot interpret any meaning from this understood by a computer system. One of the best data (calculating cosine similarity will always result in methods to represent this for a computer is to convert zero or near-zero values). This is where pre-trained word the words into real numeric vectors - word embeddings embeddings and BERT embeddings help and that is why [16]. Word embeddings associate each word in the they have become popular in variety of natural language vocabulary (a set of words) with a real-valued vector set processing tasks, including hate speech detection. in a predefined N-dimensional space (Fig. 1). After transforming the words or sentences into their 2.2. Pre-trained word embedding models embeddings, it is possible to model the semantic importance of a word in numerical form and thus to It is often an optimal solution to use pre-trained models carry out mathematical operations [35]. for deep learning tasks. A pre-trained model is developed This vector mapping can be learned using unsuper- and trained by someone to solve a specific problem based vised methods such as statistical document analysis or on chosen data [37]. Using pre-trained models saves by using supervised techniques, for example, neural net- time spent on training the model or in search of efficient work model developed for tasks such as sentiment analy- neural network architecture. Two main ways to use a pre- sis or document classification [38]. trained model is fixed feature extraction or fine-tuning sentences that are shorter than the longest sentence, ze- of the model and adapting it to the problem at hand [19]. ros are added, i.e. the lengths of the sentences are made The fine-tuning of the model is done in one step. Fig. 3 equal. This step is called padding [25]. Next, word embed- represents process, where each user-generated comment dings are used - taking each word for each of them one for hate speech detection is classified according to a fine- specific vector is assigned. Each value of these vectors tuned BERT model[20]. represents one aspect of the words (Fig. 4). The feature-based approach involves two steps. First, each text, for example, a user-generated comment, is represented as a sequence of words or subwords, and each word or the insertion of each subword is calculated using fastText or BERT models. Second, this sequence of insertions will form the input to the neural network (NN) classifier, where the final decision regarding label of the input text will be made (Fig. 3) [20]. For this task a variety of deep neural network (DNN) architectures can be used, for example, deep recurrent neural network (RNN) [31], deep convolutional neural network (CNN) [33], gated recurrent unit (GRU) [3], long short-term memory (LSTM) [34], etc. The most suitable architecture usually is selected via experiments and by combining more than one architecture for the task. Figure 4: Three steps before word embeddings [21] BERT is based on the transformer architecture, there- fore it uses the attention mechanism. Attention is a way of looking at the relationship between the words in each sentence, and it allows for BERT to take into account a very large amount of context of a concrete size, both from the left and the right of a particular word [20]. By examining the working principle of BERT word embed- dings, it can be seen that when inputting an English word with an ambiguous meaning, for example, crush, BERT Figure 3: Illustrative explanation of the feature-based and fine-tuning methodologies [20] model can understand that this is a word with several different meanings (each word is inserted according to the context in which it was used). On the other hand, in Word2Vec or fastText based models every word has 2.3. BERT embeddings a single meaning (it specifies only one vector for all the different meanings of this word) [36]. BERT - Bidirectional Encoder Representations from In addition, BERT uses tokenization of word parts or Transformers, released in 2018 by Google AI Language subwords. For example, the English word singing can be researchers. BERT features the state-of-the-art perfor- represented as two strings: sing and ing. The advantage mance on most NLP problems [25]. BERT word embed- of this is that when a word is not in the BERT dictionary, dings can take one or two sentences as input and use a it can be split into parts to produce rare word embeddings special token [SEP] to separate them. The [CLS] token is [20]. This type of embeddings was used in all 3 chosen always placed at the beginning of the text and is a char- hate speech detection models. acteristics of classification tasks. These characters are always required, even if we have only one sentence or if we are not using BERT model for classification tasks [35] 2.4. Selected hate speech detection as it helps the algorithm to distinguish between different models sentences. For our experiment we selected three BERT models which Thus, for BERT model to be able to distinguish be- were differently pre-trained for the hate speech recogni- tween words, there are normally three main steps. First, tion task: as mentioned above, the [SEP] and [CLS] characters are added at the beginning and at the end of the sentence. • BERT-HateXplain.1 Next, an index is specified for each word and, finally, for 1 Available at https://github.com/hate-alert/HateXplain. • HateBERT.2 • BERT.3 𝑓 𝑙𝑒𝑖𝑠𝑠(𝜅) = 𝑃 − 𝑃𝑒 (3) 1 − 𝑃𝑒 The selected models were trained on different datasets and used for classifying texts as either hate speech, offen- sive or non-hate. BERT model was trained using tweets 3. Data from Twitter [30], BERT-HateXplain also was trained For model comparison we used English dataset from using Twitter and, additionally, Gab4. Moreover, HASOC 2019 shared task5 . The data source is Twitter, Human Rationales were included as part of the and the data was sampled using keywords or hashtags training data to boost the performance [29]. relevant for hate speech [15]. All the tweets were anno- HateBERT model was trained using RAL-E: the Reddit tated by 2 annotators. When there was a mismatch in the Abusive Language English dataset [30]. annotation between annotators, the tweet was assigned to a third annotator. The dataset has been labelled with 2.5. Inter-annotator agreement 5 classes: In linguistics inter-annotator agreement is a formal • NOT - Non Hate / Non Offensive Content: posts means of comparing annotator performance in terms with no hate, profane or offensive content. of reliability [26]. The annotation guidelines define a cor- • HOF - Hate Speech and Offensive Language: rect annotation for each relevant instance. As the actual posts with hate, offensive or profane content. annotations are created by the annotators, there is no • HATE - Hate Speech: posts contain hateful con- reference dataset against which to check if the annota- tent. tions are correct. Therefore, common practice is to check • OFFN - Offensive Language: posts contain offen- for reliability of the annotation process, assuming that if sive content. the annotation process is not reliable, then annotations • PRFN - Profane Language: posts contain profane cannot be expected to be correct. words but hate or offensive content is absent. For our experiment, we chose inter-annotator agree- ment to evaluate how the selected models for hate speech We have chosen 3 of these classes for evaluation, detection “agree” in terms of annotation of hate speech namely, NOT, HATE and OFFN, as these were the classes instances. We selected Cohen’s kappa, Fleiss’ kappa and our selected models were trained to identify. PRFN (pro- Accuracy metrics. fane language) class was merged with NOT (non hate / Accuracy is one of the metrics for evaluating classifi- non offensive content) as it did contain neither HATE cation models. Having more than two classes, the targets (hate speech) nor OFFN (offensive language) content [32]. are calculated as part of the correctly predicted sample The number of records assigned to each class is shown in the test set, divided by all predictions made in the test in Table 1. set hey [39] The dataset has 2 subsets - training subset (5852 posts) and test subset (1153 posts). We performed evaluation Number of correct predictions on these subsets with different models separately. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1) Total number of predictions Table 1 Cohen’s kappa is commonly used for measuring the Distribution of classes degree of agreement between two raters on a nominal scale. This coefficient also controls for random agreement Data Subset NOT posts HATE posts OFFN posts [28]. Cohen’s kappa has value 1 for perfect agreement Training 4042 1443 667 between the raters and value 0 - for random agreement. Testing 958 124 71 As we compared more than 2 models (“annotators"), we used pairwise Cohen’s kappa (2). Fleiss’ kappa (3) is used for analyzing agreement between more than two raters rating nominal categories [27] and its value for perfect 4. Results agreement is 1, while 0 marks random agreement. We used disagree6 library developed for the Python pro- 𝑝0 − 𝑝𝑒 𝑐𝑜ℎ𝑒𝑛(𝜅) = (2) gramming language. It was used to calculate the number 1 − 𝑝𝑒 of disagreements between three models and expert an- 2 Available at https://github.com/tommasoc80/HateBERT. notations. This makes it easier to understand how hate 3 Available at https://github.com/google-research/bert. 4 5 Gab is American microblogging and social networking service. Available at https://hasocfire.github.io/hasoc/2019/dataset.html. 6 Available at https://gab.com. Available at https://github.com/o-P-o/disagree/. speech is treated by each of the selected models. After re- Cohen’s kappa coefficient is a quantitative measure of viewing the data, it was found that the most coincidences two evaluators (annotators) evaluating the same thing, a are in those comments that are marked as NOT (non hate measure of reliability adjusted for how often annotators speech or non offensive language). For models and ex- agree. A coefficient value of 0 means that the consen- perts, it is easier to distinguish these types of commentssus of the evaluators is random, and 1 means that the because of the large amount of comments with this label evaluators fully agree [26]. It is possible for the statistic present in the dataset. The biggest discrepancies are ob- to be negative, which can occur by chance if there is no served where the content contains hate speech (HATE) relationship between the ratings of the two raters, or it (Table 2). may reflect a real tendency of the raters to give differing ratings [22]. Table 2 When this metric is applied to the models, the best Disagreements in annotations result was obtained between BERT and HateBERT mod- els. BERT-HateXplain model has a coefficient of almost 2 do not 3 do not 0 (0.007), indicating that most consensus is random and Data Subset All match match match that the model is not reliable, even though Accuracy is Training 2919 2556 377 high. However, all models have a relatively low Cohen’s Testing 802 298 52 kappa coefficient (Fig. 7 and Fig. 8), therefore it would be incorrect to rely on the results of these models for auto- After calculating Accuracy of the models, it was ob- mated hate speech detection without taking into account served that BERT-HateXplain model has the highest esti- their limitations. mate, which reaches almost 68 percent using the training subset. Accuracy becomes even greater when using the testing subset, in this case accuracy stands at nearly 82 percent. However, all models do not differ by a large percentage, as HateBERT model reached 77 percent and BERT model with 75 percent had the lowest Accuracy score using the testing subset (Fig. 5 and Fig. 6). Figure 7: Cohen’s kappa for training subset Figure 5: Accuracy of training subset Figure 8: Cohen’s kappa for testing subset We have also calculated Fleiss’ kappa coefficient, which is defined as extended the case of Cohen’s kappa, Figure 6: Accuracy of testing subset where the annotations of more than two evaluators can be compared. A comparison of expert annotations and 3 From the results obtained, it can be seen that with selected models gave an estimate of 0.122 using training larger amount of data Accuracy percentage drops down. subset and 0.163 for testing subset. According to [23], It is also important to note that there is a small amount such Fleiss’ kappa ratio refers to a slight agreement. of OFFN (offensive) and HATE (hate speech) comments The results showed that the selected models, namely, in the test data subset and for that reason it is easier for BERT, HateBERT and BERT-HateXplain, which are the model to achieve higher accuracy. trained on English datasets, are not very reliable. Al- though selected models are popular in hate speech re- search, when evaluated against selected inter-annotator the European Refugee Crisis,” arXiv e-prints, arXiv- agreement metrics, it can be seen that their performance 1701, 2017. is not enough to solve the hate speech detection tasks. [5] P. Fortuna, J. Soler, & L. Wanner, “Toxic, hateful, offensive or abusive? what are we really classify- ing? An empirical analysis of hate speech datasets,” 5. Conclusions and future plans In Proceedings of the 12th language resources and evaluation conference, pp. 6786–6794, 2020. In this paper, we presented an inter-annotator agreement [6] T. Warmsley, M. Macy, I. Weber, “Automated hate for hate speech detection tasks between three different speech detection and the problem of offensive lan- BERT models using HASOC 2019 dataset. The experi- guage,” In Proceedings of the eleventh international ment results showed that it is not correct to rely only on conference on web and social media, AAAI, pp 512– Accuracy metric, even if Accuracy percentage is high, be- 515, 2017. cause the reliability could be low. To check if the model is [7] A. Schmidt, & M. Wiegand, “A survey on hate reliable we chose Cohen’s kappa and Fleiss’ kappa. In our speech detection using natural language process- selected models we found that the highest Accuracy was ing,” In Proceedings of the fifth international work- achieved with BERT-HateXplain model, even so, when shop on natural language processing for social calculating the Cohen’s kappa metric the estimate was media, Association for Computational Linguistics almost 0, which means that model’s results were random (ACL) pp. 1–10, 2017. and were not reliable for real life use. However, compar- [8] P. Fortuna, & S. Nunes, “A survey on automatic ing BERT and HateBERT models we saw that annotations detection of hate speech in text,” ACM Computing are quite similar, and their Cohen’s kappa metric result Surveys (CSUR), 51(4), pp. 1–30, 2018. suggests that similar neural network architectures can [9] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Sa- deliver not only high accuracy, but also correlating re- tapara, T. Ranasinghe, & M. Zampieri, “Overview sults and reliability. As for Fleiss’ kappa, a comparison of of the hasoc subtrack at fire 2021: Hate speech and expert annotations and three selected models gave an es- offensive content identification in English and Indo- timate of a slight agreement (0.122 for training subset and Aryan languages and conversational hate speech,” 0.163 for testing subset), confirming that high Accuracy In Forum for Information Retrieval Evaluation, pp. can go together with low reliability of the model. 1–3, 2021. Our future plans include wider model testing with dif- [10] K. Madukwe, X. Gao, & B. Xue, “In data we trust: A ferent annotation schemes (e.g. distinguish profane lan- critical analysis of hate speech detection datasets,” guage, sexist language, misogyny, etc.) and data sources In Proceedings of the Fourth Workshop on Online as well. We also plan to test models for different lan- Abuse and Harms, pp. 150–161, 2020. guages, e.g. Russian, Spanish, German, French, etc. We [11] Z. Waseem, & D. Hovy, “Hateful symbols or hateful plan to use the knowledge gained from this experiment people? Predictive features for hate speech detec- for developing hate speech detection model for Lithua- tion on Twitter,” In Proceedings of the North Amer- nian language as well. ican chapter of the association for computational linguistics: Human language technologies 2016, As- References sociation for Computational Linguistics (ACL), pp. 88–93, 2016. [1] Reuters. 2018. Why Facebook is los- [12] Z. Zhang, & L. Luo, “Hate speech detection: A ing the war on hate speech in Myan- solved problem? the challenging case of long tail mar. https://www.reuters.com/investigates/ on Twitter,” Semantic Web, 10(5), pp. 925–945, 2019. special-report/myanmar-facebook-hate/. (Ac- [13] H. Watanabe, M. Bouazizi, & T. Ohtsuki, “Hate cessed on 10/03/2022). speech on twitter: A pragmatic approach to collect [2] M. A. Rizoiu, T. Wang, G. Ferraro, & H. Suominen, hateful and offensive expressions and perform hate “Transfer learning for hate speech detection in so- speech detection,” IEEE access, 6, pp. 13825–13835, cial media,” arXiv preprint arXiv:1906.03829, 2019. 2018. [3] Z. Zhang, D. Robinson, and J. Tepper, “Detecting [14] M. Wiegand, J. Ruppenhofer, & T. Kleinbauer, “De- hate speech on Twitter using a convolution-gru tection of abusive language: the problem of biased based deep neural network,” In European semantic datasets,” In Proceedings of the 2019 conference web conference, pages 745–760. Springer, 2018. of the North American Chapter of the Association [4] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. for Computational Linguistics: human language Kurowsky, & M. Wojatzki, “Measuring the Relia- technologies, volume 1 (long and short papers), pp. bility of Hate Speech Annotations: The Case of 602–608, 2019. [15] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, & A. Patel, “Overview of the hasoc rankings? A theoretical and a simulation approach track at fire 2019: Hate speech and offensive con- using the sum of the pairwise absolute row differ- tent identification in Indo-European languages,” In ences (PARDs),” Journal of Statistical Theory and Proceedings of the 11th forum for information re- Practice 14.3 pp. 1–16, 2020. trieval evaluation, pp. 14–17, 2019. [28] A. De Raadt, M. J. Warrens, R. J. Bosker, and H. AL [16] Y. Li and Y. Tao, “Word embedding for understand- Kiers, “Kappa coefficients for missing data,” Educa- ing natural language: a survey,” Guide to big data tional and psychological measurement 79, no. 3, pp. applications. Springer, Cham, 2018. 83–104. 2018. 558–576, 2019. [17] A. Pogiatzis, “NLP: Contextualized word embed- [29] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. dings from bert,” Medium, 20-Mar-2019. [Online]. Goyal, and A. Mukherjee, “Hatexplain: A bench- Available: https://towardsdatascience.com/nlp- mark dataset for explainable hate speech detec- extract-contextualized-word-embeddings-from- tion,” arXiv.org, 18-Dec-2020. [Online]. Available: bert-keras-tf-67ef29f60a7b. [Accessed: 25-Mar- https://arxiv.org/abs/2012.10289. [Accessed: 29- 2022]. Mar-2022]. [18] D. Becker, “Using categorical data with one hot [30] T. Caselli, V. Basile, J. Mitrović, and M. Granitzer, encoding,” Kaggle, 22-Jan-2018. [Online]. Available: “Hatebert: Retraining bert for abusive language de- https://www.kaggle.com/dansbecker/using- tection in English,” arXiv.org, 04-Feb-2021. [On- categorical-data-with-one-hot-encoding. [Ac- line]. Available: https://arxiv.org/abs/2010.12472. cessed: 25-Mar-2022]. [Accessed: 29-Mar-2022]. [19] Z. Yichu and S. Vivek, “A Closer Look at [31] R. Alshaalan and H. Al-Khalifa, “Hate speech de- How Fine-tuning Changes BERT,” 2021. Available: tection in Saudi Twittersphere: A deep learning ap- https://arxiv.org/pdf/2106.14282.pdf. [Accessed: 25- proach,” In Proceedings of the Fifth Arabic Natural Mar-2022]. Language Processing Workshop, pp. 12–23. 2020. [20] A. G. D’Sa, I. Illina, and D. Fohr, “Bert and fasttext [32] S. Malmasi and M. Zampieri, “Challenges in dis- embeddings for automatic detection of toxic speech,” criminating profanity from hate speech." Journal of 2020 International Multi-Conference on: “Organi- Experimental & Theoretical Artificial Intelligence zation of Knowledge and Advanced Technologies” 30, no. 2, pp. 187–202, 2018. (OCTA), 2020. [33] M. A. Bashar, and R. Nayak, “QutNocturnal@ [21] M. Mirshafiee, “Step by step introduction HASOC’19: CNN for hate speech and offensive to word embeddings and Bert Embeddings,” content identification in Hindi language,” arXiv Medium, 07-Oct-2020. [Online]. Available: preprint arXiv:2008.12448, 2020. https://mitra-mirshafiee.medium.com/step-by- [34] G. L. De la Pena Sarracén, R. G. Pons, C. E. M. step-introduction-to-word-embeddings-and- Cuza, and P. Rosso, “Hate speech detection using bert-embeddings-1779c8cc643e. [Accessed: attention-based lstm,” EVALITA evaluation of NLP 25-Mar-2022]. and speech tools for Italian 12, pp. 235–238, 2018. [22] J. Sim and C. C. Wright, “The Kappa statistic in [35] K. Ethayarajh, “How contextual are contextualized Reliability Studies: Use, interpretation, and sample word representations? comparing the geometry size requirements,” Physical Therapy, vol. 85, no. 3, of BERT, ELMo, and GPT-2 embeddings,” arXiv pp. 257–268, 2005. preprint arXiv:1909.00512, 2019. [23] J. R. Landis and G. G. Koch, “The measurement of [36] F. K. Khattak, S. Jeblee, C. Pou-Prom, M. Abdalla, observer agreement for categorical data,” Biomet- C. Meaney, & F. Rudzicz, “A survey of word em- rics, vol. 33, no. 1, p. 159, 1977. beddings for clinical text,” Journal of Biomedical [24] P. Rodríguez, M. A. Bautista, J. Gonzàlez, & S. Es- Informatics, 100, 100057, 2019. calera, “Beyond one-hot encoding: Lower dimen- [37] M. Mosbach, M. Andriushchenko, and D. Klakow, sional target embedding,” Image and Vision Com- “On the stability of fine-tuning bert: Misconcep- puting, 75, 21–31, 2018. tions, explanations, and strong baselines,” arXiv [25] W. Zhang, W. Wei, W. Wang, L. Jin, & Z. Cao, “Re- preprint arXiv:2006.04884, 2020. ducing BERT computation by padding removal and [38] M. K. Dahouda, and J. Inwhee, “A Deep-Learned curriculum learning,” In 2021 IEEE International Embedding Technique for Categorical Features En- Symposium on Performance Analysis of Systems coding,” IEEE Access 9: 114381–114391, 2021. and Software (ISPASS) (pp. 90–92), 2021. [39] Google Developers, “Classification: Ac- [26] R. Artstein, “Inter-annotator agreement,” In Hand- curacy,” Google. [Online]. Available: book of linguistic annotation (pp. 29-7-313). https://developers.google.com/machine- Springer, Dordrecht, 2017. learning/crash-course/classification/accuracy. [27] L. Bartok and M. A. Burzler, “How to assess rater [Accessed: 14-Jun-2022].