Detecting Hate Speech on English and Indo-Aryan Languages with BERT and Ensemble learning Camilo Caparrós-Laiz1 , José Antonio García-Díaz1 and Rafael Valencia-García1 1 Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100 Murcia, Spain Abstract The increasing use of social media platforms is making possible the communication between people around the world, including those with conflicting ideologies and cultures. However, some people use offensive language instead of having a polite conversation either because of little education or with the intention of intoxicating the debate. In this paper we analyze the results achieved by the UMUTeam of applying BERT models, either separate or combined with other popular models, for the HASOC’2021 shared-task for identifying offensive language in English, Hindi and Marathi. Our bests results are achieved with BERT for English classification subtask (1A), in which we reached a macro F1-score of 80.13%, and with ensemble learning for the rest of the subtasks, reaching macro F1-scores of 62.89% in English (subtask 1B), 75.20% and 51.67% in Hindi (subtasks 1A and 1B, respectively), and 84.23% in Marathi. Keywords Hate-speech detection, Feature engineering, Transformers, Low-resource languages, Deep-learning 1. Introduction Social media platforms are places where people can freely communicate. However, the anonymity and remote communication makes it easier to have impolite, offensive or even hateful con- versations. Although these platforms may have their own rules, heated arguments can go unnoticed or be wrongly censored. One way to avoid censorship is to provide tools for tagging offensive messages is by using methods that automatically detect offensive language. The biggest challenge regarding offensive-speech is the ability to interpret language. Naive methods may be based on the identification certain offensive keywords but do not take into account the context around the keyword. Modern approaches based on transformers are capable of detecting the context of the words, improving the accuracy of these kind of systems. HASOC 2021’s shared-task [1] offered two subtasks. The first subtask is divided into two: a binary classification problem to identify whether a post contains hate, offensive or profane content [2, 3] (1A) and a multi-classification problem to discriminate between hate, profane and offensive posts (1B). The second subtask is the identification of Conversational Hate-Speech in Code-Mixed Languages (ICHCL). The dataset for these tasks contains tweets which may or may Forum for Information Retrieval Evaluation, December 13-17, 2021, India Envelope-Open camilo.caparrosl@um.es (C. Caparrós-Laiz); joseantonio.garcia8@um.es (J. A. García-Díaz); valencia@um.es (R. Valencia-García) Orcid 0000-0002-5191-7500 (C. Caparrós-Laiz); 0000-0002-3651-2660 (J. A. García-Díaz); 0000-0003-2457-1791 (R. Valencia-García) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) not be offensive in English, Hindi and Marathi. The UMUTeam only participated in subtasks 1A and 1B in all the proposed languages. Our approach to solve these tasks is by using state of the art models such as [4] BERT, which is able to capture language features, and combine it with other linguistic features (LF) extracted from UMUTextStats [5, 6] and by applying ensemble learning approaches. We can find in the bibliography several works regarding hate-speech and offensive-speech identification. For example, in [7] the authors evaluated multiple models such as logistic regression, naive Bayes, decision trees and linear support vector machines (SVM). They found that logistic regression with L2 regularization by using 5-fold cross-validation and parameter grid search achieved the best results, with a precision of 91% and recall of 90% for offensive language, and a precision of 44% and recall of 61% for hate speech. Another related research is [8], in which the authors used a Multi-view SVM to classify hate speech. They fit different text features to a linear SVM and, finally, combine those features with a linear SVM. This model achieved a 80% of accuracy on Stormfront dataset and a 61% of accuracy on TRAC dataset. One benefit of the multi-view SVM approach is that it allows for some interpretability by identifying which classifier contributes the most. Other works have focused on specific types of hate-speech identification, such as misogyny. For example, in [6] the authors compiled and evaluated a dataset in Spanish regarding misogyny. They focused on determining violence against relevant women and also they analyzed differences between posts from European-Spanish and Latin American Spanish. Most of the datasets are in English. However, many other languages do not have the same availability. Relevant work has been done to improve the accuracy on certain languages by using cross-lingual transfer learning as in [9] for Bengali, Hindi and Spanish, and as in [10] for German. 2. Methodology We participate with three runs for the subtasks: 1A, 1B (English), 1A, 1B (Hindi), and 1A (Marathi). Due to lack of time, we could not complete the first run for subtasks 1B (Hindi) and 1A (Marathi). For the first run, we use BERT [4] with HuggingFace 1 transformers library [11]. This library provides an automatic framework to train and predict a wide variety of models for supervised learning. The tokenizer and BERT model for English was b e r t - b a s e - u n c a s e d and b e r t - b a s e - m u l t i l i n g u a l - u n c a s e d for Hindi and Marathi. The BERT model is B e r t F o r S e q u e n c e C l a s s i f i c a t i o n , which is used to classify sequences like text. We divided each dataset into training and validation to let us know beforehand the performance of the models. For the second run, we combine the same neural network BERT with stylometric linguistic features [5, 6]. For this, we extract from BERT the encoding of the [ C L S ] token, as suggested in [12]. Next, we use Keras to combine the features and evaluate a total of 110 neural network models in which different number of hidden layers, neurons, learning rates, batch sizes, and activation functions. We ranked all the evaluated models with the macro F1-score over the validation split and select the one that achieved the best result. 1 HuggingFace: https://huggingface.co/ For the third run, we prepare an ensemble of deep-learning classifiers based on the weighted mode. To calculate the importance of each classifier in the final output, we ranked each model based on the F1-score with the validation dataset. The involved neural networks models are the following: (1) linguistic features, (2) sentence embeddings from fastText 2 in English [13] and in Hindi and Marathi [14], (3) word embeddings from gloVe and fastText evaluated with convolutional and recurrent neural networks, and (4) BERT (as described for the first and second run). For each feature set, we evaluated 110 neural network models in order to decide what are the best hyperparameters. 3. Results In this section, we show and discuss the results obtained by our methods. In each table, we present the three best methods of the overall results and our three best methods (UMUTeam) sorted by their macro F1-score obtained by the official HASOC leaderboard. Each table is a subtask (see Tables 1 and 2 for English, Tables Table 3 and 4 for Hindi, and Table 5 for Marathi). Table 1 Results over English subtask 1A Team Method Macro F1 NLP-CIC 0.8305 HUNLP 0.8215 neuro-utmn-thales 0.8199 UMUTeam Run 1 - BERT 0.8013 UMUTeam Run 3 - Ensemble 0.7959 UMUTeam Run 2 - BERT+LF 0.7933 Table 2 Results over English subtask 1B Team Method Macro F1 NLP-CIC 0.6657 neuro-utm-thales 0.6577 HASOC21rub 0.6482 UMUTeam Run 3 - Ensemble 0.6289 UMUTeam Run 2 - BERT+LF 0.6219 UMUTeam Run 1 - BERT 0.3751 Our best result for the English subtask 1A is 80.13%, achieved by the BERT model. This result is very close to the best overall result (83.05%). The ensemble and LF models also had results close to the top results with 79.59% and 79.33%, respectively. For the English subtask 1B we achieved 62.89% with the ensemble model. Again, this result is very close to the top result 2 fastText: https://fasttext.cc/ Table 3 Results over Hindi subtask 1A Team Method Macro F1 t1 0.7825 Super Mario 0.7797 Hasnuhana 0.7797 UMUTeam Run 3 - Ensemble 0.7520 UMUTeam Run 1 - BERT 0.7185 UMUTeam Run 2 - BERT+LF 0.6724 Table 4 Results over Hindi subtask 1B Team Method Macro F1 NeuralSpace 0.5603 SATLab 0.5586 hate-busters 0.5582 UMUTeam Run 2 - Ensemble 0.5167 UMUTeam Run 1 - BERT+LF 0.4889 Table 5 Results over Marathi subtask 1A Team Method Macro F1 WLV-RIT 0.9144 neuro-utmn-thales 0.8808 Hasnuhana 0.8756 UMUTeam Run 2 - Ensemble 0.8423 UMUTeam Run 1 - BERT+LF 0.8402 (66.57%). Our second run achieved similar results (62.19%). However, BERT got very bad results (37.51%). The best model for the Hindi subtask 1A achieved a macro F1-score of 75.20% with the ensemble model, 71.85% for BERT model and 67.24% for LF model. The results for Hindi subtask 1B are 51.67% for ensemble model and 48.89% for LF model. Lastly, for the Marathi subtask 1A we achieved a macro F1-score of 84.23% with the ensemble model and 84.02% with the LF model. We notice that, generally, the ensemble learning model is the one that performs the best in all tasks except for the English subtask 1A in which BERT performs better. We can also observe that the English subtask 1A is quite near to top performing methods of the overall results. This finding suggests that the feature sets and the neural network models are complementary as their combination outperforms the results achieved by BERT. We achieved near to state of the art results in English subtasks and decent results for Hindi and Marathi. We see that transformer models such as BERT are suitable for English tasks about hate speech and offensive language detection. However, other languages such as Hindi, which is quite different from English, seem to perform worse. This may be for two reasons: (1) the model itself may not fit as well as other languages [15, 16], (2) a lack of pre-trained models for the specific language [17]. 4. Conclusions and further work In this work we had the opportunity to take part in the HASOC’2021 shared-task for automatic detection of hate speech on various languages. We achieved relevant results on some of the proposed tasks, with 84.23% with an ensemble model for hate content detection in Marathi and achieving a 51.67% of macro F1-score for discriminating between hate, profane and offensive content in Hindi with an ensemble model. Although these are good results, we consider that there are improvements that can be done in order to reach better results. We propose the creation of more pre-trained models for more languages, which include Hindi and Marathi. We also consider the analysis of other models that could improve our results in these tasks. Besides, we will pay special attention to language models and linguistic features capable of extracting patterns from figurative language [18], in which words and expressions shift from their literal meaning. For this, we will compile datasets from satiric media and we will analyze linguistic devices such as sarcasm or irony, similar to the works described at [19]. Acknowledgments This research paper is part of the research project PID2019-107652RB-I00 funded by MCIN/ AEI/10.13039/501100011033. In addition, José Antonio García-Díaz has been supported by Banco Santander and University of Murcia through the industrial doctorate programme. References [1] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in: FIRE: 2021 Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December 2021, ACM, 2021. [2] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Satapara, P. Majumder, J. Schäfer, T. Ranas- inghe, M. Zampieri, D. Nandini, A. K. Jaiswal, Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Lan- guages, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. URL: http://ceur-ws.org/. [3] S. Gaikwad, T. Ranasinghe, M. Zampieri, C. M. Homan, Cross-lingual offensive language identification for low resource languages: The case of marathi, in: Proceedings of RANLP, 2021. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [5] J. A. García-Díaz, M. Cánovas-García, R. Valencia-García, Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in latin america, Future Generation Computer Systems 112 (2020) 614–657. doi:1 0 . 1 0 1 6 / j . f u t u r e . 2 0 2 0 . 0 6 . 0 1 9 . [6] J. A. García-Díaz, M. Cánovas-García, R. Colomo-Palacios, R. Valencia-García, Detect- ing misogyny in spanish tweets. an approach based on linguistics features and word embeddings, Future Generation Computer Systems 114 (2021) 506 – 518. URL: http: //www.sciencedirect.com/science/article/pii/S0167739X20301928. doi:1 0 . 1 0 1 6 / j . f u t u r e . 2020.08.032. [7] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, Proceedings of the International AAAI Conference on Web and Social Media 11 (2017) 512–515. URL: https://ojs.aaai.org/index.php/ICWSM/article/ view/14955. [8] S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, O. Frieder, Hate speech detection: Challenges and solutions, PLOS ONE 14 (2019) 1–16. URL: https://doi.org/10.1371/journal. pone.0221152. doi:1 0 . 1 3 7 1 / j o u r n a l . p o n e . 0 2 2 1 1 5 2 . [9] T. Ranasinghe, M. Zampieri, Multilingual offensive language identification with cross- lingual embeddings, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 5838–5844. URL: https://aclanthology.org/2020.emnlp-main.470. doi:1 0 . 1 8 6 5 3 / v1/2020.emnlp- main.470. [10] I. Bigoulaeva, V. Hangya, A. Fraser, Cross-lingual transfer learning for hate speech detection, in: Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, Association for Computational Linguistics, Kyiv, 2021, pp. 15–25. URL: https://aclanthology.org/2021.ltedi-1.3. [11] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Brew, Huggingface’s transformers: State-of-the-art natural language processing, CoRR abs/1910.03771 (2019). URL: http://arxiv.org/abs/1910.03771. arXiv:1910.03771. [12] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. a r X i v : 1 9 0 8 . 1 0 0 8 4 . [13] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training distributed word representations, in: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. [14] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157 languages, in: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. [15] S. Wu, M. Dredze, Are all languages created equal in multilingual bert?, CoRR abs/2005.09093 (2020). URL: https://arxiv.org/abs/2005.09093. a r X i v : 2 0 0 5 . 0 9 0 9 3 . [16] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual bert?, CoRR abs/1906.01502 (2019). URL: http://arxiv.org/abs/1906.01502. a r X i v : 1 9 0 6 . 0 1 5 0 2 . [17] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, S. Pyysalo, Multilingual is not enough: BERT for finnish, CoRR abs/1912.07076 (2019). URL: http: //arxiv.org/abs/1912.07076. a r X i v : 1 9 1 2 . 0 7 0 7 6 . [18] M. del Pilar Salas-Zárate, G. Alor-Hernández, J. L. Sánchez-Cervantes, M. A. Paredes- Valverde, J. L. García-Alcaraz, R. Valencia-García, Review of english literature on figurative language applied to social networks, Knowl. Inf. Syst. 62 (2020) 2105–2137. URL: https: //doi.org/10.1007/s10115-019-01425-3. doi:1 0 . 1 0 0 7 / s 1 0 1 1 5 - 0 1 9 - 0 1 4 2 5 - 3 . [19] M. del Pilar Salas-Zárate, M. A. Paredes-Valverde, M. Á. Rodríguez-García, R. Valencia- García, G. Alor-Hernández, Automatic detection of satire in twitter: A psycholinguistic- based approach, Knowl. Based Syst. 128 (2017) 20–33. URL: https://doi.org/10.1016/j.knosys. 2017.04.009. doi:1 0 . 1 0 1 6 / j . k n o s y s . 2 0 1 7 . 0 4 . 0 0 9 .