Using Only Character Ngrams for Hate Speech and Offensive Content Identification in Five Low-Ressource Languages Yves Bestgen1 1 Laboratoire d’analyse statistique des textes - Statistical Analysis of Text Laboratory (LAST - SATLab), Université catholique de Louvain, 10 place Cardinal Mercier, Louvain-la-Neuve, 1348, Belgium Abstract This paper describes the system proposed by the SATLab for hate speech and offensive content identi- fication in five low-ressource languages. This language-agnostic system applies a classical supervised learning to character n-grams, using no other data than the learning materials. After optimizing a series of parameters, it ranked first in the Bodo task and second in the Gujarati task, for which the learning material contained only 200 tweets. It also performed well in the Sinhala and Assamese task, but was outperformed by several systems in the Bengali task. Keywords Character ngrams, logistic regression, gradient boosting decision tree, low-resource languages 1. Introduction This year, the SATLab team took part in five tasks proposed by HASOC 2023, the fifth edition of the challenge on Hate Speech and Offensive Content Identification in English and Indo- Aryan Languages [1]. Identifying offensive content on the Internet is both a crucial task and a particularly complex one. It’s a very important task because an increasingly large proportion of humanity is informed via the Internet, and because these same people have (a priori) the possibility of disseminating any content they wish. It is therefore very easy to disseminate hateful and offensive content that could harm or affect a large number of users. The sheer volume of content disseminated makes monitoring difficult, especially in languages with limited linguistic resources. HASOC aims to promote the development of automatic techniques for such resource-poor languages [2]. In this NLP field as in many other NLP domains, deep learning and pre-computed embeddings are the preferred solutions, even in low-resource languages [2, 3]. Despite this, the SATLab presented at the two previous HASOC editions a language-agnostic system using only character ngrams as features, with no other linguistic resources [4, 5]. This approach has achieved excellent results, particularly for languages with few linguistic resources. As HASOC 2023 is dedicated to this type of language, the same system has been proposed. Forum for Information Retrieval Evaluation, December 15-18, 2023, India " yves.bestgen@uclouvain.be (Y. Bestgen) ~ https://perso.uclouvain.be/yves.bestgen (Y. Bestgen)  0000-0001-7407-7797 (Y. Bestgen) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings A first important feature of this 2023 edition of HASOC is that it includes languages that have never been the subject of this type of challenge, such as Sinhala, Bengali, Bodo and Assamese. The proposed system achieved excellent results for Bodo, finishing first, and for Sinhala, fourth, but very close to the best teams. A second important feature of HASOC 2023 is that for one of the languages, Gujarati, the learning material contained just 200 tweets. The organizers expected participants to explore various techniques to improve the system in a few short settings. The system developed by the SATLab does not use any information sources other than the learning material. Nevertheless, it achieved an excellent second place for this subtask, with a Marcro-F1 of 0.8383, 0.0105 points behind the best team. The remaining of this paper presents the five tasks in which the SATLab took part and the challenge rules. Next, the general characteristics of the proposed systems are described. Finally, the results obtained are discussed. 2. Tasks and Challenge Rules All the tasks involved identifying hate and offensive content in short messages posted on the Internet, such as tweets or YouTube comments. For each of them, the system was asked to distinguish between messages that included offensive language such as insulting, hurtful, derogatory, or obscene content (HOF) and messages that did not (NOT). All the task languages in which the SATLab participated were poorly endowed with linguistic resources. In Task 1A ([6]), the tweets were written in Sinhala, an official language of Sri Lanka spoken by just under 20 million people [7], while Task 1B focused on Gujarati, an official language of India spoken by approximately 50 million people. The learning material for Task 1A consisted of 7,500 instances and the test material of 2,500 instances. For Task 1B, there were only 200 tweets for learning and 1,196 for testing. Task 4 [8] involved three Indian languages and thus three subtasks: Assamese, Bengali [9] and Bodo. The learning material for Assamese consisted of 4036 instances and the test material of 1009 instances. For Bengali, there were 1281 instances for learning and 320 for testing. For Bodo, there were 1679 instances for learning and 420 for testing. During the test phase of the challenge, five runs could be submitted for tasks 1A and 1B, while for three subtasks of Task 4, five runs could be submitted every day for more than a fortnight. The measure used to evaluate the systems is the Macro-F1 score. 3. Systems The systems proposed for the five tasks are all derived from the system that achieved excellent results in the 2021 and 2022 editions of HASOC. These were supervised approaches based only on the learning materials provided by the task organizers. Two supervised procedures were used: the LIBLinear L2-regularized logistic regression model (dual, -s 7) for classification [10] and a LightGBM gradient boosting decision tree approach [11]. Since the only features used to categorize instances are ngrams of characters, this approach can be used to analyze any language, including the five in this challenge. This approach is very simple to deploy, as it requires no language-specific resources. It is also powerful because a series of parameters can be optimized by cross-validation on the learning material. The remainder of this section first presents the parameters affecting feature extraction and then those affecting the supervised learning procedures. 3.1. Feature Extraction Of the many parameters evaluated for the 2021 and 2022 editions of HASOC, the following two were retained: • The maximum length of ngrams, which could vary from 4 to 7. In all cases, all ngrams shorter than this maximum value were used. • The weighting applied to the frequency of each feature in an instance: the sublinear TfIdf and BM25 [12]. In all the systems developed, the minimum frequency of a feature in the material analyzed has been set at 2, and the weighting scheme applied to all features in an instance is the L2 normalization. 3.2. Learning procedures For the LIBLinear L2-regularized logistic regression model (dual, -s 7), three parameters were evaluated: • The regularization parameter C. • The -w1 options for adjusting the parameter C of the HOF category. • The bias parameter (-B), which shifts the separating hyperplane from the origin. LightGBM’s parameters are far too numerous to present here. They have been optimized by the automatic procedure described in Bestgen [13]. 3.3. System Optimization The parameters presented above were first optimized on the training material using a 4-fold cross-validation procedure stratified by category. Secondly, some trials allowed by the challenge rules were used to try to optimize these parameters for the test set. For each task, both the LIBLinear and LightGBM procedures were evaluated. In the challenge submissions, the whole training material was used. 4. Results This section successively presents the results of the systems submitted for each of the five tasks. Table 1 Macro-F1s for the best teams in Task 1A and 1B 1A: Sinhala 1B: Gujarati Rank Team Macro-F1 Rank Team Macro-F1 1 FiRC-NLP 0.8382 1 FiRC-NLP 0.8488 2 Krispy Mango 0.8371 2 SATLab 0.8383 3 AiAlchemists 0.8355 3 Krispy Mango 0.7956 4 SATLab 0.8351 4 AiAlchemists 0.7926 5 Z-AGI Labs 0.8349 5 XAG-TUD 0.7799 6 NAVICK 0.8281 6 SSN_CSE_ML_TEAM 0.7732 4.1. Task 1A: Sinhala The cross-validation procedure on the training material led to the choice of the following parameters for feature extraction: maximum length of 5 characters and sublinear TfIdf. The cross-validation did not reveal any significant differences between the supervised learning procedures and so both approaches were evaluated on the test material. The system that performed best was an ensemble of three other models: a LIBLinear (C=8, w1=1.8 and B=0.2) and two LightGBM models, the first based on the same features as the LIBLinear and the second based on ngrams ranging from 1 to 7 characters. These three models obtained cross-validation Macro-F1 scores of 0.8018, 0.8289 and 0.8295 respectively. The best of them obtained 0.8304 on the test material. The set of three systems (majority vote) came fourth in the challenge with a Macro-F1 of 0.8351, just 0.0031 behind the best team, as shown in Table 1. 4.2. Task 1B: Gujarati As a reminder, the training set for this task contained only 200 instances. The cross-validation procedure on this set led to the choice of the following parameters for feature extraction: maximum length of 4 characters and sublinear TfIdf. Cross-validation showed that LightGBM (Macro-F1 = 0.75) was clearly more efficient than LIBlinear (Macro-F1 = 0.68, C = 3.5, B = 1). This observation was confirmed on the test material (but to a lesser extent) with Macro-F1 values of 0.8188 and 0.8130 respectively. However, while precision and recall for the LightGBM version were almost identical, LIB- Linear’s precision (0.8890) was significantly higher than recall (0.7840), suggesting that the system was assigning too few instances to the HOF category. On the basis of the probabilities of belonging to this category returned by LIBLinear, the proportion of HOFs in the prediction was increased by assigning to this category all instances with a probability greater than or equal to 0.43 (instead of the default value of 0.50), raising the proportion of HOFs in predictions on the test material from 0.20 to 0.28. This simple trick, which increased recall to 0.83 while only reducing precision to 0.85, enabled the system to gain 0.02 points and take second place in the challenge with a Macro-F1 of 0.8383, 0.0105 points behind the best team and more than 0.04 ahead of the third-placed team (see Table 1) . It would be interesting to compare the performance of these systems using the bootstrap confidence intervals [14] to determine whether they are of any practical use. When making such a comparison, it will be necessary to take into account Table 2 Parameters and the Macro-F1s for the three languages in Task 4 Macro-F1 Language Ngram length Weighting C w1 CV Test Rank Assamese 6 BM25 6 0.72 0.689 0.715 4 Bengali 5 TfIdf 2.7 1.75 0.656 0.671 9 Bodo 7 TfIdf 10 1.15 0.817 0.857 1 the resources employed by each system [15]. The proposed approach, which uses only 200 tweets for training, is therefore very effective. However, there is a significant and unexpected difference between the performance on the training material with a maximum Macro-F1 of 0.75 in cross-validation and the performance on the test material with a Macro-F1 of 0.84. This difference may be due to the fact that cross-validation training is carried out on only 150 tweets, whereas the test phase uses 200. 4.3. Task 4: Assamese, Bengali and Bodo The results for these three subtasks are presented together because the proposed systems are very similar. The LIBLinear procedure is used in each case. Table 2 shows the parameters derived from the cross-validation and the Macro-F1 achieved on using the CV and on the test set. These systems ranked first for Bodo with a 0.006 lead over the 2nd team, fourth for Assamese with a 0.019 difference from the first and ninth for Bengali with a 0.10 difference from the first. 5. Conclusion The SATLab approach for identifying offensive content in short social network posts proved highly effective for four of the five languages (Bodo, Gujarati, Sinhala and Assamese), but much less so for the last one (Bengali), since the difference with the best team for these two languages is almost 0.10 Macro-F1 score. The origin of these differences is unknown to me. Only a reading of the organizers’ synthesis [1] could reveal whether there are differences between these tasks or between the systems presented to perform them. The efficiency obtained for Gujarati is quite astonishing and unexpected for an approach that employs no other resources than the learning material, which is limited for this language to 200 instances. This result suggests that it would be interesting to repeat all the HASOC tasks proposed over the last five years and determine for each of them the impact of the number of instances available for learning on performance in the test phase. To be honest, I doubt that such good results could be obtained for all of them. The difference in performance in Task 4 between Bodo and the other two languages also merits further analysis. Acknowledgments The author wishes to thank the organizers of this shared task for putting together this valuable event. He is a Research Associate of the Fonds de la Recherche Scientifique - FNRS (Fédération Wallonie Bruxelles de Belgique). Computational resources have been provided by the super- computing facilities of the Université Catholique de Louvain (CISM/UCL) and the Consortium des Equipements de Calcul Intensif en Fédération Wallonie Bruxelles (CECI). References [1] T. Ranasinghe, K. Ghosh, A. S. Pal, A. Senapati, A. E. Dmonte, M. Zampieri, S. Modha, S. Satapara, Overview of the HASOC subtracks at FIRE 2023: Hate speech and offensive content identification in assamese, bengali, bodo, gujarati and sinhala, in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023, ACM, 2023. [2] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandalia, A. Patel, Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indo- european languages, in: P. Majumder, M. Mitra, S. Gangopadhyay, P. Mehta (Eds.), FIRE ’19: Forum for Information Retrieval Evaluation, Kolkata, India, December, 2019, ACM, 2019, pp. 14–17. URL: https://doi.org/10.1145/3368567.3368584. doi:10.1145/3368567.3368584. [3] T. Mandl, S. Modha, A. Kumar, B. R. Chakravarthi, Overview of the HASOC track at FIRE 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German, in: P. Majumder, M. Mitra, S. Gangopadhyay, P. Mehta (Eds.), FIRE 2020: Forum for Information Retrieval Evaluation, Hyderabad, India, December 16-20, 2020, ACM, 2020, pp. 29–32. URL: https://doi.org/10.1145/3441501.3441517. doi:10.1145/ 3441501.3441517. [4] Y. Bestgen, A simple language-agnostic yet strong baseline system for hate speech and offensive content identification, in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1–10. [5] Y. Bestgen, Confirming the effectiveness of a simple language-agnostic yet very strong system for hate speech and offensive content identification, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 1–6. [6] S. Satapara, H. Madhu, T. Ranasinghe, A. E. Dmonte, M. Zampieri, P. Pandya, N. Shah, M. Sandip, P. Majumder, T. Mandl, Overview of the hasoc subtrack at fire 2023: Hate- speech identification in sinhala and gujarati, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [7] T. Ranasinghe, I. Anuradha, D. Premasiri, K. Silva, H. Hettiarachchi, L. Uyangodage, M. Zampieri, Sold: Sinhala offensive language dataset, arXiv preprint arXiv:2212.00851 (2022). [8] K. Ghosh, A. Senapati, A. S. Pal, Annihilate Hates (Task 4, HASOC 2023): Hate Speech Detection in Assamese, Bengali, and Bodo languages, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023. [9] N. Romim, M. Ahmed, H. Talukder, M. Saiful Islam, Hate speech detection in the Bengali language: A dataset and its baseline evaluation, in: M. S. Uddin, J. C. Bansal (Eds.), Proceedings of International Joint Conference on Advances in Computational Intelligence, Springer Singapore, Singapore, 2021, pp. 457–468. [10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874. [11] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A highly efficient gradient boosting decision tree, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 3146–3154. URL: http://papers.nips.cc/paper/ 6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf. [12] Y. Bestgen, Optimizing a supervised classifier for a difficult language identification prob- lem., in: Proceedings of the Eigth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2021, pp. 96–101. [13] Y. Bestgen, LAST at CMCL 2021 shared task: Predicting gaze data during reading with a gradient boosting decision tree approach, in: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, Online, 2021, pp. 90–96. URL: https://aclanthology.org/2021.cmcl-1.10. doi:10.18653/v1/ 2021.cmcl-1.10. [14] Y. Bestgen, Please, don’t forget the difference and the confidence interval when seeking for the state-of-the-art status, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 5956–5962. URL: https://aclanthology.org/2022.lrec-1.640. [15] J. Dodge, S. Gururangan, D. Card, R. Schwartz, N. A. Smith, Show your work: Improved re- porting of experimental results, in: Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 2185–2194. URL: https://www.aclweb.org/anthology/D19-1224. doi:10.18653/v1/D19-1224.