Confirming the Effectiveness of a Simple Language-Agnostic Yet Very Strong System for Hate Speech and Offensive Content Identification Yves Bestgen1 1 Laboratoire d’analyse statistique des textes - Statistical Analysis of Text Laboratory (LAST - SATLab), Université catholique de Louvain, 10 place Cardinal Mercier, Louvain-la-Neuve, 1348, Belgium Abstract At the 2021 edition of HASOC, the SATLab team proposed a very simple language-agnostic system for hate speech and offensive content identification. This system proved to be extremely effective for the two less resourced languages (e.g., Hindi and Marathi). The present paper describes the use of the same system for task 3 of the 2022 edition of HASOC on hate speech and offensive content identification in Marathi. It consists of a logistic regression applied to character n-grams. It ranked fifth on subtask 3A (macro-F1 = 0.937), quite close to the first ones, second on subtask 3B (macro-F1 = 0.915), very close to the first one, and first (macro-F1 = 0.961) with more than 16 Macro F1 points ahead of the second one in subtask 3C. These results confirm the effectiveness of the approach and suggest that studies evaluating different systems for this kind of problem should employ a character n-gram based approach as a baseline. They also show that the task is extremely simple since all macro-F1s are greater than or equal to 0.915. Keywords Character n-grams, logistic regression, low-resource languages 1. Introduction Hate speech and offensive content on internet is a crucial problem. Insulting or obscene content can hurt many people, but also denigrate entire communities. It is therefore important that web players like Twitter or Facebook are able to identify such content quickly and efficiently. Only the development of automatic detection systems can achieve this. This is the objective of the HASOC evaluation campaigns "Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages" [1, 2]. During the 2021 edition of HASOC, the SATLab proposed a very simple language-agnostic system for hate speech and offensive content identification [3]. This system proved to be extremely efficient for the two less resourced languages of that challenge, ending seventh for Hindi task 1, second for Hindi task 2, and fourth for Marathi. HASOC 2022 [4] proposes to extend this research path by proposing three subtasks for Marathi. It seemed interesting to determine if the approach proposed last year by the SATLab Forum for Information Retrieval Evaluation, December 9-13, 2022, India " yves.bestgen@uclouvain.be (Y. Bestgen) ~ https://perso.uclouvain.be/yves.bestgen (Y. Bestgen)  0000-0001-7407-7797 (Y. Bestgen) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) was as effective for this year’s task. The following sections of this paper present the three subtasks and the datasets made available for this shared task, the system developed, and the results obtained, which confirms that the proposed approach is a very strong language-agnostic system for hate speech and offensive content identification. 2. Task and material The task 3 of the HASOC 2022 shared task, to which the SATLab participated, consists of three subtasks in Marathi [5]. The first subtask (3A) requires discriminating between offensive (OFF) and non-offensive (NOT) tweets. The second subtask (3B) requires predicting the type of offense as Targeted (TIN) and Untargeted (UNT) insult. Targeted insult tweets explicitly target an individual, a group or anything else (without any specification). The third subtask (3C) focuses on the target of offences, asking to discriminate between individual (IND), group (GRP) or other (OTH). It should be noted that the material provided by the organizers is identical for all three subtasks. As a result, only tweets categorized as offensive in the gold label of subtask 3A are categorized in subtask 3B and only tweets categorized as Targeted in the gold label of subtask 3B are categorized according to target type in subtask 3C. The organizers consider subtasks 3B and 3C to be two and three class classification problems, probably because that is the basis on which they evaluate performance. But, for the participants, subtask 3B is a three class classification problem because, when they predict that an offensive instance is a Targeted (TIN) or an Untargeted (UNT) insult, they have to do so on the complete material used for subtask 3A. They therefore do not know which instances are offensive and which are not. Since the systems are penalized if they assign one of the offensive labels to a "neutral" instance, it is indeed a three- category task. Similarly, subtask 3C is a four-class classification problem for the participants because, when they predict that a Targeted (TIN) instance is focused on an individual (IND), on a group (GRP) or on something else (OTH), they have to do so on the complete material used for subtask 3A. They therefore do not know which instances are offensive and targeted and which are not and are penalized if they get it wrong. No information about the measure of effectiveness was provided to participants during the learning phase. During the testing phase, it appeared that it was Macro-F1, but how this score was calculated was unknown. For these reasons, I do not report here the results on the learning set. In total, the organizers provided 3103 instances for learning and 510 for the test phase. Table 1 shows the distribution of the learning material in the different categories after deletions of three problematic instances1 . The distribution in the test material is not yet officially known, but a hypothesis is proposed in the result section. 1 Id 1865 contains no text, but is nevertheless offensive, Id 1981 is offensive, but has no label for subtask 2B, and Id 2324 is targeted, but has no label for subtask 3C Table 1 Dataset statistics on the learning set for the three subtasks. Note : IRR = Irrelevant for that task, but present in the material and should be identified as such. Subtask 3A Subtask 3B Subtask 3C NOT 2034 IRR 2034 UNT 327 IRR 2361 OFF 1066 IND 502 TIN 739 GRP 157 OTH 80 3. Proposed system The proposed system is a very simplified version of the one used for HASOC 2021 and the VarDial challenges [3, 6]. Its features are the following: • It is only based on character n-grams observed at least twice in the material, • The n-grams were one to five characters in length, • The n-grams that start or end a tweet were marked as such, • Their frequency in the tweet was weighted by means of BM25 ([7, 8]), • The feature scores for each instance were normalized by the classical L2 regularization. The parameters, such as the length of the character n-grams, were not set on the HASOC 2022 learning material, but directly taken from the system used for HASOC 2021. These features were provided to the L1-regularized logistic regression from the LIBLinear package [9], with the −𝐵1 (bias) option, an approach very simple to use because it only requires the optimization of the regularization parameter 𝐶 and of the −𝑤𝑖 parameters which allow to adjust this C parameter for the different categories. The optimization of the parameters was performed independently for each subtask by means of an ANSI C program using several successive random grid searches in a 4-fold stratified cross-validation procedure [10]. This produced the parameter values given below. Thus, a different and independent model is build for each subtask. • Subtask 3A: c=4.5, -w(NOT)=1, -w(OFF)=2.1. • Subtask 3B: c=6, -w(IRR2 )=1, -w(TIN)=4, -w(UNT)=15. • Subtask 3C: c=46, -w(IRR)=1, -w(IND)=8.04, -w(GRP)=74, -w(OTH)=115. As can be seen, the values of these parameters are very diverse, raising concerns about an overfit problem when applying them to the test material. In particular, the −𝑤𝑖 are strongly influenced by the imbalance of the data in the different categories. However, there was no 2 IRR = Irrelevant for that task, but present in the material and should be identified as such. Table 2 Macro-F1 on the test set for the three subtasks. Note : R = Rank. Task R Team Score R Team Score R Team Score 3A 1 ssncse_nlp 0.975 2 optimize prime 0.959 5 SATLab 0.937 3B 1 hate-busters 0.921 2 SATLab 0.915 3 ssncse_nlp 0.696 3C 1 SATLab 0.961 2 ssncse_nlp 0.793 3 ml_ai_iiitranchi 0.742 reason to believe that the same distribution would be found in the test material. For this reason, I tried another approach, based on a data augmentation procedure in which synthetic data are added in the less populated categories. It has been shown that such an approach can reduce the difficulties encountered when analyzing highly unbalanced data [11]. In the present challenge, this approach has been shown to be slightly beneficial on the cross-validation training material, but useless and even inefficient on the test material. It will therefore not be described in detail here. 4. Official results The submission site was excellent, as it was last year, but significant problems immediately arose in the scoring procedure. For this reason, I only made two submissions for each subtask out of the five allowed. More than a week after the end of the challenge, the Macro-F1s for subtask 3A were modified. It is these modified scores that are shown in Table 2. This table also shows two benchmarks for each subtask. The performance of the system is excellent overall, confirming that a simple language-agnostic approach can be very effective in identifying offensive content and in determining whether it is targeted and what types of targets it is aimed at. Another interpretation of this excellent performance is that the tasks proposed by the organizers is extremely simple. Since I have no knowledge of Marathi, it was not possible for me to analyze the most effective features to try to understand the origin of the effectiveness of the proposed system. A final observation worth mentioning is that the SATLab model predictions produce long runs of identical labels when the instances are ordered by their official Id, without affecting its effectiveness (macro-F1 > 0.91). Figure 1 illustrates this phenomenon. The predictions for the test material are ordered from Id 0 to Id 509 in rows of 50 instances and 10 for the last row. The five categories are distinguished by colors as shown in the legend. To create this graph, I started with the predictions for subtask 3C (macro-F1 = 0.961), which were supplemented with the predictions for subtask 3B when no category had been assigned to an instance for subtask 3C. Since the proposed system is extremely efficient, one can assume that almost all of these labels are correct and therefore concludes that the labels in the test material were not randomly distributed. Figure 1: Long runs of identical categories in SATLab model predictions when the instances are ordered by their official Id 5. Conclusion In the 2021 edition of HASOC, SATLab ended seventh for Hindi task 1, second for Hindi task 2, and fourth for Marathi. This year, a simplified version of that system ended fifth for subtask 3A, second for subtask 3B and first, with a huge margin, for subtask 3C, all in Marathi. These results thus confirm that a very simple language-agnostic approach, based only on character n-grams and logistic regression, can be extremely efficient (Macro-F1 > 0.96 for a four-category classification problem) when the objective is the identification of hate speech and offensive content in Indo-Aryan languages. These results also suggest that studies evaluating different approaches to this kind of problem such as [12] should use a character n-gram based approach as a baseline. It should be noted, however, that this approach, which is also very effective in detecting hyperpartisan news articles [13], is much less effective in identifying passages of text that contain patronizing and condescending language [14]. Acknowledgments The author is a Research Associate of the Fonds de la Recherche Scientifique - FNRS (Fédération Wallonie Bruxelles de Belgique). References [1] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandalia, A. Patel, Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indo- european languages, in: P. Majumder, M. Mitra, S. Gangopadhyay, P. Mehta (Eds.), FIRE ’19: Forum for Information Retrieval Evaluation, Kolkata, India, December, 2019, ACM, 2019, pp. 14–17. URL: https://doi.org/10.1145/3368567.3368584. doi:10.1145/3368567.3368584. [2] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in: FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December 2021, ACM, 2021. [3] Y. Bestgen, A simple language-agnostic yet strong baseline system for hate speech and offensive content identification, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 1–10. [4] S. Satapara, P. Majumder, T. Mandl, S. Modha, H. Madhu, T. Ranasinghe, M. Zampieri, K. North, D. Premasiri, Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages, in: FIRE 2022: Forum for Information Retrieval Evaluation, Virtual Event, 9th-13th December 2022, ACM, 2022. [5] T. Ranasinghe, K. North, D. Premasiri, M. Zampieri, Overview of the HASOC subtrack at FIRE 2022: Offensive Language Identification in Marathi, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, CEUR, 2022. [6] Y. Bestgen, Optimizing a supervised classifier for a difficult language identification prob- lem., in: Proceedings of the Eigth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 2021, pp. 96–101. [7] S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Foundations and Trends in Information Retrieval 3 (2009) 333–389. [8] Y. Bestgen, Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets, in: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, 2017, pp. 115–123. [9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874. [10] Y. Bestgen, LAST at CMCL 2021 shared task: Predicting gaze data during reading with a gradient boosting decision tree approach, in: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, Online, 2021, pp. 90–96. [11] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357. [12] M. Zampieri, T. Ranasinghe, M. Chaudhari, S. Gaikwad, P. Krishna, M. Nene, S. Paygude, Predicting the type and target of offensive social media posts in marathi, Social Network Analysis and Mining 12 (2022) 77. URL: https://doi.org/10.1007/s13278-022-00906-8. doi:10. 1007/s13278-022-00906-8. [13] Y. Bestgen, Tintin at SemEval-2019 task 4: Detecting hyperpartisan news article with only simple tokens, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 1062–1066. URL: https://aclanthology.org/S19-2186. doi:10.18653/v1/S19-2186. [14] Y. Bestgen, SATLab at SemEval-2022 task 4: Trying to detect patronizing and conde- scending language with only character and word n-grams, in: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Association for Compu- tational Linguistics, Seattle, United States, 2022, pp. 490–495. URL: https://aclanthology. org/2022.semeval-1.67. doi:10.18653/v1/2022.semeval-1.67.