A simple language-agnostic yet strong baseline system for hate speech and offensive content identification

A simple language-agnostic yet strong baseline system for hate speech and offensive content identification YvesBestgen yves.bestgen@uclouvain.be Laboratoire d'analyse statistique des textes -Statistical Analysis of Text Laboratory (LAST -SATLab) Université catholique de Louvain

10 place Cardinal Mercier, Louvain-la-Neuve 1348 Belgium

Forum for Information Retrieval Evaluation

December 13-17 2021 India

A simple language-agnostic yet strong baseline system for hate speech and offensive content identification 1613-0073 C30766602A455159C7975E9214954CC0 GROBID - A machine learning software for extracting information from scholarly documents Character n-grams logistic regression gradient boosting decision tree low-resource languages

For automatically identifying hate speech and offensive content in tweets, a system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed by the SATLab team. After its optimization in terms of the feature weighting and the classifier parameters, it reached, in the multilingual HASOC 2021 challenge, a medium performance level in English, the language for which it is easy to develop deep learning approaches relying on many external linguistic resources, but a far better level for the two less resourced language, Hindi and Marathi. It ended even first when performances are averaged over the three tasks in these languages. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches such as deep learning or taking into account complementary resources.

Introduction

The diffusion of hate speech and offensive content in social networks has become a crucial problem. The tremendous number of posts broadcasted at any given time prevents their identification by human evaluation. This task is made even more complex by the large number of languages in which these offensive contents are spread. Not surprisingly, a lot of research is being done to develop automatic detection systems. As in many NLP domains, deep learning approaches and the use of pre-computed embeddings have proven to be the most efficient, even in languages with few resources [1,2]. However, traditional machine learning systems have sometimes proven to be very competitive [3,4]. One may thus wonder what level of performance can be achieved by a much simpler yet heavily optimized classical supervised approach, completely language-agnostic, based only on a few thousand examples to feed the supervised learner but without any additional resources. If this system is (relatively) successful, it would give a computationally easy baseline that could help evaluating the benefits of additional knowledge, complex architectures, deep learning or language expertise. The HASOC 2021 shared task "Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages" [5] is particularly relevant for developing such a system because it proposes three languages. Among them, one, English, is obviously the most studied language in automatic language processing and the one in which the largest number of resources is available. Hindi and, even more so, Marathi have been much less studied and are still classified as low-resource languages [6,7,8]. One can think a priori that the approach proposed here will be much more competitive in these two languages.

The remainder of this paper presents the datasets made available for this shared task and the challenge rules, the system developed, and the results obtained which confirms that the proposed approach is a strong language-agnostic baseline for hate speech and offensive content identification.

Materials and Task

The SATLab participated in subtask 1 of the HASOC 2021 shared task "Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages" which proposes two problems to be solved in three languages [5]. The first problem requires to categorize tweets into two categories: Hate and Offensive (HOF) or not (NOT). It is proposed for English, Hindi and Marathi. The second problem requires categorizing the same tweets into four categories, by dividing the Hate and Offensive category into three subcategories: Hate speech (HATE), Offensive (OFFN) and Profane (PRFN). It is offered for English and Hindi.

For each language, learning and test materials have been provided by the task organizers [9,8]. The frequencies (#) and percentages (%) in each category of each problem for each language are given in Table 1.

This table deserves several comments. First of all, the learning set is much smaller in Marathi (18% of the total) than in the other two languages, the difference between the two latter being much smaller (37% of the total in English and 45% of the total in Hindi). The proportion of tweets in the HOF category is much larger in English than in the other two languages. The difference clearly comes from the PRFN category which is much more frequent in English than in Hindi where it represents only a very small percentage.

Challenge rules

The rules of the challenge allowed teams to use any additional resources including materials from previous HASOC tasks, lexical norms such as emotional word lists, precomputed embeddings, the use of syntactic parsers or even machine translation systems to analyze other languages in English. The system proposed by the SATLab does not include any of these additional resources.

The official measure chosen by the organizers to rank the teams in the challenge is the Macro-F1 which has the advantage of giving the same weight to all categories, however rare they may be (e.g., less than 5% of PRFN in Hindi).

Each team was allowed to submit five runs for each subtask between August 20 and 30, 2021, and the team's best performance was displayed in the Leaderboard. Compared to the ten or so other shared tasks I participated in, it is important to underline that the submission system proposed by the challenge organizers (https://hasocfire.github.io/submission/login.html) was particularly ergonomic. Moreover, the fact that the teams could not hide their best score, as it is often the case in other systems, made, in my opinion, the competition more fair.

Proposed System

In order to meet the requirements presented in the introduction, the proposed system is only based on character n-grams [10], an approach frequently used in automatic language processing when the developed system has to support several languages. These n-grams were extracted from the lowercased tweets with the only specificity that those starting or ending the tweet were distinguished from the others by the presence of a specific character. All character n-grams observed at least twice in the material were retained.

During the n-gram extraction, three parameters had to be set:

• The length of the n-grams in number of characters. The minimum length was systematically set to 1 while the maximum lengths evaluated varied between four and eight characters. • The weighting applied to the frequency of each feature in each instance. Two wellestablished weighting schema were evaluated:

-Sublinear TF-IDF: (sl)TF-IDF = (1 + log(𝑡𝑓 )) × log 𝑁 𝑑𝑓(1)

where 𝑡𝑓 refers to the frequency of the term in the document, 𝑁 is the number of documents in the set and 𝑑𝑓 the number of documents that include the term. -BM25 ( [11,12]), which is considered as one of the most efficient weighting schema [13]. It is a kind of TF-IDF that takes into account the length of the document. The following formula was used:

BM25 = 𝑡𝑓 𝑡𝑓 + 𝑘 1 * (1 − 𝑏 + 𝑏 * 𝑑𝑙 𝑑𝑙−𝑎𝑣𝑔 𝑑𝑙 ) × log 𝑁 − 𝑑𝑓 + 0.5 𝑑𝑓 + 0.5(2)

in which * 𝑡𝑓 𝑡𝑓 +𝑘 1 is the TF component which, contrarily to the usual TF-IDF, has an asymptotic maximum tuned by the 𝑘 1 parameter. * (1 − 𝑏 + 𝑏 * 𝑑𝑙 𝑑𝑙−𝑎𝑣𝑔 𝑑𝑙 ), where 𝑑𝑙 is the length of the document and 𝑎𝑣𝑔 𝑑𝑙 , the average length of the documents in the set, is the document length normalization factor whose impact is tuned by parameter 𝑏 (and by 𝑘 1 ). * The second part of the formula is a variant of the usual IDF, proposed by Robertson and Spärck Jones [11]. In our analyses, 𝑘 1 was set to 2 and 𝑏 to 0.75.

• Normalization of the feature scores for each instance:

-The classical L2 regularization. -A MinMax transformation: MinMax = 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒 𝑖 _𝑠𝑐𝑜𝑟𝑒 − 𝑚𝑖𝑛 𝑚𝑎𝑥 − 𝑚𝑖𝑛 + 0.01(3)

It is important to note that this transformation is applied independently to each instance and not, as is often the case, to each feature. The value of 0.01 is added to distinguish the lowest scoring feature of an instance with the value of 0, which codes the absence of a feature.

These character n-grams were the only features provided to the supervised learning procedure. Two well-established procedures were evaluated:

• The (dual) L2-regularized logistic regression as implemented in the LIBLinear package [14], an extremely fast approach and very simple to use because it only requires the optimization of two parameters. The two parameters to optimize are the regularization parameter C and -wi which allows to adjust this parameter C for the different categories. This approach was used for the initial submission to each of the five problems. • A much slower and more complex approach to optimize because it requires the optimization of many parameters, but that has recently outperformed all deep-learning based systems participating in the CMCL 2021 shared task on predicting gaze data during reading [15]: a gradient boosting decision tree approach as implemented in the LightGBM free software [16]. This approach has been used only in a second time.

The system was independently optimized for each language during the learning phase using a 3-fold cross-validation procedure, whose folds were stratified according to the four categories of problem 2 for English and Hindi and the two categories of problem 1 for Marathi. This cross-validation step led to setting the parameters described above as shown in Table 2 for the initial SATLab submissions.

Results

In this section, the performance of the initial system proposed by the SATLab and the various optimization attempts that have been made are first presented. Secondly, these performances are compared to those of other teams in order to determine whether the proposed approach is competitive enough to serve as a baseline for evaluating the benefits of using deep learning approaches and resources supplementary to those provided in the task itself.

SATLab submissions

Table 3 presents the performance of the main versions of the SATLab system submitted for the five problems and thus the benefits brought by the optimization attempts on the test set. The first row reports the performance of the original system for each problem during the cross-validation step. Logically, the performances are less good for problems requiring the identification of more than two categories as well as when a category is particularly rare (Hindi-2). We also observe strong differences between the three languages. Since only one split into three folds was used, one can assume that these scores are, at least slightly, overestimated.

The second row shows the performance of the same versions on the test set and thus the initial submissions to the challenge. All scores are higher on the test set than during the cross-validation step.

As it was allowed to submit five runs for each problem, I first tried to optimize the classifier based on logistic regression by modifying very slightly the two LIBLinear parameters (i.e., C and -wi). These attempts brought a (very) slight benefit for two of the five problems as shown in the third row of Table 3.

In a second step, an LightGBM classifier was trained using a random grid search procedure for each of the five problems to try optimizing the parameters. As shown in the fourth row of Table 3, this step resulted in a stronger performance improvement in two problems: English-1 and Marathi. For the other three problems, LightGBM did not improve the performance of LIBLinear. The selected parameters for the two successful problems are given in Appendix 1. The number of boosting iterations was determined during cross-validation by using the LightGBM early stopping procedure which stops training when the performance on the validation fold doesn't improve in the last 200 rounds. The final system values on the test set for the five problems are bolded in the table. The run names of these solutions in the official leaderboard are respectively: English 1b, English 2, Hindi 1, Hindi B S4 and Marathi 3.

Benchmarking the approach

The main objective of SATLab's participation in HASOC 2021 was to propose a competitive system relying only on the training data and employing only classical supervised learning procedures. To determine whether this goal was achieved, Tables 4-6 compare the performance of the approach to that of the other participating systems.

Table 4 shows for each of the five problems the number of teams that participated, the scores of the top three teams, the scores of the best SATLab version, and the scores of the two contiguous teams. As it can be seen, it is clearly in the two less resourced languages, Hindi and Marathi, that the performance of the approach is among the best since it is even second, very close to the first (and the third), in the Hindi-2 problem. In English on the other hand, the system is ranked in the middle of the pack of average scores at 0.048 and 0.054 of the best team.

The difference in performance between English and the other two languages is particularly evident in Tables 5 and 6, which present the average scores of the teams for the five problems (Table 5), the two problems in English and the three problems in less endowed languages (Table 6). Before calculating these averages, the scores for each problem were divided by the maximum score obtained for the problem in question. This transformation1 allows to give an equivalent weight to the scores of all problems. It is then possible to present in the same table, without distorting the results, all the teams, whatever the number of problems they have participated in. Without this transformation, the teams that participated in the most difficult tasks are penalized compared to those that did not. In these tables, the number of problems each team participated in is given by the column in which the score is found and the total number of teams that participated in a given number of problems is presented in the last row.

In terms of the overall average (Table 5), SATLab ranks sixth overall and third among the 16 teams that participated in the five tasks. In English (Table 6), on the other hand, it ranks only 20th. In the two less endowed languages (Table 6), it is second, exceeded only by a team that participated in only one of the five tasks.

Conclusion

A system, based exclusively on the character n-grams present in the posts to be categorized, employing no additional linguistic resources and thus completely language-agnostic, is proposed to automatically identify hate speech and offensive content in social network posts. It relies on traditional machine learning procedures such as logistic regression. Used in the HASOC 2021 challenge [5], it reached a medium performance level in English, the language for which it was easy to develop deep learning approaches relying on many external linguistic resources. Its performance, averaged on the two Hindi problems and the Marathi problem, ranks it in first place among the teams that proposed systems for at least two of these problems. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches that are frequently used to address this type of task such as deep learning or taking into account complementary resources [1,2,9]. However, it is essential to note that the proposed system never ranked first in any specific task. It is therefore clearly not the best performing system for any of the five tasks.

Table 11Dataset statistics of subtask 1Learning PhaseLearningTestProblem 1Problem 2PhasePhaseNOT HOFNONE HATE OFFN PRFNTotalTotalEnglish# 1342 25011342683622119638431281%34.965.134.917.816.231.175.025.0Hindi# 3161 1433316156665421345941532%68.831.268.812.314.24.675.025.0Marathi # 12056691874525%64.335.778.121.9

Table 22Parameters for the initial submissionsLanguageEnglishHindiMarathiProblem12121N-gram length55555WeightingTF-IDFTF-IDF BM25TF-IDFBM25Normalization MinMaxL2L2MinMaxL2C1.12.53.70.0836w_HOF0.52.22w_HATE2.01.87w_OFFN3.00.93w_PRFN0.85.60Table 3Macro-F1 during cross-validation and on the test setLanguageEnglishHindiMarathiProblem12121CV0.7483 0.5876 0.7551 0.51330.8565Initial0.7635 0.6114 0.7718 0.55630.8547Best LR0.55860.8595Best LGBM 0.78230.8749

Table 44Macro-F1 on the test set for the five problemsEnglish-1 N=56English-2 N=37Rank TeamMacro-F1 Rank TeamMacro-F11 NLP-CIC0.83051 NLP-CIC0.66572 HUNLP0.82152 neuro-utmn-thales0.65773 neuro-utmn-thales0.81993 HASOC21rub0.6482......22 hate-busters0.789415 KuiYongyi0.611623 SATLab0.782316 SATLab0.611424 TAD0.777617 hate-busters0.6096......Hindi-1 N=34Marathi N=25Rank TeamMacro-F1 Rank TeamMacro-F11 t10.78251 WLV-RIT0.91442 Super Mario0.77972 neuro-utmn-thales0.88083 Hasnuhana0.77973 Hasnuhana0.8756...4 SATLab0.87496 KuiYongyi0.77255 PreCog IIIT0.87347 SATLab0.7718...8 neuro-utmn-thales0.7682...Hindi-2 N=24Rank TeamMacro-F11 NeuralSpace0.56032 SATLab0.55863 hate-busters0.5582...

Table 55Transformed Macro-F1 for the five problemsNbr. of problems the team participated inRank Team543211 WLV-RIT1.00002 NLP-CIC0.98143 neuro-utmn-thales0.98004 HASOC21rub0.96935 NeuralSpace0.96666 SATLab0.96017 KuiYongyi0.95968 CAROLL Passau0.95909 IMS-SINAI0.956910 hate-busters0.9517...Number of Teams16861815

Table 66Transformed Macro-F1 for the two English problems and for the Hindi and Marathi problemsEnglish 1 & 2Hindi 1 & 2 and Marathi#problems#problemsRk Team21Rk Team3211 NLP-CIC1.00001 WLV-RIT1.00002 neuro-utmn-thales 0.98762 SATLab0.98003 HASOC21rub0.96933 NeuralSpace0.97874 HUNLP0.96754 neuro-utmn-thales0.97255 HNLP0.96745 KuiYongyi0.9707...6 NLP-CIC0.969019 hate-busters0.93317 hate-busters0.964020 SATLab0.93028 CAROLL Passau0.959021 TeamOulu0.92729 BIU0.9484......Number of Teams3825Number of Teams17138

This transformation of the scores considers that the minimum score in each task is the same, 0, and that therefore no correction should be made at this level. This seems to me justified by the fact that, even if it is unlikely, a system can be wrong on all instances, but also and especially because it is the deviation from the maximum score that is important.

Acknowledgments

The author wishes to thank the organizers of this shared task for putting together this valuable event and the reviewers for their very constructive comments. He is a Research Associate of the Fonds de la Recherche Scientifique -FNRS (Fédération Wallonie Bruxelles de Belgique). Computational resources have been provided by the supercomputing facilities of the Université Catholique de Louvain (CISM/UCL) and the Consortium des Equipements de Calcul Intensif en Fédération Wallonie Bruxelles (CECI).

Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indoeuropean languages TMandl SModha PMajumder DPatel MDave CMandalia APatel 10.1145/3368567.3368584 doi:10.1145/3368567.3368584 FIRE '19: Forum for Information Retrieval Evaluation PMajumder MMitra SGangopadhyay PMehta

Kolkata, India

ACM December, 2019. 2019 Overview of the HASOC track at FIRE 2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english and german TMandl SModha AK M BRChakravarthi 10.1145/3441501.3441517 doi:10.1145/3441501.3441517 FIRE 2020: Forum for Information Retrieval Evaluation PMajumder MMitra SGangopadhyay PMehta

Hyderabad, India

ACM December 16-20, 2020. 2020 Iiit-hyderabad at HASOC 2019: Hate speech detection VMujadia PMishra DMSharma Working Notes of FIRE 2019 -Forum for Information Retrieval Evaluation CEUR Workshop Proceedings PMehta PRosso PMajumder MMitra

Kolkata, India

December 12-15, 2019. 2019 2517 Irlab@iitbhu at HASOC 2019: Traditional machine learning for hate speech and offensive content identification ASaroj RKMundotiya S Working Notes of FIRE 2019 -Forum for Information Retrieval Evaluation CEUR Workshop Proceedings PMehta PRosso PMajumder MMitra

Kolkata, India

December 12-15, 2019. 2019 2517 Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech SModha TMandl GKShahi HMadhu SSatapara TRanasinghe MZampieri FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event ACM December 2021. 2021 Proceedings of the Workshop on Deep Learning Approaches for Low-Resource NLP, Association for Computational Linguistics RHaffari CCherry GFoster SKhadivi BSalehi the Workshop on Deep Learning Approaches for Low-Resource NLP, Association for Computational Linguistics

Melbourne

2018 JOrtega AKOjha KKann C.-HLiu Proceedings of the 4th workshop on technologies for machine translation of low-resource languages: Introduction the 4th workshop on technologies for machine translation of low-resource languages: Introduction 2021 Proceedings of Machine Translation Summit XVIII. I-VI Cross-lingual offensive language identification for low resource languages: The case of Marathi SGaikwad TRanasinghe MZampieri CMHoman Proceedings of RANLP RANLP 2021 Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages TMandl SModha GKShahi HMadhu SSatapara PMajumder JSchäfer TRanasinghe MZampieri DNandini AK Working Notes of FIRE 2021 -Forum for Information Retrieval Evaluation 2021 Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets YBestgen Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

Valencia, Spain

2017 The probabilistic relevance framework: BM25 and beyond SRobertson HZaragoza Foundations and Trends in Information Retrieval 3 2009 Optimizing a supervised classifier for a difficult language identification problem YBestgen Proceedings of the Eigth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) the Eigth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) 2021 An Introduction to Information Retrieval CDManning PRaghavan HSchütze 2008 Cambridge University Press LIBLINEAR: A library for large linear classification R.-EFan K.-WChang C.-JHsieh X.-RWang C.-JLin Journal of Machine Learning Research 9 2008 LAST at CMCL 2021 shared task: Predicting gaze data during reading with a gradient boosting decision tree approach YBestgen 10.18653/v1/2021.cmcl-1.10 Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics the Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics 2021 LightGBM: A highly efficient gradient boosting decision tree GKe QMeng TFinley TWang WChen WMa QYe T.-YLiu Advances in Neural Information Processing Systems 30 IGuyon UVLuxburg SBengio HWallach RFergus SVishwanathan RGarnett Curran Associates, Inc 2017 A .78875 num_leaves': 14, 'learning_rate': 0.0095, 'min_data_in_leaf': 6, 'max_depth': 10, 'feature_fraction': 0 min_data_in_leaf': 3, 'max_depth': 11. , 'feature_fraction': 0.12125, 'bagging_freq': 4, 'bagging_fraction': 0.8, 'metric': 'binary', 'objective': 'binary', 'is_unbalance': 'false'. The treshold used to decide that an instance belongs to the HOF class was set at 0.35