Overview of the HASOC Subtrack at FIRE 2023: Identification of Conversational Hate-Speech Hiren Madhu1 , Shrey Satapara2 , Pavan Pandya3 , Nisarg Shah3 , Thomas Mandl5 and Sandip Modha6 1 Indian Institute of Science, Bangalore, India 2 Indian Institute of Technology, Hyderabad, India 3 Indiana Bloomington University, USA 5 University of Hildesheim, Germany 6 LDRP-ITR, Gandhinagar, India Abstract Identifying hate speech based on context is a requirement for real-world content moderation systems. However, in research, the definition and use of context for hate speech recognition has seen a variety of approaches. The task ”Identification of Conversational Hate Speech” 2023 has provided a further dataset for hate speech detection, including context. The data was collected from Twitter (called X since 2023) and includes tweets, comments, or responses to such tweets or comments. This paper reports on the dataset, experiments, and results. Six teams submitted results for the binary classification task, and the best submission reached an F1 measure of 0.8. For the second task, five submissions were submitted. We also present a baseline that uses unlabelled data to obtain its predictions. Keywords Hate Speech, NLP, Social Media, Language Resource, Deep Learning, Text Classification, Evaluation, Benchmark, Context 1. Introduction Hate speech and offensive language, which include hurtful, insulting, or derogatory remarks exchanged between individuals, are commonly observed on social media platforms like Facebook, Twitter, and Reddit. The abundant presence of such content on these platforms fosters offline hate crimes and fuels disorderly actions against various communities or political groups driven by agendas such as racism, misogyny, anti-LGBTQI+, anti-Muslim, anti-government, and other extremist ideologies [1]. To combat these hate crimes, the European Union (EU) and other European nations have implemented laws that classify online hate speech as a criminal offense, leading to the conviction of many individuals involved in such online activities. In contrast, the United States (US) primarily focuses on addressing hate speech through non-legal Forum for Information Retrieval Evaluation, December 15-18, 2023, Goa, India Envelope-Open hirenmadhu16@gmail.com (H. Madhu); shreysatapara@gmail.com (S. Satapara); pavanpandya1311@gmail.com (P. Pandya); nisarg0606@gmail.com (N. Shah); mandl@uni-hildesheim.de (T. Mandl); sjmodha@gmail.com (S. Modha) Orcid 0000-0002-6701-6782 (H. Madhu); 0000-0001-6222-1288 (S. Satapara); 0000-0002-8398-9699 (T. Mandl); 0000-0003-2427-2433 (S. Modha) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings means to safeguard the principles of free speech. While freedom of speech is crucial, a recent study [2] reveals that Elon Musk’s influence on Twitter and his alterations to content moderation policies have increased hate speech on the platform. Consequently, this has caused numerous environmentally-conscious users to become inactive on the platform, resulting in a decline in the quality of discourse. In scenarios like this, freedom of speech acts as a double-edged sword. Due to this, open societies need to figure out how to maintain civil discourse without resorting to totalitarian control, unlike the new Digital Service Act (DSA). DSA operates based on a ”delete first, think later” approach, which removes user-generated content excessively and undermines freedom of expression [3]. Content moderation aims to strike a balance between these objectives. Nonetheless, moderating content necessitates numerous human annotators’ involvement, making scalability impractical. This situation has driven research efforts toward the advancement of automatic systems for identifying harmful online content. Text classification represents just one component essential for meeting legal and practical demands [4], although it is crucial. Most existing research in the field of identifying hate speech or offensive content primarily concentrates on analyzing the text of individual posts. Frequently, offensive or hateful content can be concealed within a conversation thread and may not be immediately evident in isolated comments or responses. However, it is feasible to uncover such hate speech by examining the original content and the context in which it was posted. Additionally, social media content often spans multiple languages, including code-mixed languages like Hinglish1 . This makes it crucial for social media platforms to detect and remove such content before it reaches a wider audience. In the two editions of identification of conversational hate speech in code-mixed languages (ICHCL) [5, 6], datasets that handle such conversational hate speech have been released. The first edition featured binary labels to distinguish between hateful and regular tweets, while the second edition introduced a multiclass task, further categorizing hateful content tweets into two subtypes: standalone and contextual hate speech. In this paper, we provide an overview of the third edition of ICHCL, which is centered on promoting the development of semi-supervised algorithms for classifying hateful text. Detailed information regarding the task and dataset is elaborated upon in Section 3. 2. Related Work Because a tweet is typically a part of a larger discourse and a conversation among certain people, it is frequently difficult to understand it on its own. So far, only few text classification experiments and datasets took context into account. Context has been modeled in different ways. An early approach used LDA and RNNs. Recursive neural networks were used to capture context within sentences [7] but less for capturing relations between subsequent messages in social media. LSTMs were used in an approach by Gao and Huang[8]. The dataset is based on comments in discussion threads on news articles and it contains 1500 comments. The context is given by the content of the news articles [8]. 1 Hindi written in Latin script instead of Devanagari script The shared task RumourEval reacts to the need to consider evolving conversations and news updates for rumors and check their veracity [9]. The organizers provided a dataset of misinformation posts and conversations about those posts. The best performing system [10] used word2vec combined with several other dimensions such as source content analysis, source account credibility, reply account credibility and stance of the source message among others. One dataset was labeled twice by crowd workers. One group was provided context and the other one not [11]. The data is extracted from Wikipedia talk pages. Context was given by the parent message and the title of the discussion thread [11]. It needs to pointed out that the parent message might not be the last message preceding the message to the annotated. A further dataset that was extended with context information for the concept of abusiveness [12]. This data was collected based on an existing dataset without contextual information. For all tweets, the text was used to search them and if they were found, the authors tried to extract the previous messages. For all tweets, for which this was successful, the preceding messages were downloaded as context. Such a process leads to a greatly varying context size is very between items. Around 45% of the hateful tweets had one preceding tweet as context and another 45% had between 2 and 5 preceding tweets. Applying this methodology, almost half of the tweets which were annotated as abusive were labelled as non-abusive once context was available [12]. In a study with 10.000 Youtube comments, the quality of annotations in regard to interrater agreement was measured. Context improved the metrics by less than 5% in absolute terms [13]. In a study with Reddit posts, 27000 posts were annotated [14]. Context was given by providing the entire thread to the annotaters. However, diverse uses of context for annotation were reported. Quite low interrater agreement was reported, however, experiments showed an overall trend for improvement for context modeling [14]. Another dataset based of 6800 Reddit posts incuding the context of one preceeding comment was created [15]. A crowd worker annotation process showed low agreement and low quality annotations were disregarded. Showing the previous post changes the judgment in over 30% of the items of the Hate class. The best classification results reach F1 scores of 0.7 [15]. Within HASOC, datasets were collected in two previous editions of ICHCL and experiments were carried out [16, 6]. The diversity of the approaches, data sources and contexts definitions shows that further experiments are required. 3. ICHCL Task Overview and Dataset A conversational thread might contain hateful, offensive, or profane language. This kind of content may not be immediately noticeable within individual tweets, comments, or responses to such tweets or comments. Nevertheless, it is possible to detect such hate speech by examining the original content and the context in which it was posted. For two editions of ICHCL, we have been focusing on detecting such hateful content in conversations. This year, we introduce a variation of the last two editions. We provide details about this in the following section. In the subsequent section, we discuss the details of the ICHCL 2023 dataset. 3.1. Task Overview Training supervised models for classifying code-mixed text presents substantial challenges due to the limited availability of labeled data and the associated high cost of annotating large datasets. However, employing semi-supervised learning methods can alleviate these challenges by leveraging unlabeled data to improve model accuracy and reduce the need for extensive labeled data. As a result the ICHCL task was developed further. Participants received an unlabeled training dataset and a labeled test dataset containing around 1,000 code-mixed Hindi samples. A crucial requirement was that participants must utilize the new unlabeled data to make predictions on the test dataset. The classification task was divided into two subtasks: • Task 2a: This subtask focuses on the binary classification of conversational tweets with tree-structured data into: – (NOT) Non-Hate-Offensive - This tweet, comment, or reply does not contain any hate speech or offensive content. – (HOF) Hate and Offensive - This tweet, comment, or reply contains hate speech, offensive, or profane content either on its own or in support of hate expressed in the parent tweet. • Task 2b: This subtask is centered on classifying conversational tweets with tree-structured data into specific forms of hate, as follows: – (SHOF) Standalone Hate - This tweet, comment, or reply contains hate speech, offensive, or profane content on its own. – (CHOF) Contextual Hate - Comment or reply supporting the hate, offense, and profanity expressed in its parent. This includes affirming the hate with positive sentiment and having apparent hate. – (NONE) Non-Hate - This tweet, comment, or reply does not contain any hate speech or offensive or profane content. This edition addresses the scarcity of labeled data and reduces annotation costs by providing only unlabeled data to participants. Participants received an unlabeled training dataset and a labeled test dataset comprising approximately 1,000 code-mixed Hindi samples. Additionally, a crucial requirement in this edition of ICHCL was that participants needed to leverage the unlabeled data which is provided to make predictions on the test dataset. Furthermore, a link to their GitHub repository to demonstrate compliance with the requirement was mandatory. To ensure fairness and equal opportunities for all participants, we imposed a condition that restricts the use to transformers with fewer than 200M parameters, preventing groups with extensive computational resources from gaining an unfair advantage. 3.2. Dataset This section will provide an overview of how we gathered the dataset and present its statistics. To ensure a fair and unbiased sample of tweets, we selected controversial news stories covering a wide range of topics. We specifically handpicked contentious stories that were highly likely to contain hateful, offensive, or profane comments. These stories were drawn from the following categories: • Brahmin Controversy in JNU • Corruption • Hinduphobia • Kali smoking controversy • Karnataka Election • Kerala stories • Modi clean chit • Nupur Sharma • Pakistan World Cup loss • Udaipur murder • Udhav Thakre government • Zubair arrest The participants were encouraged to use ICHCL 2021 and 2022 datasets as labeled data. In Table 1, we present the dataset statistics of the 2021 and 2022 datasets. We also present the statistics for the 2023 test data and the unlabeled training data. Table 2 presents the inter- annotator agreement for each level. #Twitter Posts #Comments on Posts #Replies on comments HOF NONE HOF CHOF NONE HOF CHOF NONE Train (2021) [5] 49 33 1820 - 1958 972 - 908 SHOF NONE SHOF CHOF NONE SHOF CHOF NONE Train (2022) [6] 75 97 588 171 1166 973 717 1127 Test 1 5 141 68 523 112 79 69 Total 125 135 2549 239 3647 2057 736 2104 Unlabeled 26 3928 4571 Table 1 Dataset statistics for the ICHCL dataset 4. Results In this edition of ICHCL, an initiative was put in place to encourage young researchers to develop innovative solutions. We introduced a semi-supervised version of the task to broaden the approaches used. Unfortunately, no submissions utilized the semi-supervised methods, highlighting a lack of interest in this part of the task. However, we still present the results and approaches by the participants. As we can see in Table 3, for Task 2A, which focuses on binary classification, FiRC-NLP secured the top position with their submission ”parfirst2_all_folds,” achieving an impressive F1 score of 0.8079. They were closely followed by IRLab@IITBHU, Chetona, and AiAlchemists, Type IAA after two annotation rounds IAA after three annotation rounds Main 0.800 1.000 Comment 0.85381 0.80276 Replies 0.93961 0.90243 Table 2 Inter Annotator Agreement (IAA) for Task 2 Rank Team Name Submission Name F1 Precision Recall 1 FiRC-NLP [17] parfirst2_all_folds 0.80791 0.80844 0.80741 2 IRLab@IITBHU[18] IRLab@IITBHU_Task2A_1 0.70079 0.70255 0.69949 3 Chetona [19] chetona-2a-def2 0.61551 0.62525 0.61425 4 AiAlchemists[20] task2_binary_test_pred_2 0.61466 0.63351 0.60820 5 MUCS_3 [21] MUCs_run_2 0.43474 0.38456 0.500 6 HASOC BASELINE 0.37429 0.29909 0.500 Table 3 ICHCL Task 2A results Rank Team Name Submission Name F1 macro Precision Recall 1 FiRC-NLP [17] parfirst_top3_top7_task2b 0.65414 0.64334 0.67178 2 IRLab@IITBHU[18] IRLab@IITBHU_Task_2B_1 0.56316 0.56872 0.56685 3 AiAlchemists[20] task_multiclass_1 0.38243 0.39198 0.39212 4 HASOC BASELINE (Multiclass) 0.24952 0.19939 0.33333 5 Chetona [19] chetona_2b_def2 0.17263 0.20795 0.15883 Table 4 ICHCL Task 2B results demonstrating competitive results in precision and recall. In Task 2B, a multi-class classification task, FiRC-NLP continued to lead with their submission ”parfirst_top3_top7_task2b,” achieving a significant F1 macro score of 0.6541 presented in Table 4. IRLab@IITBHU and AiAlchemists also demonstrated notable performance. However, it is worth noting that the baseline submissions by HASOC in both tasks ranked lower, underlining the competitiveness of the shared task. 5. Methodology In this section, we first explain the baseline we provided to the participants, and then we discuss the methodology of the top two teams. 5.1. Baseline Model In order to support a low threshold for the entry to the shared task, a baseline model was provided for participants. It included with a template for steps like importing data, preprocessing, featuring, and classification. The participating teams could make changes in the code and experiment with various settings. This year, we use a semi-supervised baseline, specifically, we use pseudo-labeling. First, we fine-tune a bert-base-multilingual on the labeled part of the dataset (2021, 2022 datasets). We then predict labels for the unlabeled training set (2023 training data) and then again fine-tune the model with the entire dataset (2021, 2022 datasets, and 2023 dataset with predictions). 5.2. Participant approaches In this section, we explain and summarise the most successful participant approaches: • FiRC-NLP: The sytsem uses concatenation to incorporate context and fine-tune XLM- RoBERTa-large for binary classification. For the multiclass task, the team fist applies the same binary classifier to classify hate and non-hate, and then fine-tunes another LLM to classify hate into standalone or contextual hate [17]. • IRLab@IITBHU: The submission implements a contrastive loss function to fine-tune the vanilla mBERT model, which is then used to obtain features for each individual level. After this step, they pass the features through a two-layer LSTM model to incorporate the context together with features from sentence BERT. • Chetona: The submission concatenates the different levels of the conversational thread given. In then applies IndicBERT to encode the text and classifies based on the training data [19]. 6. Conclusion We reported on experiments with conversational and contextual hate speech detection. The new ICHCL dataset was created with a higher interrater agreement. The use of unlabelled data was set a the challenge for the task 2023. However, participants did not use the in that way. Overall, the submissions reached a good level of performance with up to a 0.8 F1 score applying deep learning models. In future evaluations, data augmentation by large language models might be valuable direc- tions. First experiments report positive outcomes [22]. References [1] I. Kamenova, A. Perliger, 16. Online Hate Crimes, Handbook on Crime and Technology (2023) 278. doi:10.4337/9781800886643.00026 . [2] C. H. Chang, N. R. Deshmukh, P. R. Armsworth, Y. J. Masuda, Environmental users abandoned Twitter after Musk takeover, Trends in Ecology & Evolution (2023). doi:10. 1016/j.tree.2023.07.002 . [3] A. Turillazzi, M. Taddeo, L. Floridi, F. Casolari, The digital services act: an analysis of its ethical, legal, and social implications, Law, Innovation and Technology 15 (2023) 83–106. doi:10.1080/17579961.202 . [4] A. Arora, P. Nakov, M. Hardalov, S. M. Sarwar, V. Nayak, Y. Dinkov, D. Zlatkova, K. Dent, A. Bhatawdekar, G. Bouchard, I. Augenstein, Detecting harmful content on online plat- forms: What platforms need vs. where research efforts go, ACM Computing Surveys (2023). doi:10.1145/3603399 , just Accepted. [5] S. Satapara, S. Modha, T. Mandl, H. Madhu, P. Majumder, Overview of the HASOC subtrack at FIRE 2021: Conversational hate speech detection in code-mixed language, in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021, volume 3159 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 20–31. URL: https: //ceur-ws.org/Vol-3159/T1-2.pdf. [6] S. Modha, T. Mandl, P. Majumder, S. Satapara, T. Patel, H. Madhu, Overview of the HASOC subtrack at FIRE 2022: Identification of conversational hate-speech in hindi- english code-mixed and german language, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, Kolkata, India, December 9-13, 2022, 2022, pp. 475–488. URL: https://ceur-ws.org/Vol-3395/T7-1.pdf. [7] H. Park, S. Cho, J. Park, Word RNN as a baseline for sentence completion, in: 5th IEEE International Congress on Information Science and Technology, CiSt 2018, Marrakech, Morocco, October 21-27, 2018, IEEE, 2018, pp. 183–187. doi:10.1109/CIST.2018.8596572 . [8] L. Gao, R. Huang, Detecting online hate speech using context aware models, in: Proceed- ings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 2017, pp. 260–266. doi:10.26615/978- 954- 452- 049- 6_036 . [9] G. Gorrell, E. Kochkina, M. Liakata, A. Aker, A. Zubiaga, K. Bontcheva, L. Derczynski, SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 845–854. doi:10. 18653/v1/S19- 2147 . [10] Q. Li, Q. Zhang, L. Si, eventAI at SemEval-2019 task 7: Rumor detection on social media by exploiting content, user credibility and propagation information, in: Proceedings of the 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 855–859. doi:10.18653/v1/S19- 2148 . [11] J. Pavlopoulos, J. Sorensen, L. Dixon, N. Thain, I. Androutsopoulos, Toxicity detection: Does context really matter?, in: Proceedings of the 58th Annual Meeting of the Associ- ation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for Computational Linguistics, 2020, pp. 4296–4305. doi:10.18653/v1/2020.acl- main.396 . [12] S. Menini, A. P. Aprosio, S. Tonelli, Abuse is contextual, what about NLP? the role of context in abusive language annotation and detection, CoRR abs/2103.14916 (2021). URL: https://arxiv.org/abs/2103.14916. arXiv:2103.14916 . [13] N. Ljubešic, I. Mozetic, P. K. Novak, Quantifying the impact of context on the quality of manual hate speech annotation, Natural Language Engineering 1 (2022) 14. doi:10.1017/ S1351324922000353 . [14] B. Vidgen, D. Nguyen, H. Margetts, P. Rossini, R. Tromble, Introducing cad: the contextual abuse dataset, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2289–2303. doi:10.18653/v1/2021.naacl- main.182 . [15] X. Yu, E. Blanco, L. Hong, Hate speech and counter speech detection: Conversational context does matter, arXiv Preprint (2022). URL: https://arxiv.org/abs/2206.06423. [16] S. Satapara, S. Modha, T. Mandl, H. Madhu, P. Majumder, Overview of the HASOC subtrack at FIRE 2021: Conversational hate speech detection in code-mixed language, in: P. Mehta, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December 13-17, 2021, volume 3159 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 20–31. URL: http: //ceur-ws.org/Vol-3159/T1-2.pdf. [17] M. S. Jahan, F. Hassan, W. Mohamed, A. M. Bouchekif, Multilingual hate speech detection using ensemble of transformer models, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India, December 15-18, 2023, CEUR-WS.org, 2023. [18] S. Chandal, A. Dhaka, S. Pal, Crossing borders: Multilingual hate speech detection, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India, December 15-18, 2023, CEUR-WS.org, 2023. [19] N. Madani, S. Saha, M. Sullivan, R. Srihari, Hate Speech Detection in Low Resource Indo-Aryan Languages, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India, December 15-18, 2023, CEUR-WS.org, 2023. [20] C. Muhammad Awais, J. Raj, Breaking Barriers: Multilingual Toxicity Analysis for Hate Speech and Offensive Language in Low-Resource Indo-Aryan Languages, in: Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, CEUR, 2023. [21] P. M, R. K, A. Hegde, K. G, S. Coelho, H. L. Shashirekha, Taming toxicity: Learning models for hate speech and offensive language detection in social media text, in: K. Ghosh, T. Mandl, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India, December 15-18, 2023, CEUR-WS.org, 2023. [22] A. Anuchitanukul, J. Ive, L. Specia, Revisiting contextual toxicity detection in conversations, ACM J. Data Inf. Qual. 15 (2023) 6:1–6:22. URL: https://doi.org/10.1145/3561390. doi:10. 1145/3561390 .