Hate Speech and Offensive Content Identification in Multiple Languages using machine learning algorithms Dikshitha Vani V, B. Bharathi Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Tamil Nadu 603110 Abstract The freedom of expression on social media sites like Twitter and Facebook provides opportunities for people to voice out their opinions and concerns. At the same time, it has also become a tool for immense bullying and hateful comments online. AI tools are methods used to identify such comments automatically . These identification tools are evaluated by continuous experimentation with data sets. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to developing benchmark data for this purpose. This paper presents the HASOC task for Offensive Language Identification in Marathi. The data set was assembled from Twitter. This task has 3 subtasks. Subtask A is Offensive Language Detection where the goal is to discriminate between offensive and non- offensive posts. In subtask B, only the posts labeled as Offensive (OFF) in subtask A are included and the goal is to predict the type of offense as either Targeted Insult(TIN) or Untargeted (UNT). In subtask C, only posts that are either insults or threats (TIN) are considered in this third layer of annotation and classifies them on the target of offenses as Individual (IND), Group (GRP), and Other (OTH). In this work, our team ssncse_nlp have applied machine learning prediction algorithms - Random forest (RF), Support Vector Machine(SVM), Logistic Regression, and k nearest neighbors (KNN) classifier algorithms along with count vectorized features to the tweets for classification. Finally, the result shows that Random Forest predicts the labels for subtasks A and C more accurately than the other classifier models with a Macro F1 score of 0.9745 and 0.7929 while the Logistic Regression classifier predicts more accurately for subtask B with a Macro F1 of 0.6958. 1. Introduction Social media is an active tool for people of all ages. They generate and consume a great deal of data. Social media has gained massive support and reach over the years because it helps people express themselves and their ideologies, showcase their talents, meet new people, connect themselves to others’ stories, etc. On the other hand, it also means that people are openly critical of others and "impose" their ideologies on others. When ideologies clash, they introduce insulting, hurtful, derogatory or obscene language. Such objectionable content can even be a threat to democracy. Open societies need to find an acceptable way to react to such content without imposing rigid censorship regimes. Forum for Information Retrieval Evaluation, December 9-13, 2022, India † These authors contributed equally. $ dikshithavani2010541@ssn.edu.in (D. V. V); bharathib@ssn.edu.in (B. Bharathi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR http://ceur-ws.org Workshop ISSN 1613-0073 Proceedings Online hate speech can be produced and distributed easily, at low cost and anony- mously, while having the potential to reach a globally diverse audience in real time. The relative permanency of online content is also an issue when hate speech can reappear and (re)gain popularity over time. As a consequence, many social media websites monitor user posts. This leads to a pressing demand for methods to automatically identify suspicious posts. Online communities, social media enterprises, and technology companies have been investing heavily in technology and processes to identify offensive language and prevent abusive social media behavior. However, there is increasing evidence that social media platforms still struggle to keep up with the demand for technology to identify offensive content, particularly for languages other than English. This paper aims to detect hate speech in the Marathi Language. Marathi is an Indo-Aryan language predominantly spoken by Marathi people in the Indian state of Maharashtra. It is the official language of Maharashtra, and a co-official language in Goa state and the territory of Daman, Diu & Silvassa. It is one of the 22 scheduled languages of India, with more than 90 million speakers. Marathi ranks 10th in the list of languages with the most native speakers in the world. Marathi has the third largest number of native speakers in India, after Hindi and Bengali. The language has some of the oldest literature of all modern Indian languages. In this paper, we aim to perform the classification of data based on a statistical analy- sis of test data using a count vectorizer. It outlines four classifier models along with a count vectorizer, namely random forest (RF), support vector machine (SVM), Logistic Regression, and K nearest neighbors (KNN). The rest of the paper is organized as follows-Section 2 describes other related work on Hate Speech. The dataset for the shared task and machine learning algorithms used for this task are described in Section 3. Section 4 discusses the 4 classifier models used in this paper. Results are presented in Section 5. Section 6 emphasizes improvements that can be applied to the model. Section 7 concludes the paper. 2. Related Works Identifying the vulgarity of comments on social media and classifying them has become an important field of study today to ensure security, safety, peace, and harmony among people. Using Classifier models coupled with a count vectorizer is one of the most commonly used models to detect hate speech. Ensemble models, a machine learning approach to combine multiple other models in the prediction process, is also commonly used to improve classification performance. [1], [2], [3], [4], and [5] uses classical classifier models and ensemble models for the classifications of data(coupled with count vectorizers or TF-IDF vectorizers). The following paragraph gives a brief description or names the models used in the papers listed above. In [1], authors have used Ensemble Learning models such as Random Forest and Ad- aBoost coupled with count vectorizer for novel hate Speech Detection where Random Forest yielded 95% accuracy. They also used word cloud for displaying the most prominent tweets responsible for hateful sentiments. [2] discusses hate speech detection in Indonesian language. The authors have used five stand-alone classification algorithms namely Naïve Bayes, K-Nearest Neighbours, Maximum Entropy, Random Forest, and Support Vector Machines, and two ensemble methods namely hard voting and soft voting on Twitter hate speech database. They aimed to prove that the ensemble method can improve classification performance. In [3], the authors have devised a system that uses both machine learning and deep learning techniques to detect the offensive comments. [4] describes the usage of SVM and Naive Bayes classifier for Hate speech models and the results showed a classification accuracy of approximately 99% and 50% for SVM and NB respectively over their test set. Similarly [5] uses Logistic Regression and XGBoost classifiers. Apart from the classifier models that are combined with count vectorizer or TF-IDF vectorizer, transformer models are also widely used for Hate Speech Detection. [6] uses multilingual pre-trained models. In this paper, for the detection of Hindi and Marathi languages, the MBERT model was used, and for the English language, the BERT model was used. 3. Proposed Methodology The proposed system of detecting offensive content from the HASOC 2022 task 3 data is described in the following sections. The steps involved in the proposed system are as follows: 1. Data set exploration and preprocessing 2. Feature extraction 3. Model training and Testing 3.1 discusses the data set in detail, 3.2 discusses the need for feature extraction and finally 3.3 discusses the training and testing procedures. 3.1. Data Set exploration and Preprocessing The data set given by the shared task organizers contains a training set consisting of 3103 instances and a testing set consisting of 510 instances. The training set contains the ID -identity of the user and Tweet while the test set consists of ID and tweet. Each instance of the training set also contains up to 3 labels each corresponding to one of the following levels: - Level (or sub-task) A: Offensive language identification; - Level (or sub-task) B: Automatic categorization of offense types; - Level (or sub-task) C: Offense target identification. The training set is splitted into 3 CSV files for carrying out each of the subtasks sepa- rately. Subtask A contains all 3103 instances while subtask B and C contains 1068 and 740 instances respectively. The data set is split so that none of the tasks contain a null row entry in them. Each of these files contains ID, Tweet and label of corresponding level/task. Our team has not done any preprocessing of the data set for Marathi language texts. The texts can be preprocessed by removing stopwords from the text. Stopwords are words that do not provide any information for the classification but however increase the dimension of the matrix provided by the count vectorizer. This is explained in the section 6. The data set has been retrieved from [7]. 3.2. Feature extraction Machines cannot understand characters and words. In order to use textual data for predictive modeling, words need to be encoded as integers or floating-point values for use as inputs in machine learning algorithms. This process is called feature extraction. We have used Scikit-learn’s count vectorizer to convert a collection of text documents to a vector of term/token counts. It enables the pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text. The Count vectorizer is imported from sklearn.feature_extraction.text module. The training set is given to fit_transform function to fit the data into the model and test data set is used to transform the given set according to the fitted model from training set. 3.3. Model Training and Testing The model is trained and tested using 4 Classifier models: 1. RandomForest 2. SVM classifier 3. Logistic Regression 4. KNN classifier In the following subsection 3.3.1, a major pre-processing step: Label encoding is discussed. In the next subsection 3.3.2, to study these 4 different classifier models we brief in the impor- tance of splitting the training data set. 3.3.1. Label Encoding Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering. Label Encoding converts the labels into a numeric form to bring them into machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important preprocessing step for the structured dataset in supervised learning. This is achieved using LabelEncoder() from sklearn.preprocessing header. Since Scikit-Learn library is used, there no need for performing label encoding separately as the classifier object takes care of that by itself. 3.3.2. Data Split The tweet field of the training data(features) and the label of subtask(labels) are used to perform a train and test data split. This splitting is necessary to check the correctness of the proposed model. The split- ted training set fits into a model based on the classifier. The tweets of the splitted test set are given to the fitted model to predict the labels of the subtask. The predicted results are then compared with the actual labels of the splitted test set using the metrics -> classification report, confusion matrix and accuracy score. Empirical studies show that the best results are obtained if we use 20-30% of the data for testing, and the remaining 70-80% of the data for training. The paper "Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation"1 by the University of Texas at El Paso discusses the same. Each of these classifier models is discussed in the next section. 4. Classifier Models This section is further divided into four sections. Section 4.1 describes Random Forest classifier model, section 4.2 describes Support vector machine classifier model, section 4.3 describes Logistic regression classifier model, and section 4.4 describes K nearest Neighbors classifier model. 4.1. Random Forest classifier Random forest is imported from the Ensemble module. Ensemble means combining multiple models i.e a collection of models is used to make predictions rather than an individual model. Ensemble uses two types of methods: 1. Bagging– It creates a different training subset from sample training data with replace- ment and the final output is based on majority voting. 2. Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. The random forest algorithm is based on bagging principle. Following are the steps involved in Random Forest Algorithm: 1 Source: https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2202&context=cs_techrep • n random records are taken from the data set. • Individual decision trees are constructed for each sample. • Each decision tree will generate an output. • Final output is based on Majority Voting for Classification. Figure 1: Steps involved in Random forest Classification2 The table 4.1 describes the hyperparameters used. Hyperpameter Function n_estimators The number of trees the algorithm builds before averaging the predictions random_state controls the randomness of the bootstrapping of the samples used when building trees To train the model the splitted training set is fit into the regressor. The splitted test set is then used to predict the labels. The table 1 compares the test prediction and actual results shown by the metrics Macro F1, precision, and accuracy. 4.2. Support Vector Machine Support Vector Machine (SVM) is a supervised machine learning algorithm that can be used for classification problems. In this algorithm, data items are plotted as a point in n-dimensional space, n being the number of features, and the value of each feature is the value of a particular coordinate. Classification is performed by finding the hyper-plane that 2 Source: https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest Table 1 Classification Report - A comparison of test prediction with actual labels of the subtasks Subtasks Macro F1 Macro Precision Accuracy Subtask A 0.87 0.89 0.88 Subtask B 0.52 0.68 0.68 Subtask C 0.36 0.50 0.71 Figure 2: Hyperplane differentiating between 2 classes3 differentiates the two classes as shown in figure 2. If the hyperplane classifies the dataset linearly then the algorithm is called SVC and the if algorithm separates the dataset by non-linear approach it is called SVM as depicted in figure 3 To train the model the splitted training data set is fit into the SVC classifier. The splitted test set is then used to predict the labels. The table 2 is the comparison between the test prediction and actual results shown by the metrics Macro F1, precision and accuracy. 3 Source: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ 4 Source: https://www.analyticsvidhya.com/blog/2021/06/support-vector-machine-better-understanding/ Figure 3: Difference between SVM and SVC classifiers4 Table 2 Classification Report - A comparison of test prediction with actual labels of the subtasks Subtasks Macro F1 Macro Precision Accuracy Subtask A 0.87 0.88 0.88 Subtask B 0.40 0.33 0.66 Subtask C 0.33 0.50 0.71 4.3. Logistic Regression Logistic regression is a supervised classification algorithm where the target variable can take only discrete values for a given set of features. The model builds a regression model to predict the probability that a given data entry belongs to the category. Logistic regression models the data using the sigmoid function. 5 Based on the number of categories, Logistic regression can be classified as: 1. Binomial: target variable can have only 2 possible types: “0” or “1” . 2. Multinomial: target variable can have 3 or more possible types which are not ordered like "Type A” vs “Type B” vs “Type C”. 3. Ordinal: it deals with target variables with ordered categories.Each category can be given a certain value like 0,1,2 etc. Example a test result can be classified as "poor","good","better" etc. To train the model, the splitted training data set is fit into the Logistic Regression model. The splitted test set is then used to predict the labels. 5 Source: https://www.geeksforgeeks.org/understanding-logistic-regression/ 6 Source: https://www.researchgate.net/publication/239269767_Artificial_Neural_Networks_in_Multivariate_ Calibration/figures?lo=1 Figure 4: Sigmoid function6 The table 3 is the comparison between the test prediction and actual results shown by the metrics Macro F1, precision and accuracy. Table 3 Classification Report - A comparison of test prediction with actual labels of the subtasks Subtasks Macro F1 Macro Precision Accuracy Subtask A 0.87 0.88 0.88 Subtask B 0.62 0.63 0.67 Subtask C 0.43 0.45 0.68 4.4. K-nearest Neighbors K-nearest neighbors (KNN) is a supervised learning algorithm used for classification. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. K number of points that are closet to the test data is selected. The algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and the class holds the highest probability will be selected. The figure 5 depicts the different steps involved in the KNN classification To train the model the splitted training data set is fit into the KNeighborsClassifier. The splitted test set is then used to predict the labels. The table 4 is the comparison between the test prediction and actual results shown by the metrics Macro F1, precision and accuracy. Out of the 4 classifiers, it is evident that for the given data set, the Random Forest classifier with Count vectorizer produces a higher accuracy, macro average of F1-Score and precision. 7 Source: https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4 Figure 5: Steps in knn classification7 Hyperpameter Function the tuning parameter/hyperparameter (k) which n_neighbors indicates the count of the nearest neighbors 5. Results The entire training data set(without splitting) is used to fit the model and the Test data set is used to predict the labels from the fitted model to get the results. The results were copied to a csv file and submitted to the HASOC 2022 submission portal. Table 4 Classification Report - A comparison of test prediction with actual labels of the subtasks Subtasks Macro F1 Macro Precision Accuracy Subtask A 0.87 0.88 0.88 Subtask B 0.55 0.58 0.64 Subtask C 0.38 0.37 0.65 Our team ssncse_nlp ranked 1 in the subtask A with a macro F1 and macro precision of 0.9745 and 0.9758 respectively using R and om F or est classifier. In subtask B, our team ranked 3rd with a macro F1 and macro precision of 0.6958 and 0.7587 respectively using Log i st i c Reg r essi on classifier . In subtask C, our team ranked 2nd with a macro F1 and macro precision of 0.7929 and 0.7963 respectively using R and om F or est classifier. The overview of the HASOC 2022 task is given in [8] and [9]. The reference to the code is given in the foot note text as a git hub link8 . The overall results(macro F1) are depicted in the table below: Table 5 Results of HASOC9 Classifier Model Subtask Macro F1 A 0.9745 RANDOM FOREST B 0.6938 C 0.7929 A 0.8647 SUPPORT VECTOR B 0.5376 MACHINE C 0.3856 A 0.9587 LOGISTIC B 0.6958 REGRESSION C 0.7867 A 0.7316 K NEAREST B 0.5018 NEIGHBORS C 0.3986 I n f er ence : The above observations show that Random forest is a better choice. This is because Random Forest works well even for data points that are randomly distributed. 6. Ablation Study Count vectorizer will consume much memory as it needs to store a vocabulary dictionary in memory. This problem can be reduced to some extent by the preprocessing of the data to remove the stopwords from the tweets. A few stopwords in Marathi language are as follows: 8 Source: https://github.com/dikshu-02/HASOC2022-task3 9 Source: https://hasocfire.github.io/submission/leaderboard.html Alternate is to use TF-IDF vectorizer. TF-IDF is better than count Vectorizers. This is because, TF-IDF not only focuses on the frequency of words present in the corpus, but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions. [10] uses TF-IDF vectorizer for Social Network Hate speech Detection for Amharic language. Another issue with these classifier models is that they don’t take into account the con- text of a sentence. Hate speech is a very abstract and broad topic that is difficult to understand. The detection of hate speech depends on people’s subjective understanding of what hate speech is. The model discussed in this paper counts the number of repeated words and finds a pattern while training and labels the data based on the count of specific words. This is misleading as the same words can be associated with negative sentences as well i.e. both “This is good” and “This is not good” counts equal numbers of “good”, ”this” and “is". To ensure increased accuracy, pre-trained models for Marathi such as Roberta-base-mr can be used for sentiment analysis or context analysis. [11] have shown the usage of Marathi based transformer models and compares results of using a monolingual and multilingual model. 7. Conclusion The need for scrutiny of comments online is becoming increasingly necessary to prevent cyber bullying and maintain the harmony of life. The models discussed in this paper are some of the ways to identify offensive tweets Once identified there can be many ways to prevent them(blocking the users, having a point-based system in Social media for reducing hate comments, giving warnings etc) which are not the scope of this paper. There is however further scope for improvement in the discussed models as seen in the ablation study. References [1] T. Turki, S. S. Roy, Novel hate speech detection using word cloud visualization and ensemble learning coupled with count vectorizer, Applied Sciences 12 (2022). URL: https://www.mdpi.com/2076-3417/12/13/6611. doi:10.3390/app12136611. [2] M. A. Fauzi, A. Yuniarti, Ensemble method for indonesian twitter hate speech detection, Indonesian Journal of Electrical Engineering and Computer Science 11 (2018) 294–299. [3] A. Anand, J. Golecha, B. Bharathi, B. Jayaraman, T. Mirnalinee, Machine learning based hate speech identification for english and indo-aryan languages (2021). [4] D. Asogwa, C. Chukwuneke, N. Chigozie, G. Anigbogu, Hate speech classification using svm and naive bayes, 2022. [5] M. K. A. Aljero, N. Dimililer, A novel stacked ensemble for hate speech recognition, Applied Sciences 11 (2021) 11684. [6] A. Kalaivani, D. Thenmozhi, Multilingual hate speech and offensive language detection in english, hindi, and marathi languages (2021). [7] M. Zampieri, T. Ranasinghe, M. Chaudhari, S. Gaikwad, P. Krishna, M. Nene, S. Paygude, Predicting the type and target of offensive social media posts in marathi, Social Network Analysis and Mining 12 (2022) 77. URL: https://doi.org/10.1007/s13278-022-00906-8. doi:10.1007/s13278-022-00906-8. [8] Satapara, Shrey and Majumder, Prasenjit and Mandl, Thomas and Modha, Sandip and Madhu, Hiren and Ranasinghe, Tharindu and Zampieri, Marcos and North, Kai and Premasiri, Damith, Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages, in: FIRE 2022: Forum for Information Retrieval Evaluation, Virtual Event, 9th-13th December 2022, ACM, 2022. [9] T. Ranasinghe, K. North, D. Premasiri, M. Zampieri, Overview of the HASOC subtrack at FIRE 2022: Offensive Language Identification in Marathi, in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation, CEUR, 2022. [10] Z. Mossie, J.-H. Wang, Social network hate speech detection for amharic language, Computer Science & Information Technology (2018) 41–55. [11] A. Velankar, H. Patil, R. Joshi, Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi, arXiv preprint arXiv:2204.08669 (2022).