DLRG@HASOC 2019: An Enhanced Ensemble Classifier for Hate and Offensive Content Identification R.Rajalakshmi * and B. Yashwant Reddy School of Computing Science and Engineering Vellore Institute of Technology, Chennai, India rajalakshmi.r@vit.ac.in, byashwanth.reddy2016@vitstudent.ac.in • Abstract. Recent advancements in the Internet technologies have made a tremendous change in the social media. Hate Speech is an attack that is directed towards a group of people based on their religion, gender, colour etc. The offensive content in social media poses a threat to democracy. As these kind of hate speech and offensive content on the web increases day by day, manually monitoring or controlling such hate crimes is a highly challenging task. Most of the existing methodologies focus on English language tweets and only limited work has been reported for Hindi and German language posts. Also, the importance of feature se- lection methods is not explored much for this problem. In this research work, an enhanced ensemble classifier approach is proposed to identify hate and offensive content posted in Hindi or German languages. In the proposed approach, CHI square based feature selection method is com- bined with a Random Forest Classifier to classify the tweets. This work was submitted to Hate and Offensive Content Identification (HASOC) task@FIRE2019. From the various experiments conducted on the re- leased HASOC dataset, it is shown that an accuracy of 81% and 64% was achieved on German and Hindi language tweets. Keywords: Hate Speech Identification, Ensemble Classifier, Chi Square Feature Selection, German, Hindi, Social Media 1 Introduction Nowadays many people post their opinions, thoughts and comments on social websites like face book, twitter etc. due to the advanced technologies. The of- fensive and hate speech posted in social media increases every day and the com- panies are investing heavily to identify such offensive tweets. As these kind of offensive tweets contain different hash tags, emojis and follow various language styles, it is highly challenging to monitor and control such hate crimes manually. * Corresponding Author Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 December 2019, Kolkata, India R.Rajalakshmi et al. To overcome the above issues, machine learning based methods were proposed in existing works [1, 2] and they focused on detecting common hate speech, not particular to offensive speech. Even though hate speech detection problem on English language has been studied by various researchers, only few works were reported for German and Hindi language tweets. The lexicon based and rule- based approaches were followed in the existing works which is not able to gen- eralize well. Also, traditional tf-idf based methods were used with simple linear classifiers and emphasis was given to other feature weighting methods. In this research work, an attempt is made to study the importance of feature selection methods along with the power of ensemble based classifiers. We have proposed an enhanced ensemble classifier with the CHI square based feature selection method to select the important features. This research work was submitted to Hate and Offensive Content Identifi- cation (HASOC) task@FIRE2019. As part of the task, the organizers released the datasets containing the tweets in German and Hindi languages. The task is to identify the tweets that contain the hate and offensive content. To perform this binary classification task, we applied various machine learning techniques by extracting suitable features from the given data. To study the importance of feature selection methods, we conducted experiments with different feature selection methods such as TF-IDF Mutual Information and CHI square based approach. Among the two datasets, German dataset was highly imbalanced, so we have applied the widely used SMOTE analysis. To design a suitable predictive model, we conducted experiments with various machine learning techniques such as Logistic Regression, Support Vector Machine and Random Forest Classifier. From the experimental results, it is observed that the ensemble based approach is better than the individual classifiers. We have achieved an accuracy of 81% on German dataset and 65% on Hindi dataset, applying Random Forest classifier with CHI square based feature selection. The paper is organized as follows: Related works are presented in Section 2 and the proposed methodology is detailed in Section 3. The experimental results and discussion are briefed in Section 4 followed by conclusion in Section 5. 2 Related Works There had been many studies reported on classifying the offensive content on the web. Greevy and Smeaton [5] used SVM and bag of words to detect offensive content on web pages. They have used PRINCP corpus of 3 million words with 2 class labels namely offensive and not offensive. BOW, n-gram word sequences and POS tagged documents were used by them to represent the dataset. But they used only SVM classifier for detection and other methods were not explored. A similar approach was suggested by Warner and Hirschberg [4] using unigrams with SVM to detect offensive content of web. Hate base is and online reposi- tory of hate speech words. T. Davidson, D. Warmsley [6] had build a classifier for Hate base. They have created unigram, bigram , trigram features weighted with its TF-IDF and calculated its Part of Speech (POS) tag. They suggested DLRG@HASOC 2019:Ensemble Classifier linear classifiers for classifying the offensive language. But the model was biased towards the offensive language and failed to differentiate between the common place offensive language with serious hate speech. Google had developed a tool for identifying toxicity of comments between the range of 0 to 100. C. Nobata, J. Tetreault [7] had proposed annotation of hate speech versus clean speech. They have collected news and finance dataset for the binary classification of abusive and clean tweets. They have employed Vowpal Wabbit’s regression model for the features obtained through n-grams, linguistic, syntactic and distributional semantics. They have compared the accuracies of all the features but they worked only on English language and did not attempt in other languages. D. Gitari [8] had further classified the tweets into strong or weak using lexicon based ap- proaches. They have used semantic and subjectivity approach to create lexicon and use these features for a classifier. But they used rule-based classifier instead of machine learning model which lead to low precision and recall scores. Nitesh et al. [11] over-sampled the minority class through SMOTE (Synthetic Minority Over-sampling Technique), which generated new synthetic examples along the line between the minority examples and their selected nearest neighbors. To handle multilingual queries, code mixing and code borrowing need to be differentiated. The borrowing likeliness of English words in Hindi language was determined by a novel relevant factor [14]. In this work, both Hindi and En- glish tweets were considered to find the relevant words. Various feature weighting methods have been proposed for URL classification and sentiment analysis prob- lems and the effectiveness of different classifiers were studied. The importance of features like tf-idf and mutual information in determining the category of a web page was explored by using URL based features [15]. For sentiment analysis on movie reviews, the tf-idf and word2vec methods were applied and the effective- ness of deep learning model has been studied in [16]. A novel feature weighting method has been proposed for Naı̈ve Bayes classifier [17], for the problem of categorizing the URLs by considering only the features derived from URLs. In this work, a variant of CHI square method was suggested to find the goodness of features and it was embedded into the calculation of likelihood probability for the Naı̈ve Bayes Classifier. Using linear SVM weights as features, URL classifi- cation was performed in [18]. These URL features were automatically learnt and data-set independent dictionary was constructed to classify the URLs. In another work [19], transfer learning approach was preferred to learn the features from Convolutional Neural Network and it has been used as input to SVM for classi- fying the URLs that are generated using Domain Generated Algorithms. In all the above mentioned works, the significance of feature weighting methods have been studied for classifying the web pages. GermEval is a shared task focused on offensive language identification in German tweets (8500 tweets). Wiegand et al. (2018) [21] further applied the idea to Waseem et al to this task. They experi- mented with detecting offensive vs. non-offensive tweets, and also with a second task on further sub-classifying the offensive tweets as, insult, abuse or profanity. The 2018 Workshop on Trolling, Aggression, and Cyber bullying (TRAC) hosted a shared task focused on detecting aggressive text in both English and Hindi [22]. R.Rajalakshmi et al. The dataset from this task is available to the public and contains 15,869 Face- book comments labeled as overtly aggressive(OAG), covertly aggressive(CAG), or non-aggressive(CAG). The best-performing scores was obtained using convo- lutional neural networks (CNN), recurrent neural networks, and LSTM for their approach. Offensive Language Identification Dataset (OLID) dataset, which was built specifically for this task was annotated using a hierarchical three-level anno- tation model introduced in Zampieri et al. [20]. Three sub tasks include Offensive Language Identification (Not Offensive, Offensive), Categorization of Offensive Language (Targeted Insult, Untargeted), Offensive Language Target Identifica- tion (Individual, Group, Other) [23]. In all the above methods, the importance of determining the offensive content is emphasized. 3 Proposed Methodology The task of identifying the hate and offensive content in the tweets is consid- ered as a binary classification problem. The performance of any binary classifier depends on the suitable features and chosen machine learning algorithm. In this work, three different feature selection methods were chosen to viz., :i) TF-IDF (Term Frequency / Inverse Data Frequency ii) Mutual Information and iii) CHI square. Also, the effectiveness of ensemble method has been studied by apply- ing on three classifiers viz., Logistic Regression, Support Vector Machine and Random Forest Classifier. To identify hate and offensive speech on two data sets viz., German dataset and Hindi dataset, the following steps are performed: – Translation of tweets to English – Pre-processing and Tokenization – Feature Extraction by applying three variants viz., TF-IDF, Mutual Infor- mation and CHI square – Performing SMOTE analysis (this step is required only for German dataset, as it is highly imbalanced) – Building the model and predicting whether the given tweet is offensive or not by using the model. 3.1 Translation of Tweets In this task, we have been provided with two different language datasets (German and Hindi). As a first step, the tweets are translated to English language. For example, a tweet in German ”Frank Rennicke – Ich binxa0stolz” was converted by employing MLtranslate and it results in the corresponding English tweet Frank Rennicke - I am proud. For this translation process, ML Translator API was used, which is a Google’s Neural Machine Translation (NMT) system [24]. This translation method was widely used because of its simplicity and zero-shot translation. Melvin et al.[24] proposed a single Neural Translation multilingual model that shares the same encoder, decoder and attention modules for all the languages without increasing the complexity of model. Also, as the parameters DLRG@HASOC 2019:Ensemble Classifier are shared across all the languages, it generalizes well to multiple languages. This NMT model has the advantage of zero-shot translation, as several language pairs are used in a single model and unseen word pairs in different languages were also learnt by the model. We found this translation process as suitable for this task and hence applied the same for converting the tweets in German / Hindi to English. 3.2 Pre-processing and Tokenization Hash tags provide insights about a specific ideology by a group of people. These tags provide vital information for text classification, especially in the case of identification of offensive language in tweets. So we have processed the hash tags and obtained tokenized words out of it after segmenting the tokens. For example, after applying the hash tag segmentation on the pre-processed tweet #everyhingisgood, we obtain everything is good. Lemmatization is the process of reducing the word to its root form, which is helpful. We have used NLTK (Nat- ural Language Tool Kit) WordNet Lemmatizer for performing lemmatization. Consider the following example, Koeln Mohamed recognizes no German right but only the #Scharia. That he wanted to break Cologne Cathedral was just a joke but when he comes out of jail, he has no more pity. After lemmatising, it becomes koeln mohamed recognizes german right scharia wanted break cologne cathedral joke come jail pity. 3.3 Feature Extraction In any text classification task, the feature extraction plays an important role. To extract the suitable features from the pre-processed data, we have used three variants namely TF-IDF, Mutual Information and Chi-square. TF-IDF: The TF-IDF (Term Frequency – Inverse Document Frequency) is the well-known weighting scheme and this score is calculated based on the count of terms that are present in every tweet with the terms present in the entire corpus. As it extracts most descriptive terms from the tweet collection and simple to implement, we have chosen this feature weighting scheme, In our experiments, the minimum frequency of the word is set to 5 and maximum number of words is set to 5000. Mutual Information: Mutual Information (MI) is the measure of dependence between two random variables, and it can be used to find the dependency be- tween the input features and the output categories in the context of feature selection for text classification problems. For the given task of classifying the tweets, we can calculate the amount of information a particular word contributes to the class label (offensive). If the mutual information is high, then the feature has high relevance to that target and if it is zero, there is no relevance. R.Rajalakshmi et al. In the HASOC German (also for Hindi) dataset, we have calculated the values of a, c, b and d based on the number of training tweets in positive / negative category that contains / does not contain the term ti. The mutual information is obtained by using the formula shown below. M I = log2 (max(aN/(a + c)N, cN/(a + c)N ) (1) where ‘a’ denotes the number of positive category tweets in training data that contains the term ti ‘b’ denotes the number of positive category tweets in training data that do not contain the term ti ‘c’ denotes the number of negative category tweets in training data that contains the term ti ‘d’ denotes the number of negative category tweets in training data that do not contain the term ti Chi Square: The Chi-Square test is generally applied to find the relationship between two variables. The effectiveness of Chi-square based feature selection method has been reported in various text / web page classification problems [15,17]. In Natural Language Processing, identifying the relevant words is im- portant to increase the efficiency of the classification algorithm. The Chi square statistic would be small if the term is uncorrelated with the class and would be high, if the term is correlated. In this task, we have calculated Chi-square statistic using the dataset and selected the terms with high score as they are the most informative features. Its formula is given below using the same notations a, b, c and d mentioned above. Chi = (N (ad − bc)2 )/((a + c)(b + d)(a + b)(c + d)) (2) 3.4 Addressing Imbalanced data and Classification The German dataset was a highly imbalance dataset, that contains 3412 hate and offensive tweets with 407 non-offensive tweets, so SMOTE analysis is performed. For the Hindi dataset, this step was ignored, as it is a balanced dataset. Random Oversampling and under sampling The mechanics of random oversampling follow naturally from its description by adding a group of N number of samples from the minority category. While oversampling adds data to the original data set, random under sampling removes the data from the data set. Both the methods try to alter the size of the original data set. Even though, training accuracy may increase by applying this method, the model performance will be relative low on testing data [13]. SMOTE Analysis We have applied SMOTE (Synthetic Minority Oversam- pling Technique) from sklearn. By this oversampling technique, the size of mi- nority class tweets are increased to the size of majority class tweets. This method generates synthetic minority examples to over-sample the minority class. For ev- ery minority example, its k (which is set to 5 in SMOTE) nearest neighbours of DLRG@HASOC 2019:Ensemble Classifier the same class are calculated, then some examples are randomly selected from them according to the over-sampling rate. SMOTE analysis was applied to give better performance compared with other sampling Techniques [11]. Classification After making the dataset suitable for training, two different models were designed, one with Logistic Regression and another one with the ensemble classifier Random Forest by varying the feature weighting methods viz., TF-IDF, Mutual Information and Chi-square. 4 Experimental Results To study the performance of the proposed method on the German (and Hindi ) datasets, various experiments were conducted. For implementation, we used Python 3 and scikit-learn library. All the experiments were carried on a work- station with Intel Xeon Quad Core Processor, 32 GB RAM, NVIDIA Quadro P4000 GPU 8GB. For the initial experiments, we have divided the released train- ing data into training set and validation set and conducted the experiments us- ing accuracy as the performance metric. Finally the performance of the proposed system was tested on the test set released by the organizers. For these experi- ments, we combined all the training and validation data into a single training set and applied the algorithm. We have reported the validation accuracy and test accuracy obtained on both German dataset and Hindi dataset. After translation and pre-processing of tweets, tokenization was performed. Then to extract the suitable features, we have applied three variants viz., TF- IDF, Mutual Information and Ch-square. First, TF-IDF vectorizer (using sklearn) was used to get maximum of 10,000 features with the minimum occurrence fre- quency of 2 for German dataset and 5000 features for Hindi dataset. We then tried with count vectorizer (using sklearn) and calculated Mutual Information and Chi-square values for every word token using the above mentioned formu- las. By this way, a total of 12,717 features were extracted for German dataset and 15,111 features were extracted for Hindi dataset. We have used the above features and used Logistic Regression (LR) and Random Forest (RF) classifier with three variants viz., TF-IDF, Mutual Information and Chi-square values. The accuracy of simple and ensemble classifiers on validation set and test set was presented in Table 1 and Table 2. It is observed from Table 1 that, on German dataset, among the three feature weighting schemes, CHI square based feature weighting method performs better than the other two methods viz.,TF-IDF and MI with Random Forest Classifier. A validation accuracy of 90% was achieved while combining CHI square based features feature with the ensemble classifier Random Forest. It is also to be noted that, MI performs better than TF-IDF and resulted in 88% and 89% validation accuracy on German dataset with the single Logistic classifier and with Random Forest classifier. On Hindi dataset, the validation accuracy of 79% was achieved with the Random Forest classifier for the CHI square based feature selection. Based on the inference on the validation set, we have applied CHI square with R.Rajalakshmi et al. Table 1. Performance - Single and Ensemble classifier - Validation Accuracy Dataset / Methods Logistic Regression Random Forest TF-IDF MI CHI TF-IDF MI CHI German 86 88 83 88 89 90 Hindi 77 69 78 75 72 79 Random Forest classifier on the released test set and the results are reported in Table 2. We have obtained an accuracy of 81% and 64% on German dataset and Hindi dataset respectively. Table 2. Performance of proposed appraoch - Test Accuracy Dataset Random Forest with CHI German 81 Hindi 64 5 Conclusion This work was submitted to the FIRE2019 task, Identification of Hate and Of- fensive Speech in Indo-European Languages. In this research work, the problem of identifying the hate and offensive content in tweets have been experimen- tally studied on two different language datasets German and Hindi. The im- portance of feature weighting methods was analysed by using three different variants viz., TF-IDF, Mutual Information and CHI square based feature selec- tion. After choosing the suitable feature selection method, we have studied the significance of ensemble classifier over individual classifier. Among the released datasets, German dataset was highly imbalanced, so we applied SMOTE analy- sis and then performed classification. From the experimental results, it is shown that the performance of the Random Forest classifier with CHI square based feature selection method is better than the other methods and a test accuracy of 81% and 64% were achieved on German and Hindi dataset respectively. In this work, we have restricted to machine learning approaches with suitable feature selection method and deep learning techniques will be explored in future. 6 Acknowledgment The authors would like to thank the management of Vellore Institute of Technol- ogy, Chennai for providing the support to carry out this work. The first would like DLRG@HASOC 2019:Ensemble Classifier to thank the Department of Science and Engineering Research Board (SERB), Government of India for their financial grant (Award Number: ECR/2016/00484) for this research work. References 1. Burnap, P., Williams, M.L.: Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making. In: Policy and Internet, Vol.7.2, pp. 223–242 (2015). 2. Kwok, I., Wang, Y.: Locate the hate: Detecting tweets against blacks. In: Twenty- Seventh AAAI Conference on Artificial Intelligence,pp.1621-1622 (2013) . 3. de Gibert, O., Perez, N., Garc’ia-Pablos, A., Cuadros, M.: Hate Speech Dataset from a White Supremacy Forum. In: 2nd Workshop on Abusive Language Online,pp.11- 20 (2018). 4. Warner, W., Hirschberg, J.: Detecting Hate Speech on the World Wide Web. In: Proceedings of the Second Workshop on Language in Social Media,pp.19-26 (2012). 5. Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th annual international conference on Research and devel- opment in information retrieval - SIGIR ’04, pp. 468 – 469(2004). 6. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated Hate Speech De- tection and the Problem of Offensive Language. In: Proceedings of the Eleventh International AAAI Conference on Web and Social Media (ICWSM 2017),pp. 512- 515(2017). 7. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y.,Chang, Y.: Abusive Language Detection in Online User Content. In: Proceedings of the 25th International Con- ference on World Wide Web (WWW 2016),pp. 145-153(2016). 8. Gitari, D., Zuping, Z., Damien, H., Long, J.: A Lexicon-based Approach for Hate Speech Detection. In: International Journal of Multimedia and Ubiquitous Engi- neering, vol.10.4, pp. 215-230(2015). 9. Hall, M., Smith L.: Practical feature subset selection for machine learning. In: Proceedings of the 21st Australasian Conference on Computer Science, pp.181- 191(1998). 10. Wu, H., Gu, X.: Balancing Between Over-Weighting and Under-Weighting in Su- pervised Term Weighting. In: International Journal of Information Processing and Management, vol.53,pp.547-557(2017). 11. Chawla, N.V., Bowyer,K.W., Hall, L.O., Kegelmeyer W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. In: Journal of Artificial Intelligence Research, vol.16,pp.321-357 (2002) . 12. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: International Conference on Intelli- gent Computing,pp.878-887(2005). 13. He, H., Garica, E.A.: Learning from imbalanced data. In: IEEE Transactions On Knowledge and Data Engineering,vol.21,pp.1263-1284(2009). 14. Rajalakshmi, R., Agrawal, R., Borrowing Likeliness Ranking based on Relevance Factor, In: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS 2017, India, pp: 12:1–12:2 15. Rajalakshmi, R., Xaviar, S., Experimental Study of Feature Weighting Techniques for URL Based Webpage Classification, Procedia Computer Science, Vol.115, pp. 218-225, (2017) R.Rajalakshmi et al. 16. Sivakumar, S., Rajalakshmi, R, Comparative evaluation of various feature weight- ing methods on movie reviews, Advances in Intelligent Systems and Computing, Vol-711, pp. 721-730 (2019). 17. Rajalakshmi, R., Aravindan, C., Naive Bayes approach for URL classification with supervised feature selection and rejection framework, Computational Intelligence, 34(1), pp. 363-396 (2018). 18. R. Rajalakshmi, C. Aravindan, ”An Effective and Discriminative Feature Learn- ing for URL Based Web Page Classification,” 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 2018, pp. 1374-1379, (2018). 19. Rajalakshmi, R., Ramraj, S., Ramesh Kannan, R, Transfer learning approach for identification of malicious domain names, Communications in Computer and Infor- mation Science, Vol. 969, pp. 656-666.(2019) 20. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R.: SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In: Proceedings of the 13th International Workshop on Seman- tic Evaluation,pp.75-86(2019). 21. Wiegand, M., Siegel, M., Ruppenhofer, J.: Overview of the germeval 2018 shared task on the identification of offensive language(2018). 22. Kumar, R., Ojha, A.K., Malmasi, S., Zampieri, M.: Benchmarking Aggression Iden- tification in Social Media. In: Proceedings of TRAC(2018). 23. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R.: Pre- dicting the Type and Target of Offensive Posts in Social Media. In: Proceedings of NAACL (2019). 24. Johnson, Melvin., Schuster, Mike., Le, Quoc V., Krikun, Maxim., Wu, Yonghui.,Chen, Zhifeng., Thorat, Nikhil., Vi’egas, Fernanda., Wattenberg, Mar- tin., Corrado, Greg.,Hughes, Macduff., Dean, Jeffrey, Google’s Multilingual Neu- ral Machine Translation System: Enabling Zero-Shot Translation, Vol-5, pp. 339—351(2017) 25. Modha, S., Mandl, T., Majumder, P., Patel, D.: Overview of the HASOC track atFIRE 2019: Hate Speech and Offensive Content Identification in Indo- EuropeanLanguages. In: Proceedings of the 11th annual meeting of the Forum for Informa-tion Retrieval Evaluation (December 2019)