QMUL-NLP at HASOC 2019: Offensive Content Detection and Classification in Social Media Aiqi Jiang Queen Mary University of London, London E1 4NS, UK aiqi.jiang@yahoo.com Abstract. With the development of the Internet, the Web has become an information dissemination platform, an information amplifier, and a new social media. The information load and participation of the Internet far exceeds the existing traditional media, and various problems have emerged. There has been significant work in several languages in partic- ular for English. However, there is a lack of research in this recent and relevant topic for most other languages. This track intends to develop data and evaluation resources for several languages. The objectives are to stimulate research for these languages and to find out the quality of hate speech detection technology in other languages. The paper mainly describes the organization of the HASOC 2019 Task, a Shared Task on Hate Speech and Offensive Content Identification in Indo-European Lan- guages. The task is organized in three related classification subtasks: sub- task A is a coarse-grained binary classification to identify hate speech and offensive language, a fine-grained classification subtask B is to further classify the data from the subtask A into three categories, and subtask C will check the type of offense. This paper mainly focuses on English of- fensive language detection and shows the experimental result in subtask A and subtask B. Keywords: Hate speech detection · Offensive language · Word embed- ding · Text classification · LSTM · HASOC. 1 Introduction With the popularity of Internet applications and the convenience of free speech, a lot of hate speech and other offensive content on the Internet pose a huge threat to the stability of society. The online communication platform has no strict scrutiny of speech and post, making a variety of offensive language, such as insulting, harmful, derogatory or obscene, freely and quickly transmitted from person to person, and can have an influence on people’s views and social trends Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem- ber 2019, Kolkata, India. Aiqi Jiang [17]. A large number of such languages contain many of these critical and mis- leading statements and lack factual research, which may lead to some excessive behavior in society and pose a threat to democracy. Therefore, in order to pre- vent the abuse and transmission of hate speech on social media, the accurate detection of hate speech is urgent. At present, many online communities, social media companies and technology companies pay great attention to this related research, investing a lot of money and technical support [16]. The main structure of the rest of this paper has been organized as follows. Section 2 will mainly describe recent researches about hate speech detection. Then the datasets released by HASOC to the participants for training and testing the systems will be introduced in section 3. Section 4 presents the two subtasks and the measures we exploited in the evaluation. Section 5 reports on approaches in the experiment and results of the system. Finally a conclusion will be given in section 6. 2 Related Work Existing works on hate speech has been very limited, largely due to a lack of a general definition of hate speech, a lack of analysis of its demographic im- pact, and a lack of surveys of the most effective characteristics [11]. Generally speaking, hate speech is based on attacks on individuals or groups in certain ways, such as gender, race, religion, ethnicity, disability or sexual orientation. It means deliberately suppressing, intimidating, or inciting some statements about violence and prejudice against individual groups [15]. Related researches on hate speech detection has been developing only in recent a few years. The existing technology used in hate speech detection in social media is mainly about Dictionaries and lexicons [4], Bag-of-words(BOW) [3], TF-IDF [1], Part-of-speech(POS) [3], and Word embedding [2]. Many recent studies have shown that deep learning techniques with word embedding show higher accuracy in text categorization [2]. Among them, Word2Vec has obtained many applications [7], a method based on unsupervised word embedding to find the semantic and syntactic relationships of words to then capture the more attributes and contextual hints in human language.. The most common method found in the work of [6] is to establish a ma- chine learning model for hate speech classification. Considering the discovery frequency, the most commonly used algorithms are SVM, Random Forests, De- cision Trees, Logistic Regression and Naive Bayes, where Random Forests and Logistic Regression show a good performance. As for deep learning methods, ex- isting ones are largely based on Convolutional Neural Networks (CNN) or Long Short Term Memory (LSTM), a type of Recurrent Neural Networks (RNN) [14]. Intuitively, traditional machine learning methods learn features similar to n-gram sequences, while deep learning ones learn sequence order, which seems more useful for classification tasks [9]. In this paper, we choose to use English dataset to mainly address Subtask A and Subtask B, where different feature extraction methods (N-gram and word Title Suppressed Due to Excessive Length embedding) and classification algorithms (Logistic Regression and LSTM) will be implemented in this experiment. 3 Data 3.1 Datasets The training dataset provided by HASOC is created mainly from the Twitter and Facebook in English. It is raw data with text ID number, post content and different class labels for three subtasks. The external dataset used the public Twitter search API to collect the entire corpus, filtering for tweets not written in English [13] The size of English training data corpus is 5852 posts and the external dataset has 39292 texts. The size of test dataset has 1153 posts. The following Table 1 shows the details of three datasets we used. Table 1. Size of data sets for English Datasets Number of the text HASOC training data 5852 Annotated data 39292 HASOC test data 1153 3.2 Training and Test Data – Training data Training data set has randomly combined the training dataset HASOC re- leased and the external dataset. Then the combined dataset is divided into a training set and a test set according to the ratio of 4:1. – Test data The test data set is what HASOC released with approximately 1100 posts. 4 Task description The format of an annotated text in the training and development set shows the pattern as follows: ID, text, task 1, task 2, task 3 where ID is a progressive number denoting the text within the dataset, text is the given post, while the other three parts of the pattern are the labels of the classes for the texts. And the test set only includes ID and text. An example of one post is as follows: Aiqi Jiang hasoc en 2, @politico No. We should remember very clearly that Individ- ual1 just admitted to treason. TrumpIsATraitor McCainsAHero JohnMcCain- Day, HOF, HATE, TIN where the text has been classified by the annotators as hateful-offensive, hateful, and insulting to an individual, group, or others. 4.1 Subtask A Subtask A is a coarse-grained binary classification task to make Hate speech and Offensive language identification [5]. The system has to predict whether a text in English contains hate speech and offensive information or not. For the class of this subtask, there is two labels: HOF and NOT. The label HOF means it contains any form of non-acceptable language such as hate speech, aggression, profanity otherwise NOT. 4.2 Subtask B Subtask B is a fine-grained multi-level classification task to further identify three classes: HATE, OFFN and PRFN. There are four annotations, where most of posts is classified to OTHER, some to be HATE and the other two categories to be relatively less. Dubious cases, which are difficult to decide even for humans, will be left out [5]. 4.3 Evaluation Measures and Baseline In the result of binary classification, there are four different situations, namely true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). Based on the results of manual annotations, there are four commonly used indicators to measure the performance of the classifier, namely accuracy, recall and F1 scores [12]. – Precision positive predictive value: it is a consistent result between manual and automatic classification. TP P recision = (1) TP + FP – Recall sensitivity: it shows the proportion of all positive cases, which is a measure of the ability of the classifier to identify positive samples. TP Recall = (2) TP + FN In this report, the evaluation measures are the same for both subtask A and subtask B. To provide a metric that is independent of class size, the classification result will be mainly computed by macro-averaged F1 score and weighted F1 score [10], which are based on metrics mentiond above. Title Suppressed Due to Excessive Length – Macro-averaged F1 score It is firstly calculated for each category of indicator values, and then for the arithmetic mean of all categories. P P P recision Recall M acroP = M acroR = (3) n n 2 × M acroP × M acroR M acroF = (4) M acroP + M acroR – Weighted F1 score It is firstly calculated for each label and then averaged by support weighting - the actual number of instances per label. P P TP TP W eightedP = P P W eightedR = P P (5) TP + FP TP + FN 2 × W eightedP × W eightedR W eightedF = (6) W eightedP + W eightedR 5 Participant Systems and Result The hate speech detection system is implemented in the process of four parts, namely text preprocessing, fearture extraction, classifier building, and classifica- tion. Then the classification results from different models will be analyzed. 5.1 Experiments Text preprocessing The text usually contains a lot of meaningless or unaf- fected information that may affect research results at different stages, such as punctuation, common words, links, and numbers. In this step, regular expres- sions have been used to eliminate noise, including non-alphanumeric characters and numbers. And we remove text information noise like stopwords as well, which are probably of little value in hate speech detection later. The stop words list in NLTK corpus has been chosen to delete meaningless words in texts. Besides, the post in social media can commonly include many non-point content, such as the mention to user, specific topic and URL links. So these contents is replaced by the corresponding words, namely USER, TOPIC and URL. Feature extraction Before training the model, it is necessary to convert the text to various feature vectors because the preprocessed text cannot be directly recognized by the model. In this step, I mainly consider trying to use two common features: n-gram and word embedding features. They can be compared according to the final results generated by the classifier. Aiqi Jiang – N-gram feature I mainly focus on unigram features, and then select bag-of-words (BOW) model for n-gram feature notation. It is fairly straighforward and each el- ement demonstrates how often the term appears in a text sentence. Since the information of the low frequency words is more abundant, I use the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to con- vert the frequency into the weight of the word, which is a quite robust and accurate weight calculation method. Feature processing can be done through CountVectorizer() and TfidfTransformer() in the NLTK package. – Word embeddings feature The vector dimension is low and dense, and the information density is high in word embedding. We will use Word2Vec model, where the similarity between words can be directly reflected by the calculation of word vectors. In order to create the Word2Vec feature, it is decided to use the pre-trained Google Word2Vec model, which has more convincing information about word simi- larity. Then the similarity value of each word will be the average of its word embedded value in the Word2Vec feature vector. Classifier building The whole experimental process is able to be achieved by NLTK, scikit-learn and the Keras system. The same two feature will be the input feature data for different classifiers for subtask A and subtask B. – Subtask A For both n-gram feature and word embedding feature, traditional machine learning algorithm Logistic Regression (LR) and deep learning sequential algorithm Long Short-Term Memory(LSTM) are respectively implemented as a binary classifier. – Subtask B The One-vs-all classifier will be built by using LR and LSTM for both two extracted features. Classifcation There will be four different classification experiments. Three of them will implement experiments on HASOC test data initially given by HASOC organizers, where Logistic Regression classifier will input TF-IDF and Word2Vec features respectively, and LSTM model will utilize Word2Vec feature. The other experiment is implemented by providing HASOC organizers with our LSTM model with Word2Vec feature, which the final result is based on a new test dataset used privately by HASOC organizers. HASOC test will show the final F1 score result from HASOC organizers. 5.2 Result Analysis The feature dimensionality of one-hot representation is rather high, which is easy to lead to a poor training model. So the LR classifier with TF-IDF feature is considered as the baseline for comparison. Title Suppressed Due to Excessive Length – Subtask A Our result is ranked in 9th position, seemingly a good score. There is not a big difference between two F1 scores. The LSTM classifier with word embedding features has the best performance. Table 2. The result of Subtask A TF-IDF + LR Word2Vec + LR Word2Vec + LSTM HASOC test macro F1 0.7991 0.7793 0.8104 0.7431 weighted F1 0.8738 0.8435 0.8661 0.8163 – Subtask B The HASOC result is ranked in 32th, a not very good result. The weighted F1 values show a good performance, but it has a big difference from the macro-averaged F1 values. Table 3. The result of Subtask B TF-IDF + LR Word2Vec + LR Word2Vec + LSTM HASOC Test macro F1 0.3029 0.2598 0.3083 0.2740 weighted F1 0.6738 0.6032 0.6955 0.6807 It can be seen that the LR classifier using pre-trained word embedding model do not work much better than the classifier using TF-IDF feature, which may be because the Google pre-trained word embedding model is based on the field of news instead of Twitter. And because of this, the performance of the LSTM model is not particularly good in the final result. 6 Conclusion The spread of hate speech on social media has increased significantly in recent years, which could have a serious effect on the society. Therefore, our work makes several contributions according to this problem. First, we try several methods classifying hate speech using both traditional machine learning model like LR and deep learning model like LSTM, to empirically improve classification ac- curacy. Second, we create a new hate speech dataset by combining an external dataset together with the original released one from HASOC organizers. Third, the pre-trained model for word embedding feature extraction is used to improve the accuracy of hate speech classification. Our results show a good performance in both two F1 scores in Subtask A and weighted F1 score in Subtask B, while subtask B needs a further fine-grained experiment based on specific classes. Aiqi Jiang References 1. Agarwal, S., Sureka, A. (2017). Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging web- site. arXiv preprint arXiv:1701.04931. 2. Badjatiya, P., Gupta, S., Gupta, M., Varma, V. (2017, April). Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 759-760). International World Wide Web Conferences Steering Committee. 3. Davidson, T., Warmsley, D., Macy, M., Weber, I. (2017, May). Automated hate speech detection and the problem of offensive language. In Eleventh international aaai conference on web and social media. 4. Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., Bhamidipati, N. (2015, May). Hate speech detection with comment embeddings. In Proceedings of the 24th international conference on world wide web (pp. 29-30). ACM. 5. Modha, S., Mandl, T., Majumder, P., Patel, D. (2019, December). Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages. In Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation. 6. Mehdad, Y., Tetreault, J. (2016, September). Do characters abuse more than words?. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 299-303). 7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). 8. Mossie, Z., Wang, J. H. (2018). Social Network Hate Speech Detection for Amharic Language. Computer Science Information Technology, 41 9. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y. (2016, April). Abu- sive language detection in online user content. In Proceedings of the 25th interna- tional conference on world wide web (pp. 145-153). In International World Wide Web Conferences Steering Committee. 10. Ozgur, A., Ozgur, L., Gungor, T. (2005, October). Text categorization with class- based and corpus-based keyword selection. In International Symposium on Com- puter and Information Sciences(pp. 606-615). Springer, Berlin, Heidelberg. 11. Schmidt, A., Wiegand, M. (2017, April). A survey on hate speech detection using natural language processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media (pp. 1-10). 12. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), pp. 1-47. 13. Waseem, Z., Hovy, D. (2016, June). Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL student research workshop(pp. 88-93). 14. Wei, X., Lin, H., Yang, L., Yu, Y. (2017). A convolution-LSTM-based deep neural network for cross-domain MOOC forum post classification. Information, 8(3), 92. 15. Wikipedia page about hate speech, https://en.wikipedia.org/wiki/Hate speech. Last accessed 12 Oct 2019 16. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. arXiv preprint arXiv:1902.09666. Title Suppressed Due to Excessive Length 17. Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., Kumar, R. (2019). Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval). arXiv preprint arXiv:1903.08983.