-

Amrita CEN at HASOC 2019: Hate Speech Detection in Roman and Devanagiri Scripted Text

Sreelakshmi.K

Premjith.B

Soman K.P

0 0 Center for Computational Engineering & Networking (CEN) Amrita School of Engineering , Coimbatore, Amrita Vishwa Vidyapeetham , India

Nowadays the usage of social media sites like Facebook and Twitter has increased rapidly which has lead to huge ooding of data in the social media sites. Though these social media sites give free opportunities to people to express and share their thoughts they also end up in spread of huge amount of hate content. In this paper we present a domain speci c word embedding model for classi cation of English tweets to Non Hate-O ensive and Hate-O ensive and a fastText model for Hindi text classi cation. The classi cation is done using the dataset got from HASOC 2019 shared task. Deep learning algorithm is used as the classi er.

FastText Convolutionl Neural Network Long short term memory Hate speech

Hate speech is a form of expressing aggression, profanity in verbal or non-verbal way. It can be like discriminating or using lthy language against a person or group just on grounds of their age, gender, sex, caste, economical status etc. this can even lead to huge violence or con ict between individuals or communities. So it is very important to detect them before it reaches a huge mass [ 1 ], [ 2 ], [ 3 ], [ 4 ].

For a country like India people tend to use regional language for texting or tweeting. Around half of the population speak Hindi. So the need to nd hate speech in Hindi is very high. Not only human it can even corrupt chatbots. Since chatbots learn from conversation with human if it is not able to di erentiate hate and non-hate content then it also starts to use it. So it is has become a huge responsibility for the government as well as Twitter and Facebook to detect this hate speech content.

So for this,in the paper we developed two separate models to classify tweets in Hindi and English as hate or not. The English data is in roman script and the Hindi data in Devanagari script. The dataset is from HASOC 2019 shared task [ 5 ] Two samples of English data is given below HATE "I love this bill, I think they should start printing them FuckTrump https://t.co/NY9CuyivGl" Non-HATE "All Indian spectators shd hv BalidanBadge in ground, DhoniKeepsTheGlove DhoniKeepBalidaanBadgeGlove DhoniKeepsTheGlove DhoniKeSathDesh" 2

Related Work

There are lot of works done in the area of hate speech detection, few of them are given below

Shervin et.al [ 6 ] in his paper developed a model using character n-grams, word n-grams and word skip-grams for the classi cation of English tweets to hate speech (HATE), o ensive and no o ensive content. The system used SVM as the classi er with an accuracy of 78%.

Georgious et.al [ 7 ] in his paper presents a model to detect hateful content in so-cial media. They made use of of Recurrent NeuralNetwork (RNN)classi ers and fed various features associated with user-related information, such as the users' tendency towards racism or sexism. They made use of a publicly available corpus with 16000 tweets.

Satyajith et.al [ 8 ] collected around 250000 tweets using Twitter API and trained a word2vec model and obtained the domain speci c word embedding. Using these embeddings they extracted the features for 4500 Hindi-English codemixed data and classi ed it as hate and non-hate. They used CNN, LSTM and BiLSTM as classi ers. 3

Proposed methodology

The steps used for our proposed methodology is as follows: { Pre-processing: The data consists of usernames, hashtags, urls and unwanted characters. The rst step was to remove these usernames, hashtags, urls , unwanted characters ,punctuations. Then the whole text was converted to lower case. { Retraining the model:Once the text data is cleaned we tokenized the data and segemented it to the level of words. Each tokeninzed sentence is given to a bilingual model which is already trained on 250K code-mixed sentences. We retrained that model using gensim's word2vec with our data and generated word embedding as feature vectors from the retrained model. { Feature Extraction: For the Hindi corpus fastext features were extracted.

FastText consists of pre-trained model for hindi. Each sentence was tokenised and the wordvector of each word was taken from fastText model and the average of each words of a sentence was taken. The vector size for fastText was speci ed as 300. For english data teh vector representation for each data was taken using bilingual word embedding and the average of each words of a sentence was taken. For this word2vec was used and the vector isze was speci ed to be 300. { Classi cation: For deep learning model which consists of CNN, LSTM layers were used for classi cation. The feature extracted matrix was fed to a embedding layer then to CNN and then LSTM. The ow diagram is given in Fig. 2 4

Conclusion

In many applications like chatbot building, content recommendation and sentiment analysis the need for hate speech detection is high. Especially for a country like India with diverse culture and language the usage of Hindi in Twitter is also high. So this paper presents a deep learning model which makes use of two di erent features to classify tweets in English and Hindi to hate and non-hate.

Zampieri ,

Malmasi ,

Nakov ,

Rosenthal ,

Farra , and

Kumar , \ Semeval-2019 task 6: Identifying and categorizing o ensive language in social media (o enseval)," in Proceedings of the 13th International Workshop on Semantic Evaluation , 2019 , pp. 75 { 86 .

Wiegand ,

Siegel , and

Ruppenhofer , \ Overview of the germeval 2018 shared task on the identi cation of o ensive language ," 2018 .

Kumar ,

A. K.

Ojha ,

Malmasi , and

Zampieri , \ Benchmarking aggression identi cation in social media," in Proceedings of TRAC , 2018 .

Zampieri ,

Malmasi ,

Nakov ,

Rosenthal ,

Farra , and

Kumar , \ Predicting the Type and Target of O ensive Posts in Social Media," in Proceedings of NAACL , 2019 .

Modha ,

Mandl ,

Majumder , and

Patel , \ Overview of the HASOC track at FIRE 2019: Hate Speech and O ensive Content Identi cation in Indo-European Languages," in Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation , 2019 .

G. K.

Pitsilis ,

Ramampiaro , and

Langseth , \ Detecting o ensive language in tweets using deep learning," arXiv preprint arXiv: 1801 .04433, 2018 .

Malmasi and

Zampieri , \ Detecting hate speech in social media," arXiv preprint arXiv:1712.06427 , 2017 .

Kamble and

Joshi , \ Hate speech detection from code-mixed hindi-english tweets using deep learning models," arXiv preprint arXiv: 1811 .05145, 2018 .