1. Introduction

Astralis@Hasoc 2020:Analysis On Identification Of Hate Speech In Indo-European Languages With Fine-Tuned Transformers.

Hiren Madhu

hirenmadhu16@gmail.com

Shrey Satapara

shreysatapara@gmail.com

Harsh Rathod

harshrathod6874@gmail.com

LDRP-ITR

Gandhinagar

India

The detection of hate speech in online social media platforms is of great importance in text classification. There is a need to research languages other than English. In this paper, we describe our team Astralis' combined effort in the shared task HASOC. We analyzed various models such as Naive Bayes, SVM, ANN, CNN, and embeddings such as TF-IDF, Multilingual BERT, and OPENAI-GPT2. Our relative performance was better in Subtask B for all languages, with our best-performed system ranked in second position in German Subtask B.

1. Introduction

● (HOF) Hate and Offensive - This division consists of Hate and offensive content.

Sub-task B is a fine-grained classification offered for English, German, Hindi. Hate-speech and offensive posts from the sub-task A are further classified into three categories. ● (HATE) Hate speech:- This contains a class of hate speech content. ● (OFFN) Offensive:- Posts under this class contain offensive content.

● (PRFN) Profane:- This subcategory contains profane content.

2. Related Work

Hate speech detection is a vast field of research and attracts many. Here we briefly describe some of the works done in this area.

The GermEval presented a task that identifies offensive language. The performance was measured by F1 score, precision, and recall.[ 2 ] This was a competition comprising 20 teams working on the shared task. The best results were achieved using five disjoint sets to train three different classifiers and then combining them, resulting in a meta-level classifier.[ 3 ] SemEval 2019 focuses on studying the type and target of the offensive language. They presented a shared task called OffensEval[ 4 ]. A schema is defined for taking into account the class and target. The dataset used is Offensive language identification(OLID). Three sub-tasks were given according to their annotation schema, which the participating team had to use. Sub-task A was offensive language identification, Sub-task B was Automatic categorization of offence types Subtask C was Offense target identification.

Work was done on text classification using CNN by Yoon kim[ 5 ]. The CNN models discussed herein improve upon state of the art on 4 out of 7 tasks, including sentiment analysis and question classification. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google News and are publicly available.[ 6 ] Nobata et al.[ 7 ] came up with a model that uses regression techniques to detect hate and offensive content from a speech. Djuric et al.[ 8 ] presented a model that used LR classifiers to identify hate content. Besides using conventional techniques, this research also used comment embeddings as one of their features.

3. Dataset

Below we briefly describe the dataset used for HASOC 2020. Given below is the class-wise distribution of the dataset provided to us during this task.

HOF 1856 847

NOT 1852 2116

4. Approach

Here we describe the various methodologies we used in different steps of the experiment.

4.1. Preprocessing

For preprocessing, we followed a few simple traditional steps. At first, all of the Twitter handles were removed. After that, the links in the tweets were removed. All the retweets in the data had “RT” at the start. So we removed that. Then we removed all the residual blanks and kept the emojis. As the transformers[ 9 ] model we used had emoji support.

4.2. Embeddings

Here we define the methodology that we used to analyze two different subtasks that were given to us. We have used various models to get the best possible results.

4.2.1. TF-IDF

It is short for the term frequency-inverse document frequency. One gram and bigrams were used to create this vectorizer with minimum occurrences of 5 for English. The length of the vectors derived was 1281 4.2.2. BERT[ 10 ]

BERT stands for Bidirectional Encoder Representation from Transformers. The hugging face[ 11 ] transformers library has made many transformers models available for use. From BERT, we used two models. bert-base-uncased for English and bert-base-multilingual-uncased for English, Hindi, and German. Matrices of length 768 * MAX_LEN was received for this. MAX_LEN is a parameter that shows the Max Length of tokenized tweets. Post padding with ‘0’ was used. 4.2.3. OPENAI GPT-2[ 12 ]

Similarly, shaped matrices as BERT were received from this transformer model for Hindi, English, and German.

4.3. Models

We have used various Machine Learning algorithms and Deep Neural Networks, and here we describe them in detail. We used tensorflow[18] keras[19] to make all the Neural Networks. 4.3.1. SVM[ 13 ]

SVM performs exceptionally well in specific NLP scenarios. To implement it, we used the scikit-learn[14] library. We used SVC with RBF kernel. To implement this with matrices that we got from transformers, we used the Continuous Bag Of Words method, in which the mean of embeddings of each word is taken. So the input to SVM was 768 length vectors.

4.3.2. KNN

This was also implemented using scikit-learn library with neighbours set to 3. Here again, the CBOW method was used.

4.3.3. Naive Bayes[15]

Naive Bayes also works well for NLP. Here, the CBOW method was used to get the embeddings.

4.3.4. ANN

The input to this ANN was the CBOW embeddings and tf-idf for English. It was a five-layer NN with 128, 128, 256, 512 neurons and the last layer had 1 or 4 neurons depending upon the subtask.

4.3.5. CNN[16]

Here we have used a similar approach as the work produced by Yoon Kim. The architecture of CNN has layers in the following order.

1. Input: For initializing the input tensors 2. BatchNormalization[20]: To normalize each batch that is being processed. 3. Dropout[17]: Dropout helps to reduce the possibility of overfitting 4. After this, it is divided into various branches. Each branch computes x-gram.

a. Conv1D: The kernel_size is set to x to compute x-gram. b. GlobalMaxPool1D: To get a vector of length of the number of filters c. BatchNormalization 5. The above three layers produce one x-gram. After this, all of the x-grams are merged into one Tensor. 6. The merged tensor is then passed into a Dense layer of 128 neurons. 7. And then, finally, a dense classifier layer with 1 or 4 neurons based on the subtask.

The architecture described above can be seen in Figure 1.

First, we experimented with CNN with Bigram, Trigram, and Four-gram. Then after with Bigram and Trigram. In most cases, the Bigram + Trigram model was working better than the Bigram + Trigram + Four-gram model. So, for further analysis, we considered only the Bigram+Trigram model.

For the above models, training Adam optimizer [21] was used. We also used mild kernel regularization[22] in English subtask 2. We have also used class weights[23] in subtask B for all languages because there was an imbalance of classes in the dataset.

5. Analysis

Experimental results in the available test set show that CNN(bigram + trigram) outperforms all other models. CNN can exceed most of all baseline models, precisely because of the nature of tweets. For example, tweets can be indirect texts. (e.g., sarcasm), full of noise and may not follow proper grammatical structure.

CNN can identify many small and large patterns in a tweet; if some are impacted by the noise[ 6 ], it can still use other patterns to determine the class, which can be seen in Table 4 and 5, which displays Embedding vs Model F1 Macro scores which is the metric used for scoring in HASOC 2020. The TFIDF for vectors for English subtask A works well enough, but the subtask B performance is lower, which is caused by the imbalance in the dataset's distribution. The bert-based-uncased Bigram + Trigram model gives the best performance in English. Models for which CBOW was used do not perform on par with the CNN model for both the subtasks. The bert-base-multilingual-cased and the gpt2 transformers embedding gives a reasonably similar performance in some situations, and for some, BERT performs a little better. bert-based-uncased performs better than bert-base-multilingualcased in both the subtasks of English.

So from the above work, we concluded to submit the CNN with Bigram + Trigram model with bert-based-uncased transformer for English subtasks and bert-base-multilingual-cased transformer for Hindi and German subtasks. The label wise F1 scores for the submitted models are shown in Table 6. OFFN and HATE categories from subtask 2 have relatively lower F1 scores due to relatively lower occurrences in the dataset. The final results on the private dataset on which the final ranks were given are displayed in Table 7.

6. Conclusion

This paper describes offensive text identification into three Indo-European languages. We have shown our methodology for classifying tweets and posts from social media using multiple models in given three languages categorizing hate and offensive speech. After analyzing different models, we observed that the bert-based-uncased Bigram + Trigram model gives the best performance in English and bert-base-multilingual-cased transformer for Hindi and German subtasks. The results indicate that organizing profane and hate content is a strenuous task. In future work, we hope that better models and methods can be used to improve the effect of identifying hate speech.

7. References

2011. [15] [18] [16] [17]

[1] Mandl , Thomas and Modha, Sandip and Shahi, Gautam Kishore and Jaiswal, Amit Kumar and Nandini, Durgesh and Patel, Daksh and Majumder, Prasenjit and Schäfer, Johannes. 2020 . Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages . In Proceedings of the 12th annual meeting of the Forum for Information Retrieval Evaluation. Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation

[2]

Michael

Wiegand , Melanie Siegel, and

Josef

Ruppenhofer . 2018 . Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language . In Proceedings of GermEval.

[3] Montani , Joaquın Padilla . 2018 . Tuwienkbs at germeval 2018: German abusive tweet detection . In14thConference on Natural Language Processing KONVENS 2018 , page 45

[4]

Marcos

Zampieri , Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and

Ritesh

Kumar . 2019b . SemEval -2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) . In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval).

[5] Kim , Y. : Convolutional neural networks for sentence classification . CoRR abs/1408 .5882 ( 2014 ), http://arxiv.org/abs/1408.5882

[6] Mikolov , Tomas & Chen, Kai & Corrado, G. s & Dean , Jeffrey. ( 2013 ). Efficient Estimation of Word Representations in Vector Space . Proceedings of Workshop at ICLR . 2013 .

[7] Nobata , C. , Tetreault , J., Thomas , A. , Mehdad , Y. , Chang , Y. : Abusive Language Detection in Online User Content . In: WWW 2016 . pp. 145 - 153 . Montreal ( 2016 )

[8] Djuric , N. , Zhou , J. , Morris , R. , Grbovic , M. , Radosavljevic , V. , Bhamidipati , N.: Hate Speech Detection with Comment Embeddings . In: WWW 2015 . pp. 29 - 30 . Florence, Italy ( 2015 )

[9]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin, “ Attention Is All You Need” . arXiv:1706 .03762 https://arxiv.org/pdf/1706.03762.pdf

[10] Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). pp. 4171 - 4186 . Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019 ). https://doi.org/10.18653/v1/ N19 -1423

[11] Wolf , Thomas & Debut, Lysandre & Sanh, Victor & Chaumond, Julien & Delangue, Clement & Moi, Anthony & Cistac, Pierric & Rault, Tim & Louf, Rémi & Funtowicz, Morgan & Brew, Jamie. ( 2019 ). Transformers: State-of-the-art Natural Language Processing . arXiv: 1910 . 03771v5 [cs .CL] https://arxiv.org/pdf/ 1910 .03771.pdf

[12] Radford , Alec, and Wu , Jeff, and Child , Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya, Language Models are Unsupervised Multi-Task

Learners

, ( 2019 )

[13] Marti

Hearst . 1998 . Support Vector Machines . IEEE Intelligent Systems 13, 4 (July 1998 ), 18 - 28 . DOI:https://doi.org/10.1109/5254.708428