1. Introduction

model for identification of ofensive content in south Indian languages

Shankar Biradar

shankar@iiitdwd.ac.in 0

Sunil Saumya

sunil.saumya@iiitdwd.ac.in 0

Arun Chauhan

aruntakhur@gmail.com 0 0 Indian Institute of Information Technology Dharwad

2021

In recent years, there has been a lot of focus on ofensive content. The amount of ofensive content generated by social media is increasing at an alarming rate. It created a greater need to address this issue than ever before. To address these issues, the organizers of “Dravidian-Code Mixed HASOC-2021” have created two challenges. Task 1 involves identifying ofensive content in Malayalam data, whereas Task 2 includes Malayalam and Tamil Code Mixed Sentences. Our team participated in Task 2. We used multilingual BERT to extract features in our proposed model, and we used two diferent classifiers, Support Vector Machine (SVM) and Deep Neural Network (DNN), on the extracted features. In addition, we used the proposed data to evaluate the performance of a monolingual BERT classifier. Our best performing model monolingual Bert received a weighted F1 score of 0.70 for Malayalam data, ranking ifth; we also received a weighted F1 score of 0.573 for Tamil Code Mixed data, ranking twelfth.

1. Introduction

The availability of smartphones and the internet has created a lot of interest in social media among today’s youth. These applications give a huge platform for users to connect with the outside world and share their ideas and opinions with others. With these benefits comes a disadvantage: many people misuse the platform under the name of freedom of expression to publish inflammatory content on social media. This inflammatory information typically targets a single person, a group of people, a particular faith, or a community [ 1 ]. People generate objectionable content and aggressively propagate it on social media. This type of material is produced for a variety of reasons, including commercial and political gain [ 2 ]. This type of content can disrupt social harmony and cause riots in society. Also, it has the potential to have a detrimental psychological influence on the readers. It can harm people’s emotions and conduct. Therefore, identifying this type of content is critical; as a result, researchers, policymakers, and investors (stakeholders) are attempting to develop a dependable technique to identify ofensive content on social media. nEvelop-O (A. Chauhan)

Various studies on hate speech, harmful content, and abusive language identification in social media have been conducted during the previous decade. The majority of these studies were focused on monolingual English content, and a large amount of English language cuprous has been created [ 3 ]. But, people in countries with a complex culture and history, such as India, frequently use regional languages to generate inappropriate social media posts. Users typically mix their regional languages with English while creating such content. This type of text is known as code mixed text on social media. Hence we require an eficient method to classify ofensive content in Code-Mixed Indian languages. In this context, the “DravidianCodeMixed HASOC-2021” shared task provider has organized two tasks for detecting hate speech in Dravidian languages such as Malayalam and Tamil code-mixed data. Our team took part in Task No. 2, and this paper presents the working notes for our suggested model.

The rest of the article is arranged in the following manner: Section 2 provides a brief summary of previous work, while Section 4 describes the proposed model in full. Section 5 concludes by providing information on the outcome.

2. Literature review

Many researchers and practitioners from industry and academia have been attracted to the subject of automatic identification of hostile and harmful speech. [ 4 ] Provided a high-level review of the current state-of-the-art techaniques in ofensive language identification and related issues, such as hate speech recognition. [ 5 ] Developed a publicly accessible dataset for identifying the ofensive language in tweets by categorizing them as hate speech, ofensive but not hate speech or neither. Various machine learning models, such as Support Vector Machine (SVM) and logistic regression, were created utilizing various data properties, such as n-grams, TF-IDF, readability, etc., for this purpose. [ 6 ] Built a model with deep neural networks in combination with SVM for the detection of ofensive content with the accomplishment of F1 score of 90%.

Ofensive content detection from tweets is part of some conferences as well as competition tasks. Ofensive 2020 was provided by SemEval in 2020 as a task in five languages: English, Arabic, Danish, Greek, and Turkish [ 7 ]. In FIRE 2019, a similar task was achieved for IndoEuropean languages such as English, Hindi, and German. The data set was created using samples obtained on Twitter and Facebook in all three languages. Various models, including LSTM with attention, Word2vec embedding with CNN, and BERT, were used for this task. In several cases, traditional learning models outperformed deep learning methods for a language other than English [ 8 ]. Shared task on ofensive language detection in Dravidian languages was provided by [ 9 ] [ 10 ].

3. Task and Data set description

We have taken data set from HASOC subtask, ofensive language identification of Dravidian CodeMix[ 11 ]. Challenges provided by the organizers are as follows.

Task 1: A binary classification problem with message-level labeling for ofensive and nonofensive information in Malayalam CodeMixed YouTube comments.

Task 2: Given Romanized Tanglish and Manglish tweeter or YouTube comments, the system must classify them as ofensive or non-ofensive.

Our team took part in Task 2 for identifying ofensive information in the Tanglish and Manglish data sets. According to the organizer, Tanglish data is collected from Twitter tweets and comment on the hello APP. Whereas Manglish data is taken from YouTube comments [ 11 ]. A detailed description of the data set is provided in Table 1, both Tanglish and Manglish data contain ID, Tweet, and Label fields.

4. Methdology

Our team has proposed three submissions based on three diferent models. In the first two models, mBERT embeddings are passed through SVM and DNN classifiers, while in the third model, monolingual BERT is employed as a classifier. Each of them is designed using the general architecture shown in Figure 1. Thus, our model consists of three stages, each of which is discussed in the preceding subsections.

4.1. Data processing

The data set provided by the organizer contains many unwanted information. A few data preprocessing steps were undertaken on both text and label fields to convert the data suitable for model building. Digits, special characters, hyperlinks, and Twitter user handles were omitted from the data set because they were not helping us improve the performance of our model. Furthermore, the social media data provided by the organizer did not follow grammatical norms; hence, data lemmatization is performed to convert the data to its usable base form. For example, the word ate, eaten, and eating were converted to their base form eat. Converting text to lower case is also done to eliminate redundant terms. All of this preprocessing was done with the help of the NLTK toolbox from the Python library [ 12 ]. The preprocessed data is then fed into a tokenizer, which divides the tweet into several tokens. The mBERT tokenizer 1 is used for this purpose. Padding and masking were also used to handle variable-length sentences.

1https://huggingface.co/bert-base-multilingual-cased

4.2. Feature extraction

To obtain contextual embeddings from Code-Mixed data, we used the multilingual Bidirectional Encoder Representation (mBERT) model [ 13 ] in models 1 and 2, and monolingual BERT in model 3. The architecture of the mBERT model is largely based on the original monolingual BERT architecture [ 14 ], which has 12 transformer blocks, 12 attention heads, and 768 hidden layers. Furthermore, the vector dimension of mBERT embeddings is 768. This model was trained using the same pre-training technique as the BERT, namely Masked Language Modeling (MLM) and Next Sentence Prediction. The only distinction is that multilingual BERT is trained on Wikipedia data from 104 diferent languages to handle languages other than English. We only draw embeddings from the CLS token at the beginning for our classification purposes because it gives whole sentence embeddings.

4.3. Classification

Our proposed model experimented with three diferent classifiers: SVM and DNN classifiers with mBERT embedding in models 1 and 2 and pre-trained language model BERT in model 3. The descriptions of these models are presented in the subsections that follow. The intuition behind selecting these proposed models is that they outperformed other models such as Logistic Regression (LR), Random Forest (RF), and Naive Bayes (NB) in our preliminary trials.

4.3.1. Traditional machine learning based classifier

We experimented with traditional machine learning algorithms such as Support Vector Machine (SVM) with ten-fold cross-validation. Experiment results for the suggested model demonstrate that kernel value ”1” and solver ”lbfgs” produce the best results. Experimental trials are used to determine these hyper-parameter values. This model accepts mBERT embeddings as input and produces labels that are either ofensive or non-ofensive. The model was developed using Python’s sci-kit-learn library [ 15 ].

4.3.2. Deep neural network based model

Later, we experimented with the Deep Neural Network (DNN) model, a second model in our proposed methodology. The DNN model comprises several dense layers that are designed to extract more significant features from input embeddings. We used dense layers of 1000, 500, 100, and 50 neurons in our model. Each dense layer follows a dropout rate of 0.4 to prevent the overfitting problem. The optimum grid search value determines the dropout rate of 0.4, and it remains constant throughout the experiment. To normalize activation data, we additionally employed a batch normalization layer. The output from these layers is then classified using the sigmoid layer.

4.3.3. Transformer model

In our last model, we experimented with transformer-based language models such as BERT. Transformer architectures are trained on generic tasks such as modeling language and then ifne-tuned for classification. The underlying model for our classification is Bert-base-uncased 2, which BERT developers provide. We did not use ten-fold cross-validation to evaluate monolingual BERT since it is more computationally expensive. Implementation details of all three proposed models are provided in GitHub repository3.

5. Experimental Results

To evaluate the presented models, the organizers have provided a weighted F1 score. Among the proposed models, our top-performing monolingual BERT received a sixth-place for ofensive content recognition in the Mangalish data set and eleventh in the Tanglish data set. Table 2 and Table 3 provide the list of top-performing models with weighted F1 scores for Manglish 2https://huggingface.co/transformers/model_doc/bert.html 3https://github.com/shankarb14/dravidian-codemix and Tanglish data set respectively (The result of our proposed model is shown in bold letters). Among our proposed models, BERT outperformed other classifiers, reaching 70% accuracy for the Mangalish data set and 57% accuracy for the Tanglish data set. Finally, we compared the results of our proposed models in Table 4. We trained our proposed models on a Tanglish data set comprising 4000 comments from the training set and tested them on 940 comments from the test set. For the Manglish data set, 4000 train comments and 1000 test comments are used.

5.1. Error analysis

We investigated the behavior of proposed models on sample test sentences to evaluate their performance. We discovered that our best-performing model monolingual BERT classifier could accurately classify all of the test samples based on our experimental observations. However, multilingual BERT models such as mBERT+SVM and mBERT+DNN could not classify test samples 3 and 2, respectively. Table 5 summarises the results of the findings.

6. Conclusion and future enhancement

In our work, we presented a model submitted by our team IIITD-ShankarB for ofensive content identification in the shared task “Dravidian-CodeMixed HASOC-2021”. Our proposed work experimented with three distinct models: a machine learning-based model, a Deep Neural Network model, and a transformer-based language model. Our model is one of the top-performing models, ranking sixth on the Manglish data set and eleventh on the Tanglish data set. In future work, we can improve the eficiency of the suggested model by including domain-specific embeddings.

[1]

S. A.

Chowdhury ,

Mubarak ,

Abdelali , S.-g. Jung,

B. J.

Jansen ,

Salminen , A multiplatform Arabic news comment dataset for ofensive language detection , in: Proceedings of the 12th Language Resources and Evaluation Conference , 2020 , pp. 6203 - 6212 .

[2]

C. N. d.

Santos , I. Melnyk , I. Padhi , Fighting ofensive language on social media with unsupervised text style transfer , arXiv preprint arXiv: 1805 . 07685 ( 2018 ).

[3]

Mubarak ,

Darwish , W. Magdy, Abusive language detection on Arabic social media , in: Proceedings of the first workshop on abusive language online , 2017 , pp. 52 - 56 .

[4]

Fortuna ,

Nunes , A survey on automatic detection of hate speech in text, ACM Computing Surveys (CSUR) 51 ( 2018 ) 1 - 30 .

[5]

Davidson ,

Warmsley ,

Macy , I. Weber , Automated hate speech detection and the problem of ofensive language , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 11 , 2017 .

[6]

Al-Khalifa ,

Magdy ,

Darwish ,

Elsayed ,

Mubarak , Proceedings of the 4th workshop on open -source Arabic corpora and processing tools, with a shared task on ofensive language detection , in: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Ofensive Language Detection , 2020 .

[7]

Zampieri ,

Nakov ,

Rosenthal ,

Atanasova , G. Karadzhov,

Mubarak ,

Derczynski ,

Pitenis , Ç. Çöltekin, Semeval-2020 task 12: Multilingual ofensive language identification in social media (ofenseval 2020 ), arXiv preprint arXiv: 2006 . 07235 ( 2020 ).

[8]

Mandl ,

Modha ,

Majumder ,

Patel ,

Dave ,

Mandlia ,

Patel , Overview of the HASOC track at FIRE 2019 : Hate speech and ofensive content identification in Indo-European languages , in: Proceedings of the 11th forum for information retrieval evaluation , 2019 , pp. 14 - 17 .

[9]

B. R.

Chakravarthi ,

Priyadharshini ,

Jose , A. Kumar

, T. Mandl,

P. K.

Kumaresan ,

Ponnusamy , H. R L , J. P. McCrae ,

Sherly , Findings of the shared task on ofensive language identification in Tamil, Malayalam, and Kannada , in: ” Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages”, Association for Computational Linguistics , Kyiv, 2021 , pp. 133 - 145 . URL: https://aclanthology.org/ 2021 . dravidianlangtech- 1 . 17 .

[10]

Mandl ,

Modha , A. Kumar

B. R.

Chakravarthi , Overview of the HASOC track at FIRE 2020 : Hate speech and ofensive language identification in Tamil, Malayalam, Hindi, English and German , in: Forum for Information Retrieval Evaluation , 2020 , pp. 29 - 32 .

[11]

B. R.

Chakravarthi ,

P. K.

Kumaresan ,

Sakuntharaj ,

A. K.

Madasamy ,

Thavareesan , P. B,

S. Chinnaudayar

Navaneethakrishnan ,

J. P.

McCrae ,

Mandl , Overview of the HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[12]

Bird ,

Klein , E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, ” O'Reilly Media , Inc.”, 2009 .

[13]

Pires ,

Schlinger ,

Garrette , How multilingual is multilingual bert? , arXiv preprint arXiv: 1906 . 01502 ( 2019 ).

[14]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[15]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg , et al., Scikit-learn: Machine learning in python . the journal of machine learning research 12 ( 2011 ).