1. Introduction

Overview of the HASOC-DravidianCodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam

Bharathi Raja Chakravarthi

bharathi.raja@insight-centre.org 3

Prasanna Kumar Kumaresan

prasanna.mi20@iiitmk.ac.in 2

Ratnasingam Sakuntharaj

sakuntharaj@esn.ac.lk 1

Anand Kumar Madasamy

Sajeetha Thavareesan

sajeethas@esn.ac.lk 1

B Premjith

b_premjith@cb.amrita.edu 0

K Sreelakshmi

k_sreelakshmi@cb.students.amrita.edu 0

Subalalitha Chinnaudayar Navaneethakrishnan

subalalitha@gmail.com 5

John P. McCrae

john.mccrae@insight-centre.org 3

Thomas Mandlg

mandl@uni-hildesheim.de 0 Center for Computational Engineering and Networking (CEN), Amrita School of Engineering , Coimbatore, Amrita Vishwa Vidyapeetham , India 1 Eastern University , Sri Lanka 2 Indian Institute of Information Technology and Management-Kerala , India 3 Insight Centre for Data Analytics, National University of Ireland , Galway 4 National Institute of Technology Karnataka Surathkal , Karnataka , India 5 SRM Institute of Science and Technology , Chennai, Tamil Nadu , India

We present the results of HASOC-Dravidian-CodeMix shared task1 held at FIRE 2021, a track on ofensive language identification for Dravidian languages in Code-Mixed Text in this paper. This paper will detail the task, its organisation, and the submitted systems. The identification of ofensive language was viewed as a classification task. For this, 16 teams participated in identifying ofensive language from Tamil-English code mixed data, 11 teams for Malayalam-English code mixed data and 14 teams for Tamil data. The teams detected ofensive language using various machine learning and deep learning classification models. This paper has analysed those benchmark systems to find out how well they accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam.

eol>Sentiment analysis Dravidian languages Tamil Malayalam Kannada Code-mixing

1. Introduction

Advancements in technology have aimed to ease peoples’ lives and have attracted many users towards digitization, particularly younger generations [ 1, 2 ]. As a result, the number of people Ǳ using social media to express their opinions and beliefs has increased dramatically [ 3 ]. However, the lack of regulation gives individuals the freedom to post ofensive content. There is also no mechanism to regulate the posting of hateful content in under-resourced languages [ 4, 5, 6 ].

Tamil is a Dravidian language spoken primarily in Sri Lanka, India, Malaysia, and Singapore [ 7, 8, 9 ]. It is an agglutinative language with a rich morphological structure [ 10 ]. Tamil has 247 letters comprising of 12 vowels, 18 consonants, 216 composite letters combining each consonant with each vowel, and one special letter known as ”Ayutha eluththu”. Malayalam is also a Dravidian language spoken in Kerala, India [ 11, 12, 13 ]. Malayalam also has its own script for writing; however, social media users use Latin script or mix languages when commenting or posting online [ 14, 15 ].

The HASOC-DravidianCodeMix shared task 2021 aims to provide a new gold standard corpus for ofensive language identification of code-mixed text in Dravidian languages (TamilEnglish and Malayalam-English). Code-mixed content online results from people mixing multiple languages, especially their native language and another commonly spoken language while expressing their views [ 16 ]. Ofensive language often comprises of hate speech, such as racism, ageism, homophobia, transphobia, ableism and any hate-promoting content against an individual or group [ 17 ]. It has been an active area of research in both academia and industry for the past two decades [ 18 ]. There is an increasing demand for the identification of ofensive language in code-mixed social media texts [ 19 ].

There were 16 teams involved in identifying ofensive language from Tamil-English code mixed data, 11 teams in identifying ofensive language from Malayalam-English code mixed data, and 14 teams in identifying ofensive language in Tamil data. The teams used a variety of machine learning and deep learning classification models to identify ofensive language. The purpose of this study is to examine such benchmark systems in order to determine how well they fit a code-mixed scenario in Dravidian languages, with a particular emphasis on Tamil and Malayalam.

2. Task Description

The task aims to identify ofensive language content of the code-mixed comments/posts in Dravidian Languages (Tamil, Tamil-English and Malayalam-English) collected from social media. The comment/post may contain more than one sentence, but the average sentence length in the corpora is one. Each comment/post is annotated at the comment/post level. This dataset also exhibits class imbalance problems that mirrors real-world scenarios.

• Task 1

Task 1 focuses on ofensive language identification from Tamil text. Task 1 is a coarsegrained binary classification where each participating system has to classify YouTube comments in Tamil into two classes: Ofensive and Not-ofensive.

– Not-Ofensive – The comments does not contain ofensive language. Example:

Text: ேபரைவ சார்பாக படம் ெவற்ற ெபற வாழ்துககள்

Task Task 1: Tamil Task 2: Tamil Task 3: Malayalam – Ofensive - The comments contain hate, ofensive or profane content.

Text: ேபாடா ெவங்காயம் ஒனன்யலாம் அடுச் ெகாளள்மு்

ெவைண்ண .

Translation: You onion we should beat you to death butter – butter and onion are ofensive words in Tamil. • Task 2

Task 2 focus on ofensive language identification in code-mixed Malayalam-English and Tamil-English comments. Example: Code-mixed Tamil – Not-Ofensive – The comments does not contain ofensive language.

Text: iantha padam rumba nalla iruku

Translation of codemixed Tamil: This movie is very good – Ofensive – The comments does not contain ofensive language.

Text: i ammaye bhegikku

Translation of codemixed Malayalam: f..k this mother f..kers 2.1. Dataset description The datasets for both Task 1 and Task 2 were prepared by collecting comments from YouTube. Table 1 shows the number of comments in each dataset. 2.1.1. Task 1: Tamil Dataset We collected data from YouTube comments for Task 1 using the YouTube comment scrapper 1 to download the comments from particular videos. The comments were collected from movie trailers. We removed all the comments which were not in Tamil. These comments were then used to create a dataset for the ofensive language classification task. This dataset contains a total of 6,534 comments and is split into train and test. The training dataset consists of 5,880 comments and the test dataset consists of 654 comments.

1https://pypi.org/project/youtube-comment-scraper-python/

No. TeamName 1 SSN_NLP 2 MUCIC [ 20 ] 3 SSN_NLP_MLRG [ 21 ] 4 IRLab [ 22 ] 5 BITS Pilani [ 23 ] 6 AIML [ 24 ] 7 Pegasus [ 25 ] 8 KonguCSE 9 Jusgowithurs 10 Gothainayaki.A 11 MUM 12 SSNCSE_NLP [ 26 ] 13 AI_ML NIT Patna 14 Saahil Raj 2.1.2. Tamil and Malayalam Dataset Task 2 data was also taken from YouTube comments and posts. These comments were used to create a dataset for the ofensive language classification in both languages. The dataset includes diferent types of code-mixing, such as mixing Tamil and Latin characters for the Tamil dataset, code mixed data for the Malayalam dataset, and mixing at the word level. The Tamil dataset contains a total of 5,941 comments from this split into training, development and test. The training dataset consists of 4,000 comments, the development dataset contains 940 comments, and the test dataset consists of 1,001 comments. The Malayalam dataset contains a total of 5,951 comments from this split into training, development and test. The training dataset consists of 4,000 comments, the dev dataset contains 951 comments, and the test dataset consists of 1,000 comments. These datasets also are published in the same competition, HASOC-Dravidian CodeMixed, which is on Codalab.

3. Methodology

We have received fourteen, sixteen and eleven submissions for Task 1: Tamil track, Task 2: Tamil track and Task 2: Malayalam track, respectively. The submissions were evaluated based on weighted average F1-score, and rank lists were prepared accordingly. Table 2 shows the rank list of teams that participated in Task 1: Tamil track. Tables 3 and 4 show the rank lists of the teams that competed in Task 2: Tamil track and Task 2: Malayalam track, respectively. Tables 2, 3 and 4 show the precision, recall and weighted average F1-score of all the participating teams on test data. In this section, we briefly describe the methodologies of teams that participated in the three tasks.

• SSN_NLP_MLRG [ 21 ]: Team SSN_NLP_MLRG participated in the Tamil-English subNo. TeamName 1 MUCIC [ 20 ] 2 AIML [ 24 ] 3 SSN_IT_NLP [ 27 ] 4 ZYBank AI 5 IRLab [ 22 ] 6 HSU [ 28 ] 7 IIITSurat [ 29 ] 8 Team Pegasus [ 25 ] 9 PSG [ 30 ] 10 SSNCSE_NLP [ 26 ] 11 IIITD-shanker [ 31 ] 12 CEN_NLP 13 RameshKannan 14 MUM 15 AI_ML_NIT_Patna 16 JBTTM No. TeamName 1 AIML [ 24 ] 2 MUCIC [ 20 ] 3 HSU [ 28 ] 4 IIIT Surat [ 29 ] 5 IRLab [ 22 ] 6 IIITD-ShankarB [ 31 ] 7 SSNCSE_NLP [ 26 ] 8 Pegasus [ 25 ] 9 CEN_NLP 10 MUM 11 JBTTM task. The authors implemented both traditional machine learning and deep learning models for the classification. They experimented with Support Vector Machine (SVM) [ 32 ], naive bayes, random forest and extreme gradient boosting ensemble classifiers for categorizing the ofensive contents with N-gram, character and word level Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words (BoW) features. The deep learning models used for the classification includes a shallow Neural Network (NN), a Long Short Term Memory (LSTM) [ 33 ] and a Convolutional Neural Network (CNN). The embeddings in the NN were initialized using the fastText [ 34 ] pre-trained word embeddings. The authors also followed a transfer learning approach by multilingual Bidirectional Encoder Representation (mBERT) [ 35 ], ALBERT [ 36 ] (A Lite BERT for self-supervised learning of language representations), DistilBERT [37] (Distilled version of BERT[ 35 ]) with the ktrain, and ULMFiT [38] with Fastai [39] to build the classification model. • HSU_TransEmb [ 28 ]: Team HSU_TransEmb used a Transformer ensemble system to identify the ofensive contents from Tamil-English and Malayalam-English code-mixed data. The ensemble system consists of mBERT, DistilBERT and MuRIL models [40]. The preprocessed data were fed to the three ensemble BERT models, and the class probabilities were computed. The class label was identified from the sum of the class probabilities obtained from the BERT models. • MUCIC [ 20 ]: Team MUCIC took part in both Tamil-English and Malayalam-English shared tasks. They used word-level as well as character-level N-gram based TF-IDF for extracting the features from the texts. Furthermore, they identified 40,000 frequent features in each case and constructed a combined set containing 80,000 frequent features. They employed linear SVM, random forest, logistic regression and an ensemble of these three classifiers to train the model. The logistic regression model obtained the highest F1-score of 0.881 in the Tamil-English task, whereas random forest exhibited the best performance with an F1-score of 0.783. • IIITSurat [ 29 ]: Team IIITSurat took part in both shared tasks and employed machine learning and deep learning models for classification. Machine learning classifiers such as logistic regression, random forest, naive bayes, XG boost, and SVM were trained over TF-IDF features. In addition to machine learning models, the authors executed Deep Neural Network (DNN), CNN, BiLSTM and Transformer-based models such as BERT [ 35 ], Indic BERT [41] and MuRIL [40] for classification. Among all the models, MuRIL achieved the highest F1-scores of 0.78 and 0.91 in Malayalam-English and Tamil-English tasks, respectively. • Pegasus [ 25 ]: Team Pegasus submitted their results in Task 1 and Task 2. They utilized XLM-RoBERTa [42] and DistilBERT models for identifying ofensive language social media text. As mentioned earlier, the authors deployed the embedding generated using the BERT and fed it into a BiLSTM network. In Task 1, Team Pegasus to avoid repetition of the authors concatenated the embeddings obtained from both BERT models and passed them to a BiLSTM network. This model attained an F1-score of 0.810. The authors performed transliteration and translation on Task 2 data and applied the XLM-RoBERTa model to extract the embedding, which obtained F1-scores of 0.612 and 0.670 in TamilEnglish and Malayalam-English tasks, respectively. • IRLab [ 22 ]: Team IRLab implemented a Deep Neural Network (DNN) with TF-IDF features for Tasks 1 and 2. The authors extracted unigram to six-gram TF-IDF features and identified the first 30,000 features. A DNN with four dense layers read these features and predicted the class label for each data. They also performed hyperparameter tuning for each model to fix the best model. Their model achieved F1-scores of 0.84, 0.65 and 0.71 in Task 1, Tamil-English, and Malayalam-English shared tasks. • AIML [ 24 ]: Team AIML proposed an ensemble model which used character N-grambased TF-IDF features for the identification of ofensive texts. The authors considered one to six character N-gram features and trained an ensemble of SVM, logistic regression and random forest. Their model attained an F1-score of 0.83 in Task 2, whereas it achieved F1-scores of 0.67 and 0.77 in Tamil-English and Malayalam-English tasks, respectively. • SSN_IT_NLP [ 27 ]: Team SSN_IT_NLP presents an ofensive language identification model for Tamil-English data. The mBERT generates embeddings from the data, which are then fed to an ensemble of SVM, XG Boost and Linear Discriminant Analysis (LDA). The label predicted by the majority of the models was selected as the final output. • NLP_CSE: Team NLP_CSE employed machine learning and deep learning models for predicting the ofensive data. A logistic regression classifier takes TF-IDF features for training the model. Furthermore, the authors used random oversampling algorithms to deal with the class imbalance problem in the data. The model obtained an F1-score of 0.5243. In addition to the logistic regression model, the authors implemented an LSTM-based encoder-decoder architecture and a transformer-based model. The encoderdecoder model was a deep multi-layer network that also incorporated an attention mechanism. This model consisted of stacks of four encoders and four decoders. The transformer model, mBERT, was used to generate the embedding for sentences and considered the cosine similarity between sentences for classification. • BITS_Pilani [ 23 ]: Team BITS_Pilani used a DNN which contain an embedding layer, pooling layer, dropout layer, a fully connected layer and an output layer for classifying the text into Ofensive and Not ofensive in the Tamil-English subtask. The model achieved an F1-score of 0.835 in the competition. • M Subramanian et al.: Team M Subramanian et al. employed the naive bayes multinomial model, KNN, logistics regression, and SVM classifier with BoW features for classifying the social media text into ofensive or not ofensive categories. This team participated in the shared task for only Tamil data. The Logistic regression model attained the highest performance among the classifiers.

4. Evaluation

The distribution of the ofensive languages classes are imbalanced in both datasets. This takes into account the varying degrees of importance of each class in the dataset. We used a classiifcation report tool from Scikit learn 2.

Precision =

+ (1) 2https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html Recall =

+ F-Score = 2 ∗

Precision ∗ Recall

Precision + Recall weighted = ∑( of × Weight of ) weighted = ∑( of × Weight of ) =1 =1 =1 − weighted = ∑( − of × Weight of )

5. Results and Discussion

Shared tasks on ofensive language detection in CodeMix Tamil and Malayalam data were organized as part of HASOC 2021. Fourteen submissions for Track 1: Tamil and sixteen submissions in Track 2. For Malayalam, eleven teams submitted their results in Track 2. Table 5 shows the number of teams participated in each shared task. Participating teams explored N-gram based TF-IDF, BoW and diferent variants of BERT for representing the input text. None of the teams used language specific features. They used various conventional machine learning classifiers such as SVM, naive bayes, random forest. logistic regression, XG boost, KNN and ensemble of machine learning classifier models for the identification of the ofensive language text. In addition to that, DNN, LSTM and its variants and transformer-based classifiers were also studied for the classification. Team HSU_TransEmb explored an ensemble of mBERT, DistilBERT and MuRIL for detecting ofensive texts from CodeMix Tamil and Malayalam data. NLP_CSE investigated the performance of oversampling algorithms to address the class imbalance problem in the data. Tables 1, 2 and 3 show the rank lists for Task 1: Tamil track, Task 2: Malayalam track and Task 2: Tamil track, respectively. Figures 1, 2 and 3 show precision, recall and F1-scores of submissions in Track 1: Tamil, Track 2: Tamil and Track 2: Malayalam. Figure 4 shows the box-plots of the performance of the teams participated in Track 1: Tamil, Track 2: Tamil and Track 2: Malayalam.

Team SSN_NLP obtained the first rank in Track 1 with an F1-score of 0.859. MUCIC and SSN_NLP_MLRG grabbed second and third positions with F1-scores of 0.852 and 0.844. Among the 14 teams, seven scored F1-scores greater than 0.8. Looking at the models used by the teams, one can see that the teams that finished top used diferent kinds of feature extraction models and classifiers.

Team MUCIC attained the first position in Track 2: Tamil shared task, and they achieved an F1-score of 0.678. MUCIC used word level as well as character level N-gram based TFIDF features for classification. They performed the predictions using SVM, random forest, logistic regression, and an ensemble of these three. The second-placed team, AIML, and the (2) (3) (4) (5) (6) Competition All three tasks Track 1: Tamil Both tasks in Track 2 Track 2: Tamil alone Track 2: Malayalam alone Track 1: Tamil and Track 2: Tamil third-placed team, SSN_IT_NLP, scored F1-scores of 0.670 and 0.668, respectively. AIML also utilized the N-gram based TF-IDF features with SVM, logistic regression and random forest. They considered unigram to six-gram features for this analysis. SSN_IT_NLP made use of mBERT embeddings with SVM, XG boost and LDA to identify the ofensive language texts among the data. Among the 16 teams that participated, ten teams recorded F1-scores greater than 0.6.

In Track 2: Malayalam, AIML reached the top position with an F1-score of 0.766. MUCIC and HSU were placed in the second and third positions with F1-scores of 0.762 and 0.735, respectively. AIML used unigram to six-gram based TF-IDF features with SVM, logistic regression and random forest classifiers for the identification of ofensive language texts. MUCIC also followed a similar methodology, but they used only the most frequent forty thousand n-gram based TF-IDF features from each class for classification. Team HSU utilized an ensemble of mBERT, DistilBERT and MuRIL for the detection of ofensive language contents. In this task, 6 out of 11 teams obtained an F1-score greater than 0.7, and one team scored an F1-score less than 0.6.

It is interesting to note that teams that used TF-IDF features attained the top position in both tasks in Track 2. A similar trend was visible in HASOC 2020 [43]. The teams that won the HASOC 2020 shared tasks in CodeMix data used TF-IDF features with machine learning classifiers.

6. Conclusion

This paper gives an overview of the HASOC- Dravidian-CodeMix shared task at FIRE 2021. The shared task consisted of three subtasks for Tamil, CodeMix Tamil and Malayalam languages. There were 16 teams who participated in Tamil-English code mixed data, 11 teams in Malayalam-English code mixed data and 14 teams in Tamil data. Teams used methods ranging from Bag of Words, TF-IDF to BERT-based models to represent the data and applied conventional machine learning algorithms, deep neural networks and transformer networks for prediction. One team employed oversampling algorithms to deal with the imbalance in the data by synthetically generating the data points in minority classes. The analysis of the methods of the teams showed that both conventional and deep learning/transformer-based methods exhibit similar performances in terms of the evaluation metrics used for assessing the models.

Acknowledgments

This publication is the outcome of the research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289_P2 (Insight_2), and Irish Research Council grant IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language for Minority and Historical Languages). We also thank Ciara Oloughlin for her help with proof reading. (2019). [37] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [38] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 328–339. URL: https://aclanthology.org/P18-1031. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 8 - 1 0 3 1 . [39] J. Howard, S. Gugger, Fastai: a layered api for deep learning, Information 11 (2020) 108. [40] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave, et al., Muril: Multilingual representations for indian languages, arXiv preprint arXiv:2103.10730 (2021). [41] D. Kakwani, A. Kunchukuttan, S. Golla, N. Gokul, A. Bhattacharyya, M. M. Khapra, P. Kumar, inlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, 2020, pp. 4948–4961. [42] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [43] B. R. Chakravarthi, A. K. M, J. P. McCrae, B. Premjith, K. Soman, T. Mandl, Overview of the track on HASOC-Ofensive Language Identification-DravidianCodeMix., in: FIRE (Working Notes), 2020, pp. 112–120.

[1]

B. R.

Chakravarthi , HopeEDI: A multilingual hope speech detection dataset for equality, diversity, and inclusion , in: Proceedings of the Third Workshop on Computational Modeling of People's Opinions , Personality, and Emotion's in Social Media, Association for Computational Linguistics , Barcelona, Spain (Online) , 2020 , pp. 41 - 53 . URL: https://aclanthology.org/ 2020 .peoples- 1 .5.

[2]

B. R.

Chakravarthi ,

Muralidaran , Findings of the shared task on hope speech detection for equality, diversity, and inclusion , in: Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion , Association for Computational Linguistics, Kyiv, 2021 , pp. 61 - 72 . URL: https://aclanthology.org/ 2021 .ltedi- 1 .8.

[3]

Suryawanshi ,

B. R.

Chakravarthi ,

Arcan ,

Little , P. Buitelaar, TrollsWithOpinion: A Dataset for Predicting Domain-specific Opinion Manipulation in Troll Memes , arXiv preprint arXiv:2109.03571 ( 2021 ).

[4]

J. J.

Andrew , JudithJeyafreedaAndrew@DravidianLangTech-EACL2021: ofensive language detection for Dravidian code-mixed YouTube comments , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics , Kyiv, 2021 , pp. 169 - 174 . URL: https://aclanthology. org/ 2021 .dravidianlangtech- 1 . 22 .

[5]

Bharathi , A. S. A , SSNCSE_NLP@ DravidianLangTech-EACL2021: Ofensive language identification on multilingual code mixing text , in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics , Kyiv, 2021 , pp. 313 - 318 . URL: https://aclanthology.org/ 2021 . dravidianlangtech- 1 . 45 .

[6]

B. R.

Chakravarthi ,

Priyadharshini ,

Ponnusamy ,

P. K.

Kumaresan ,

Sampath ,

Thenmozhi ,

Thangasamy ,

Nallathambi ,

J. P.

McCrae , Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments , arXiv preprint arXiv: 2109 .00227 ( 2021 ).

[7]

Sakuntharaj ,

Mahesan , A novel hybrid approach to detect and correct spelling in Tamil text , in: 2016 IEEE International Conference on Information and Automation for Sustainability (ICIAfS) , IEEE, 2016 , pp. 1 - 6 .

[8]

Sakuntharaj ,

Mahesan , Use of a novel hash-table for speeding-up suggestions for misspelt Tamil words , in: 2017 IEEE International Conference on Industrial and Information Systems (ICIIS) , IEEE, 2017 , pp. 1 - 5 .

[9]

Sakuntharaj ,

Mahesan , Detecting and correcting real-word errors in Tamil sentences , Ruhuna Journal of Science 9 ( 2018 ).

[10] Nuhman , Basic Tamil Grammar, Readers Association, Kalmunai, Department of Tamil, University of Peradeniya, 2013 .

[11]

Thavareesan ,

Mahesan , Word embedding-based Part of Speech tagging in Tamil texts , in: 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS) , 2020 , pp. 478 - 482 . doi:1 0 . 1 1 0 9 / I C I I S 5 1 1 4 0 . 2 0 2 0 . 9 3 4 2 6 4 0 .

[12]

Thavareesan ,

Mahesan , Sentiment Lexicon Expansion using Word2vec and fastText for Sentiment Prediction in Tamil texts , in: 2020 Moratuwa Engineering Research Conference (MERCon), 2020 , pp. 272 - 276 . doi:1 0 . 1 1 0 9

/ M E R C o

n 5

0 0 8 4 . 2 0 2 0 . 9 1 8 5 3 6 9 .

[13]

Thavareesan ,

Mahesan , Sentiment Analysis in Tamil Texts: A Study on Machine Learning Techniques and Feature Representation, in: 2019 14th Conference on Industrial and Information Systems (ICIIS) , 2019 , pp. 320 - 325 . doi:1 0 . 1 1 0 9 / I C I I S 4 7 3 4 6 . 2 0 1 9 . 9 0 6 3 3 4 1 .

[14]

B. R.

Chakravarthi ,

Priyadharshini ,

Muralidaran , N. Jose,

Suryawanshi ,

Sherly ,

J. P.

McCrae , DravidianCodeMix: Sentiment Analysis and Ofensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text , arXiv preprint arXiv:2106.09460 ( 2021 ).

[15]

B. R.

Chakravarthi ,

Soman ,

Ponnusamy ,

P. K.

Kumaresan ,

K. P.

Thamburaj ,

J. P.

McCrae , et al., DravidianMultiModality: A Dataset for Multi-modal Sentiment Analysis in Tamil and Malayalam , arXiv preprint arXiv: 2106 .04853 ( 2021 ).

[16]

Jose ,

B. R.

Chakravarthi ,

Suryawanshi ,

Sherly ,

J. P.

McCrae , A survey of current datasets for code-switching research , in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) , IEEE, 2020 , pp. 136 - 141 .

[17]

Zampieri ,

Malmasi ,

Nakov ,

Rosenthal ,

Farra , R. Kumar, SemEval -2019 task 6: Identifying and categorizing ofensive language in social media (OfensEval) , in: Proceedings of the 13th International Workshop on Semantic Evaluation , Association for Computational Linguistics , Minneapolis, Minnesota, USA, 2019 , pp. 75 - 86 . URL: https://aclanthology.org/S19-2010. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 1 0 .

[18]

B. R.

Chakravarthi ,

Priyadharshini ,

Muralidaran ,

Suryawanshi ,

Jose , E. Sherly,

J. P.

McCrae , Overview of the track on sentiment analysis for dravidian languages in code-mixed text, Forum for Information Retrieval Evaluation ( 2020 ).

[19]

Zampieri ,

Malmasi ,

Nakov ,

Rosenthal ,

Farra ,

Kumar , Predicting the type and target of ofensive posts in social media, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 1415 - 1420 . URL: https: //aclanthology.org/N19-1144. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 1 4 4 .

[20]

Balouchzahi ,

Bashang , G. Sidorov,

H. L.

Shashirekha , CoMaTa OLI- Code-mixed Malayalam and Tamil Ofensive Language Identification , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[21]

Kalaivani ,

Thenmozhi , SSN_NLP_MLRG@ Dravidian-CodeMix-FIRE2020 : Sentiment Code-Mixed Text Classification in Tamil and Malayalam using ULMFiT , in: FIRE (Working Notes) , 2020 .

[22]

Saroj , S. Pal, IRLab@IIT-BHU@ Dravidian-CodeMix-FIRE2020: Sentiment Analysis on Multilingual Code Mixing Text Using BERT-BASE , in: FIRE (Working Notes) , 2020 .

[23]

Tripathy ,

Pathak ,

Sharma , Ofensive Language Classification of Code-Mixed Tamil with Keras , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[24]

Kumari ,

Kumar , Ofensive Language Identification on Multilingual Code Mixing Text , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[25]

Kalyan Jada ,

Yasaswini ,

Puranik ,

Sampath ,

Thangasamy ,

Pal Thamburaj , Analyzing Social Media Content for Detection of Ofensive Text , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[26]

N. N. Appiah

Balaji , B. B, B. J , SSNCSE _NLP@ Dravidian-CodeMix-FIRE2020 : Sentiment Analysis for Dravidian Languages in Code-Mixed Text , in: FIRE (Working Notes) , 2020 .

[27]

Divya ,

Sripriya , Ofensive Content Recognition, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[28]

S. N. V. C.

Basava ,

A. P.

Karri , Transformer Ensemble System for Detection of Ofensive Content in Dravidian Languages , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[29]

Bhawal ,

Roy ,

Kumar , Ofensive Language Identification on Multilingual Code Mixed Text using BERT , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[30]

Benhur J , K. S, Pretrained Transformers for Ofensive Language Identification in Tanglish , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[31]

Biradar ,

Saumya ,

Chauhan , mBERT based model for identification of ofensive content in south Indian languages , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[32]

Cortes ,

Vapnik , Support-vector networks , Machine learning 20 ( 1995 ) 273 - 297 .

[33]

Hochreiter ,

Schmidhuber , Long short-term memory , Neural computation 9 ( 1997 ) 1735 - 1780 .

[34]

Joulin , E. Grave,

Bojanowski , T. Mikolov, Bag of tricks for eficient text classification , in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 , Short

Papers

, Association for Computational Linguistics , 2017 , pp. 427 - 431 .

[35]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .

[36]

Lan ,

Chen ,

Goodman ,

Gimpel ,

Sharma , R. Soricut, Albert: A lite bert for self-supervised learning of language representations , arXiv preprint arXiv: 1909 .11942