Overview of the HASOC-DravidianCodeMix Shared
Task on Offensive Language Detection in Tamil and
Malayalam
Bharathi Raja Chakravarthia , Prasanna Kumar Kumaresanb ,
Ratnasingam Sakuntharajc , Anand Kumar Madasamyd , Sajeetha Thavareesanc ,
B Premjithe , K Sreelakshmie , Subalalitha Chinnaudayar Navaneethakrishnanf , John
P. McCraea and Thomas Mandlg
a
  Insight Centre for Data Analytics, National University of Ireland, Galway
b
  Indian Institute of Information Technology and Management-Kerala, India
c
  Eastern University, Sri Lanka
d
  National Institute of Technology Karnataka Surathkal, Karnataka, India
e
  Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita
Vishwa Vidyapeetham, India
f
  SRM Institute of Science and Technology, Chennai, Tamil Nadu, India
g
  University of Hildesheim, Germany


                                         Abstract
                                         We present the results of HASOC-Dravidian-CodeMix shared task1 held at FIRE 2021, a track on offen-
                                         sive language identification for Dravidian languages in Code-Mixed Text in this paper. This paper will
                                         detail the task, its organisation, and the submitted systems. The identification of offensive language
                                         was viewed as a classification task. For this, 16 teams participated in identifying offensive language
                                         from Tamil-English code mixed data, 11 teams for Malayalam-English code mixed data and 14 teams
                                         for Tamil data. The teams detected offensive language using various machine learning and deep learn-
                                         ing classification models. This paper has analysed those benchmark systems to find out how well they
                                         accommodate a code-mixed scenario in Dravidian languages, focusing on Tamil and Malayalam.

                                         Keywords
                                         Sentiment analysis, Dravidian languages, Tamil, Malayalam, Kannada, Code-mixing,


1. Introduction
Advancements in technology have aimed to ease peoples’ lives and have attracted many users
towards digitization, particularly younger generations [1, 2]. As a result, the number of people
                  1
     https://dravidian-codemix.github.io/HASOC-2021/index.html
FIRE 2021: Forum for Information Retrieval Evaluation, December 17-21, 2020, Hyderabad, India
£ bharathi.raja@insight-centre.org (B.R. Chakravarthi); prasanna.mi20@iiitmk.ac.in (P.K. Kumaresan);
sakuntharaj@esn.ac.lk (R. Sakuntharaj); m𝑎 𝑛𝑎𝑛𝑑𝑘𝑢𝑚𝑎𝑟@𝑛𝑖𝑡𝑘.𝑒𝑑𝑢.𝑖𝑛 (A.K. Madasamy); sajeethas@esn.ac.lk (S.
Thavareesan); b_premjith@cb.amrita.edu (B. Premjith); k_sreelakshmi@cb.students.amrita.edu (K. Sreelakshmi);
subalalitha@gmail.com (S.C. Navaneethakrishnan); john.mccrae@insight-centre.org (J.P. McCrae);
mandl@uni-hildesheim.de (T. Mandl)
Ǳ
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
using social media to express their opinions and beliefs has increased dramatically [3]. How-
ever, the lack of regulation gives individuals the freedom to post offensive content. There is
also no mechanism to regulate the posting of hateful content in under-resourced languages
[4, 5, 6].
   Tamil is a Dravidian language spoken primarily in Sri Lanka, India, Malaysia, and Singapore
[7, 8, 9]. It is an agglutinative language with a rich morphological structure [10]. Tamil has 247
letters comprising of 12 vowels, 18 consonants, 216 composite letters combining each conso-
nant with each vowel, and one special letter known as ”Ayutha eluththu”. Malayalam is also a
Dravidian language spoken in Kerala, India [11, 12, 13]. Malayalam also has its own script for
writing; however, social media users use Latin script or mix languages when commenting or
posting online [14, 15].
   The HASOC-DravidianCodeMix shared task 2021 aims to provide a new gold standard cor-
pus for offensive language identification of code-mixed text in Dravidian languages (Tamil-
English and Malayalam-English). Code-mixed content online results from people mixing mul-
tiple languages, especially their native language and another commonly spoken language while
expressing their views [16]. Offensive language often comprises of hate speech, such as racism,
ageism, homophobia, transphobia, ableism and any hate-promoting content against an individ-
ual or group [17]. It has been an active area of research in both academia and industry for the
past two decades [18]. There is an increasing demand for the identification of offensive lan-
guage in code-mixed social media texts [19].
   There were 16 teams involved in identifying offensive language from Tamil-English code
mixed data, 11 teams in identifying offensive language from Malayalam-English code mixed
data, and 14 teams in identifying offensive language in Tamil data. The teams used a variety of
machine learning and deep learning classification models to identify offensive language. The
purpose of this study is to examine such benchmark systems in order to determine how well
they fit a code-mixed scenario in Dravidian languages, with a particular emphasis on Tamil
and Malayalam.


2. Task Description
The task aims to identify offensive language content of the code-mixed comments/posts in
Dravidian Languages (Tamil, Tamil-English and Malayalam-English) collected from social me-
dia. The comment/post may contain more than one sentence, but the average sentence length
in the corpora is one. Each comment/post is annotated at the comment/post level. This dataset
also exhibits class imbalance problems that mirrors real-world scenarios.

    • Task 1
      Task 1 focuses on offensive language identification from Tamil text. Task 1 is a coarse-
      grained binary classification where each participating system has to classify YouTube
      comments in Tamil into two classes: Offensive and Not-offensive.
         – Not-Offensive – The comments does not contain offensive language. Example:
            Text: ேபரைவ சார்பாக படம் ெவற்ற ெபற வாழ்த்துக்கள்
            Task                   Train set    Development set   Test set   Total data points
            Task 1: Tamil          5,880                      -       654                6,534
            Task 2: Tamil          4,000                   940      1,001                5,941
            Task 3: Malayalam      4,000                   951      1,000                5,951
Table 1
Number of comments in datasets used for Task 1 and Task 2 and their split into train, development and
test set


               Translation: Congratulations on the success of the film on behalf of the As-
               sembly
            – Offensive - The comments contain hate, offensive or profane content.
               Text: ேபாடா ெவங்காயம் ஒன்னயலாம் அடுச்சு ெகாள்ளமும்
               ெவண்ைண .
               Translation: You onion we should beat you to death butter – butter and onion
               are offensive words in Tamil.

    • Task 2
      Task 2 focus on offensive language identification in code-mixed Malayalam-English and
      Tamil-English comments. Example: Code-mixed Tamil


            – Not-Offensive – The comments does not contain offensive language.
               Text: iantha padam rumba nalla iruku
               Translation of codemixed Tamil: This movie is very good
            – Offensive – The comments does not contain offensive language.
               Text: i ammaye bhegikku
               Translation of codemixed Malayalam: f..k this mother f..kers

2.1. Dataset description
The datasets for both Task 1 and Task 2 were prepared by collecting comments from YouTube.
Table 1 shows the number of comments in each dataset.

2.1.1. Task 1: Tamil Dataset
We collected data from YouTube comments for Task 1 using the YouTube comment scrapper 1
to download the comments from particular videos. The comments were collected from movie
trailers. We removed all the comments which were not in Tamil. These comments were then
used to create a dataset for the offensive language classification task. This dataset contains a
total of 6,534 comments and is split into train and test. The training dataset consists of 5,880
comments and the test dataset consists of 654 comments.
   1
       https://pypi.org/project/youtube-comment-scraper-python/
                No.   TeamName                  Precision   Recall   F1-Score    Rank
                1     SSN_NLP                       0.856    0.864      0.859       1
                2     MUCIC [20]                    0.850    0.861      0.852       2
                3     SSN_NLP_MLRG [21]             0.841    0.847      0.844       3
                4     IRLab [22]                    0.839    0.835      0.837       4
                5     BITS Pilani [23]              0.831    0.846      0.835       5
                6     AIML [24]                     0.823    0.843      0.825       6
                7     Pegasus [25]                  0.812    0.807      0.810       7
                8     KonguCSE                      0.749    0.797      0.764       8
                9     Jusgowithurs                  0.750    0.817      0.750       9
                10    Gothainayaki.A                0.855    0.824      0.749      10
                11    MUM                           0.853    0.821      0.742      11
                12    SSNCSE_NLP [26]               0.747    0.725      0.735      12
                13    AI_ML NIT Patna               0.710    0.717      0.714      13
                14    Saahil Raj                    0.706    0.547      0.599      14
Table 2
Rank list based on weighted average F1-score along with other evaluation metrics (Precision and Recall)
for Task 1: Tamil track


2.1.2. Tamil and Malayalam Dataset
Task 2 data was also taken from YouTube comments and posts. These comments were used to
create a dataset for the offensive language classification in both languages. The dataset includes
different types of code-mixing, such as mixing Tamil and Latin characters for the Tamil dataset,
code mixed data for the Malayalam dataset, and mixing at the word level. The Tamil dataset
contains a total of 5,941 comments from this split into training, development and test. The
training dataset consists of 4,000 comments, the development dataset contains 940 comments,
and the test dataset consists of 1,001 comments. The Malayalam dataset contains a total of 5,951
comments from this split into training, development and test. The training dataset consists
of 4,000 comments, the dev dataset contains 951 comments, and the test dataset consists of
1,000 comments. These datasets also are published in the same competition, HASOC-Dravidian
CodeMixed, which is on Codalab.


3. Methodology
We have received fourteen, sixteen and eleven submissions for Task 1: Tamil track, Task 2:
Tamil track and Task 2: Malayalam track, respectively. The submissions were evaluated based
on weighted average F1-score, and rank lists were prepared accordingly. Table 2 shows the rank
list of teams that participated in Task 1: Tamil track. Tables 3 and 4 show the rank lists of the
teams that competed in Task 2: Tamil track and Task 2: Malayalam track, respectively. Tables 2,
3 and 4 show the precision, recall and weighted average F1-score of all the participating teams
on test data. In this section, we briefly describe the methodologies of teams that participated
in the three tasks.
    • SSN_NLP_MLRG [21]: Team SSN_NLP_MLRG participated in the Tamil-English sub-
                 No.    TeamName              Precision    Recall   F1-Score   Rank
                 1      MUCIC [20]                0.679     0.685      0.678      1
                 2      AIML [24]                 0.670     0.670      0.670      2
                 3      SSN_IT_NLP [27]           0.685     0.688      0.668      3
                 4      ZYBank AI                 0.671     0.676      0.654      4
                 5      IRLab [22]                0.654     0.662      0.650      5
                 6      HSU [28]                  0.655     0.664      0.649      6
                 7      IIITSurat [29]            0.679     0.673      0.636      7
                 8      Team Pegasus [25]         0.633     0.644      0.612      8
                 9      PSG [30]                  0.614     0.609      0.611      9
                 10     SSNCSE_NLP [26]           0.615     0.607      0.610     10
                 11     IIITD-shanker [31]        0.599     0.568      0.573     11
                 12     CEN_NLP                   0.596     0.540      0.539     12
                 13     RameshKannan              0.524     0.526      0.525     13
                 14     MUM                       0.591     0.527      0.522     14
                 15     AI_ML_NIT_Patna           0.539     0.509      0.515     15
                 16     JBTTM                     0.537     0.483      0.503     16
Table 3
Rank list based on weighted average F1-score along with other evaluation metrics (Precision and Recall)
for Task 2: Tamil track


                 No.   TeamName                Precision   Recall   F1-Score    Rank
                 1     AIML [24]                   0.776    0.762      0.766       1
                 2     MUCIC [20]                  0.764    0.760      0.762       2
                 3     HSU [28]                    0.744    0.730      0.735       3
                 4     IIIT Surat [29]             0.752    0.727      0.734       4
                 5     IRLab [22]                  0.754    0.705      0.714       5
                 6     IIITD-ShankarB [31]         0.715    0.693      0.700       6
                 7     SSNCSE_NLP [26]             0.692    0.678      0.683       7
                 8     Pegasus [25]                0.708    0.660      0.670       8
                 9     CEN_NLP                     0.652    0.635      0.641       9
                 10    MUM                         0.628    0.637      0.632      10
                 11    JBTTM                       0.577    0.584      0.580      11
Table 4
Rank list based on weighted average F1-score along with other evaluation metrics (Precision and Recall)
for Task 2: Malayalam track


      task. The authors implemented both traditional machine learning and deep learning
      models for the classification. They experimented with Support Vector Machine (SVM)
      [32], naive bayes, random forest and extreme gradient boosting ensemble classifiers
      for categorizing the offensive contents with N-gram, character and word level Term
      Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words (BoW) features.
      The deep learning models used for the classification includes a shallow Neural Network
      (NN), a Long Short Term Memory (LSTM) [33] and a Convolutional Neural Network
  (CNN). The embeddings in the NN were initialized using the fastText [34] pre-trained
  word embeddings. The authors also followed a transfer learning approach by multilin-
  gual Bidirectional Encoder Representation (mBERT) [35], ALBERT [36] (A Lite BERT for
  self-supervised learning of language representations), DistilBERT [37] (Distilled version
  of BERT[35]) with the ktrain, and ULMFiT [38] with Fastai [39] to build the classification
  model.
• HSU_TransEmb [28]: Team HSU_TransEmb used a Transformer ensemble system to
  identify the offensive contents from Tamil-English and Malayalam-English code-mixed
  data. The ensemble system consists of mBERT, DistilBERT and MuRIL models [40]. The
  preprocessed data were fed to the three ensemble BERT models, and the class probabili-
  ties were computed. The class label was identified from the sum of the class probabilities
  obtained from the BERT models.
• MUCIC [20]: Team MUCIC took part in both Tamil-English and Malayalam-English
  shared tasks. They used word-level as well as character-level N-gram based TF-IDF for
  extracting the features from the texts. Furthermore, they identified 40,000 frequent fea-
  tures in each case and constructed a combined set containing 80,000 frequent features.
  They employed linear SVM, random forest, logistic regression and an ensemble of these
  three classifiers to train the model. The logistic regression model obtained the highest
  F1-score of 0.881 in the Tamil-English task, whereas random forest exhibited the best
  performance with an F1-score of 0.783.
• IIITSurat [29]: Team IIITSurat took part in both shared tasks and employed machine
  learning and deep learning models for classification. Machine learning classifiers such
  as logistic regression, random forest, naive bayes, XG boost, and SVM were trained over
  TF-IDF features. In addition to machine learning models, the authors executed Deep
  Neural Network (DNN), CNN, BiLSTM and Transformer-based models such as BERT
  [35], Indic BERT [41] and MuRIL [40] for classification. Among all the models, MuRIL
  achieved the highest F1-scores of 0.78 and 0.91 in Malayalam-English and Tamil-English
  tasks, respectively.
• Pegasus [25]: Team Pegasus submitted their results in Task 1 and Task 2. They utilized
  XLM-RoBERTa [42] and DistilBERT models for identifying offensive language social me-
  dia text. As mentioned earlier, the authors deployed the embedding generated using the
  BERT and fed it into a BiLSTM network. In Task 1, Team Pegasus to avoid repetition of
  the authors concatenated the embeddings obtained from both BERT models and passed
  them to a BiLSTM network. This model attained an F1-score of 0.810. The authors per-
  formed transliteration and translation on Task 2 data and applied the XLM-RoBERTa
  model to extract the embedding, which obtained F1-scores of 0.612 and 0.670 in Tamil-
  English and Malayalam-English tasks, respectively.
• IRLab [22]: Team IRLab implemented a Deep Neural Network (DNN) with TF-IDF fea-
  tures for Tasks 1 and 2. The authors extracted unigram to six-gram TF-IDF features and
  identified the first 30,000 features. A DNN with four dense layers read these features and
  predicted the class label for each data. They also performed hyperparameter tuning for
         each model to fix the best model. Their model achieved F1-scores of 0.84, 0.65 and 0.71
         in Task 1, Tamil-English, and Malayalam-English shared tasks.
    • AIML [24]: Team AIML proposed an ensemble model which used character N-gram-
      based TF-IDF features for the identification of offensive texts. The authors considered
      one to six character N-gram features and trained an ensemble of SVM, logistic regres-
      sion and random forest. Their model attained an F1-score of 0.83 in Task 2, whereas
      it achieved F1-scores of 0.67 and 0.77 in Tamil-English and Malayalam-English tasks,
      respectively.
    • SSN_IT_NLP [27]: Team SSN_IT_NLP presents an offensive language identification model
      for Tamil-English data. The mBERT generates embeddings from the data, which are then
      fed to an ensemble of SVM, XG Boost and Linear Discriminant Analysis (LDA). The label
      predicted by the majority of the models was selected as the final output.
    • NLP_CSE: Team NLP_CSE employed machine learning and deep learning models for
      predicting the offensive data. A logistic regression classifier takes TF-IDF features for
      training the model. Furthermore, the authors used random oversampling algorithms
      to deal with the class imbalance problem in the data. The model obtained an F1-score
      of 0.5243. In addition to the logistic regression model, the authors implemented an
      LSTM-based encoder-decoder architecture and a transformer-based model. The encoder-
      decoder model was a deep multi-layer network that also incorporated an attention mech-
      anism. This model consisted of stacks of four encoders and four decoders. The trans-
      former model, mBERT, was used to generate the embedding for sentences and considered
      the cosine similarity between sentences for classification.
    • BITS_Pilani [23]: Team BITS_Pilani used a DNN which contain an embedding layer,
      pooling layer, dropout layer, a fully connected layer and an output layer for classify-
      ing the text into Offensive and Not offensive in the Tamil-English subtask. The model
      achieved an F1-score of 0.835 in the competition.
    • M Subramanian et al.: Team M Subramanian et al. employed the naive bayes multinomial
      model, KNN, logistics regression, and SVM classifier with BoW features for classifying
      the social media text into offensive or not offensive categories. This team participated in
      the shared task for only Tamil data. The Logistic regression model attained the highest
      performance among the classifiers.


4. Evaluation
The distribution of the offensive languages classes are imbalanced in both datasets. This takes
into account the varying degrees of importance of each class in the dataset. We used a classi-
fication report tool from Scikit learn2 .
                                                                𝑇𝑃
                                              Precision =                                             (1)
                                                              𝑇𝑃 + 𝐹𝑃
   2
       https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
                                                    𝑇𝑃
                                      Recall =                                               (2)
                                                  𝑇𝑃 + 𝐹𝑁
                                                Precision ∗ Recall
                              F-Score = 2 ∗                                                  (3)
                                                Precision + Recall
                                          𝐿
                             𝑃weighted = ∑(𝑃 of 𝑖 × Weight of 𝑖)                             (4)
                                          𝑖=1

                                           𝐿
                             𝑅weighted = ∑(𝑅 of 𝑖 × Weight of 𝑖)                             (5)
                                          𝑖=1

                                           𝐿
                     𝐹 − 𝑆𝑐𝑜𝑟𝑒weighted = ∑(𝐹 − 𝑆𝑐𝑜𝑟𝑒 of 𝑖 × Weight of 𝑖)                     (6)
                                          𝑖=1


5. Results and Discussion
Shared tasks on offensive language detection in CodeMix Tamil and Malayalam data were orga-
nized as part of HASOC 2021. Fourteen submissions for Track 1: Tamil and sixteen submissions
in Track 2. For Malayalam, eleven teams submitted their results in Track 2. Table 5 shows the
number of teams participated in each shared task. Participating teams explored N-gram based
TF-IDF, BoW and different variants of BERT for representing the input text. None of the teams
used language specific features. They used various conventional machine learning classifiers
such as SVM, naive bayes, random forest. logistic regression, XG boost, KNN and ensemble of
machine learning classifier models for the identification of the offensive language text. In ad-
dition to that, DNN, LSTM and its variants and transformer-based classifiers were also studied
for the classification. Team HSU_TransEmb explored an ensemble of mBERT, DistilBERT and
MuRIL for detecting offensive texts from CodeMix Tamil and Malayalam data. NLP_CSE inves-
tigated the performance of oversampling algorithms to address the class imbalance problem in
the data. Tables 1, 2 and 3 show the rank lists for Task 1: Tamil track, Task 2: Malayalam track
and Task 2: Tamil track, respectively. Figures 1, 2 and 3 show precision, recall and F1-scores
of submissions in Track 1: Tamil, Track 2: Tamil and Track 2: Malayalam. Figure 4 shows the
box-plots of the performance of the teams participated in Track 1: Tamil, Track 2: Tamil and
Track 2: Malayalam.
   Team SSN_NLP obtained the first rank in Track 1 with an F1-score of 0.859. MUCIC and
SSN_NLP_MLRG grabbed second and third positions with F1-scores of 0.852 and 0.844. Among
the 14 teams, seven scored F1-scores greater than 0.8. Looking at the models used by the teams,
one can see that the teams that finished top used different kinds of feature extraction models
and classifiers.
   Team MUCIC attained the first position in Track 2: Tamil shared task, and they achieved
an F1-score of 0.678. MUCIC used word level as well as character level N-gram based TF-
IDF features for classification. They performed the predictions using SVM, random forest,
logistic regression, and an ensemble of these three. The second-placed team, AIML, and the
                    Competition                         No. of teams participated
                    All three tasks                                             6
                    Track 1: Tamil                                              7
                    Both tasks in Track 2                                       5
                    Track 2: Tamil alone                                        4
                    Track 2: Malayalam alone                                    0
                    Track 1: Tamil and Track 2: Tamil                           1
Table 5
Number of teams participated in each shared task


Figure 1: Bar plot describing Precision, Recall and F1-scores of the submissions for Track 1: Tamil


Figure 2: Bar plot describing Precision, Recall and F1-scores of the submissions for Track 2: Tamil


third-placed team, SSN_IT_NLP, scored F1-scores of 0.670 and 0.668, respectively. AIML also
utilized the N-gram based TF-IDF features with SVM, logistic regression and random forest.
They considered unigram to six-gram features for this analysis. SSN_IT_NLP made use of
mBERT embeddings with SVM, XG boost and LDA to identify the offensive language texts
among the data. Among the 16 teams that participated, ten teams recorded F1-scores greater
than 0.6.
   In Track 2: Malayalam, AIML reached the top position with an F1-score of 0.766. MUCIC and
HSU were placed in the second and third positions with F1-scores of 0.762 and 0.735, respec-
tively. AIML used unigram to six-gram based TF-IDF features with SVM, logistic regression
and random forest classifiers for the identification of offensive language texts. MUCIC also
Figure 3: Bar plot describing Precision, Recall and F1-scores of the submissions for Track 2: Malayalam


Figure 4: Box-plot for the submissions for Track 1: Tamil, Track 2: Tamil and Track 2: Malayalam


followed a similar methodology, but they used only the most frequent forty thousand n-gram
based TF-IDF features from each class for classification. Team HSU utilized an ensemble of
mBERT, DistilBERT and MuRIL for the detection of offensive language contents. In this task,
6 out of 11 teams obtained an F1-score greater than 0.7, and one team scored an F1-score less
than 0.6.
   It is interesting to note that teams that used TF-IDF features attained the top position in
both tasks in Track 2. A similar trend was visible in HASOC 2020 [43]. The teams that won
the HASOC 2020 shared tasks in CodeMix data used TF-IDF features with machine learning
classifiers.


6. Conclusion
This paper gives an overview of the HASOC- Dravidian-CodeMix shared task at FIRE 2021.
The shared task consisted of three subtasks for Tamil, CodeMix Tamil and Malayalam lan-
guages. There were 16 teams who participated in Tamil-English code mixed data, 11 teams in
Malayalam-English code mixed data and 14 teams in Tamil data. Teams used methods rang-
ing from Bag of Words, TF-IDF to BERT-based models to represent the data and applied con-
ventional machine learning algorithms, deep neural networks and transformer networks for
prediction. One team employed oversampling algorithms to deal with the imbalance in the
data by synthetically generating the data points in minority classes. The analysis of the meth-
ods of the teams showed that both conventional and deep learning/transformer-based methods
exhibit similar performances in terms of the evaluation metrics used for assessing the models.


Acknowledgments
This publication is the outcome of the research supported in part by a research grant from Sci-
ence Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289_P2 (Insight_2), and Irish
Research Council grant IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Lan-
guage for Minority and Historical Languages). We also thank Ciara Oloughlin for her help
with proof reading.


References
 [1] B. R. Chakravarthi, HopeEDI: A multilingual hope speech detection dataset for equal-
     ity, diversity, and inclusion, in: Proceedings of the Third Workshop on Computational
     Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, Associa-
     tion for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 41–53. URL:
     https://aclanthology.org/2020.peoples-1.5.
 [2] B. R. Chakravarthi, V. Muralidaran, Findings of the shared task on hope speech detection
     for equality, diversity, and inclusion, in: Proceedings of the First Workshop on Lan-
     guage Technology for Equality, Diversity and Inclusion, Association for Computational
     Linguistics, Kyiv, 2021, pp. 61–72. URL: https://aclanthology.org/2021.ltedi-1.8.
 [3] S. Suryawanshi, B. R. Chakravarthi, M. Arcan, S. Little, P. Buitelaar, TrollsWithOpinion:
     A Dataset for Predicting Domain-specific Opinion Manipulation in Troll Memes, arXiv
     preprint arXiv:2109.03571 (2021).
 [4] J. J. Andrew, JudithJeyafreedaAndrew@DravidianLangTech-EACL2021:offensive lan-
     guage detection for Dravidian code-mixed YouTube comments, in: Proceedings of the
     First Workshop on Speech and Language Technologies for Dravidian Languages, Associ-
     ation for Computational Linguistics, Kyiv, 2021, pp. 169–174. URL: https://aclanthology.
     org/2021.dravidianlangtech-1.22.
 [5] B. Bharathi, A. S. A, SSNCSE_NLP@DravidianLangTech-EACL2021: Offensive language
     identification on multilingual code mixing text, in: Proceedings of the First Work-
     shop on Speech and Language Technologies for Dravidian Languages, Association for
     Computational Linguistics, Kyiv, 2021, pp. 313–318. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.45.
 [6] B. R. Chakravarthi, R. Priyadharshini, R. Ponnusamy, P. K. Kumaresan, K. Sampath,
     D. Thenmozhi, S. Thangasamy, R. Nallathambi, J. P. McCrae, Dataset for Identification
     of Homophobia and Transophobia in Multilingual YouTube Comments, arXiv preprint
     arXiv:2109.00227 (2021).
 [7] R. Sakuntharaj, S. Mahesan, A novel hybrid approach to detect and correct spelling in
     Tamil text, in: 2016 IEEE International Conference on Information and Automation for
     Sustainability (ICIAfS), IEEE, 2016, pp. 1–6.
 [8] R. Sakuntharaj, S. Mahesan, Use of a novel hash-table for speeding-up suggestions for
     misspelt Tamil words, in: 2017 IEEE International Conference on Industrial and Informa-
     tion Systems (ICIIS), IEEE, 2017, pp. 1–5.
 [9] R. Sakuntharaj, S. Mahesan, Detecting and correcting real-word errors in Tamil sentences,
     Ruhuna Journal of Science 9 (2018).
[10] Nuhman, Basic Tamil Grammar, Readers Association, Kalmunai, Department of Tamil,
     University of Peradeniya, 2013.
[11] S. Thavareesan, S. Mahesan, Word embedding-based Part of Speech tagging in Tamil
     texts, in: 2020 IEEE 15th International Conference on Industrial and Information Systems
     (ICIIS), 2020, pp. 478–482. doi:1 0 . 1 1 0 9 / I C I I S 5 1 1 4 0 . 2 0 2 0 . 9 3 4 2 6 4 0 .
[12] S. Thavareesan, S. Mahesan, Sentiment Lexicon Expansion using Word2vec and fastText
     for Sentiment Prediction in Tamil texts, in: 2020 Moratuwa Engineering Research Con-
     ference (MERCon), 2020, pp. 272–276. doi:1 0 . 1 1 0 9 / M E R C o n 5 0 0 8 4 . 2 0 2 0 . 9 1 8 5 3 6 9 .
[13] S. Thavareesan, S. Mahesan, Sentiment Analysis in Tamil Texts: A Study on Machine
     Learning Techniques and Feature Representation, in: 2019 14th Conference on Industrial
     and Information Systems (ICIIS), 2019, pp. 320–325. doi:1 0 . 1 1 0 9 / I C I I S 4 7 3 4 6 . 2 0 1 9 .
     9063341.
[14] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly,
     J. P. McCrae, DravidianCodeMix: Sentiment Analysis and Offensive Language Iden-
     tification Dataset for Dravidian Languages in Code-Mixed Text,                                  arXiv preprint
     arXiv:2106.09460 (2021).
[15] B. R. Chakravarthi, K. Soman, R. Ponnusamy, P. K. Kumaresan, K. P. Thamburaj, J. P.
     McCrae, et al., DravidianMultiModality: A Dataset for Multi-modal Sentiment Analysis
     in Tamil and Malayalam, arXiv preprint arXiv:2106.04853 (2021).
[16] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current
     datasets for code-switching research, in: 2020 6th International Conference on Advanced
     Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 136–141.
[17] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019
     task 6: Identifying and categorizing offensive language in social media (OffensEval),
     in: Proceedings of the 13th International Workshop on Semantic Evaluation, Associa-
     tion for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 75–86. URL:
     https://aclanthology.org/S19-2010. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 1 0 .
[18] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for dravidian languages in
     code-mixed text, Forum for Information Retrieval Evaluation (2020).
[19] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the
     type and target of offensive posts in social media, in: Proceedings of the 2019 Con-
     ference of the North American Chapter of the Association for Computational Linguis-
     tics: Human Language Technologies, Volume 1 (Long and Short Papers), Association
     for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 1415–1420. URL: https:
     //aclanthology.org/N19-1144. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 1 4 4 .
[20] F. Balouchzahi, S. Bashang, G. Sidorov, H. L. Shashirekha, CoMaTa OLI- Code-mixed
     Malayalam and Tamil Offensive Language Identification, in: Working Notes of FIRE 2021
     - Forum for Information Retrieval Evaluation, CEUR, 2021.
[21] A. Kalaivani, D. Thenmozhi, SSN_NLP_MLRG@Dravidian-CodeMix-FIRE2020: Senti-
     ment Code-Mixed Text Classification in Tamil and Malayalam using ULMFiT, in: FIRE
     (Working Notes), 2020.
[22] A. Saroj, S. Pal, IRLab@IIT-BHU@Dravidian-CodeMix-FIRE2020: Sentiment Analysis on
     Multilingual Code Mixing Text Using BERT-BASE, in: FIRE (Working Notes), 2020.
[23] S. Tripathy, A. Pathak, Y. Sharma, Offensive Language Classification of Code-Mixed Tamil
     with Keras, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[24] J. Kumari, A. Kumar, Offensive Language Identification on Multilingual Code Mixing
     Text, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[25] P. Kalyan Jada, K. Yasaswini, K. Puranik, A. Sampath, S. Thangasamy, K. Pal Thamburaj,
     Analyzing Social Media Content for Detection of Offensive Text, in: Working Notes of
     FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
[26] N. N. Appiah Balaji, B. B, B. J, SSNCSE_NLP@Dravidian-CodeMix-FIRE2020: Sentiment
     Analysis for Dravidian Languages in Code-Mixed Text, in: FIRE (Working Notes), 2020.
[27] S. Divya, N. Sripriya, Offensive Content Recognition, in: Working Notes of FIRE 2021 -
     Forum for Information Retrieval Evaluation, CEUR, 2021.
[28] S. N. V. C. Basava, A. P. Karri, Transformer Ensemble System for Detection of Offensive
     Content in Dravidian Languages, in: Working Notes of FIRE 2021 - Forum for Information
     Retrieval Evaluation, CEUR, 2021.
[29] S. Bhawal, P. Roy, A. Kumar, Offensive Language Identification on Multilingual Code
     Mixed Text using BERT, in: Working Notes of FIRE 2021 - Forum for Information Retrieval
     Evaluation, CEUR, 2021.
[30] S. Benhur J, K. S, Pretrained Transformers for Offensive Language Identification in Tan-
     glish, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[31] S. Biradar, S. Saumya, A. Chauhan, mBERT based model for identification of offensive
     content in south Indian languages, in: Working Notes of FIRE 2021 - Forum for Informa-
     tion Retrieval Evaluation, CEUR, 2021.
[32] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[33] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997)
     1735–1780.
[34] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classifica-
     tion, in: Proceedings of the 15th Conference of the European Chapter of the Association
     for Computational Linguistics: Volume 2, Short Papers, Association for Computational
     Linguistics, 2017, pp. 427–431.
[35] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of the
     North American Chapter of the Association for Computational Linguistics: Human Lan-
     guage Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
[36] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for
     self-supervised learning of language representations, arXiv preprint arXiv:1909.11942
     (2019).
[37] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[38] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in: Pro-
     ceedings of the 56th Annual Meeting of the Association for Computational Linguistics
     (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Aus-
     tralia, 2018, pp. 328–339. URL: https://aclanthology.org/P18-1031. doi:1 0 . 1 8 6 5 3 / v 1 /
     P18- 1031.
[39] J. Howard, S. Gugger, Fastai: a layered api for deep learning, Information 11 (2020) 108.
[40] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Ag-
     garwal, R. T. Nagipogu, S. Dave, et al., Muril: Multilingual representations for indian
     languages, arXiv preprint arXiv:2103.10730 (2021).
[41] D. Kakwani, A. Kunchukuttan, S. Golla, N. Gokul, A. Bhattacharyya, M. M. Khapra, P. Ku-
     mar, inlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilin-
     gual language models for indian languages, in: Proceedings of the 2020 Conference on
     Empirical Methods in Natural Language Processing: Findings, 2020, pp. 4948–4961.
[42] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning
     at scale, arXiv preprint arXiv:1911.02116 (2019).
[43] B. R. Chakravarthi, A. K. M, J. P. McCrae, B. Premjith, K. Soman, T. Mandl, Overview
     of the track on HASOC-Offensive Language Identification-DravidianCodeMix., in: FIRE
     (Working Notes), 2020, pp. 112–120.