mBERT based model for identification of offensive
content in south Indian languages
Shankar Biradar, Sunil Saumya and Arun Chauhan
Indian Institute of Information Technology Dharwad,
Indian Institute of Information Technology Dharwad
Graphic Era University Dehradun


                                      Abstract
                                      In recent years, there has been a lot of focus on offensive content. The amount of offensive content
                                      generated by social media is increasing at an alarming rate. It created a greater need to address this
                                      issue than ever before. To address these issues, the organizers of “Dravidian-Code Mixed HASOC-2021”
                                      have created two challenges. Task 1 involves identifying offensive content in Malayalam data, whereas
                                      Task 2 includes Malayalam and Tamil Code Mixed Sentences. Our team participated in Task 2. We
                                      used multilingual BERT to extract features in our proposed model, and we used two different classifiers,
                                      Support Vector Machine (SVM) and Deep Neural Network (DNN), on the extracted features. In addition,
                                      we used the proposed data to evaluate the performance of a monolingual BERT classifier. Our best
                                      performing model monolingual Bert received a weighted F1 score of 0.70 for Malayalam data, ranking
                                      fifth; we also received a weighted F1 score of 0.573 for Tamil Code Mixed data, ranking twelfth.

                                      Keywords
                                      Offensive, mBERT, CodeMixed, SVM


1. Introduction
The availability of smartphones and the internet has created a lot of interest in social media
among today’s youth. These applications give a huge platform for users to connect with the
outside world and share their ideas and opinions with others. With these benefits comes a
disadvantage: many people misuse the platform under the name of freedom of expression to
publish inflammatory content on social media. This inflammatory information typically targets
a single person, a group of people, a particular faith, or a community [1]. People generate
objectionable content and aggressively propagate it on social media. This type of material is
produced for a variety of reasons, including commercial and political gain [2]. This type of
content can disrupt social harmony and cause riots in society. Also, it has the potential to have a
detrimental psychological influence on the readers. It can harm people’s emotions and conduct.
Therefore, identifying this type of content is critical; as a result, researchers, policymakers, and
investors (stakeholders) are attempting to develop a dependable technique to identify offensive
content on social media.


Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open shankar@iiitdwd.ac.in (S. Biradar); sunil.saumya@iiitdwd.ac.in (S. Saumya); aruntakhur@gmail.com
(A. Chauhan)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   Various studies on hate speech, harmful content, and abusive language identification in social
media have been conducted during the previous decade. The majority of these studies were
focused on monolingual English content, and a large amount of English language cuprous
has been created [3]. But, people in countries with a complex culture and history, such as
India, frequently use regional languages to generate inappropriate social media posts. Users
typically mix their regional languages with English while creating such content. This type of
text is known as code mixed text on social media. Hence we require an efficient method to
classify offensive content in Code-Mixed Indian languages. In this context, the “Dravidian-
CodeMixed HASOC-2021” shared task provider has organized two tasks for detecting hate
speech in Dravidian languages such as Malayalam and Tamil code-mixed data. Our team took
part in Task No. 2, and this paper presents the working notes for our suggested model.
   The rest of the article is arranged in the following manner: Section 2 provides a brief summary
of previous work, while Section 4 describes the proposed model in full. Section 5 concludes by
providing information on the outcome.


2. Literature review
Many researchers and practitioners from industry and academia have been attracted to the
subject of automatic identification of hostile and harmful speech. [4] Provided a high-level
review of the current state-of-the-art techaniques in offensive language identification and
related issues, such as hate speech recognition. [5] Developed a publicly accessible dataset for
identifying the offensive language in tweets by categorizing them as hate speech, offensive
but not hate speech or neither. Various machine learning models, such as Support Vector
Machine (SVM) and logistic regression, were created utilizing various data properties, such as
n-grams, TF-IDF, readability, etc., for this purpose. [6] Built a model with deep neural networks
in combination with SVM for the detection of offensive content with the accomplishment of F1
score of 90%.
   Offensive content detection from tweets is part of some conferences as well as competition
tasks. Offensive 2020 was provided by SemEval in 2020 as a task in five languages: English,
Arabic, Danish, Greek, and Turkish [7]. In FIRE 2019, a similar task was achieved for Indo-
European languages such as English, Hindi, and German. The data set was created using
samples obtained on Twitter and Facebook in all three languages. Various models, including
LSTM with attention, Word2vec embedding with CNN, and BERT, were used for this task. In
several cases, traditional learning models outperformed deep learning methods for a language
other than English [8]. Shared task on offensive language detection in Dravidian languages was
provided by [9] [10].


3. Task and Data set description
We have taken data set from HASOC subtask, offensive language identification of Dravidian
CodeMix[11]. Challenges provided by the organizers are as follows.
   Task 1: A binary classification problem with message-level labeling for offensive and non-
offensive information in Malayalam CodeMixed YouTube comments.
Table 1
data set description
                        Task          Data set     Offensive   Not-offensive   Total
                                      Train        1980        2020            4000
                        Tanglish
                                      Test         475         465             940
                                      Train        1953        2047            4000
                        Manglish
                                      Test         512         488             1000


  Task 2: Given Romanized Tanglish and Manglish tweeter or YouTube comments, the system
must classify them as offensive or non-offensive.
  Our team took part in Task 2 for identifying offensive information in the Tanglish and
Manglish data sets. According to the organizer, Tanglish data is collected from Twitter tweets
and comment on the hello APP. Whereas Manglish data is taken from YouTube comments [11].
A detailed description of the data set is provided in Table 1, both Tanglish and Manglish data
contain ID, Tweet, and Label fields.


4. Methdology
Our team has proposed three submissions based on three different models. In the first two
models, mBERT embeddings are passed through SVM and DNN classifiers, while in the third
model, monolingual BERT is employed as a classifier. Each of them is designed using the general
architecture shown in Figure 1. Thus, our model consists of three stages, each of which is
discussed in the preceding subsections.

4.1. Data processing
The data set provided by the organizer contains many unwanted information. A few data
preprocessing steps were undertaken on both text and label fields to convert the data suitable for
model building. Digits, special characters, hyperlinks, and Twitter user handles were omitted
from the data set because they were not helping us improve the performance of our model.
Furthermore, the social media data provided by the organizer did not follow grammatical norms;
hence, data lemmatization is performed to convert the data to its usable base form. For example,
the word ate, eaten, and eating were converted to their base form eat. Converting text to lower
case is also done to eliminate redundant terms. All of this preprocessing was done with the
help of the NLTK toolbox from the Python library [12]. The preprocessed data is then fed into
a tokenizer, which divides the tweet into several tokens. The mBERT tokenizer 1 is used for
this purpose. Padding and masking were also used to handle variable-length sentences.


    1
        https://huggingface.co/bert-base-multilingual-cased
Figure 1: General architecture of classifier model


4.2. Feature extraction
To obtain contextual embeddings from Code-Mixed data, we used the multilingual Bidirectional
Encoder Representation (mBERT) model [13] in models 1 and 2, and monolingual BERT in
model 3. The architecture of the mBERT model is largely based on the original monolingual
BERT architecture [14], which has 12 transformer blocks, 12 attention heads, and 768 hidden
layers. Furthermore, the vector dimension of mBERT embeddings is 768. This model was
trained using the same pre-training technique as the BERT, namely Masked Language Modeling
(MLM) and Next Sentence Prediction. The only distinction is that multilingual BERT is trained
on Wikipedia data from 104 different languages to handle languages other than English. We
only draw embeddings from the CLS token at the beginning for our classification purposes
because it gives whole sentence embeddings.
4.3. Classification
Our proposed model experimented with three different classifiers: SVM and DNN classifiers
with mBERT embedding in models 1 and 2 and pre-trained language model BERT in model 3.
The descriptions of these models are presented in the subsections that follow. The intuition
behind selecting these proposed models is that they outperformed other models such as Logistic
Regression (LR), Random Forest (RF), and Naive Bayes (NB) in our preliminary trials.

4.3.1. Traditional machine learning based classifier
We experimented with traditional machine learning algorithms such as Support Vector Machine
(SVM) with ten-fold cross-validation. Experiment results for the suggested model demonstrate
that kernel value ”1” and solver ”lbfgs” produce the best results. Experimental trials are used
to determine these hyper-parameter values. This model accepts mBERT embeddings as input
and produces labels that are either offensive or non-offensive. The model was developed using
Python’s sci-kit-learn library [15].

4.3.2. Deep neural network based model
Later, we experimented with the Deep Neural Network (DNN) model, a second model in our
proposed methodology. The DNN model comprises several dense layers that are designed to
extract more significant features from input embeddings. We used dense layers of 1000, 500,
100, and 50 neurons in our model. Each dense layer follows a dropout rate of 0.4 to prevent the
overfitting problem. The optimum grid search value determines the dropout rate of 0.4, and it
remains constant throughout the experiment. To normalize activation data, we additionally
employed a batch normalization layer. The output from these layers is then classified using the
sigmoid layer.

4.3.3. Transformer model
In our last model, we experimented with transformer-based language models such as BERT.
Transformer architectures are trained on generic tasks such as modeling language and then
fine-tuned for classification. The underlying model for our classification is Bert-base-uncased 2 ,
which BERT developers provide. We did not use ten-fold cross-validation to evaluate mono-
lingual BERT since it is more computationally expensive. Implementation details of all three
proposed models are provided in GitHub repository3 .


5. Experimental Results
To evaluate the presented models, the organizers have provided a weighted F1 score. Among the
proposed models, our top-performing monolingual BERT received a sixth-place for offensive
content recognition in the Mangalish data set and eleventh in the Tanglish data set. Table 2
and Table 3 provide the list of top-performing models with weighted F1 scores for Manglish
   2
       https://huggingface.co/transformers/model_doc/bert.html
   3
       https://github.com/shankarb14/dravidian-codemix
Table 2
Top performing models on Manglish data set
                   Team name             Precision   Recall   F1 score   Rank
                   AIML                  0.776       0.762    0.766      1
                   MUCIC                 0.764       0.76     0.762      2
                   HSU                   0.744       0.73     0.735      3
                   IIIT Surat            0.752       0.727    0.734      4
                   IRLab                 0.754       0.705    0.714      5
                   IIITD-ShankarB        0.715       0.693    0.7        6


Table 3
Top performing models on Tanglish data set
                   Team name             Precision   Recall   F1 score   Rank
                   MUCIC                 0.679       0.685    0.678      1
                   AIML 2                0.67        0.67     0.67       2
                   SSN_IT_NLP            0.685       0.688    0.668      3
                   ZYBank AI             0.671       0.676    0.654      4
                   IRLab                 0.654       0.662    0.65       5
                   IIITD-ShankarB        0.599       0.568    0.573      11


Table 4
Comparative results of proposed models
     Model name         Data set         Accuracy(%)   F1-score OFF(%)       F1-score NOT(%)
     mBERT+SVM          Malayalam        53            41                    61
     mBERT+DNN          Malayalam        55            43                    63
     BERT classifier    Malayalam        70            58                     77
     mBERT+SVM          Tamil            50            43                    54
     mBERT+DNN          Tamil            51            44                    55
     BERT classifier    Tamil            57            53                    60


and Tanglish data set respectively (The result of our proposed model is shown in bold letters).
Among our proposed models, BERT outperformed other classifiers, reaching 70% accuracy for
the Mangalish data set and 57% accuracy for the Tanglish data set. Finally, we compared the
results of our proposed models in Table 4. We trained our proposed models on a Tanglish data
set comprising 4000 comments from the training set and tested them on 940 comments from
the test set. For the Manglish data set, 4000 train comments and 1000 test comments are used.

5.1. Error analysis
We investigated the behavior of proposed models on sample test sentences to evaluate their
performance. We discovered that our best-performing model monolingual BERT classifier could
accurately classify all of the test samples based on our experimental observations. However,
Table 5
Test case result for sample test sentences
             Tweet                 Data set   mBERT+SVM     mBERT+DNN        BERT    Target
  athe beharyku deputationil
  pokam pinarai vijayanu         Malayalam    NOT           NOT              NOT     NOT
  chinayilum pokam.
  ithu vallathum nadakkumo
  shajan sir kettittu            Malayalam    NOT           OFF              NOT     NOT
  kothiyakunnu.
  Indha movie ku award
  tharlanA avanga mansanay       Tamil        OFF           NOT              OFF     OFF
  illa bro.
  kritheeck Kookaburra en
                                 Tamil        NOT           NOT              NOT     NOT
  unaku enachu? Cbsc ah??


multilingual BERT models such as mBERT+SVM and mBERT+DNN could not classify test
samples 3 and 2, respectively. Table 5 summarises the results of the findings.


6. Conclusion and future enhancement
In our work, we presented a model submitted by our team IIITD-ShankarB for offensive content
identification in the shared task “Dravidian-CodeMixed HASOC-2021”. Our proposed work
experimented with three distinct models: a machine learning-based model, a Deep Neural Net-
work model, and a transformer-based language model. Our model is one of the top-performing
models, ranking sixth on the Manglish data set and eleventh on the Tanglish data set. In future
work, we can improve the efficiency of the suggested model by including domain-specific
embeddings.


References
 [1] S. A. Chowdhury, H. Mubarak, A. Abdelali, S.-g. Jung, B. J. Jansen, J. Salminen, A multi-
     platform Arabic news comment dataset for offensive language detection, in: Proceedings
     of the 12th Language Resources and Evaluation Conference, 2020, pp. 6203–6212.
 [2] C. N. d. Santos, I. Melnyk, I. Padhi, Fighting offensive language on social media with
     unsupervised text style transfer, arXiv preprint arXiv:1805.07685 (2018).
 [3] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on Arabic social media,
     in: Proceedings of the first workshop on abusive language online, 2017, pp. 52–56.
 [4] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM
     Computing Surveys (CSUR) 51 (2018) 1–30.
 [5] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the
     problem of offensive language, in: Proceedings of the International AAAI Conference on
     Web and Social Media, volume 11, 2017.
 [6] H. Al-Khalifa, W. Magdy, K. Darwish, T. Elsayed, H. Mubarak, Proceedings of the 4th
     workshop on open-source Arabic corpora and processing tools, with a shared task on
     offensive language detection, in: Proceedings of the 4th Workshop on Open-Source Arabic
     Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, 2020.
 [7] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der-
     czynski, Z. Pitenis, Ç. Çöltekin, Semeval-2020 task 12: Multilingual offensive language
     identification in social media (offenseval 2020), arXiv preprint arXiv:2006.07235 (2020).
 [8] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview
     of the HASOC track at FIRE 2019: Hate speech and offensive content identification in
     Indo-European languages, in: Proceedings of the 11th forum for information retrieval
     evaluation, 2019, pp. 14–17.
 [9] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive
     language identification in Tamil, Malayalam, and Kannada, in: ”Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages”, Association
     for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.17.
[10] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the HASOC track at
     FIRE 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi,
     English and German, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
[11] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
[12] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with
     the natural language toolkit, ” O’Reilly Media, Inc.”, 2009.
[13] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual bert?, arXiv preprint
     arXiv:1906.01502 (2019).
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python. the
     journal of machine learning research 12 (2011).