COMPARATIVE ANALYSIS FOR OFFENSIVE
LANGUAGE IDENTIFICATION OF TAMIL TEXT
USING SVM AND LOGISTIC CLASSIFIER
Prabhu Ram. N, Meeradevi.T, Vibin Mammen Vinod, Gothainayaki.A, Anusha S
and Agalya T
Electronics and Communication Engineering, Kongu Engineering College, Erode, TamilNadu, India


                                      Abstract
                                      Social media like Twitter, Facebook, YouTube provide an opportunity of the fastest communication be-
                                      tween people. The social media texts are largely filled with code-mixed comments/post and reactions
                                      and its content may be filled with offensive language or non-offensive language. It is necessary to
                                      classify the YouTube comments/post and reactions as offensive label and non-offensive label. As the
                                      offensive comments/post is very sensational to something or someone to react in the society, Govern-
                                      ment has responsibility to identify it in the social media, before it reaches a larger audience. In India,
                                      multi-lingual practices use code mixed comments/post in social media, which leads to difficulty in of-
                                      fensive text classification automatically. The Dravidian code mixed data set is used to train the machine
                                      learning model to classify the label as offensive language or non-offensive language. The text data set is
                                      transformed into numerical data based on relative occurrence in the available datasets of training and
                                      testing using TFIDF method. However, the imbalanced dataset may be biased to a particular class of
                                      label, and hence it is turned into balanced dataset using SMOTE method. It is trained on SVM classifier
                                      and Logistic Classifier. The F1 score is analsyed and it is observed that balanced dataset predictions are
                                      better than unbalanced dataset predictions.

                                      Keywords
                                      Multilingual, SMOTE, TFIDF, SVM, Logistic classifier, NLP, Machine Learning


1. Introduction
In the modern era there are 3.78 billion social media users worldwide in 2021. The social media
makes communication easier and faster over the world and connecting everyone together. The
social media like Facebook, YouTube, Twitter gave us freedom to express opinion in public.
It may allow some bad actors in spreading fake news and offensive content. The offensive
language in the social platform is one of the most dangerous activities. So people have to
be protected themselves from these hateful activities in social media. The main challenges
in the social media is to identify offensive text content and deleting the problematic posts.
Research based on safety and security in social media has grown substantially in the last decade.
In many countries like United Kingdom, Canada, France, these activities are punishable[1].

FIRE 2021, Forum for Information Retrieval Evaluation, December 13-17, 2021, India
" prabhuramnphd@gmail.com (P. Ram. N); meeradevi.ece@kongu.edu ( Meeradevi.T); vibin.ece@kongu.edu
(V. M. Vinod); gothainayakia.18ece@kongu.edu ( Gothainayaki.A); anushas.18ece@kongu.edu (A. S)
 0000-0003-2769-9790 (P. Ram. N)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Social networks have introduced policies to restrict the offensive speech on people based on
racism, gender etc. A fine-drawn hate speech in sentences can be considered as hate or not
hate depending upon the person who interprets. The social media texts are represented with
multilingual text and code-mix text. The phenomenon of mixing the second language into
the first language or mixing the foreign languages into the native language structure is said
to be code mix. Such that, Tamil words are written in English script. Multilingual text is the
combination of multiple native language in single sentence. Such that Tamil and English words
were written in their native script in single sentence. The technique to identify the solution
to this problem by NLP(Natural Language Processing). NLP is a field of artificial intelligence,
which has an ability to understand, analyse the context of the human language.


2. Related Works
Hate speech identification through sentiment analysis is one of the current research fields in
Natural Language Processing. The solution is given by either machine learning approach or
lexicon based approach. The machine learning approach involves collecting an annotated data,
pre-processing the collected text data, transformation into machine learning input vector by
vectorisation technique and trained to classify using machine learning model. Lexicon based
approach is widely used in sentiment analysis, where the sentiment are collected from WordNet,
SentiwordNet and are used for classification. In lexicon based approach, there is no necessity
for labelling which is a time consuming process.
   Hate speech identification on monolingual english dataset[2, 3] and code-mix dataset for
Tamil and Malayalam scripts, the features extraction is executed by various methods like
Hash Vectorizer[2], Count Vectoriser, TFIDF(Term Frequency Inverse Document Frequency)[3,
4, 5, 6] and Word Embedding, customized word embedding, CBOG, Skip-gram, word2vec,
doc2vec, fastText[7]. TFIDF vectorizer, Count vectorizer are most commonly used vectorisation
algorithms which are not neural network based transformation. However, TFIDF performs well
on smaller vocabulary size, but more features are recorded on larger dataset, by modifying
IDF(Inverse Document Frequency) feature size with minimum computation time[8]. Neural
network based vectorization methods such as word2vec, doc2vec, fastText are used on code-mix
dataset. In which fastText vectorization performs better than other neural network based
vectorization methods[9]. The neural network based classification architectures like sub-word
level LSTM model, Hierarchical LSTM model, BERT, XLM-RoBERT, LSTM, GRU, XLNet[10,
11, 12] were used. Some machine learning based classification models such as Support Vector
Machine(SVM), Logistic Regression (LR), Random Forest Classifier (RFC)[3, 4, 13, 14, 15, 16]
and K-Nearest Neighbour (KNN)[17] are used. SVM model performs better for code-mix tamil
dataset than other machine learning models. Deep learning models such as RNN[11, 18],MLP
are also used for classification[19, 20] for enhancement in prediction of classification. The
evaluation of predictive model by accuracy, f1-score, precision, recall[14, 15, 17]. Hate speech
identification of code mix data, trained model has reduced prediction accuracy due to imbalanced
dataset. Section 3 describes the methodology, Section 4 describes about experimental setup for
training model of SVM and logistic classifier in different configurations of hyper-parameter. The
conversion of imbalanced dataset into balanced dataset using SMOTE method is also described
in Section 4. Section 5 describes about the results and discussion. Section 6 describes about the
conclusion.


3. Methodology
The flow of methodology have been described in detail in the following sub sections.

3.1. Text Pre-processing
Preprocessing involves the removal of special characters such as reaction smiles, punctuation
using standard package. The number of vocabularies gets reduced after removal of special
characters. In English language, conversion of token of words into its equivalent base form of
word by stemming and lemmitization is done. However, in Dravidian language, such processes
are not possible. The stream of text data is converted into token of word as unigram word,
bigram word, n-gram words as a token by the process called as tokenisation.

3.2. Vectorisation
The text after pre-processing is vectorised. The vectorisation method include TFIDF(Term
Frequency Inverse Document Frequency) used to represent the text data into its equivalent
numerical data. TFIDF adds weightage to unique words in the document.

3.3. Training Model
The logistic regression and SVM model are trained by tri-gram based TFIDF vectorization of
training dataset. The aim of the task is to classify the text as offensive or not-offensive class.
Logistic regression(LR) and Support Vector Machine (SVM) are supervised machine learning
algorithms used for classification and regression and they are best suited for binary classification.

3.4. Making Balanced dataset
The dataset may be balanced or imbalanced dataset. The balanced dataset contains equal number
of labels as offensive labels and not-offensive labels. The imbalanced dataset is the one which
has either one of the labels high. The dataset with 1153 offensive and 4724 not offensive is
an example of imbalanced dataset. This imbalancing in the dataset may lead to fit the model
on majority class which may give lower prediction results. There are some methods to make
imbalanced dataset to balanced dataset. They are:

    • Oversampling
    • Undersampling
    • SMOTE(Synthetic Minority Oversampling Technique)
Algorithm 1 : SMOTE’s algorithm
 1: procedure SMOTE(𝑋, 𝑦)                       ◁ SMOTE of X data array and y target array
 2:    𝑘𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠 ← [5]                                     ◁ Number of nearest neighbors
 3:    𝑛𝑗𝑜𝑏𝑠 ← 4                                           ◁ Number of cores on execution
 4:    𝑛𝑠𝑎𝑚𝑝𝑙𝑒 ← 𝐶𝑜𝑢𝑛𝑡(𝑋)                                       ◁ Number of input samples
 5:    𝑚𝑖𝑛 ← 𝐶𝑜𝑢𝑛𝑡(𝑦𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠 ) − 𝐶𝑜𝑢𝑛𝑡(𝑦𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠 )     ◁ Number of majority and
    minority classes
 6:    𝑠𝑡𝑒𝑝 ← 𝑟𝑎𝑛𝑑𝑜𝑚(0, 1)                                     ◁ Scalar multiplicative value
 7:    while 𝑖 ≤ 𝑛𝑠𝑎𝑚𝑝𝑙𝑒 and 𝑋𝑖 ∈y𝑚𝑖𝑛𝑖𝑜𝑟𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠 and min ̸= 0 do
 8:        Xinn ← [[𝑋𝑛𝑛𝑖 ], [𝑋 𝑖 ], ..., [𝑋 𝑖
                         1    𝑛𝑛2          𝑛𝑛𝑘      ]]        ◁ 𝑋 𝑖 Nearest neighbour sample
                                                𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑢𝑟𝑠
 9:       Xinew ← Xi + 𝑠𝑡𝑒𝑝 × (Xi − Xinn )                             ◁ 𝑋𝑖 , 𝑋𝑛𝑛 ∈y𝑚𝑖𝑛𝑖𝑜𝑟𝑖𝑡𝑦𝑐𝑙𝑎𝑠𝑠
10:       𝑚𝑖𝑛 ← 𝑚𝑖𝑛 − 1
11:       𝑖←𝑖+1
12:    end while
13:    return 𝑋𝑛𝑒𝑤                                                             ◁ Augmented Data
14: end procedure


   Oversampling methods is duplicating actual minority data from the dataset. The undersam-
pling method is removal of actual majority data from the data set.These approaches does not add
any new information to the dataset.SMOTE is the process of synthetically generating features
of minority class[21, 22, 23]. Based on Algorithm 1 the balanced dataset is generated.

3.5. Evaluating Model
The trained model is to be evaluated with the test data set. The metrics used to evaluate the
model are accuracy, f1-score, precision and recall. The accuracy of the model alone is insufficient
to evaluate as best fitted model. This is due to model may be biased to certain classes which can
be identified using f1-score metrics.


4. Experimental Setup
The dataset given in HASOC-Dravidian CodeMix FIRE 2021 [24] for the task of detection of
offensive language is split into training samples and testing samples and is described in Table 1.
In Table 2 is the description of number of known vocabulary from training set and unknown
vocabulary in cross validation dataset and test dataset with respect to the known vocabulary
from training samples. The datasets are labelled as offensive label, not-offensive label and
not-tamil label. The occurrence of "not-tamil" label in the given dataset is minimum in count,
so the samples of "not-tamil" labels are dropped in text pre-processing stage. 30% of training
samples is treated as the cross validation data sets. Since samples are imbalanced,it is necessary
to make them as a balanced training samples using SMOTE method. The imblearn package
from python is used to perform SMOTE[21].
   The training samples has been trained by logistic classifier and SVM classifier with certain
parameters.The Logistic classifier and SVM models can be trained using open source python
Table 1
HASOC-Dravidian-CodeMix-FIRE 2021 Dataset
                                 Sample count in Training set                      Sample count in Testing set
                            Actual Dataset Regenerated Dataset
 Not Offensive Class              4724               4724                                     536
 Offensive Class                  1153               4724                                     118
 Not Tamil                          3                  -                                       -
 Average Length of sentence                  16                                                17
 Maximum Length of sentence                 113                                               164
 Minimum Length of sentence                  1                                                 2


Table 2
Description of actual dataset based on number of vocabulary
                                                                          Vectorization
                                                          Uni-Gram          Bi-Gram           Tri-Gram
 Number of Vocabulary in the Training Set                   19208            44604              48345
                                                         CV *    Test     CV *     Test      CV *    Test
 Number of Unknown Vocabulary with respect
                                                         14883   16722   41491      43033   46842    47638
 to Training Set
 Number of Non offensive sample with atleast
                                                          537     338     120        62       72      31
 one unknown vocabulary
 Number of Offensive sample with atleast one
                                                           13      4       6          0        6       0
 unknown vocabulary
 *
   Cross Validataion dataset

Table 3
Parameters used in Logistic classifier and SVM
                                                  C      max𝑖 𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛    Kernel
                            Logistic Classifier   1           500        Sigmoid
                            SVM                   1        No limit       linear


package such as sklearn. The parameter value setting in the logistic classifier and SVM classifier
are tabulated in Table 3. The parameter C is termed as inverse of regularization strength. If the
value of C is larger, SVM classifier minimises the number of misclassified samples and their by
making smaller margin of decision boundary.1


5. Results and Discussions
The logistic trained model and SVM classifier model is evaluated using labelled test samples by
accuracy, f1-score of average weighted by support, precision and recall metrics in the Table 4
and Table 5. The macro average metrics are calculated for each class and is used to determine
the average of it without considering imbalanced classes into account. The weighted average
   1
       https://github.com/GothainayakiA/Hatesppech.git
Table 4
Classification Report of Logistic Classifier model using TFIDF vector
                                    Imbalanced Dateset                Balanced Dataset
                               Precision Recall f1-Score        Precision Recall f1-Score
        Not Offensive class      0.820     1.000    0.901         0.833     0.910    0.873
        Offensive class          0.000     0.000    0.000         0.333     0.203    0.253
        Accuracy                                    0.820                            0.783
        Macro Average            0.410     0.500    0.450         0.586     0.557    0.563
        Weighted Average         0.672     0.820    0.738        0.747      0.783    0.761

Table 5
Classification Report of SVM Classifier model using TFIDF vector
                                    Imbalanced Dateset                Balanced Dataset
                               Precision Recall f1-Score        Precision Recall f1-Score
        Not Offensive class      0.823     1.000    0.903         0.837     0.950    0.890
        Offensive class          1.000     0.025    0.050         0.413     0.161    0.232
        Accuracy                                    0.824                            0.807
        Macro Average            0.912     0.513    0.476         0.625     0.555    0.561
        Weighted Average        0.855      0.824    0.749         0.761     0.807    0.771


metrics which calculate the average weight of number of true instance for each class.
In the Table 4, f1-score of offensive class has been improved to 0.253 and overall weighted
average f1-score of balanced dataset by SMOTE is increased from 73.8% to 76.1% in logistic
classifier. Similarly, in SVM classifier as shown in Table 5, f1-score of offensive class has been
improved to 0.23 and overall weighted average f1-score of balanced dataset by SMOTE is
increased from 74.9% to 77.1%. The number of unknown vocabulary described in Table 2 is
maximum in cross validation set and testing set as compared to training dataset. This leads to
misclassification and reaches an average accuracy.


6. Conclusion
The task of identifying offensive language for the dataset given in HASOC-Dravidian CodeMix
FIRE 2021[24] is performed by using TFIDF Vectorisation methods and trained on logistic
classifier model and SVM classifier model. It is observed that the models are trained with
imbalanced samples provides biased predictions to one specific class. Hence, to improve the
level of biased prediction to certain class, the oversampling technique is used to generate new
labelled dataset from the existing dataset. The generated balanced dataset is trained on logistic
classifier and SVM classifier. It is concluded that there is an improvement in average weighted
f1-score prediction by 2.3% and 2.2% with logistic classifier model and SVM classifier model
respectively. However, the occurrence of unknown vocabularies in the cross validation and
test set is possible, contextual based word representation to the unknown vocabulary may be
applied. In future SMOTE can be performed for pre-trained models like word2vec,fastText and
also custom trained model of word vectorisation and the model to be trained using sequential
neural network like RNN,LSTM,GRU.


References
 [1] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the
     problem of offensive language, in: Proceedings of the International AAAI Conference on
     Web and Social Media, volume 11, 2017.
 [2] S. Kaur, P. Kumar, P. Kumaraguru, Automating fake news detection system using multi-
     level voting model, Soft Computing 24 (2020) 9049–9069. URL: https://doi.org/10.1007/
     s00500-019-04436-y. doi:10.1007/s00500-019-04436-y.
 [3] A. Muneer, S. M. Fati, A comparative analysis of machine learning techniques for cyber-
     bullying detection on twitter, Future Internet 12 (2020). URL: https://www.mdpi.com/
     1999-5903/12/11/187. doi:10.3390/fi12110187.
 [4] V. Pathak, M. Joshi, P. Joshi, M. Mundada, T. Joshi, KBCNMUJAL@HASOC-Dravidian-
     CodeMixFIRE2020: Using machine learning for detection of hate speech and offensive
     code-mixed social media text, CEUR Workshop Proceedings 2826 (2020) 351–361.
 [5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed Tamil-English text (2020) 202–210. URL: https://www.
     aclweb.org/anthology/2020.sltu-1.28.
 [6] S. Swaminathan, H. K. Ganesan, R. Pandiyarajan, HRS-TECHIE@Dravidian-CodeMix
     and HASOC-FIRE2020: Sentiment analysis and hate speech identification using machine
     learning, deep learning and ensemble models, CEUR Workshop Proceedings 2826 (2020)
     241–252.
 [7] A. V. Mandalam, Y. Sharma, Sentiment Analysis of Dravidian Code Mixed Data, Proceed-
     ings of the First Workshop on Speech and Language Technologies for Dravidian Languages
     (2021) 46–54. URL: https://www.aclweb.org/anthology/2021.dravidianlangtech-1.6.
 [8] S. Manochandar, M. Punniyamoorthy, Scaling feature selection method for enhancing the
     classification performance of Support Vector Machines in text mining, Computers and
     Industrial Engineering 124 (2018) 139–156. URL: https://doi.org/10.1016/j.cie.2018.07.008.
     doi:10.1016/j.cie.2018.07.008.
 [9] K. Sreelakshmi, B. Premjith, K. P. Soman, Detection of Hate Speech Text in Hindi-English
     Code-mixed Data, Procedia Computer Science 171 (2020) 737–744. URL: https://doi.org/10.
     1016/j.procs.2020.04.080. doi:10.1016/j.procs.2020.04.080.
[10] T. Y. Santosh, K. V. Aravind, Hate speech detection in Hindi-English code-mixed social
     media text, ACM International Conference Proceeding Series (2019) 310–313. doi:10.
     1145/3297001.3297048.
[11] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. K. M, T. Mandl, P. K. Kumaresan, R. Pon-
     nusamy, R. L. Hariharan, J. P. Mccrae, E. Sherly, Findings of the Shared Task on Offensive
     Language Identification in Tamil , Malayalam , and Kannada (2021) 133–145.
[12] S. Banerjee, A. Jayapal, S. Thavareesan, Nuig-shubhanker@dravidian-codemix-fire2020:
     Sentiment analysis of code-mixed dravidian text using xlnet, 2020.
[13] N. P. Ram, V. M. Vinod, V. Mekala, M. Manimegalai, A fast and energy efficient path
     planning algorithm for offline navigation using SVM classifier, International Journal of
     Scientific and Technology Research 9 (2020) 2082–2086.
[14] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed Dataset
     for Sentiment Analysis and Offensive Language Detection, Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media (2020) 54–63. URL: https://www.aclweb.org/anthology/2020.peoples-1.6.
[15] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
     dataset for code-mixed Malayalam-English (2020) 177–184. URL: https://www.aclweb.org/
     anthology/2020.sltu-1.25.
[16] N. P. Ram, K. Sandhiya, V. M. Vinod, V. Mekala, Offline navigation: Gps based assisting
     system in sathuragiri forests using machine learning, in: 2018 International Conference on
     Intelligent Computing and Communication for Smart World (I2C2SW), 2018, pp. 326–331.
     doi:10.1109/I2C2SW45816.2018.8997523.
[17] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus Creation
     for Sentiment Analysis in Code-Mixed Tamil-English Text (2020) 202–210. URL: http:
     //arxiv.org/abs/2006.00206. doi:10.5281/zenodo.4015253. arXiv:2006.00206.
[18] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages
     in Code-Mixed Text, ACM International Conference Proceeding Series (2020) 21–24.
     doi:10.1145/3441501.3441515.
[19] Z. Al-Makhadmeh, A. Tolba, Automatic hate speech detection using killer natural language
     processing optimizing ensemble deep learning approach, Computing 102 (2020) 501–522.
     URL: https://doi.org/10.1007/s00607-019-00745-0. doi:10.1007/s00607-019-00745-0.
[20] A. Al-Hassan, H. Al-Dossari, Detection of Hate Speech in Social Networks: a Survey on
     Multilingual Corpus (2019) 83–100. doi:10.5121/csit.2019.90208.
[21] G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the
     curse of imbalanced datasets in machine learning, Journal of Machine Learning Research
     18 (2017) 1–5. URL: http://jmlr.org/papers/v18/16-365.html.
[22] G. Douzas, F. Bacao, Geometric SMOTE a geometrically enhanced drop-in replacement
     for SMOTE, Information Sciences 501 (2019) 118–135. URL: https://doi.org/10.1016/j.ins.
     2019.06.007. doi:10.1016/j.ins.2019.06.007.
[23] S. Kiyohara, T. Miyata, T. Mizoguchi, Prediction of grain boundary structure and energy
     by machine learning 18 (2015) 1–5. URL: http://arxiv.org/abs/1512.03502. doi:10.1126/
     sciadv.1600746. arXiv:1512.03502.
[24] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.