<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Learning Models for Hate Speech Identification in Marathi Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Disha Gajbhiye</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Swapnil Deshpande</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prerna Ghante</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhijeet Kale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deptii Chaudhari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hope Foundation's International Institute of Information Technology</institution>
          ,
          <addr-line>Hinjawadi, Pune</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hate speech content has become a significant issue in today's world. Hate speech detection is an automated task of detecting textual content that contains discriminatory language regarding a person or group based on who they are, their race, gender, caste, etc. In this paper, we discuss the models submitted by our team, Mind Benders, for Marathi subtask A, for "Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages (HASOC)" at Forum for Information Retrieval Evaluation. A training and test dataset in Marathi language containing 1874 and 625 tweets, respectively, were shared by the HASOC organizers. Using these datasets, we propose an approach to automatically classify the tweets into two categories: "NOT" (Non-Hate-Ofensive) and "HOF" (Hate and Ofensive). The classiifcation models developed are applied to the test dataset. They are experimented with to predict the categories of respective test data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Logistic Regression</kwd>
        <kwd>Random Forest Classifier</kwd>
        <kwd>TF-IDF Vectorizer</kwd>
        <kwd>Text Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The use of social media has increased in recent years. It plays a significant role in forming and
shaping views of people on various issues. Users tend to send hateful and ofensive messages to
a person or community on social media platforms, leading to heated debates.</p>
      <p>To make social networking sites a friendly knowledge-sharing environment, there is an acute
need for an automated hate speech detection system that will automate making decisions.</p>
      <p>Hate speech classifies tweets into two categories, hate speech or non-hate speech. The
number of hate and non-hate tweets had to be balanced as the initial stage in developing our
model. Our data preprocessing step involved two approaches, Random forest, and Logistic
Regression.</p>
      <p>Random forest is a supervised learning technique used for both classification and regression
problems in ML. It builds decision trees on diferent samples and takes their majority vote for
classification and average in case of regression.</p>
      <p>A machine learning approach called logistic regression is used to forecast the likelihood of
a target variable. It’s a method for predicting a categorical dependent variable from a set of
independent variables.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Several studies on the automatic detection of hate speech and ofensive and non-ofensive
content have been published. Kulkarni, et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has adopted the best accuracy using IndicBERT
and CNN with Indic fastText word embeddings. This dataset will play a crucial role in advancing
NLP research for the Marathi language.
      </p>
      <p>
        Aluru et al.[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] worked on using classification techniques for hate speech detection like
CNNGRU, BERT, mBERT, translation. Pathak et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] applied Support Vector Classifier, Multinomial
Bayes, LR, Random Forest Classifier, n-gram model, Text Classification. Founta et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] worked
on Deep Learning Architectures such as text classification network, metadata network,
combining two classification paths, and trained combined networks. The related study shows that
significant work has been done on detecting hate speech in many Indian languages.
      </p>
      <p>
        The approach of a system developed by Khandelwal et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is based on N-gram, CBOW, and
reference tokens. This system detects abusive language in English from social media. Another
work done by Lakshmi BS et al.[6] detects the ofensive content from English and Kannada
social media text. Sutejo et al.[7] used word n-gram, Long short-term memory (LSTM) in deep
learning to determine the sentiments from the Indonesian language. Jiang et al.[8] used two
datasets of the Hate speech dataset published on Kaggle that contain 1000 unique labeled values
(tweets data). They have used multiple classifiers such as Logistic Regression and Support
Vector Machines (SVMs) for classification.
      </p>
      <p>Kovács et al.[9] worked on the text preprocessing methods and the cross-validation method
used to train and evaluate models. Working on Natural Language Toolkit, Word2vec, a
combination of Bag of-word (CBOW) and Skip-Gram algorithm, was done by Chaitanya et al.[10].
Gaydhani et al.[11] employed several techniques such as SVM, Logistic regression, and Naive
Bayes to classify tweets into ofensive and non-ofensive. Mandl et al.[ 12] presented an overview
of the tasks and the results of the HASOC track at FIRE 2020.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Definition</title>
      <p>We propose a coarse-grained binary classification to classify tweets into two classes: Hate and
Ofensive (HOF) and Non- Hate and ofensive (NOT).</p>
      <p>Non-Hate-Ofensive (NOT) - Post does not contain any Hate speech, profane, ofensive
content. Hate and Ofensive (HOF) - Post contains Hate, ofensive, and profane content.</p>
      <p>Best resulting features are used by extracting language-specific and language-independent
characteristics of the given dataset. The approach applied for the classification of this text data
is explained below.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>We have chosen the task to identify ofensive and non-ofensive content in the Marathi dataset
released for the HASOC shared task as discussed above, consisting of CSV files of comments. All
given comments are in Marathi. This training dataset has columns with column names as Text
ID, text, and label, respectively. The Label column has values either HOF, indicating ofensive
text, or NOT, indicating a non-ofensive text. The number of comments in the file is around
1874. This training dataset is given to carry the experimental work of training the machine by
applying appropriate machine learning algorithms. The test dataset has only two columns, text
id, and text. The third column i.e. Label, is missing. After the training in the first phase, the
machine learning algorithms have to predict the labels of the respective tweets. Approximately
625 comments are available in this dataset for both languages. Gaikwad et al.[13] worked on
a model for Marathi language that described the task’s data. Modha et al.[14] have given an
overview of the results and findings of HASOC 2021. Table 1 represents the statistical data
about this Training and Test Dataset for the Marathi language.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Methodology</title>
        <p>A supervised machine learning approach is used in the experimental work. While building the
model, data preprocessing is a vital step. In NLP, the first step is to preprocess the data, i.e.,
removal of unnecessary noise from the textual content. This is followed by encoding the text
into numeric vectors as Machine Learning needs data in the numeric form. This is done using
encoding techniques such as BagOf Words, n-gram, TF-IDF, Word2Vec, etc. In our analysis, we
have implemented the TF-IDF feature extraction technique.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Data preprocessing</title>
          <p>The data is usually in the natural human format, which is in sentences or paragraphs. Hence,
before analyzing it, the information needs to be transformed and cleaned up so that the computer
in the desired language can understand it. Following are the steps of the preprocessing data
phase:
• Removal of Leading and Trailing spaces: They are unnecessary whitespaces located at
both ends of the line, removed using the python strip() method.
• Removal of irrelevant characters (numbers and punctuation): In our analysis, the English
alphabet and numbers, Marathi numbers, and punctuation are irrelevant. Thus, they are
removed to simplify the text content.
• Removal of URLs and emojis: URLs and emojis are also needless in our analysis; hence
they are removed from the text using regex expression.
• Removal of stopwords: A custom-made Marathi stopwords list is defined for removing
stopwords, which are commonly used words that have no real value in the analysis.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Features Extraction</title>
          <p>For feature extraction, we applied the TF-IDF technique, which is used to get the most important
words. TF and IDF measure the frequency of the word in a document and the uniqueness of
the word, respectively. To convert the sentences into vectors, multiply the word frequency by
the inverse document frequency. This is done with sci-kit-learn, and the TF-IDF vectorizer
technique is used to extract features from the document of words. Thus, it provides a matrix of
numeric values of the entire document.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Classifier Models</title>
          <p>Implementation of two classifier models was carried out, namely, Logistic Regression and
Random Forest Classifier. The extracted feature set is used in the training phase. Around 70%
of the observations from the training dataset are used for fitting the model. In contrast, the
remaining portion is used for testing to make predictions to test the model’s accuracy. We have
used the accuracy of the results of classification to evaluate the performance of these classifiers.
For this purpose, the parametric values with the best performance are found by varying the
parameters of each classifier.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Work</title>
      <p>We have developed a model for the HASOC Marathi subtask A using a Machine Learning
approach. Experiments were done on various classifier algorithms by the feature extraction set.
The classifier algorithms that we used for our experiments are as follows:
• Logistic Regression (LR): As it is known, Linear regression uses a linear function to
map input values to continuous values. The data is modeled using a straight line to predict
the output of a variable. Logistic regression is similar to linear regression, except logistic
regression predicts whether something is true or false instead of predicting continuous
values. It is a Supervised Machine Learning algorithm used to predict the probability of
target variables. The probability of some obtained event is represented as a linear function
of a combination of predictor variables. It is used when data is linearly separable and
output is binary or dichotomous in nature. So, it is used for binary classification problems.
The target variable is divided into two classes’ 1’ for success/YES and’ 0’ for failure/NO.
Logistic regression’s ability to provide probabilities and classify new samples using
continuous and discrete measurements makes it a popular machine learning method. One
big diference between linear regression and logistic regression is how the line is fit to the
data. With linear regression, we fit the line using the least-squares method, i.e., we find
the line that minimizes the sum of the squares of the residuals. We also use the residuals
to calculate 2 and to compare simple models to complicated models. Logistic regression
doesn’t have the same concept of a residual, so it can’t use the least-squares method.
Instead, it uses the concept of maximum likelihood. The goal of maximum likelihood is
to find the optimal way to fit a distribution to the data.</p>
      <p>Instead of fitting a line to the data, logistic regression fits an "S" shaped logistic function
called the Sigmoid function, which is used for classification. It is helpful to map any
predicted values into values between 0 and 1. The concept of the threshold value is used
in LR. If the expected value is above the threshold, it tends to be one, and below the
threshold, it is 0. There are two hypotheses in logistic regression: a null hypothesis and
the other is an alternative hypothesis. We used an alternative where the model predicts
the accurate values and difers significantly from null or zero. From the analysis of this
hypothesis, the output from the hypothesis depends on estimated probability.
 1−   is a link function used in logistic regression where p is the probability of success
and 1-p is the probability of failure. Here p must always be positive and less than equal to
1. 1−   is an odds ratio. If the odds ratio comes out positive then the probability of success
is always more than 50%. If it comes out negative, then it is the probability of failure.
• Random Forest (RF): It is a Machine Learning algorithm used for classification and
regression problems. Random forests are made out of decision trees. Decision trees
work great with the data used to create them, but they are not flexible when it comes to
classifying new samples. Random forests combine the simplicity of decision trees with
lfexibility resulting in a vast improvement in accuracy. A random forest contains several
decision trees on various subsets of a given training dataset. It provides output based on
a majority vote. The decision tree consists of three components, a decision node, a leaf
node, and a root node. This tree will divide the training dataset into branches and further
separate it into other branches.</p>
      <p>The variation between a decision tree and a random forest is that the earlier combines
certain decisions while the latter does not. A random forest, on the other hand, combines
many decision trees. We have used a bagging method for prediction known as Bootstrap
aggregation, the ensemble technique used in random forests. It involves using diferent
samples of data rather than one. The training dataset consists of observation and features
that are used for prediction. Now the tree will produce diferent outputs depending upon
training data. The final output obtained is based on majority voting, and the collection of
this output is called aggregation.</p>
      <p>Also, we used hyper-parameters which are helpful to increase the prediction power of
the model. n_estimators is one of the hyper-parameters. n_estimator is many trees that
the algorithm builds before taking majority voting or average predictions. If the number
of trees increases, the model’s performance will improve, and prediction will be stable.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>For better model performance, we have used 70 percent of the training data for training the
model, and the remaining is used for testing. Table 2 and Table 3 show the Precision, Recall, F1,
and Accuracy scores for Logistic Regression and Random Forest.</p>
      <p>Precision is defined as the ratio + where  is the number of true positives and  , the
number of false positives. Precision is the ability of the classifier not to label a sample as positive,
that is negative. The F1 score is also known as balanced F-score or F-measure. It is the weighted
average of Precision and Recall, where an F1 score reaches its best value at 1 and worst score
at 0. The relative contribution of Precision and Recall to the F1 score is equal. The formula
for the F1 score is defined as  1 = +· . Recall score helps when the cost of false

negatives is high. Recall is the ratio + where  is the number of true positives and  , the
number of false negatives. Recall is the ability of the classifier to find all the positive samples.
Accuracy score can tell us immediately whether a model is being trained correctly and how
it may perform generally. It is simply a ratio of correctly predicted observations to the total
observations.</p>
      <p>Logistic Regression has an F1 score of 0.84 for the non-ofensive text and 0.54 for the
hateofensive. Accuracy of 0.7595 is obtained from Logistic Regression. Random Forest has an F1
score of 0.83 for the non-ofensive text and 0.67 for the hate-ofensive text. This classifier gives
an accuracy of 0.7770.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Hate speech continues to be a social media problem. This paper presents the experimental
work and results of HASOC Marathi subtask-A by Team Mind Benders. This paper proposes a</p>
      <p>Logistic Regression, Accuracy Score: 75.955%</p>
      <p>Precision</p>
      <p>Recall F1 score</p>
      <p>Support
solution for detecting Marathi hate speech and ofensive content on the Twitter dataset through
supervised machine learning approaches like Logistic Regression (LR) and Random Forest (RF).</p>
      <p>We performed an analysis of LR and RF on various sets of feature values and model parameters.
For the identification of critical features from data, we used the TF-IDF feature extraction
technique. The results showed that Random Forest performs comparatively better than the
Logistic Regression approach. We achieved a reasonable accuracy of 0.77 using the Random
Forest classifier. Given all the challenges that remain, there is a need for more research on this
problem statement.
code-mixed social media content: Corpus and baseline system, Computación y Sistemas
22 (2018) 1241–1247.
[6] B. S. Lakshmi, B. Shambhavi, An automatic language identification system for
codemixed english-kannada social media text, in: 2017 2nd International Conference on
Computational Systems and Information Technology for Sustainable Solution (CSITSS),
IEEE, 2017, pp. 1–5.
[7] T. L. Sutejo, D. P. Lestari, Indonesia hate speech detection using deep learning, in: 2018</p>
      <p>International Conference on Asian Language Processing (IALP), IEEE, 2018, pp. 39–43.
[8] L. Jiang, Y. Suzuki, Detecting hate speech from tweets for sentiment analysis, in: 2019 6th</p>
      <p>International Conference on Systems and Informatics (ICSAI), IEEE, 2019, pp. 671–676.
[9] G. Kovács, P. Alonso, R. Saini, Challenges of hate speech detection in social media, SN</p>
      <p>Computer Science 2 (2021) 1–15.
[10] I. Chaitanya, I. Madapakula, S. K. Gupta, S. Thara, Word level language identification in
code-mixed data using word embedding methods for indian languages, in: 2018
International Conference on Advances in Computing, Communications and Informatics (ICACCI),
IEEE, 2018, pp. 1137–1141.
[11] A. Gaydhani, V. Doma, S. Kendre, L. Bhagwat, Detecting hate speech and ofensive
language on twitter using machine learning: An n-gram and tfidf based approach, arXiv
preprint arXiv:1809.08651 (2018).
[12] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at fire
2020: Hate speech and ofensive language identification in tamil, malayalam, hindi, english
and german, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
[13] S. Gaikwad, T. Ranasinghe, M. Zampieri, C. M. Homan, Cross-lingual ofensive language
identification for low resource languages: The case of marathi, in: Proceedings of RANLP,
2021.
[14] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri,
Overview of the HASOC subtrack at FIRE 2021: Hate speech and ofensive content
identification in english and indo-aryan languages and conversational hate speech, in: FIRE
2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December
2021, ACM, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mandhane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Likhitkar</surname>
          </string-name>
          , G. Kshirsagar,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>L3cubemahasent: A marathi tweet-based sentiment analysis dataset</article-title>
          ,
          <source>arXiv preprint arXiv:2103.11408</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Aluru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Deep learning models for multilingual hate speech detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>06465</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mundada</surname>
          </string-name>
          , T. Joshi,
          <article-title>Kbcnmujal@ hasoc-dravidian-codemixifre2020: Using machine learning for detection of hate speech and ofensive code-mixed social media text</article-title>
          ,
          <source>arXiv preprint arXiv:2102.09866</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Founta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chatzakou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kourtellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blackburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vakali</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leontiadis</surname>
          </string-name>
          ,
          <article-title>A unified deep learning architecture for abuse detection</article-title>
          ,
          <source>in: Proceedings of the 10th ACM conference on web science</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <article-title>Gender prediction in english-hindi</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>