<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hate Speech and Ofensive Content Identification in Multiple Languages using machine learning algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dikshitha Vani V</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Bharathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Kalavakkam, Tamil Nadu 603110</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The freedom of expression on social media sites like Twitter and Facebook provides opportunities for people to voice out their opinions and concerns. At the same time, it has also become a tool for immense bullying and hateful comments online. AI tools are methods used to identify such comments automatically . These identification tools are evaluated by continuous experimentation with data sets. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to developing benchmark data for this purpose. This paper presents the HASOC task for Offensive Language Identification in Marathi. The data set was assembled from Twitter. This task has 3 subtasks. Subtask A is Offensive Language Detection where the goal is to discriminate between offensive and nonoffensive posts. In subtask B, only the posts labeled as Offensive (OFF) in subtask A are included and the goal is to predict the type of offense as either Targeted Insult(TIN) or Untargeted (UNT). In subtask C, only posts that are either insults or threats (TIN) are considered in this third layer of annotation and classifies them on the target of offenses as Individual (IND), Group (GRP), and Other (OTH). In this work, our team ssncse_nlp have applied machine learning prediction algorithms - Random forest (RF), Support Vector Machine(SVM), Logistic Regression, and k nearest neighbors (KNN) classifier algorithms along with count vectorized features to the tweets for classicfiation. Finally, the result shows that Random Forest predicts the labels for subtasks A and C more accurately than the other classifier models with a Macro F1 score of 0.9745 and 0.7929 while the Logistic Regression classifier predicts more accurately for subtask B with a Macro F1 of 0.6958.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social media is an active tool for people of all ages. They generate and consume a great deal
of data. Social media has gained massive support and reach over the years because it helps
people express themselves and their ideologies, showcase their talents, meet new people,
connect themselves to others’ stories, etc. On the other hand, it also means that people are
openly critical of others and "impose" their ideologies on others. When ideologies clash, they
introduce insulting, hurtful, derogatory or obscene language. Such objectionable content can
even be a threat to democracy. Open societies need to find an acceptable way to react to such
content without imposing rigid censorship regimes.
Online hate speech can be produced and distributed easily, at low cost and
anonymously, while having the potential to reach a globally diverse audience in real time. The
relative permanency of online content is also an issue when hate speech can reappear and
(re)gain popularity over time.</p>
      <p>As a consequence, many social media websites monitor user posts. This leads to a
pressing demand for methods to automatically identify suspicious posts. Online communities,
social media enterprises, and technology companies have been investing heavily in technology
and processes to identify offensive language and prevent abusive social media behavior.
However, there is increasing evidence that social media platforms still struggle to keep up with
the demand for technology to identify offensive content, particularly for languages other than
English. This paper aims to detect hate speech in the Marathi Language.</p>
      <p>Marathi is an Indo-Aryan language predominantly spoken by Marathi people in the Indian
state of Maharashtra. It is the official language of Maharashtra, and a co-ofcfiial language in
Goa state and the territory of Daman, Diu &amp; Silvassa. It is one of the 22 scheduled languages of
India, with more than 90 million speakers. Marathi ranks 10th in the list of languages with the
most native speakers in the world. Marathi has the third largest number of native speakers in
India, after Hindi and Bengali. The language has some of the oldest literature of all modern
Indian languages.</p>
      <p>In this paper, we aim to perform the classification of data based on a statistical
analysis of test data using a count vectorizer. It outlines four classifier models along with a count
vectorizer, namely random forest (RF), support vector machine (SVM), Logistic Regression,
and K nearest neighbors (KNN).</p>
      <p>The rest of the paper is organized as follows-Section 2 describes other related work
on Hate Speech. The dataset for the shared task and machine learning algorithms used for this
task are described in Section 3. Section 4 discusses the 4 classifier models used in this paper.
Results are presented in Section 5. Section 6 emphasizes improvements that can be applied to
the model. Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Identifying the vulgarity of comments on social media and classifying them has become
an important field of study today to ensure security, safety, peace, and harmony among
people. Using Classifier models coupled with a count vectorizer is one of the most commonly
used models to detect hate speech. Ensemble models, a machine learning approach to
combine multiple other models in the prediction process, is also commonly used to improve
classification performance.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [3], [4], and [5] uses classical classiefir models and ensemble models for the
classifications of data(coupled with count vectorizers or TF-IDF vectorizers). The following
paragraph gives a brief description or names the models used in the papers listed above.
In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], authors have used Ensemble Learning models such as Random Forest and
AdaBoost coupled with count vectorizer for novel hate Speech Detection where Random
Forest yielded 95% accuracy. They also used word cloud for displaying the most prominent
tweets responsible for hateful sentiments. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] discusses hate speech detection in Indonesian
language. The authors have used five stand-alone classification algorithms namely Naïve
Bayes, K-Nearest Neighbours, Maximum Entropy, Random Forest, and Support Vector
Machines, and two ensemble methods namely hard voting and soft voting on Twitter hate
speech database. They aimed to prove that the ensemble method can improve classification
performance. In [3], the authors have devised a system that uses both machine learning and
deep learning techniques to detect the offensive comments. [4] describes the usage of SVM
and Naive Bayes classifier for Hate speech models and the results showed a classification
accuracy of approximately 99% and 50% for SVM and NB respectively over their test set.
Similarly [5] uses Logistic Regression and XGBoost classifiers.
      </p>
      <p>Apart from the classiefir models that are combined with count vectorizer or TF-IDF
vectorizer, transformer models are also widely used for Hate Speech Detection. [6] uses
multilingual pre-trained models. In this paper, for the detection of Hindi and Marathi
languages, the MBERT model was used, and for the English language, the BERT model was
used.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>The proposed system of detecting offensive content from the HASOC 2022 task 3 data is
described in the following sections. The steps involved in the proposed system are as follows:</p>
      <sec id="sec-3-1">
        <title>1. Data set exploration and preprocessing</title>
        <p>2. Feature extraction
3. Model training and Testing
3.1 discusses the data set in detail, 3.2 discusses the need for feature extraction and finally 3.3
discusses the training and testing procedures.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Data Set exploration and Preprocessing</title>
          <p>The data set given by the shared task organizers contains a training set consisting of 3103
instances and a testing set consisting of 510 instances. The training set contains the ID
-identity of the user and Tweet while the test set consists of ID and tweet.</p>
          <p>Each instance of the training set also contains up to 3 labels each corresponding to
one of the following levels:
- Level (or sub-task) A: Offensive language identification;
- Level (or sub-task) B: Automatic categorization of offense types;
- Level (or sub-task) C: Offense target identification.</p>
          <p>The training set is splitted into 3 CSV files for carrying out each of the subtasks
separately. Subtask A contains all 3103 instances while subtask B and C contains 1068 and 740
instances respectively. The data set is split so that none of the tasks contain a null row entry in
them. Each of these files contains ID, Tweet and label of corresponding level/task.
Our team has not done any preprocessing of the data set for Marathi language texts.
The texts can be preprocessed by removing stopwords from the text. Stopwords are words that
do not provide any information for the classification but however increase the dimension of
the matrix provided by the count vectorizer. This is explained in the section 6.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The data set has been retrieved from [7].</title>
        <sec id="sec-3-2-1">
          <title>3.2. Feature extraction</title>
          <p>Machines cannot understand characters and words. In order to use textual data for predictive
modeling, words need to be encoded as integers or oflating-point values for use as inputs in
machine learning algorithms. This process is called feature extraction.</p>
          <p>We have used Scikit-learn’s count vectorizer to convert a collection of text documents
to a vector of term/token counts. It enables the pre-processing of text data prior to generating
the vector representation. This functionality makes it a highly flexible feature representation
module for text.</p>
          <p>The Count vectorizer is imported from sklearn.feature_extraction.text module. The
training set is given to fit_transform function to fit the data into the model and test data set is
used to transform the given set according to the fitted model from training set.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.3. Model Training and Testing</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The model is trained and tested using 4 Classifier models:</title>
      </sec>
      <sec id="sec-3-4">
        <title>1. RandomForest</title>
        <p>2. SVM classifier
3. Logistic Regression
4. KNN classifier</p>
        <p>In the following subsection 3.3.1, a major pre-processing step: Label encoding is discussed.
In the next subsection 3.3.2, to study these 4 different classifier models we brief in the
importance of splitting the training data set.
3.3.1. Label Encoding
Label Encoding is a popular encoding technique for handling categorical variables. In this
technique, each label is assigned a unique integer based on alphabetical ordering. Label
Encoding converts the labels into a numeric form to bring them into machine-readable
form. Machine learning algorithms can then decide in a better way how those labels must
be operated. It is an important preprocessing step for the structured dataset in supervised
learning. This is achieved using LabelEncoder() from sklearn.preprocessing header. Since
Scikit-Learn library is used, there no need for performing label encoding separately as the
classifier object takes care of that by itself.
3.3.2. Data Split
The tweet field of the training data(features) and the label of subtask(labels) are used to
perform a train and test data split.</p>
        <p>This splitting is necessary to check the correctness of the proposed model. The
splitted training set fits into a model based on the classifier. The tweets of the splitted test set
are given to the tfited model to predict the labels of the subtask. The predicted results are
then compared with the actual labels of the splitted test set using the metrics -&gt; classicfiation
report, confusion matrix and accuracy score.</p>
        <p>Empirical studies show that the best results are obtained if we use 20-30% of the data for
testing, and the remaining 70-80% of the data for training. The paper "Why 70/30 or 80/20
Relation Between Training and Testing Sets: A Pedagogical Explanation"1 by the University of
Texas at El Paso discusses the same.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Each of these classifier models is discussed in the next section.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Classifier Models</title>
      <p>This section is further divided into four sections. Section 4.1 describes Random Forest classifier
model, section 4.2 describes Support vector machine classifier model, section 4.3 describes
Logistic regression classiefir model, and section 4.4 describes K nearest Neighbors classifier
model.</p>
      <sec id="sec-4-1">
        <title>4.1. Random Forest classifier</title>
        <p>Random forest is imported from the Ensemble module. Ensemble means combining multiple
models i.e a collection of models is used to make predictions rather than an individual model.
Ensemble uses two types of methods:
1. Bagging– It creates a different training subset from sample training data with
replacement and the final output is based on majority voting.
2. Boosting– It combines weak learners into strong learners by creating sequential models
such that the final model has the highest accuracy.</p>
        <p>The random forest algorithm is based on bagging principle.</p>
        <p>Following are the steps involved in Random Forest Algorithm:
1Source: https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2202&amp;context=cs_techrep
• n random records are taken from the data set.
• Individual decision trees are constructed for each sample.
• Each decision tree will generate an output.</p>
        <p>• Final output is based on Majority Voting for Classicfiation.</p>
        <sec id="sec-4-1-1">
          <title>The table 4.1 describes the hyperparameters used.</title>
          <p>Hyperpameter
n_estimators
random_state</p>
          <p>Function
The number of trees the algorithm builds before averaging the predictions
controls the randomness of the bootstrapping of the samples used when building trees
To train the model the splitted training set is fit into the regressor. The splitted test set is
then used to predict the labels.</p>
          <p>The table 1 compares the test prediction and actual results shown by the metrics Macro F1,
precision, and accuracy.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Support Vector Machine</title>
        <p>Support Vector Machine (SVM) is a supervised machine learning algorithm that can be
used for classicfiation problems. In this algorithm, data items are plotted as a point in
n-dimensional space, n being the number of features, and the value of each feature is the
value of a particular coordinate. Classification is performed by finding the hyper-plane that
2Source: https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest
differentiates the two classes as shown in figure 2.</p>
        <p>If the hyperplane classifies the dataset linearly then the algorithm is called SVC and the if
algorithm separates the dataset by non-linear approach it is called SVM as depicted in figure 3</p>
        <p>To train the model the splitted training data set is fit into the SVC classifier. The splitted test
set is then used to predict the labels.</p>
        <p>The table 2 is the comparison between the test prediction and actual results shown by the
metrics Macro F1, precision and accuracy.
3Source: https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
4Source: https://www.analyticsvidhya.com/blog/2021/06/support-vector-machine-better-understanding/</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Logistic Regression</title>
        <p>Logistic regression is a supervised classicfiation algorithm where the target variable can take
only discrete values for a given set of features. The model builds a regression model to predict
the probability that a given data entry belongs to the category. Logistic regression models the
data using the sigmoid function. 5</p>
        <p>Based on the number of categories, Logistic regression can be classified as:</p>
        <sec id="sec-4-3-1">
          <title>1. Binomial: target variable can have only 2 possible types: “0” or “1” .</title>
          <p>2. Multinomial: target variable can have 3 or more possible types which are not ordered
like "Type A” vs “Type B” vs “Type C”.
3. Ordinal: it deals with target variables with ordered categories.Each category can
be given a certain value like 0,1,2 etc. Example a test result can be classified as
"poor","good","better" etc.</p>
          <p>To train the model, the splitted training data set is fit into the Logistic Regression model. The
splitted test set is then used to predict the labels.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>5Source: https://www.geeksforgeeks.org/understanding-logistic-regression/</title>
          <p>6Source: https://www.researchgate.net/publication/239269767_Artificial_Neural_Networks_in_Multivariate_
Calibration/figures?lo=1</p>
          <p>The table 3 is the comparison between the test prediction and actual results shown by the
metrics Macro F1, precision and accuracy.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. K-nearest Neighbors</title>
        <p>K-nearest neighbors (KNN) is a supervised learning algorithm used for classification. KNN
tries to predict the correct class for the test data by calculating the distance between the test
data and all the training points. K number of points that are closet to the test data is selected.
The algorithm calculates the probability of the test data belonging to the classes of ‘K’ training
data and the class holds the highest probability will be selected.</p>
        <p>The figure 5 depicts the different steps involved in the KNN classification</p>
        <p>To train the model the splitted training data set is fit into the KNeighborsClassifier. The
splitted test set is then used to predict the labels.</p>
        <p>The table 4 is the comparison between the test prediction and actual results shown by the
metrics Macro F1, precision and accuracy.</p>
        <p>Out of the 4 classifiers, it is evident that for the given data set, the Random Forest classifier
with Count vectorizer produces a higher accuracy, macro average of F1-Score and precision.
7Source: https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The entire training data set(without splitting) is used to fit the model and the Test data set is
used to predict the labels from the fitted model to get the results. The results were copied to a
csv file and submitted to the HASOC 2022 submission portal.
Our team ssncse_nlp ranked 1 in the subtask A with a macro F1 and macro precision
of 0.9745 and 0.9758 respectively using R and om F or e st classifier. In subtask B, our team
ranked 3rd with a macro F1 and macro precision of 0.6958 and 0.7587 respectively using
Log i st i c Re g r e ssi on classifier . In subtask C, our team ranked 2nd with a macro F1 and
macro precision of 0.7929 and 0.7963 respectively using R and om F or e st classifier.
The overview of the HASOC 2022 task is given in [8] and [9]. The reference to the
code is given in the foot note text as a git hub link8.</p>
      <sec id="sec-5-1">
        <title>The overall results(macro F1) are depicted in the table below:</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Ablation Study</title>
      <p>Count vectorizer will consume much memory as it needs to store a vocabulary dictionary in
memory. This problem can be reduced to some extent by the preprocessing of the data to
remove the stopwords from the tweets. A few stopwords in Marathi language are as follows:</p>
      <sec id="sec-6-1">
        <title>8Source: https://github.com/dikshu-02/HASOC2022-task3 9Source: https://hasocfire.github.io/submission/leaderboard.html</title>
        <p>Alternate is to use TF-IDF vectorizer. TF-IDF is better than count Vectorizers. This is
because, TF-IDF not only focuses on the frequency of words present in the corpus, but also
provides the importance of the words. We can then remove the words that are less important
for analysis, hence making the model building less complex by reducing the input dimensions.
[10] uses TF-IDF vectorizer for Social Network Hate speech Detection for Amharic language.
Another issue with these classifier models is that they don’t take into account the
context of a sentence. Hate speech is a very abstract and broad topic that is difcfiult to understand.
The detection of hate speech depends on people’s subjective understanding of what hate
speech is. The model discussed in this paper counts the number of repeated words and finds
a pattern while training and labels the data based on the count of specific words. This is
misleading as the same words can be associated with negative sentences as well i.e. both “This
is good” and “This is not good” counts equal numbers of “good”, ”this” and “is".
To ensure increased accuracy, pre-trained models for Marathi such as Roberta-base-mr can be
used for sentiment analysis or context analysis. [11] have shown the usage of Marathi based
transformer models and compares results of using a monolingual and multilingual model.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The need for scrutiny of comments online is becoming increasingly necessary to prevent
cyber bullying and maintain the harmony of life. The models discussed in this paper are
some of the ways to identify offensive tweets Once identified there can be many ways to
prevent them(blocking the users, having a point-based system in Social media for reducing
hate comments, giving warnings etc) which are not the scope of this paper. There is however
further scope for improvement in the discussed models as seen in the ablation study.
[3] A. Anand, J. Golecha, B. Bharathi, B. Jayaraman, T. Mirnalinee, Machine learning based
hate speech identification for english and indo-aryan languages (2021).
[4] D. Asogwa, C. Chukwuneke, N. Chigozie, G. Anigbogu, Hate speech classicfiation using
svm and naive bayes, 2022.
[5] M. K. A. Aljero, N. Dimililer, A novel stacked ensemble for hate speech recognition,</p>
      <p>Applied Sciences 11 (2021) 11684.
[6] A. Kalaivani, D. Thenmozhi, Multilingual hate speech and offensive language detection
in english, hindi, and marathi languages (2021).
[7] M. Zampieri, T. Ranasinghe, M. Chaudhari, S. Gaikwad, P. Krishna, M. Nene, S. Paygude,
Predicting the type and target of offensive social media posts in marathi, Social Network
Analysis and Mining 12 (2022) 77. URL: https://doi.org/10.1007/s13278-022-00906-8.
doi:10.1007/s13278-022-00906-8.
[8] Satapara, Shrey and Majumder, Prasenjit and Mandl, Thomas and Modha, Sandip and
Madhu, Hiren and Ranasinghe, Tharindu and Zampieri, Marcos and North, Kai and
Premasiri, Damith, Overview of the HASOC Subtrack at FIRE 2022: Hate Speech and
Offensive Content Identification in English and Indo-Aryan Languages, in: FIRE 2022:
Forum for Information Retrieval Evaluation, Virtual Event, 9th-13th December 2022,
ACM, 2022.
[9] T. Ranasinghe, K. North, D. Premasiri, M. Zampieri, Overview of the HASOC subtrack at
FIRE 2022: Offensive Language Identification in Marathi, in: Working Notes of FIRE 2022
- Forum for Information Retrieval Evaluation, CEUR, 2022.
[10] Z. Mossie, J.-H. Wang, Social network hate speech detection for amharic language,</p>
      <p>Computer Science &amp; Information Technology (2018) 41–55.
[11] A. Velankar, H. Patil, R. Joshi, Mono vs multilingual bert for hate speech detection and
text classification: A case study in marathi, arXiv preprint arXiv:2204.08669 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Turki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <article-title>Novel hate speech detection using word cloud visualization and ensemble learning coupled with count vectorizer</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>12</volume>
          (
          <year>2022</year>
          ). URL: https://www.mdpi.com/2076-3417/12/13/6611. doi:
          <volume>10</volume>
          .3390/app12136611.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Fauzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yuniarti</surname>
          </string-name>
          ,
          <article-title>Ensemble method for indonesian twitter hate speech detection</article-title>
          ,
          <source>Indonesian Journal of Electrical Engineering and Computer Science</source>
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>294</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>