<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hate Speech and Ofensive Content Identification in Hindi and Marathi Language Tweets using Ensemble Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ratnavel Rajalakshmi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faerie Mattins</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S Srivarshan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L Preethi Reddy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Technology, National Institute of Technology Karnataka (NITK)</institution>
          ,
          <addr-line>Surathkal</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Engineering, Vellore Institute of Technology</institution>
          ,
          <addr-line>Chennai</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hate Speech is described as any form of speech in which speakers attempt to ridicule, humiliate, or inculcate hatred in someone else's minds based on characteristics such as religion, the colour of skin, race, or sexual preference. In recent years, social networking sites have been a major source of excessive amounts of hate speech. If unaddressed, these might cause anxiety and despair in the afected individuals or groups. As a result, the above-mentioned social networks utilize an assortment of algorithms to identify such hate speech. Detecting Hate Speech in English texts has been one of the hottest topics in recent years, with multiple types of research being published. However, in regional and indigenous languages, hate speech detection is a recent area with not much research being conducted. It is dificult to perform hate speech detection using data in regional languages due to a lack of large enough training data and a lack of resources about that domain. The HASOC [1] 2021 Hate Speech Detection Task solves one of the problems. It provides a dataset containing Tweet data in English, Hindi [2] and Marathi [3] languages. There were two subtasks as part of the main task. The subtask was to classify the hate speech and ofensive texts in the Hindi and Marathi tweet dataset as Hate Speech (HATE), Ofensive (OFFN) or Profane (PRF). This work compares the performance of diferent models on both subtasks and provides a conclusion on the best performing model. The Random Forest Classifier reports the most remarkable accuracy on the first subtask with a macro F1 score of 75.19% and 73.12% on the Marathi and Hindi tweet datasets. The XGBoost algorithm is the best performing algorithm on the second subtask with a 46.5% macro F1 score. Overall any of these models can get satisfactory results when dealing with hate speech detection in regional language. This work has been submitted to the FIRE2021 shared task, Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages (HASOC-2021) by team DLRG.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate speech detection</kwd>
        <kwd>Multilingual tweets</kwd>
        <kwd>Machine learning</kwd>
        <kwd>XGBoost</kwd>
        <kwd>Majority voting</kwd>
        <kwd>Random forest</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Hate speech is defined as any communication, whether spoken, written, or physical, that
criticises or uses discriminating, or disparaging terminology about a person ( or group) based on
who they are, such as colour, religion, nationality, ethnicity, race, sexuality, descent, or another
identifying feature. Nowadays, social media has evolved into a popular forum for ordinary
people to share their feelings on any topic. It has its own set of advantages and disadvantages.
Everyone understands their point of view and perceives the world in a new light when they
express themselves. However, this might also imply that people have the right to share whatever
they wish, even ofensive content that is harsh and upsetting to people. It impacts not only the
feelings of that particular group but also their self-esteem and dignity. As a result, maintaining
a courteous demeanour while portraying one’s viewpoint on any social media site is essential.
Unfortunately, preventing people from posting or tweeting hate speech on social media is quite
tricky. Thence, every social media company globally is attempting to build a revolutionary
approach for detecting and preventing hate speech. It has been done to some level in English,
as English is a widely known and spoken language worldwide. However, this is not the case in
Indian regional languages such as Hindi, Marathi, Tamil, and others. Therefore, it is critical to
have a solution in place.</p>
      <p>Machine learning is an application of data analysis that facilitates the creation of analytical
models. Artificial intelligence is predicated on the idea that computers can recognise patterns,
learn from data, and make decisions without the need for humans. Machine learning has been
utilised to solve problems in many science and technology domains, including social media.
Many academics are now working on multiple machine learning models to detect highly
accurate hate speech and an optimised model. However, there has not been much advancement in
the subject of detecting hate speech in regional languages. Because it is challenging to train
a model in a language that isn’t widely spoken and hence lacks the necessary dataset and
procedures to train the model efectively. In this study, data embedding and multiple classifier
models were employed to determine whether or not a tweet contains hate and ofensive content
in two Indian regional languages, namely Hindi and Marathi. This research is split into two
sections: subtask A and subtask B. Binary classification models were used to determine if the
tweet contained hate speech in Hindi and Marathi. The multiclass classification was used for
subtask B to classify hate speech further, and they are: hate speech, profane or ofensive content
in Hindi. Our team DLRG participated in Subtask A-Marathi, Subtask-A and B for Hindi.
In section 2 of this paper, the related works have been analysed and discussed. In section
3, the dataset utilised in both subtasks is discussed. Section 4 includes an explanation of the
proposed methodology. In section 5, the findings are presented and discussed, and in section 6,
the conclusion and future work are presented.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In earlier works, diferent term weighting methods have been applied for web page categorization
using TF-IDF based methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors used unigrams and to detect the tweet is
ofensive or not( binary classification) ,patterns are automatically collected from the training
set. These patterns and unigrams are later used, among others, as features to train a machine
learning algorithm. In this method, they have not considered platforms like Facebook. In
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], traditional machine learning techniques were applied with SVM for URL based web page
classification. In many recent works, the deep learning algorithms such as Convolutional Neural
Networks, Recurrent Neural Networks are are explored for various applications [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Sentiment
analysis in English is a most common problem and many approaches are suggested by various
researchers by applying both machine learning and deep learning techniques [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. This paper
projects the various works submitted as part of the HASOC FIRE’20 track[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Ensemble based
approaches were also reported in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for ofensive speech detection. A novel relevance factor
has been proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to address the challenges in code-mixed Hindi-English tweets. In
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Hindi English Code Mixed Hate Speech Detection is performed using Character Level
Embedding. The authors have used diferent deep learning models, in which Attention Model
with GRU Hybridisation performed the best. However, instead of word embeddings , the authors
have used Character-Level Embeddings which made model adaptable to the common defect. In
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the authors applied the Multi Input Multi Channel Transfer Learning based model to detect
hate speech or abusive Hinglish tweets from the Hinglish Ofensive Tweet (HOT) dataset.The
authors used transfer learning with multiple feature inputs. They achieved peak performance
using the SVM model. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], this paper explains an increment to the state of the art in hate
speech detection for Hindi English code mixed tweets. The authors compared three typical deep
learning models using domain-specific embeddings. However, the proposed method does not
take consideration into code mixed data. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the authors performed hate speech detection
on datasets of two Dravidian languages that are Tamil and Malayalam . They obtained an F1
score 0.77, and similarly, for the Tamil language ,the best performing model obtained an F1
score of 0.87.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset was obtained as part of the HASOC ’21 competition at FIRE 2021. These datasets
consist of tweets sampled from the Twitter social media platform. The data set consisted of
English, Hindi and Marathi tweets. For this work, the English tweets were disregarded. The
hashtags, URLs and keywords prevalent in regular tweets are also present in this data. For
Subtask A, the dataset was broadly divided into two categories, NOT tweets that contained
non-ofensive content and HOF tweets that contained ofensive or profane content. The HOF
tweets in the Hindi tweet data were further divided into Hate Speech (HATE), Ofensive (OFFN)
and Profane (PRF) as part of Subtask B. The dataset was in a CSV file with each entry containing
the ID, Tweet ID, Text and corresponding labels. The Hindi tweet data consisted of 4594 entries,
while the Marathi data consisted of 1874 entries.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>The proposed methodology for all the subtasks are detailed in the following sections.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>In the preprocessing of the data, the first step performed was removing noise. The URL links
and Twitter handle names were the primary sources of noise in the dataset. All the words
are removed that followed @ symbol using regular expressions. After the above step, Twitter
handle names are removed and to remove URL links, regular expressions were used to remove
anything that followed HTTP/HTTPS. Then the stop words like a, an, the, is, etc., followed by
punctuation marks and special characters, are removed. These stop words are not necessary to
ifnd the statement’s sentiment as they do not carry any significant meaning. The punctuation
and special characters cannot be used to identify the sentiment.</p>
        <p>We removed these as they are unnecessary, and we can work solely on the essential characters
and words for better performance. Further, Stemming is performed to reduce the inflection in
words, and the words can be brought to their respective base forms. Stemming helps deal with
spelt words and remove unwanted sufixes, which is very common in a social media text. We
used a snowball stemmer imported from the NLTK package, and it is based on a programming
language called "Snowball". Finally, we removed high-frequency words. Based on the term
frequency method, we removed the most frequently occurred 1000 words.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature Extraction</title>
        <p>
          The Scikit-Learn [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] library has several vectorizers to process and tokenize text in the same
function, and it involves converting word characters into integers. After the preprocessing
of data, the clean tweets are obtained. By treating these as a sequence of words, the features
were extracted. We used a Count vectorizer for feature extraction and Machine Learning models.
Countvectorizer: We made a loop over n_gram to create unigram, bigram, trigram till the
value of n to 5. The Countvectorizer tokenizes the text by breaking down a sentence into words,
and that vocabulary is used to encode new texts. For the n-gram of 1 to 5, the texts are broken
down into words with the integer count for the number of times it appeared and is returned in
the encoded vector.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Classifiers</title>
        <p>The process by which a group of data is divided into a set of categories is known as classification.
This process can be performed on both unstructured and structured data. The most crucial step
in this process is predicting the category by performing some calculations on the data points
provided. This proposed model uses the following classifiers: Logistic Regression, Support
Vector Machine, Stochastic Gradient Descent, Random Forest, Ensemble, and XGBoost.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Binary Classification</title>
          <p>Classifying the labels of a collection into two groups using a classification model is known as
binary classification. In most binary classification problems, one class represents the normal
state, and the other represents the aberrant state. The regular state class is assigned the class
label 0, while the aberrant state class is labelled 1. Logistic Regression, k-Nearest Neighbours,
Decision Trees, Support Vector Machines, and Naive Bayes are the most prominent binary
classification techniques. In this research, subtask A consists of binary classification with the
labels Hate and ofensive (HOF) speech or Not hate speech (NONE).</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Multiclass Classification</title>
          <p>Classifying the labels of a collection into three or more groups using a classification algorithm
is known as multiclass classification. In most multiclass classification problems, the diferent
classes represent the various labels that are to be assigned to each set of input variables.
Generally, when label encoding is done for multiclass problems, each class would be converted into a
binary number. This method of having multiple output variables instead of multiple outputs
simplifies the process of training. Logistic Regression, Support Vector Machines, K Nearest
Neighbours, Naive Bayes and Decision Trees are some of the most prominent classification
techniques that can also be extended to multiclass problems. In this research, subtask B consists
of multiclass classification with the labels Hate Speech (HATE), Ofensive (OFFN) and Profane
(PRF).</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Stochastic Gradient Descent</title>
        <p>Gradient descent is an iterative procedure that begins at a random position on the slope of a
function and gradually falls until it reaches its lowest point. Stochastic Gradient Descent is
an optimisation approach that takes one data point at random from the entire data set at each
iteration to minimise computations drastically. SGD is an easy algorithm to implement as well
as an eficient method to find the minimum error point. Because of the randomness of SGD’s
descent, it usually takes more iterations to reach the minima. Even though it takes more rounds
to reach the minima than normal Gradient Descent, it is computationally considerably less
expensive. Stochastic Gradient Descent has been used in both subtasks A and B. This model
was chosen due to the fact that, it takes very little amount of time to train the model. That and
the reason that this model is more likely to reach the local minima of the error function than
other models, makes it an interesting prospect. Python’s Scikit-learn package loads the SGD
model. After that, the model is trained on the training data, and the results are verified and
documented.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Logistic Regression</title>
        <p>This is a popular machine learning algorithm that has seen its fair share of usage as a classifier.
The base for Logistic Regression can be traced to its conception from statistical mathematics.
The model is trained on data where the dependent variables are categorical. It is derived from
the Linear Regression model. The output obtained from the Linear Regression Model is fitted
into a Logistic Function, which predicts the target variable. This model makes use of a decision
boundary. It sets a threshold that diferentiates one class of variables from another. The linearity
or the non-linearity of the decision boundaries depend on the input variables. The activation
function used is a Sigmoid activation function. This constraint the output obtained between the
numbers 0 and 1. Hence, the output becomes an estimated probability, which gives the final
class prediction when subjected to the decision boundary.</p>
        <p>
          This model was chosen due to the fact that it is one of the most simple and basic classifiers
that can be implemented and it would provide a good base for comparison against the other
approaches. The Scikit-Learn [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] library for Python provides a Class for initializing a Logistic
Regression Model. It also provides functions for training the model and obtaining predictions
from it. The vectorized input variables and the corresponding labels constitute the training data
on which the Logistic Regression model is loaded and trained. After training the model, it is
validated using the validation data set, and the results are recorded.
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Support Vector Machine</title>
        <p>A support vector machine (SVM) is a type of machine learning technique that evaluates data for
categorization and regression analysis. A supervised learning algorithm called a support vector
machine sorts data into two groups. An SVM algorithm’s job is to figure out which category a
new data point fits in. As a result, SVM is classified as a non-binary linear classifier. An SVM’s
output is a map of the sorted data with a separation between both sides. The SVM method
represents each piece of data as a position in n-dimensional space, where each feature is the
value of the specified coordinate. Later, the classification process is then completed by selecting
the hyper-plane that best diferentiates the 2 classes. In the SVM classifier, creating a linear
hyper-plane between these two classes is easy. The SVM kernel is a function that transforms
a low-dimensional input space into a higher-dimensional space to convert a non-separable
problem into a separable one. SVM is eficient in high-dimensional spaces and works well with
a clear separation margin. Here in this research, SVM is used in both subtasks. This model was
chosen for the reason that it works extremely well in cases where there are a large number
of features when compared to the training set. This comes in handy when dealing with data
vectorized with TF-IDF where a large number of features are extracted. It also is highly memory
eficient. Python’s Scikit-learn package is used to load the SVM model. After that, the model is
trained on the training data, and the results are verified and documented.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Ensemble - Majority Voting</title>
        <p>When individual classifiers do not perform well on a given dataset, it is common practice to
combine the models in a process called Ensembling. In most cases, this combined model would
perform as well as or even better than the individual classifiers. For t One such Ensembling
technique explored for this task was Majority Voting. It is an ensemble technique where the
predictions from various weak classifiers for a given set of input variables are considered votes.
These votes are tallied together, and the final outputs are obtained from the tallied votes. In the
current Binary Classification problem, the same process occurs. The predictions of all the weak
learners are considered as votes. The number of votes obtained by every output class are tallied.
The output class that gather the highest number of votes will be considered the output for that
particular set of input variables.</p>
        <p>The combining classifiers used for this task are Support Vector Machine, Logistic Regression,
Stochastic Gradient Descent, K Nearest neighbours, Naive Bayes, Random Forest and Decision
Tree. Instances of each of the above models are initially trained on the Vectorized input data
and their corresponding labels. Then the validation data is passed to all seven models, and the
predictions are obtained. Majority voting is performed on these predicted labels, and the final
predicted labels from Ensembling are obtained. These predicted labels are compared with the
actual labels to obtain the performance metrics of the model. A similar process is also carried
out for the test dataset.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.8. XGBoost</title>
        <p>XGBoost is a popular Machine Learning algorithm which works as decision tree-based ensemble
using a gradient boosting framework. The XGBoost algorithm takes a text string as input
and loads an XGBoost model trained to predict its label. Bagging and boosting are the most
commonly used ensemble learning techniques. The XGBoost classifier is robust and even in
distributed environments XGBoost yields eficient results.In the Scikit-learn framework,the
algorithm provides a wrapper class by allowing models to be treated like classifiers or regressors
. With this we can use the entire sci-kit-learn library with XGBoost models. The ensemble tree
methods such as XGBoost and Gradient Boosting Machines use gradient descent architecture to
boost weak learners principles.This model was chosen as a fact that it works well with large
number of texts in training dataset. As an advantage, At each iteration the XGBoost algorithm
has a built in method,which is cross validation and it is made easy by not specifying the exact
count of boosting iterations required during a single run.</p>
      </sec>
      <sec id="sec-4-9">
        <title>4.9. Random Forest</title>
        <p>Random Forest is an advanced machine learning technique that may be applied to various tasks,
including regression and classification. It is an ensemble method, which means a random forest
resulting in a combination of many small decision trees called estimators, each of which makes
its predictions. To provide a more accurate prediction, the random forest model integrates the
estimators’ predictions. The problem of standard decision tree classifiers is that they are prone
to overfitting the training set. The ensemble design of the random forest compensates for this
and allows the random forest to generalise efectively to anonymous data, including data with
missing values. Random forests can also handle enormous datasets with much dimensionality
and several feature types. While expanding the trees, the random forest adds more randomness
to the model. When splitting a node, it looks for the best feature from a random subset of
features rather than the essential feature. As a result, there is much variety, leading to a better
model. Random forest is used in both subtasks in this study. The Random forest model is loaded
using Python’s Scikit-learn package. The model is then trained using the training data, with the
results being validated and reported.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Marathi - Subtask - A</title>
        <p>A comparison was performed on various models that were trained on the Marathi dataset. The
diferent classifiers used after cleaning and vectorising the Marathi tweet dataset are, Logistic
Regression, XGBoost, Support Vector Machine, Stochastic Gradient Descent, Majority Voting
and Random Forest. A comparison of the performance metrics of all the diferent classifiers
used can be viewed in Table 2. As the table explains, the highest performance was obtained with
the Random Forest classifier model. The model attained an F1 score of 0.7519. Out of all the
classification models implemented, the Logistic regression model achieved minor performance
with an F1 score of 0.6924. It can be explained because the Logistic regression algorithm does
not work well with non-linear data. It also does not bode well with outliers in the training data.
Since the data considered here is mostly text, it is valid to assume that it contains multiple cases
of outliers being present in it.</p>
        <p>The Logistic regression algorithm also finds it challenging to accommodate the distributed data,
which might explain its diminished performance. As expected, the Majority Voting technique
performs better than its weak constituent learners due to reasons explained in section 4.7.
However, it can be observed that the usage of Random Forests improves upon this technique. It
can be explained by the fact that Random Forest is by itself an Ensembling algorithm. Tree type
algorithms are generally better than probabilistic models in handling outliers and noise in the
input data. They are also less impacted by the noise. A unique feature of the Tree type models
and incredibly Random Forests is the innate ability to handle missing data independently. The
embedded data is also non-linear. This negatively afects all probabilistic models, as that is the
area in which they thrive. Random Forests are not impacted by non-linear data as much as the
probabilistic models. The comparison between all the diferent models used and the F1 score
procured by them is illustrated in Figure 1.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Hindi - Subtask - A</title>
        <p>A comparison of several models trained on the Hindi dataset was carried out. After cleaning
and vectorising the Hindi tweet dataset, the following classifiers were used: XGBoost,
Logistic Regression, Stochastic Gradient Descent, Support Vector Machine, Majority Voting, and
Random Forest. Table 2 contains a comparison of the performance metrics of all the various
classifiers utilised. The Random Forest classifier model produced the best results, as seen in
the table. The model achieved an F1 score of 0.7312, and the XGBoost model performed the
most unsatisfactory out of all the classification models tested, with an F1 score of 0.6628. This
low performance could be explained because the boosting algorithm does not work well with
distributed data. It’s also bad news for anomalies in the training data. Given that the data
in question is largely text, it’s reasonable to presume that it contains several instances of outliers.
As predicted, the Majority Voting approach outperforms its weak learners, for reasons stated in
section 4.7. However, it can be shown that using Random Forests improves on this method. It
may be explained by the fact that Random Forest is an Ensembling algorithm in and of itself.
In general, tree-type algorithms outperform probabilistic models in dealing with outliers and
noise in input data. They are also less influenced by noise. The capacity to manage missing
data on its own is a specific property of Tree type models, particularly Random Forests. The
embedded data is also non-linear and harms all probabilistic models because it is the domain
in which they thrive. Random Forests are less afected by non-linear data than probabilistic
models. Figure 2 depicts a comparison of all the diferent models employed and the F1 score
obtained while using those classifiers.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Hindi - Subtask - B</title>
        <p>Several machine learning models were trained on the Hindi dataset and then compared. The
input data was obtained after cleaning the tweets and vectorising them. The various machine
learning algorithms used are XGBoost, Logistic Regression, Stochastic Gradient Descent,
Support Vector Machine, Majority Voting and Random Forest. A comparison of performance metrics
for all the above models is shown in Table 2. For the given subtask, XGBoost performed well
with the highest F1 score of 0.4658, followed by the Ensemble model with an F1 score of 0.4425.
The model with the lowest performance observed was the Random forest, with a macro F1 score
of 0.405. This can be attributed to the fact that as the number of features increase, Random forest
algorithm is more prone to overfitting. Following Random forest, the next least performance
was showed by Support Vector machine with a macro F1 score of 0.4225. This performance
achieved by Support Vector Machine can be explained because it does not perform well for
large datasets or datasets with more noise. This is especially true with text data as it is prone to
the presence of noise.</p>
        <p>For multiclass classification, XGBoost performs well. At each iteration of the boosting
process,the XGBoost algorithm runs cross validation, and in a single run the exact optimum number
of boosting iterations are obtained. It leads to good results being obtained. The Majority Voting
performed well after XGBoost due to the reasons explained in section 4.7. Therefore for the
Hindi subtask B dataset, XGBoost produced the best performance and Random forest exhibited
the least performance. Figure 3 illustrates the macro F1 scores obtained by diferent models
in subtask B. Table 1 shows the various submissions made to the FIRE2021 shared task, Hate
Speech and Ofensive Content Identification in English and Indo-Aryan Languages
(HASOC2021), along with their subtask names, classifiers and macro F1-scores. We have extended the
experiments by applying other classifiers and tried to improve the performance, the final results
are tabulated in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Scope</title>
      <p>This work was submitted to the FIRE2021 shared task, Hate Speech and Ofensive Content
Identification in English and Indo-Aryan Languages (HASOC-2021). The challenge of recognising
conversational hate-speech in Indian regional languages has been empirically studied in this
research utilising binary and multiclass classifiers for Marathi and Hindi tweets. This research
has projected a complete method from analysing the tweets to comprehending the results. In
subtask A, Random forest showed the best Macro F1-score of 75.19% and 73.12% respectively for
both Marathi and Hindi tweets. With a Macro F1-score of 46.58% in subtask B for Hindi tweets,
it is evident that XGBoost produced the most significant results. Overall, it can be inferred
that Random Forest works well for binary classification, whereas XGBoost is a better classifier
model for multiclass classification. In this work, only one type of feature extraction method has
been implemented. In the future, multiple other types of embeddings can be implemented to
get a better model by combining it with deep learning algorithms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech</article-title>
          , in: FIRE 2021:
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>13th</fpage>
          -17th
          <source>December</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.org/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gaikwad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Homan</surname>
          </string-name>
          ,
          <article-title>Cross-lingual ofensive language identification for low resource languages: The case of marathi</article-title>
          ,
          <source>in: Proceedings of RANLP</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <article-title>Supervised term weighting methods for url classification</article-title>
          ,
          <source>Journal of Computer Science</source>
          <volume>10</volume>
          (
          <year>2014</year>
          )
          <fpage>1969</fpage>
          -
          <lpage>1976</lpage>
          . doi:
          <volume>10</volume>
          .3844/jcssp.
          <year>2014</year>
          .
          <year>1969</year>
          .
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouazizi</surname>
          </string-name>
          , T. Ohtsuki,
          <article-title>Hate speech on twitter: A pragmatic approach to collect hateful and ofensive expressions and perform hate speech detection</article-title>
          ,
          <source>IEEE access 6</source>
          (
          <year>2018</year>
          )
          <fpage>13825</fpage>
          -
          <lpage>13835</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aravindan</surname>
          </string-name>
          ,
          <article-title>An efective and discriminative feature learning for url based web page classification</article-title>
          ,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .1109/SMC.
          <year>2018</year>
          .
          <volume>00240</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Karthik</surname>
          </string-name>
          .,
          <article-title>Design of kids-specific url classifier using recurrent convolutional neural network</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>167</volume>
          (
          <year>2020</year>
          )
          <fpage>2124</fpage>
          -
          <lpage>2131</lpage>
          . doi:https://doi.org/10.1016/j.procs.
          <year>2020</year>
          .
          <volume>03</volume>
          .260.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <article-title>Autonomous driving system with road sign recognition using convolutional neural networks</article-title>
          ,
          <source>in: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICCIDS.
          <year>2019</year>
          .
          <volume>8862152</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganganwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <article-title>Implicit aspect extraction for sentiment analysis: A survey of recent approaches</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>165</volume>
          (
          <year>2019</year>
          )
          <fpage>485</fpage>
          -
          <lpage>491</lpage>
          . URL: https: //www.sciencedirect.com/science/article/pii/S1877050920300181. doi:https://doi.org/ 10.1016/j.procs.
          <year>2020</year>
          .
          <volume>01</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sivakumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. R</surname>
          </string-name>
          ,
          <article-title>Analysis of sentiment on movie reviews using word embedding selfattentive lstm</article-title>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .4018/IJACI.2021040103.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Overview of the hasoc track at fire 2020: Hate speech and ofensive language identification in tamil, malayalam, hindi, english and german, in: Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2020</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>29</fpage>
          -
          <lpage>32</lpage>
          . URL: https://doi.org/10.1145/ 3441501.3441517. doi:
          <volume>10</volume>
          .1145/3441501.3441517.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , Dlrg@hasoc
          <year>2019</year>
          :
          <article-title>An enhanced ensemble classifier for hate and ofensive content identification</article-title>
          ,
          <source>in: FIRE</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <article-title>Borrowing likeliness ranking based on relevance factor</article-title>
          ,
          <source>in: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS '17</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          . URL: https://doi.org/10. 1145/3041823.3067694. doi:
          <volume>10</volume>
          .1145/3041823.3067694.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Rahul</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Sehra</surname>
            ,
            <given-names>Y. R.</given-names>
          </string-name>
          <string-name>
            <surname>Vardhan</surname>
          </string-name>
          ,
          <article-title>Hindi-english code mixed hate speech detection using character level embeddings</article-title>
          ,
          <source>in: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1112</fpage>
          -
          <lpage>1118</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICCMC51019.
          <year>2021</year>
          .
          <volume>9418261</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sawhney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ayyar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Did you ofend me? classification of ofensive tweets in hinglish language</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>Hate speech detection from code-mixed hindi-english tweets using deep learning models</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>05145</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mundada</surname>
          </string-name>
          , T. Joshi,
          <article-title>Kbcnmujal@ hasoc-dravidian-codemixifre2020: Using machine learning for detection of hate speech and ofensive code-mixed social media text</article-title>
          ,
          <source>arXiv preprint arXiv:2102.09866</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. VanderPlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in python</article-title>
          ,
          <source>CoRR abs/1201</source>
          .0490 (
          <year>2012</year>
          ). URL: http://arxiv.org/abs/1201.0490. arXiv:
          <volume>1201</volume>
          .
          <fpage>0490</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>