<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One to Rule Them All: Towards Joint Indic Language Hate Speech Detection.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mehar Bhatia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tenzin Singhay Bhotia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akshat Agarwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prakash Ramesh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shubham Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kumar Shridhar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Laumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ayushman Dash</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NeuralSpace</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is a contribution to the Hate Speech and Ofensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, eficiently curbing hate speech is a critical challenge and problem of interest. Our team, 'NeuralSpace' presents a multilingual architecture using state-ofthe-art transformer language models to jointly learn hate and ofensive speech detection across three languages namely, English, Hindi, and Marathi. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B. These results show the eficacy of exploiting a multilingual training scheme.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech</kwd>
        <kwd>Social Media</kwd>
        <kwd>Indic Languages</kwd>
        <kwd>Low Resource</kwd>
        <kwd>Multilingual Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Since the proliferation of social media users worldwide, platforms like Facebook, Twitter, or
Instagram have sufered from a rise of hate speech by individuals and groups. A large-scale
study on Twitter, and Whisper, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] empirically shows the prevalence of abusive comments and
toxic languages in these platforms, targeting users mostly based on race, physical features, and
gender.
      </p>
      <p>
        A Bloomberg article[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] reports that users have even found new ways of bullying others
online using euphemistic emojis. Widespread use of such abusive language on social media
platforms often causes public embarrassment to victims leading to major repercussions. Recently,
Twitch filed a lawsuit against two users who targeted LGBTQ+ and Black streamers with hate
speech [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One week later, content creators boycotted the game-streaming platform due to
the inability to control the hateful content. Observing the growing usage of online hate speech
often anonymous, overwhelming and unmanageable by human moderators. It is essential for
social media platforms to control the abuse of users’ freedom of expression and maintaining an
inclusive and respectful society. To enforce such supervision, online platforms must be able to
develop monitoring systems that can identify hate speech amongst billions of text comments
posted by users.
      </p>
      <p>
        There have been research contributions in solving the problem of identifying abusive
comments or other forms of toxic language [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]. However, most of them have majorly focused
on high-resource languages, predominantly English. As social media connects people from all
over the world, communicating in diferent languages, much of the potentially hateful content
is present in a multilingual setting. The failure to pay attention to non-English languages has
allowed such ofensive speech to flourish. The lack of datasets and models for various
lowresource languages has made the task of hate speech identification extremely dificult. In this
paper, we present our findings on a subset of Indic low-resource languages.
      </p>
      <p>The HASOC (Hate Speech and Ofensive Content) 2021 challenge has been organized as a
step towards this direction in three languages - English, Hindi, and Marathi. Figure 1
demonstrates HASOC 2021 problem statement. We focus on sub-task 1A and 1B of this competition,
which we describe in the following paragraph.</p>
      <p>Sub-task 1A focuses on hate speech and ofensive language identification in English Hindi,
and Marathi. It is a simple binary classification task in which participating systems are required
to classify tweets into one of the two classes, namely:
• (HOF) Hate and Ofensive: Posts of this category contain either hate, ofense, profanity,
or a combination of them.
• (NOT) Non-Hate and ofensive: Posts of this category do not contain any hate speech,
profane or ofensive content.</p>
      <p>Sub-task 1B is a multi-class classification task in English and Hindi. In this task, hate speech
and ofensive posts from sub-task A are further classified into the following three categories.
• (HATE) Hate speech: Posts of this category contain hate speech content.
• (OFFN) Ofensive: Posts of this category contain ofensive content.</p>
      <p>• (PRFN) Profane: Posts of this category contain profane words.</p>
      <p>In this paper, we make the following contributions:
• A pre-processing pipeline for modeling hate speech in the text of tweet domain.
• A joint fine-tuning procedure that empirically proves to outperform other approaches in
hate speech detection.
• A summary of diferent approaches that did not work as expected.
• The implementation and idea behind our winning approach for one of HASOC 2021
subtasks.</p>
      <p>In the forthcoming sections, we give a brief overview of past approaches as related work
in section 2. Then, we present a detailed description of the statistics of datasets used in
section 3. We present our approach in section 4, delineating upon our pre-processing steps and
model architecture. We highlight our model hyperparameters and other experimental details
in section 5. Later, in section 6, we display our final results and elaborate on various other
approaches that did not work well in section 7. We end with our conclusion and point to future
work in section 8.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In the past, there have been many approaches to tackle the problem of hate speech
identification. Kwok and Wang [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have experimented with a simple bag of words (BOW) approach
to identify hate speech. While being light-weight, these models performed poorly with high
false positive rates. Including various core natural language processing (NLP) features like part
of speech tags [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and N-gram graphs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have helped in improving the performance.
Lexical methods using TF-IDF and SVM as a classification model have achieved surprisingly good
performance [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        With the rise of embedding words in distributed representations, researchers have leveraged
word embeddings like Glove [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and FastText [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for embedding discrete text into a latent
space and have improved the performance over standard BOW and lexical approaches.
      </p>
      <p>
        Recurrent Neural Networks (RNNs) for many years were the de-facto approach for
tackling any natural language problem. The winning approach at the 2020 HASOC competition
for Hindi [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] used a one-layer BiLSTM with FastText embeddings to identify hate speech.
Similarly for English, the most accurate model [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used an LSTM with Glove embeddings to
represent text inputs. Mohtaj et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] also used a character-based LSTM following a similar
trend.
      </p>
      <p>
        In recent times, self-attention-based transformer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] models and language models derived
from its huge corpus trained encoders like BERT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] have shown more promise than standard
RNNs for most of the NLP tasks. Many researchers have found BERT-like models to perform
much better than other approaches majorly due to their high transfer learning prowess [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
While there has been a lot of research on hate speech in general, experiments especially
focusing on low-resource languages are less popular. Simple logistic regression using LASER
embeddings has been shown to perform better than BERT-based models [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] indicating the
need for more accurate multilingual base language models. Since then, we have witnessed the
rise of multilingual language models like XLM-Roberta [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In the following sections, we will
delineate our approach of building a solution using XLM-Roberta for identifying hate speech
along with an exhaustive comparison to other approaches.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <p>
        Datasets for HASOC 2021 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for English [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], Hindi [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and Marathi [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] languages were
collected from social media platforms and comprises of two sub-tasks. We focus only on the
ifrst task (named as subtask1 as per HASOC website) on Hindi, English, and Marathi datasets
which is further divided into sub-tasks A and B. As shown in Table 1, each dataset instance
consists of a unique hasoc_id, a tweet_id, full text of the tweet, and target variables task_1 and
task_2 for the sub-task 1A and 1B respectively. sub-task 1A is a binary classification
problem with two target classes namely, HOF (Hate and Ofensive) and NOT (Non-Hate-ofensive) ,
whereas sub-task 1B is a further fine-grained classification. The data is further classified into
four classes, namely OFFN (Ofensive) , PRFN (Profane), HATE, and NONE class. sub-task
1A requires us to work with datasets in English, Hindi, and Marathi languages, whereas only
English and Hindi datasets are available for sub-task 1B. The statistics of both the train and
test data are shown in Table 2 and Table 3.
      </p>
      <p>It can be seen that the datasets are highly imbalanced. For sub-task 1A, we notice that the
number of hate and ofensive tweets is almost double than that of non-hate or ofensive tweets
for English and Marathi. On the other hand, the number of non-hate-ofensive tweets is 55%
higher than that of hate and ofensive tweets for the Hindi dataset. Similarly sub-task 1B, which
deals with English and Hindi language, also have highly imbalanced datasets.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <p>In this section, we demonstrate our approach of solving HASOC 2021 sub-task1A and
subtask1B tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <p>For preprocessing the tweet data and hashtags, we use two python libraries, tweet-preprocessor1
and ekphrasis2, a segmenter built on Twitter corpus. For English data, the tweet-preprocessor’s
clean functionality extracts, clean, parses and tokenizes the tweet texts. For Hindi and Marathi
data, first the tweets are tokenized on whitespaces and symbols including colons, commas,
semicolons, dashes, and underscores. Secondly, we use the tweet-preprocessor python library
for the removal of URLs, hashtags, mentions, emojis, smileys, numbers, and reserved words
(such as @RT which stands for Retweets). We also notice the usage of words in English and
Arabic in the Hindi and Marathi datasets. We first transliterate this text to the desired language
using NeuralSpace’s transliteration tool 3. Later, if English or Arabic occurrences remain, we
used python library langdetect 4 (a re-implementation of Google’s language-detection library
5 from Java to Python) to extract the pure Hindi and Marathi text within the tweet.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature Extraction</title>
        <p>To extract features for our classifier, we use tweet-preprocessor to supply various information
ifelds, in addition to the cleaned content. The first feature is obtained from the hashtag text
which is segmented into constituent and meaningful tokens using the ekphrasis segmenter.
Ekphrasis tokenizes the text based on a list of regular expressions. For example, the hashtags
‘##JitegaModiJitegaBharat’, ‘#IPL2019Final’, ‘#hogicongresskijeet’ is tokenized to ‘Jitega Modi
Jitega Bharat’, ‘IPL 2019 Final’, ‘hogi congress ki jeet‘. Other features are acquired from URLs
1https://github.com/s/preprocessor
2https://github.com/cbaziotis/ekphrasis
3https://docs.neuralspace.ai/transliteration/overview
4https://pypi.org/project/langdetect/
5https://github.com/shuyo/language-detection
within the text, name mentions such as ‘BJP4Punjab’, ‘aajtak’, ‘PMOIndia’, and ‘narendramodi’,
and smileys and emojis. The extracted emojis were processed in two ways.</p>
        <p>
          First, we use emot6 python library to obtain the textual description of a particular emoji in the
text. Emot uses advanced dynamic pattern generation. For example, ‘rofl’ refers to
‘rolling-onthe-floor-and-laughing face’ and ‘speak-no-evil emoji’ refers to ‘speak-no-evil Monkey’.
However, we feel that this mapping is not suficient as it does not highlight the genuine meaning
of what the emoji represents in reality. Given that the usage of such emojis is so prevalent
and that most of them inherently have emotions built-in, emojis can give a lot of insights into
the sentiment of online text. For this reason, we also consider emoji2vec [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] embeddings for
1661 emoji Unicode symbols learned from a total of 6088 descriptions in the Unicode emoji
standard. Previous work has demonstrated the usefulness of this by evaluating various tasks
such as Twitter sentiment analysis [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. For example, consider ‘pray emoji’ and
‘tipping-handwoman emoji’, which map to ‘the-folded-hands’ symbol and the ‘woman-tipping-hand’ emoji.
The textual representation will not showcase the emoji’s association with ‘showing gratitude,
expressing an apology, sentiments such as hope or respect or even a high five‘ which is its
realworld implication. On the other hand, the person-tipping symbol is commonly used to express
‘sassiness’ or sarcasm. We expect emoji2vec to capture these kinds of analogy examples.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Proposed Architecture</title>
        <p>
          We leverage Transformer-based [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] masked language models to generate semantic
embeddings for the cleaned tweet text.
        </p>
        <p>
          We use the available training corpora and fine-tune the transformer layers in a multilingual
fashion for our downstream task. We experimented with various multi-lingual transformer
models, i.e XLM-RoBERTa (XLMR), mBERT(multilingual BERT), and DistilmBERT
(multilingualdistilBERT). A summary for each model is as follows:
• XLM-RoBERTa: The pre-training of XLM-RoBERTa is based on 100 languages, using
around 2.5TB of preprocessed CommonCrawl dataset to train cross-language
representations in a self-supervised manner. XLM-RoBERTa [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] shows that the use of large-scale
multi-language pre-training models can significantly improve the performance of
crosslanguage migration tasks.
• mBERT: Multilingual BERT [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] uses Wikipedia data of 102 languages, totaling to 177M
parameters, and is trained using two objectives i.e, 1) using a masked language modeling
(MLM) when 15% of input is randomly masked, and 2) using next sentence prediction.
• DistilmBERT: Distil multilingual BERT [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]is a distilled version of the above mBERT
model. It is also trained on the concatenation of Wikipedia in 102 diferent languages. It
has a total of 134M parameters. On average DistilmBERT is twice as fast as mBERT-base.
        </p>
        <p>To solve the sub-task 1A of three languages (English, Hindi, and Marathi), and sub-task 1B
of two languages (English and Hindi) at the same time, we adopt these multi-lingual models.</p>
        <p>As mentioned in Section 4.2, we generate semantic vector representations for all the emojis
and smileys, their respective text, and segmented hashtags within the tweet. We encode the
6https://github.com/NeelShah18/emot
emoji, smiley text embeddings, and hashtag embeddings in the same latent space. To create the
emojis’ semantic embeddings, emoji2vec is utilized. An important point to notice is that the
segmented hashtags and text descriptions of emojis can be of variable length. Hence, we
generate the centralized emoji or hashtag representation by averaging the vector representations.
This is a simple approach proposed by [27] to produce a comprehensive vector representation
for sentences.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Details</title>
      <p>We use Hugging Face’s implementation and corresponding pre-trained models of XLM-RoBERTa
7, multilingual BERT8, and multilingual-distilBERT 9 in our proposed architecture. Our
architectures using Transformer models with custom classification heads are implemented using
PyTorch. We use Adam optimizer for training with an initial learning rate of 2e-4, dropout
probability of 0.2 with other hyper-parameters set to their default values. We use a
crossentropy loss to update the weights. We also use UKPLab’s sentence-transformers library 10 to
encode the hashtags and textual descriptions of the emojis.</p>
      <p>All the fine-tuned language models broadly fall into two following categories.
• Monolingual: These are a type of models that have been fine-tuned on only the respective
target language. For instance, we only use the English train dataset to fine-tune the
model and then infer on the English test set only.
• Multilingual: These are a type of models that has been fine-tuned on a combination
of all available languages irrespective of the target language. For instance, to train a
model for the English target language on sub-task 1A, we combine the train datasets for
all languages (English, Hindi, and Marathi) and then fine-tune the model once. Such a
model may then be used for inference on any given target language. Intuitively, such a
training scheme provides three benefits.</p>
      <p>– It enforces joint modeling of the training distribution for all the given languages.</p>
      <p>Empirically we find this to perform better than individually modeling on respective
language.
– During inference, we only rely on one model to infer instead of a unique model for
each language. An approach that can be extremely compute-eficient for
production.
– We combine naturally occurring human-annotated data to form a larger dataset
of multiple languages. It becomes a promising approach towards resolving poor
model performance due to the data scarcity issue for low-resource languages.
As shown in Table 4 and 5, we empirically observe that a multilingual setting clearly
outperforms the monolingual setting across both the tasks in all three languages irrespective of the
base model. For English sub-task 1, only mBERT and DistilmBERT score below the
monolingual setting, but the diference is not as significant. This experiment suggests that multilingual
training can be a preferred approach in obtaining better-performing models, especially as it
provides a step towards resolving the data scarcity issue for low-resource languages. It will
be interesting to validate the generalizability of this hypothesis on diferent NLP tasks in the
future.</p>
      <p>All the experiments were carried out on a workstation with one NVIDIA A100-SXM4-40GB
GPU with 12 CPU cores. We use a batch size of 64 throughout. For the initial experiments,
7
8https://huggingface.co/bert-base-multilingual-uncased
9https://huggingface.co/distilbert-base-multilingual-cased
10https://github.com/UKPLab/sentence-transformers
we divided the released training data into a training set and a validation set and conducted
the experiments using accuracy as the performance metric. Finally, we test the performance
of the proposed system on the test set released by the organizers. For these experiments, we
combine all the training and validation data into a single training set and applied our algorithm.
For the multilingual setting, our experiment takes 3.5 hours to train till convergence. For the
monolingual setup, our model takes 1.2 hours to train till convergence.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>It is observed from Table 4 and 5, that for all three languages, XLM-RoBERTa has
outperformed similar multilingual Transformer models such as mBERT (multilingual BERT) and
distilmBERT (multilingual-distilBERT) on our hatespeech detection task. We observe a miniumum
absolute gain of 1.63 F1 and 1.20 F1 for sub-task 1A and sub-task 1B respectively via the
multilingual approach with XLM-RoBERTa. While a maximum absolute gain of 2.1 F1 and 2.31
F1 have been observed for sub-task 1A and sub-task 1B respectively. Empirically such
significant improvements suggest the importance of multilingual training over monolingual training.
Notably, multilingually trained XLM-RoBERTa have secured the 1st position among 24
participants and the 5th position among 34 participants on the HASOC 2021 leaderboard for sub-task
1A and sub-task 1B respectively. Securing such high ranks indicates the importance of the
multilingual approach and calls for a detailed investigation of this approach on other tasks as
well for future work.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Key Takeaways</title>
      <p>In this section, we aim to provide a checklist of various approaches and techniques which we
implemented, but failed to secure competitive positions on the leaderboard. We believe that
our readers will benefit from this checklist during future work.</p>
      <p>
        To begin with, as the dataset was overall highly imbalanced across all languages, we perform
SOUP (Similarity-based Oversampling and Undersampling processing), a technique in which
the number of the minority class samples is increased and the number of majority class samples
are decreased to obtain a balanced data set. This technique has been suggested by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and we
use this balanced data for performing the classification task. However, when compared to our
best performing model, we see a drop of 5% in accuracy.
      </p>
      <p>Secondly, to add more training samples to our multilingual dataset, we use data
augmentation techniques such as back-translation to generate this synthetic data. We adopt ML
Translator API, which is Google’s Neural Machine Translation (NMT) system. This translation method
has been widely used because of its simplicity and zero-shot translation. With this method, we
increase our dataset size by three times, however, we don’t see any performance gains using
this augmented dataset for our proposed architecture. Moreover, we observe a reduction of
toxicity upon using this back-translation method possibly resulting in false labels for many
instances.</p>
      <p>
        Based on winning approaches from [28] and [29], we applied diferent machine learning
algorithms, i.e, random forest, and LightGBM, a gradient boosting framework based on decision
trees. These techniques have shown an average drop of 5.3%. We also looked into two diferent
deep neural networks approaches and tested them for all three languages. For the English
model, we used GloVe 11 embeddings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for both sub-tasks. This embedding layer is fed to
a CNN model. The architecture comprises two convolutional, two dropouts, and two
maxpooling layers accompanied by a flatten layer and a dense layer. We achieved a macro F1 score
of 0.75 and 0.56 respectively on HASOC 2021 sub-task 1A and sub-task 1B test sets. For Hindi
and Marathi models, we use fastText 12 embeddings [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for both the sub-tasks. Here, the
embeddings are passed through a bi-directional LSTM model and a dropout layer, followed
by a dense layer. We achieve macro F1 scores of 0.74, 0.54, and 0.84 on Hindi sub-tasks 1A
and 1B, and Marathi sub-task 1A, respectively. Overall, we conclude that our final proposed
architecture performs the best compared to other approaches for all sub-tasks.
11https://nlp.stanford.edu/projects/glove/
12https://fasttext.cc/docs/en/crawl-vectors.html
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>This work has been submitted to CEUR 2021 Workshop Proceedings for the task, Identification
of Hate and Ofensive Speech in Indo-European Languages (HASOC 2021). In this research,
the problem of identifying hate and ofensive content in tweets has been experimentally
studied on three diferent language datasets namely, English, Hindi, and Marathi. We propose a
joint language training approach based on recent advances in large-scale transformer-based
language models and demonstrate our best results. We plan to further explore other novel
methods of capturing social media text semantics as part of future work. We also aim to look
at more accurate data augmentation techniques to handle the data imbalance and enhancing
hate and ofensive speech detection in social media posts.
faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[27] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings
(2016).
[28] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the HASOC track at
FIRE 2020: Hate speech and ofensive language identification in Tamil, Malayalam, Hindi,
English and German, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
[29] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview
of the HASOC track at FIRE 2019: Hate speech and ofensive content identification in
Indo-European languages, in: Proceedings of the 11th forum for information retrieval
evaluation, 2019, pp. 14–17.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mondal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Correa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benevenuto</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <article-title>Analyzing the targets of hate in online social media</article-title>
          ,
          <source>in: Tenth international AAAI conference on web and social media</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>I. Levingston</surname>
          </string-name>
          ,
          <article-title>Racist Emojis Are the Latest Test for Facebook</article-title>
          ,
          <source>Twitter Moderators</source>
          ,
          <year>2021</year>
          . URL: https://www.bloomberg.com/news/articles/2021-09-13/ racist-emojis
          <article-title>-are-the-latest-test-for-facebook-twitter-moderators.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Speakman</surname>
          </string-name>
          , Twitch Sues Users Who It Alleges Conducted 'Hate Raids',
          <year>2021</year>
          . URL: https://www.forbes.com/sites/kimberleespeakman/2021/09/10/ twitch-sues
          <article-title>-users-who-it-alleges-conducted-hate-raids/?sh=36407fe87822.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Waseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Hateful symbols or hateful people? predictive features for hate speech detection on twitter</article-title>
          ,
          <source>in: Proceedings of the NAACL student research workshop</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>88</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouazizi</surname>
          </string-name>
          , T. Ohtsuki,
          <article-title>Hate speech on twitter: A pragmatic approach to collect hateful and ofensive expressions and perform hate speech detection</article-title>
          ,
          <source>IEEE access 6</source>
          (
          <year>2018</year>
          )
          <fpage>13825</fpage>
          -
          <lpage>13835</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Al-Dossari</surname>
          </string-name>
          ,
          <article-title>Detection of hate speech in social networks: a survey on multilingual corpus</article-title>
          ,
          <source>in: 6th International Conference on Computer Science and Information Technology</source>
          , volume
          <volume>10</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Luo,
          <article-title>Hate speech detection: A solved problem? the challenging case of long tail on twitter</article-title>
          ,
          <source>Semantic Web</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <fpage>925</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kwok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Locate the hate: Detecting tweets against blacks</article-title>
          ,
          <source>in: Twenty-seventh AAAI conference on artificial intelligence</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Detecting ofensive language in social media to protect adolescent online safety</article-title>
          ,
          <source>in: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, IEEE</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C. K. Themeli</surname>
          </string-name>
          ,
          <article-title>Hate Speech Detection using diferent text representations in online user comments</article-title>
          , no.
          <source>October</source>
          <year>2018</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rajalakshmi</surname>
          </string-name>
          , DLRG@ HASOC
          <year>2020</year>
          :
          <article-title>A Hybrid Approach for Hate and Ofensive Content Identification in Multilingual Tweets</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>304</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , GloVe:
          <article-title>Global vectors for word representation</article-title>
          ,
          <source>in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srivastavab</surname>
          </string-name>
          , S. Saumyac,
          <string-name>
            <surname>NSIT</surname>
          </string-name>
          &amp; IIITDWD@ HASOC 2020:
          <article-title>Deep learning model for hate-speech identification in Indo-European languages (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>A. K. Mishraa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Saumyab</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kumara</surname>
          </string-name>
          , IIIT_DWD@ HASOC 2020:
          <article-title>Identifying ofensive content in Indo-European languages (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohtaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Woloszyn</surname>
          </string-name>
          , S. Möller, TUB at HASOC 2020:
          <article-title>Character based LSTM for Hate Speech Detection in Indo-European Languages</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>298</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mozafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Farahbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Crespi</surname>
          </string-name>
          ,
          <article-title>A BERT-based transfer learning approach for hate speech detection in online social media</article-title>
          ,
          <source>in: International Conference on Complex Networks and Their Applications</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>928</fpage>
          -
          <lpage>940</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Aluru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <article-title>Deep learning models for multilingual hate speech detection</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>06465</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zampieri, Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages and Conversational Hate Speech</article-title>
          , in: FIRE 2021:
          <article-title>Forum for Information Retrieval Evaluation, Virtual Event</article-title>
          ,
          <fpage>13th</fpage>
          -17th
          <source>December</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Madhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Ofensive Content Identification in English and Indo-Aryan Languages</article-title>
          , in: Working Notes of FIRE 2021 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.org/.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gaikwad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Homan</surname>
          </string-name>
          ,
          <article-title>Cross-lingual Ofensive Language Identification for Low Resource Languages: The Case of Marathi</article-title>
          ,
          <source>in: Proceedings of RANLP</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Eisner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , I. Augenstein,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bošnjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , emoji2vec:
          <article-title>Learning emoji representations from their description</article-title>
          ,
          <source>arXiv preprint arXiv:1609.08359</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>