=Paper= {{Paper |id=Vol-3159/T1-6 |storemode=property |title=Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets |pdfUrl=https://ceur-ws.org/Vol-3159/T1-6.pdf |volume=Vol-3159 |authors=Zaki Mustafa Farooqi,Sreyan Ghosh,Rajiv Ratn Shah |dblpUrl=https://dblp.org/rec/conf/fire/FarooqiGS21 }} ==Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets== https://ceur-ws.org/Vol-3159/T1-6.pdf
Leveraging Transformers for Hate Speech Detection
in Conversational Code-Mixed Tweets
Zaki Mustafa Farooqi, Sreyan Ghosh and Rajiv Ratn Shah
Multimodal Digital Media Analysis Lab, Indraprastha Institute of Information Technology Delhi, India


                                      Abstract
                                      In the current era of the internet, where social media platforms are easily accessible for everyone, people
                                      often have to deal with threats, identity attacks, hate, and bullying due to their association with a cast,
                                      creed, gender, religion, or even acceptance or rejection of a notion. Existing works in hate speech
                                      detection primarily focus on individual comment classification as a sequence labelling task and often
                                      fail to consider the context of the conversation. The context of a conversation often plays a substantial
                                      role when determining the author’s intent and sentiment behind the tweet. This paper describes the
                                      system proposed by team MIDAS-IIITD for HASOC 2021 subtask 2, one of the first shared tasks focusing
                                      on detecting hate speech from Hindi-English code-mixed conversations on Twitter. We approach this
                                      problem using neural networks, leveraging the transformer’s cross-lingual embeddings and further fine-
                                      tuning them for low-resource hate-speech classification in transliterated Hindi text. Our best performing
                                      system, a hard voting ensemble of Indic-BERT, XLM-RoBERTa, and Multilingual BERT, achieved a macro
                                      F1 score of 0.7253, placing us 1𝑠𝑡 on the overall leaderboard standings.

                                      Keywords
                                      Code-Mixed Languages, Hindi-English, Hate Speech, Transformers, Offensive Tweets




1. Introduction
In today’s world, hate speech is one of the major issues plaguing online social media websites.
Platforms like Twitter and Gab make it easier than ever before for a person to reach a large
audience quickly, which results in an increased temptation of users for inappropriate behavior
such as hate speech, causing potential damage to the social system and thus possessing major
threats which has already led to different types of crimes [1]. Human moderators manually
detecting hate speech online have been reported to go through trauma and mental issues. This
phenomenon necessitates automated hate speech detection as a crucial task.
   A majority of the work on hate speech classification is constrained to the 𝐸𝑛𝑔𝑙𝑖𝑠ℎ language.
The inability of mono-lingual hate speech classifiers to detect the semantic cues in code-mixed
languages necessitates an efficient classifier that can detect offensive content automatically
from code-mixed languages. Hinglish (formed of the words spoken in Hindi language but
written in Roman script instead of the Devanagari script) extends its grammatical setup from
native Hindi, accompanied by many slurs, slang, and phonetic variations due to regional

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open zaki19048@iiitd.ac.in (Z. M. Farooqi); gsreyan@gmail.com (S. Ghosh); rajivratn@iiitd.ac.in (R. R. Shah)
GLOBE https://github.com/zmf0507 (Z. M. Farooqi); https://github.com/Sreyan88 (S. Ghosh); http://midas.iiitd.edu.in/
(R. R. Shah)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
influence. Randomized spelling variations and multiple possible interpretations of Hinglish
words in different contextual situations make it extremely difficult to deal with for automated
classification. Another challenge worth considering in dealing with Hinglish is the demographic
divide between Hinglish users relative to total active users globally. This poses a severe limitation
as the tweet data in Hinglish language is a small fraction of the large pool of tweets generated,
necessitating the use of selective methods to process such tweets in an automated fashion.

                 Parent Tweet


            भारतीय रे लवे के ऑ ीजन ए      ेस अिभयान की 200वीं रे ल ने अपनी या ा पूरी
            कर ली है । 10 अ रे लगािड़यां 784 मीिटक टन ऑ ीजन लेकर गितमान ह।दे श
            भर म अभी तक 775 से अिधक टकरों के मा म से 12,630 मीिटक टन मेिडकल
            ऑ ीजन प ं चाई गई है




          Comment
                                    @user Aaap sach me doctor ho ki aaise hi farji degree li
                                                          hai?



                                                Reply
                                                                                 @user @user फज ही होगा




Figure 1: An example of conversation from the dataset where the parent tweet is not hateful but the
comment and reply are expressing implicit hate towards the user who posted parent tweet. The post
explains the amount of oxygen that has been supplied across the whole country. The hateful comment
says “Are you a genuine doctor or have you just acquired a fake degree?” while the hateful reply gives
an affirmative response by saying “it must be fake”. The green tick represents “NOT” comments while
the red cross represents “HOF” or hateful/offensive comments.


   The context of the conversation plays a very crucial role in hate speech identification. We
acknowledge the fact that a comment categorized as hate speech may not always contain the
subject of hate on its own. As shown in figure 1, agreement or disagreement with a previous
comment or the overall ideology in the chain of the conversation might also induce hate towards
a particular target group. In addition to this, systems that can efficiently utilize the entire context
of the comment chain with a holistic understanding of the entire discussion may also help in
detecting “trigger comments” i.e., non-toxic comments in online discussions which lead to toxic
replies and implicitly help in mitigating bias in hate speech identification which remains a
long-standing problem in this domain [2, 3].
   Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages
(HASOC) 2021 [4] proposes two subtasks where subtask 1 aims towards identifying and dis-
criminating hate and profane tweets in English, Hindi, and Marathi, and subtask 2 aims towards
detecting hate tweets in conversations primarily in Hindi-English code-mixed texts. This paper
illustrates our key contribution to subtask 2 [5] of HASOC 2021 [4]. We present our system
based on ensemble of transformers to detect hate speech in code-mixed Hindi-English con-
versations, which helps us achieve the first place in the HASOC Subtask 2 final leaderboard
standings.


2. Related Work
Hate speech detection is a challenging task with literature including techniques such as
dictionary-based [6], distributional semantics [7] and recent literature exploring the power
of neural network architectures for the same [8]. However, a majority of the work done
on hate speech detection is constrained to the English language [9, 10, 11, 12, 13, 14, 15, 16]
with very limited work on other foreign languages [17, 18, 19, 20, 21] and code-switched text
[22, 23, 24, 25, 26]. Although Hinglish has been a major contributor to hate speech online, this
area has seen very little work with recent work exploring transformers [27] and author profiling
using graph neural networks [25]. Works like [22] uses a Convolutional Neural Network (CNN)
architecture togther with Glove embeddings and transfer learning, in one of the first attempts
to detect hate-speech online in the Hinglish language. Similar to our work, [28] explored XLM-
RoBERTa and achieved competitive results on the task of detecting hate-speech in Dravidian
languages. In HAOSC 2020 shared task, [29] also used XLM-RoBERTa to achieve the third rank
on the overall task of detecting offensive content in code-mixed Dravidian text. However, we
acknowledge the fact that XLM-RoBERTa was pre-trained on Hindi only text and our work
differs from theirs in which we also transliterate words in a different code to Hindi to solve the
problem of hate-speech detection.
   Hate speech detection has branched into several sub-tasks like toxic span extraction [30, 31],
rationale identification [32] and hate target identification [20]. Though recent advancement
in the field of NLP has pushed the limits of hate speech identification, like transformers [25]
and graph neural networks [33, 25, 34] with people attempting to induce external knowledge
leveraging author profiling [25] or ideology [35] but using context of the conversation is still
a challenge with very little work exploring this problem. Context of the conversation plays a
huge role in hate speech identification with recent literature exploring both the structure and
effect of context [36, 37] for the same. One interesting and related direction of work described
in [38] relates to building systems that generate text which acts as hate speech interventions
in online discussion. Context can both help in detecting trigger comments [39] and implicitly
handle bias in hate speech identification which remains a long-standing problem in this domain
[2, 3]. To the best of our knowledge, all these works are constrained to the English language
with very little work on code-mixed Hindi-English and Hinglish text, considering the context
of the conversation.


3. Dataset
The dataset provided for this task has code-switched Hindi-English as well as Hinglish con-
versation chains taken from twitter as shown in Fig 1. This is a binary classification dataset
having two classes HOF and NOT. HOF denotes the tweet, comment, or reply which contains
hate, offensive, and profane content in itself or is supporting hate, whereas NOT denotes the
tweet, comment, or reply which does not contain any hate speech, profane, or offensive content.
More details about the dataset provided to us can be found in table 1. We have also provided
Avg. Comments which tells us about the average number of comments to a Parent Tweet in the
dataset since all Parent Tweets had at least one comment. We do not provide the average number
of replies for comments since all comments do not have replies.

Table 1
Original Dataset Statistics
Data    Total Conversations   HOF    NOT    Parent Tweets   Total Comments   Total Replies   Avg. Comments
Train          5740           2841   2899        82              3778            1880             46
Test           1348            695    653        16               849             483             53




Table 2
Train-Validation-Test Distribution
Data     Total Conversations         HOF(Hateful/Offensive)       NOT (neither Hateful nor Offensive)
Train             4592                         2273                                  2319
 Val              1148                         568                                   580
Test              1348                          695                                   653

Note : Val refers to Validation data .


   For feeding data into our model, we flatten the given conversations into individual parent-
comment-reply unique conversation chains. As mentioned earlier, all parent tweets have atleast
one comment, but all comments do not have replies, so, a training instance might end with
a comment. For each instance, the final label assigned to the instance was the label of the
final comment in the conversation chain. Table 2 describes the dataset distribution with the
validation split used in our experiments in section 5.2 .


4. Methodology
Our methodology primarily involves fine-tuning the transformer models pre-trained on a
massive multilingual corpus. The following sections further describe the end-to-end approach
used in our experiments.

4.1. Data Pre-processing
We first start by concatenating the tweet and its comments and replies, if they are present, to
form the final text sequence. Our intuition behind this process is that this concatenation will
help the model understand the context better, especially in those cases where the comment or
reply may not be hateful but shows support for the hateful parent tweet. While concatenating
the tweets, we insert a new separator token “[SENSEP]” between the tweet , comment, and the
reply, to differentiate between one tweet and another. Post concatenation, we perform data
cleaning by removing hashtags, emojis, URLs and mentions from the tweets. However, we do
not remove punctuation and numbers to preserve the syntactic and semantic coherence of the
tweets.
   Although the data is code-mixed, having both Hindi and English text, there were several
instances where Hindi text is present in Roman script. Since our models are pretrained on a
code-mixed corpus, it is essential to deal with Hindi text in the Roman script to make the whole
dataset consistent for training purposes. Therefore, we perform transliteration using AI4Bharat
library1 to convert the Hindi text in Roman script to Devanagari script.

                                                       Indic-BERT       Logits             Scores
   Tweet                                                Classifier


                           Roman - Devanagari
                           Hindi Transliteration,                       Logits             Scores
                                                     XLM-RoBERTa                                      Majority Voting
  Comment        Concat        Removal of                                        Softmax                                 HOF/NOT
                                                       Classifier                                   (Soft/Hard Voting)
                            Hashtags, Emojis,
                                 Mentions


                                                    Multilingual BERT   Logits             Scores
   Reply
                                                        Classifier




Figure 2: Overview of Ensemble Model. ’Concat’ refers to concatenation operation. ’Scores’ refer to
normalized logits/ probability scores .



4.2. Baseline Approach : Fine-tuning Transformer Models
We perform experiments with three different types of transformer models, which are described
below.

    • Indic-BERT [40] is an ALBERT model pre-trained on a massive multilingual corpus
      having 12 Indic languages such as Assamese, Bengali, English, Gujarati, Hindi, Kannada,
      Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu [40] . The multilingual corpus has
      about 9 billion tokens.
    • Multilingual BERT (mBERT) is a BERT [41] model pretrained on Wikipedia data having
      over 100 languages with a masked language modeling objective.
    • XLM-Roberta (XLM-R) [42] is a transformer-based masked language model pretrained
      on Common Crawl data having about 100 languages. It was proposed by Facebook and
      happened to be one of the best-performing transformer models for multilingual tasks.

   On top of these pre-trained transformer models, we add a dropout followed by a fully
connected layer of size two which takes in the transformer’s CLS token’s representation of size
768. The fully connected layer returns logits for the two classes, i.e., HOF and NOT, which are
then passed to a softmax layer to predict the class of the input text. This is explained as a part
of the ensemble model in figure 2 .


   1
       https://pypi.org/project/ai4bharat-transliteration/
4.3. Our Approach : Models Ensembling
We perform model ensembling on the top of our three fine-tuned models as explained in the
section 4.2. Since we had three different transformer models fine-tuned for our task, we decided
to combine the output of the three models using the majority voting system, i.e., Hard Voting
and Soft Voting. We denote our two ensemble models as Hard Voting Ensemble and Soft
Voting Ensemble. Hard Voting Ensemble model takes in the class predictions from each of
the fine-tuned transformer models and selects the class with maximum votes. Similarly, Soft
Voting Ensemble model takes in the class probabilities from each of the fine-tuned transformer
models and sum the same class probabilities and selects the class having higher probabilities
sum. Figure 2 shows the end-to-end model pipeline used in our experiment.


5. Experiments and Results
5.1. Experimental Setup
For fine-tuning our transformer models, we set a learning rate of 2e-5 for all our experiments
with AdamW [43] optimizer and a linear learning rate scheduler. We train our models with a
batch size of 8. Our experiments use the Hugging-Face Transformers library [44] for fine-tuning
all the pre-trained transformer models.

5.2. Evaluation with Validation data
In order to evaluate our models, we split the train dataset in an 80:20 ratio where 80% of the
data is the new train data, and the rest of 20% becomes the validation data as shown in table
2. The split is done in random order while maintaining the same class ratio in both train and
validation data. We train our models for ten epochs and keep track of the validation loss at each
epoch. For reporting results on validation data, We use the model checkpoint corresponding to
the epoch with minimum validation loss. At the same time, we make a note of that epoch for
which we later train our model on the whole train dataset in 5.3.

Table 3
Results obtained on Validation split
                Model                            F1     Precision     Recall     Accuracy(%)
              Indic-BERT                       0.7150     0.7159      0.7154        71.51
           Multilingual BERT                   0.7438     0.7438      0.7439        74.39
            XLM-RoBERTa                        0.7262     0.7277      0.7268        72.64
 Soft Voting Ensemble (Our approach)           0.7682     0.7687      0.7684        76.82
 Hard Voting Ensemble (Our approach)           0.7621     0.7628      0.7624        76.21
Note : F1, Precision and Recall are Macro Scores


  From Table 3 , we can see that our model ensembling approaches Soft Voting Ensemble
and Hard Voting Ensemble outperform rest of the baseline transformer models with Macro
F1 score of .7682 and .7621 respectively.
Table 4
Results obtained on Test Data
 Submission                     Model                                                    F1           Precision          Recall   Accuracy(%)
 NA                           Indic-BERT                                               0.6811           0.6881           0.6821      68.47
 NA                        Multilingual BERT                                           0.7031           0.7031           0.7033      70.33
 submit-1                   XLM-RoBERTa                                                0.6970           0.6970           0.6970      69.73
 submit-2        Soft Voting Ensemble (Our approach)                                   0.7223           0.7236           0.7222      72.32
 submit-3       Hard Voting Ensemble (Our approach)                                    0.7253           0.7267           0.7251      72.62

Note : F1, Precision and Recall are Macro Scores



5.3. Evaluation with Test data
Following the results obtained in section 5.2 , we train our models on the whole train dataset for
the best number of epochs identified through the minimum validation loss in section 5.2 . These
epochs vary from 3 to 6 for each of the three transformer models discussed in 4.2. We submit the
test results for three of our models as marked in table 4 which reports the results obtained on the
test data. It can be inferred that the test results follow the same trend as was with the validation
data. However, Hard Voting Ensemble with Macro F1 of 0.7253 is slightly better than Soft
Voting Ensemble with Macro F1 score of 0.7223. However, the overall performance of the
model ensembling technique is better than the Indic-BERT, XLM-RoBERTa, and Multilingual
BERT by a significant margin where the highest possible Macro F1 score for baseline models is
0.7031.


6. Analysis

                                                                  Macro F1 Score for Models
                                     0.7

                                     0.6

                                     0.5
                    Macro F1 Score




                                     0.4

                                     0.3

                                     0.2

                                     0.1

                                     0.0
                                                    T                  ERT           Ta                  ble              ble
                                                -BER              ual B          oBER                sem              sem
                                           Indic          lt iling         X LM-R            t ing En         t ing En
                                                        Mu                                 Vo               Vo
                                                                                      Soft            Hard

Figure 3: Macro F1 Scores on Test Data


  Figure 3 compares the results in table 4. It can be noted that the Indic-BERT model has
the lowest F1 score among all the baseline and ensemble models. Soft Voting Ensemble and
Hard Voting Ensemble models yield better results than all the baseline transformer models.
However, merely having a better F1 score is not enough since it is crucial to understand where
our approach fails and where it performs better than the baselines.
   We start our error analysis with table 5 where we have shown the total number of misclassified
samples from each class. In addition to it, it has the percentage of total samples misclassified
from each class, which helps us develop a better understanding of the direction where models
are making more mistakes. Figure 4 shows the detailed results of the performance of each
model for samples of both classes. A closer look reveals that Multilingual BERT and XLM-
RoBERTa misclassify almost equal number of samples from both classes HOF and NOT where
the percentage ranges from 29.35% to 31.24%. However, the case is quite different for Indic-BERT,
where the misclassification rate is very high for the NOT class, which is 40.27% compared to
the misclassification rate of 23.30% for the HOF class. This could also be because Indic-BERT is
trained on 12 Indian languages and has a less diverse pre-training corpus than the other two
transformer models.

Table 5
Total Number and Percentage of Misclassified samples for HOF and NOT classes
                            Model                                         Misclassified HOF                                                  % MR (HOF)                                         Misclassified NOT                                        % MR (NOT)
     Indic-BERT                                                                                         162                                                    23.30                                                  263                                             40.27
 Multilingual BERT                                                                                      207                                                    29.78                                                  193                                             29.55
   XLM-RoBERTa                                                                                          204                                                    29.35                                                  204                                             31.24
Soft Voting Ensemble                                                                                    168                                                    24.17                                                  205                                             31.39
Hard Voting Ensemble                                                                                    165                                                    23.74                                                  204                                             31.24

Note : % MR (HOF) and % MR (NOT) refer to percent misclassification rate for HOF and NOT classes
respectively. % MR (class) is the percentage of total misclassified samples of the class out of the total
samples of the class. These are calculated with total HOF samples count of 695 and total NOT samples
count of 653 in test data.


                          Indic-BERT                                       Multilingual BERT                                                 XLM-RoBERTa                                          Soft Voting Ensemble                                   Hard Voting Ensemble
                                                 500                                                                                                                                                                            500                                                    500
                                                                                                              450                                                      450
                                                 450                                                                                                                                                                            450                                                    450
             NOT   390                     263                      NOT   460                     193                            NOT   449                     204                        NOT     448                     205                      NOT   449                     204
                                                                                                              400                                                      400
                                                 400                                                                                                                                                                            400                                                    400
True Label




                                                       True Label




                                                                                                                    True Label




                                                                                                                                                                             True Label




                                                                                                                                                                                                                                      True Label




                                                 350                                                          350                                                      350                                                      350                                                    350

                                                 300                                                          300                                                                                                               300                                                    300
                                                                                                                                                                       300
             HOF   162                     533   250                HOF   207                     488                            HOF   204                     491                        HOF     168                     527   250                HOF   165                     530   250
                                                                                                              250
                                                                                                                                                                       250
                                                 200                                                                                                                                                                            200                                                    200
                                                                                                              200
                   NOT                     HOF                            NOT                     HOF                                  NOT                     HOF                                NOT                     HOF                            NOT                     HOF
                         Predicted Label                                        Predicted Label                                              Predicted Label                                            Predicted Label                                        Predicted Label




Figure 4: Confusion Matrix on Test Data for Baseline and Ensemble Models


  As far as our ensemble models are concerned, we observe that the misclassification rate is
very balanced for both classes. To compare it with the baseline transformer models, we can see
that the ensemble models have the lowest misclassification rate for HOF class ranging between
23% to 24%, which was also the case with Indic-BERT, and similarly, the misclassification rate
for NOT class is also close to the lowest among the baseline models. This suggests that model
ensembling minimized the misclassification rate for both classes, and we can further conclude
that a single model’s mistake is likely to be corrected by the other two models in an ensemble
model. However, the ensemble models still make 7-8% more mistakes in identifying NOT class
compared to HOF class.


7. Conclusion
In this paper, we deal with a novel problem of detecting hateful tweets in twitter conversations
where the comment or reply might not be offensive and toxic but contributes to the hate
associated with the parent tweet. We performed a thorough experimental analysis with state-
of-the-art models such as XLM-RoBERTa, Indic-BERT, and Multilingual BERT to show that
pre-trained multi-lingual transformer models can achieve decent performance on this task. We
further demonstrate that this performance can be improved with model ensemble techniques
such as Soft Voting and Hard Voting. As this problem is dealt with for the first time, there could
be many ways to improve these numbers and build a more robust system by incorporating
other factors such as emojis and hashtags, which may equally or partially contribute to the
hatefulness of a tweet. Additionally, we aim to explore better architectures for taking into
account the context of a comment like parent comment and replies to judge the nature of the
comment. Author profiling also would be a potential area of research to detect implicit hate in
conversations.


References
 [1] M. L. Williams, P. Burnap, A. Javed, H. Liu, S. Ozalp, Hate in the machine: anti-black and
     anti-muslim social media posts as predictors of offline racially and religiously aggravated
     crime, The British Journal of Criminology 60 (2020) 93–117.
 [2] L. Dixon, J. Li, J. Sorensen, N. Thain, L. Vasserman, Measuring and mitigating unintended
     bias in text classification, in: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics,
     and Society, 2018, pp. 67–73.
 [3] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced metrics for measuring
     unintended bias with real data for text classification, in: Companion Proceedings of The
     2019 World Wide Web Conference, 2019, pp. 491–500.
 [4] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri,
     Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content
     Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in:
     FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December
     2021, ACM, 2021.
 [5] S. Satapara, S. Modha, T. Mandl, H. Madhu, P. Majumder, Overview of the HASOC Subtrack
     at FIRE 2021: Conversational Hate Speech Detection in Code-mixed language , in: Working
     Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [6] R. Guermazi, M. Hammami, A. B. Hamadou, Using a semi-automatic keyword dictionary
     for improving violent web site filtering, in: 2007 Third International IEEE Conference on
     Signal-Image Technologies and Internet-Based System, 2007, pp. 337–344. doi:1 0 . 1 1 0 9 /
     SITIS.2007.137.
 [7] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate speech
     detection with comment embeddings, in: Proceedings of the 24th International Conference
     on World Wide Web, WWW ’15 Companion, Association for Computing Machinery, New
     York, NY, USA, 2015, p. 29–30. URL: https://doi.org/10.1145/2740908.2742760. doi:1 0 . 1 1 4 5 /
     2740908.2742760.
 [8] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection
     in tweets, in: Proceedings of the 26th international conference on World Wide Web
     companion, 2017, pp. 759–760.
 [9] A.-M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali,
     M. Sirivianos, N. Kourtellis, Large scale crowdsourcing and characterization of twitter
     abusive behavior, 2018. a r X i v : 1 8 0 2 . 0 0 3 9 3 .
[10] S. Carta, A. Corriga, R. Mulas, D. R. Recupero, R. Saia, A supervised multi-class multi-label
     word embeddings approach for toxic comment classification., in: KDIR, 2019, pp. 105–112.
[11] H. H. Saeed, K. Shahzad, F. Kamiran, Overlapping toxic sentiment classification using deep
     neural architectures, in: 2018 IEEE International Conference on Data Mining Workshops
     (ICDMW), IEEE, 2018, pp. 1361–1366.
[12] A. Vaidya, F. Mai, Y. Ning, Empirical analysis of multi-task learning for reducing identity
     bias in toxic comment detection, in: Proceedings of the International AAAI Conference
     on Web and Social Media, volume 14, 2020, pp. 683–693.
[13] T. Tran, Y. Hu, C. Hu, K. Yen, F. Tan, K. Lee, S. Park, Habertor: An efficient and effective
     deep hatespeech detector, 2020. a r X i v : 2 0 1 0 . 0 8 8 6 5 .
[14] H. Hitkul, R. R. Shah, P. Kumaraguru, S. Satoh, Maybe look closer? detecting trolling
     prone images on instagram, in: 2019 IEEE Fifth International Conference on Multimedia
     Big Data (BigMM), IEEE, 2019, pp. 448–456.
[15] Hitkul, K. Aggarwal, P. Bamdev, D. Mahata, R. R. Shah, P. Kumaraguru, Trawling for
     trolling: A dataset, 2020. a r X i v : 2 0 0 8 . 0 0 5 2 5 .
[16] S. Ghosh, S. Lepcha, S. Sakshi, R. R. Shah, Speech toxicity analysis: A new spoken language
     processing task, arXiv preprint arXiv:2110.07592 (2021).
[17] O. Kamal, A. Kumar, T. Vaidhya, Hostility detection in hindi leveraging pre-trained lan-
     guage models, 2021. a r X i v : 2 1 0 1 . 0 5 4 9 4 .
[18] J. A. Leite, D. F. Silva, K. Bontcheva, C. Scarton, Toxic language detection in social
     media for brazilian portuguese: New dataset and multilingual analysis, arXiv preprint
     arXiv:2010.04543 (2020).
[19] A. Saroj, S. Pal, An indian language social media collection for hate and offensive speech, in:
     Proceedings of the Workshop on Resources and Techniques for User and Author Profiling
     in Abusive Language, 2020, pp. 2–8.
[20] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, P. Rosso, M. Sanguinetti,
     SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women
     in Twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation,
     Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 54–63.
     URL: https://aclanthology.org/S19-2007. doi:1 0 . 1 8 6 5 3 / v 1 / S 1 9 - 2 0 0 7 .
[21] A. Ghosh Chowdhury, A. Didolkar, R. Sawhney, R. R. Shah, ARHNet - leveraging commu-
     nity interaction for detection of religious hate speech in Arabic, in: Proceedings of the
     57th Annual Meeting of the Association for Computational Linguistics: Student Research
     Workshop, Association for Computational Linguistics, Florence, Italy, 2019, pp. 273–280.
     URL: https://aclanthology.org/P19-2038. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 2 0 3 8 .
[22] P. Mathur, R. Shah, R. Sawhney, D. Mahata, Detecting offensive tweets in hindi-english
     code-switched language, in: Proceedings of the Sixth International Workshop on Natural
     Language Processing for Social Media, 2018, pp. 18–26.
[23] R. Kapoor, Y. Kumar, K. Rajput, R. R. Shah, P. Kumaraguru, R. Zimmermann, Mind your
     language: Abuse and offense detection for code-switched languages, in: Proceedings of
     the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 9951–9952.
[24] P. Mathur, R. Sawhney, M. Ayyar, R. Shah, Did you offend me? classification of offensive
     tweets in hinglish language, in: Proceedings of the 2nd Workshop on Abusive Language
     Online (ALW2), 2018, pp. 138–148.
[25] S. Chopra, R. Sawhney, P. Mathur, R. R. Shah, Hindi-english hate speech detection: Author
     profiling, debiasing, and practical perspectives, in: Proceedings of the AAAI Conference
     on Artificial Intelligence, volume 34, 2020, pp. 386–393.
[26] S. Kamble, A. Joshi, Hate speech detection from code-mixed hindi-english tweets using
     deep learning models, 2018. a r X i v : 1 8 1 1 . 0 5 1 4 5 .
[27] T. Ranasinghe, S. Gupte, M. Zampieri, I. Nwogu, Wlv-rit at hasoc-dravidian-codemix-
     fire2020: Offensive language identification in code-switched youtube comments, 2020.
     arXiv:2011.00559.
[28] K. Yasaswini, K. Puranik, A. Hande, R. Priyadharshini, S. Thavareesan, B. R. Chakravarthi,
     IIITT@DravidianLangTech-EACL2021: Transfer learning for offensive language detection
     in Dravidian languages, in: Proceedings of the First Workshop on Speech and Language
     Technologies for Dravidian Languages, Association for Computational Linguistics, Kyiv,
     2021, pp. 187–194. URL: https://aclanthology.org/2021.dravidianlangtech-1.25.
[29] A. Baruah, K. A. Das, F. A. Barbhuiya, K. Dey, Iiitg-adbu@hasoc-dravidian-codemix-
     fire2020: Offensive content detection in code-mixed dravidian text, 2021. a r X i v : 2 1 0 7 . 1 4 3 3 6 .
[30] J. Pavlopoulos, L. Laugier, J. Sorensen, I. Androutsopoulos, Semeval-2021 task 5: Toxic
     spans detection, in: Proceedings of the 15th International Workshop on Semantic Evalua-
     tion, 2021.
[31] S. Ghosh, S. Kumar, Cisco at semeval-2021 task 5: What’s toxic?: Leveraging transformers
     for multiple toxic span extraction from online comments, 2021. a r X i v : 2 1 0 5 . 1 3 9 5 9 .
[32] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, A. Mukherjee, Hatexplain: A
     benchmark dataset for explainable hate speech detection, 2020. a r X i v : 2 0 1 2 . 1 0 2 8 9 .
[33] P. Mishra, M. D. Tredici, H. Yannakoudakis, E. Shutova, Abusive language detection with
     graph convolutional networks, 2019. a r X i v : 1 9 0 4 . 0 4 0 7 3 .
[34] M. Das, P. Saha, R. Dutt, P. Goyal, A. Mukherjee, B. Mathew, You too brutus! trapping
     hateful users in social media: Challenges, solutions insights, 2021. a r X i v : 2 1 0 8 . 0 0 5 2 4 .
[35] J. Qian, M. ElSherief, E. Belding, W. Y. Wang, Hierarchical cvae for fine-grained hate speech
     classification, 2018. a r X i v : 1 8 0 9 . 0 0 0 8 8 .
[36] J. Pavlopoulos, J. Sorensen, L. Dixon, N. Thain, I. Androutsopoulos, Toxicity detection:
     Does context really matter?, 2020. a r X i v : 2 0 0 6 . 0 0 9 9 8 .
[37] M. Saveski, B. Roy, D. Roy, The structure of toxic conversations on twitter, 2021.
     arXiv:2105.11596.
[38] J. Qian, A. Bethke, Y. Liu, E. Belding, W. Y. Wang, A benchmark dataset for learning to
     intervene in online hate speech, arXiv preprint arXiv:1909.04251 (2019).
[39] H. Almerekhi, H. Kwak, J. Salminen, B. J. Jansen, Are these comments triggering? pre-
     dicting triggers of toxicity in online discussions, WWW ’20, Association for Comput-
     ing Machinery, New York, NY, USA, 2020. URL: https://doi.org/10.1145/3366423.3380074.
     doi:1 0 . 1 1 4 5 / 3 3 6 6 4 2 3 . 3 3 8 0 0 7 4 .
[40] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Ku-
     mar, IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained
     Multilingual Language Models for Indian Languages, in: Findings of EMNLP, 2020.
[41] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, 2019. a r X i v : 1 8 1 0 . 0 4 8 0 5 .
[42] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, 2020. a r X i v : 1 9 1 1 . 0 2 1 1 6 .
[43] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, 2019. a r X i v : 1 7 1 1 . 0 5 1 0 1 .
[44] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.