=Paper= {{Paper |id=Vol-3159/T1-30 |storemode=property |title=Multi-Task Learning with Sentiment Emotion and Target Detection to Recognize Hate Speech and Offensive Language |pdfUrl=https://ceur-ws.org/Vol-3159/T1-30.pdf |volume=Vol-3159 |authors=Flor Miriam Plaza-del-Arco,Sercan Halat,Sebastian Padó,Roman Klinger |dblpUrl=https://dblp.org/rec/conf/fire/ArcoHPK21 }} ==Multi-Task Learning with Sentiment Emotion and Target Detection to Recognize Hate Speech and Offensive Language== https://ceur-ws.org/Vol-3159/T1-30.pdf
Multi-Task Learning with Sentiment, Emotion, and
Target Detection to Recognize Hate Speech and
Offensive Language
Flor Miriam Plaza-del-Arco1,3 , Sercan Halat2,3 , Sebastian Padó3 and Roman Klinger3
1
  Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC),
Universidad de Jaén, Campus Las Lagunillas, E-23071, Jaén, Spain
2
  Department of Turkish Language Teaching, Faculty of Education, Mugla Sitki Kocman University, Kötekli, TR-48000
Mugla, Turkey
3
  Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Pfaffenwaldring 5b, 70569 Stuttgart, Germany


                                      Abstract
                                      The recognition of hate speech and offensive language (HOF) is commonly formulated as a classification
                                      task which asks models to decide if a text contains HOF. This task is challenging because of the large
                                      variety of explicit and implicit ways to verbally attack a target person or group. In this paper, we
                                      investigate whether HOF detection can profit by taking into account the relationships between HOF and
                                      similar concepts: (a) HOF is related to sentiment analysis because hate speech is typically a negative
                                      statement and expresses a negative opinion; (b) it is related to emotion analysis, as expressed hate
                                      points to the author experiencing (or pretending to experience) anger while the addressees experience
                                      (or are intended to experience) fear. (c) Finally, one constituting element of HOF is the (explicit or
                                      implicit) mention of a targeted person or group. On this basis, we hypothesize that HOF detection shows
                                      improvements when being modeled jointly with these concepts, in a multi-task learning setup. We base
                                      our experiments on existing data sets for each of these concepts (sentiment, emotion, target of HOF) and
                                      evaluate our models as a participant (as team IMS-SINAI) in the HASOC FIRE 2021 English Subtask 1A:
                                      “Subtask 1A: Identifying Hate, offensive and profane content from the post”. Based on model-selection
                                      experiments in which we consider multiple available resources and submissions to the shared task, we
                                      find that the combination of the CrowdFlower emotion corpus, the SemEval 2016 Sentiment Corpus,
                                      and the OffensEval 2019 target detection data leads to an F1 =.7947 in a multi-head multi-task learning
                                      model based on BERT, in comparison to .7895 of a plain BERT model. On the HASOC 2019 test data,
                                      this result is more substantial with an increase by 2pp in F1 (from 0.78 F1 to 0.8 F1 ) and a considerable
                                      increase in recall. Across both data sets (2019, 2021), the recall is particularly increased for the class of
                                      HOF (6pp for the 2019 data and 3pp for the 2021 data), showing that MTL with emotion, sentiment, and
                                      target identification is an appropriate approach for early warning systems that might be deployed in
                                      social media platforms.

                                      Keywords
                                      multi-task learning, hate and offensive language detection, sentiment analysis, emotion analysis, target
                                      classification, social media mining




Forum for Information Retrieval Evaluation, December 13-17, 2021, India
$ fmplaza@ujaen.es (F. M. Plaza-del-Arco); sercanhalat@mu.edu.tr (S. Halat); pado@ims.uni-stuttgart.de (S. Padó);
klinger@ims.uni-stuttgart.de (R. Klinger)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
The widespread adoption of social media platforms has made it possible for users to express
their opinions easily in a manner that is visible to a huge audience. These platforms provide
a large step forward for freedom of expression. At the same time, social media posts can also
contain harmful content like hate speech and offensive language (HOF), often eased by the
quasi-anonymity on social media platforms [1]. The European Commission’s recommendation
against racism and intolerance defines HOF as “the advocacy, promotion or incitement of the
denigration, hatred or vilification of a person or group of persons, as well any harassment, insult,
negative stereotyping, stigmatization or threat of such person or persons and any justification of
all these forms of expression – that is based on a non-exhaustive list of personal characteristics
or status that includes ‘race’, color, language, religion or belief, nationality or national or ethnic
origin, as well as descent, age, disability, sex, gender, gender identity, and sexual orientation”
[2].
   With the number of social media posts rising sharply, purely manual detection of HOF does
not scale. Therefore, there has been a growing interest in methods for automatic HOF detection.
A straight-forward approach one might consider is to make use of basic word filters, which use
lexicons that contain entries of words that are frequently used in hate speech [3]. This approach,
however, has its limitations, given that HOF depends on discourse, the media, daily politics,
and the identity of the target [4]. It also disregards the different use of potentially offending
expressions across communities [5].
   These factors motivate an interest in more advanced approaches as they are developed in the
field of natural language processing (NLP). Most recent, well-performing systems make use of
machine learning methods to associate textual expressions in a contextualized manner with
the concept of HOF. Mostly, existing models build on top of end-to-end learning, in which the
model needs to figure out this association purely from the training data (and a general language
representation which originates from self-supervised pretraining of a language model).
   In this paper, we build on the intuition that HOF is related to other concepts that might help
to direct this learning process. Analyzing the definition above, HOF is potentially related to
sentiment, emotion, and the target of hate speech. First, sentiment analysis is often defined as
the task of classifying an opinion expression into being positive or negative, given a particular
target [6, 7]. HOF is related as it typically contains a negative expression or, at least, intend.
Second, emotion analysis is concerned with the categorization of text into a predefined reference
system, for instance basic emotions as they have been proposed by Paul Ekman [8] (fear, joy,
sadness, surprise, disgust, anger). HOF contains expressions of anger and might cause fear or
other emotions in a target group. Finally, the target is, by definition, a crucial element of hate
speech, whether mentioned explicitly or not.
   The concrete research question we test is whether a HOF detection system can be improved
by exploiting existing resources that are annotated for emotion, sentiment and HOF target, and
carrying out joint training of a model for HOF and these aspects. In building such a model,
the developer has to decide (a) which of these aspects to include, (b) which corpora to use for
training for each aspect, and (c) how to combine these aspects. We assume a simple multi-task
learning architecture for (c) and perform model selection on the HASOC FIRE 2019 development
data to address (b). Finally, we address question (a) through our submissions to HASOC FIRE
2021 Shared Task [9] subtask 1A1 [10] which asks systems to carry out a binary distinction
between “non hate/offensive” and “hate/offensive” English tweets. We find that a combination of
all concepts leads to an improvement by about 2pp in F1 on the HASOC 2019 test data, with a
notable increase in recall by 6pp, and an increase by 0.5pp in F1 in the HASOC 2021 test data
(with an increase of 3pp in recall).


2. Related Work
As argued above, detecting hate and offensive language on Twitter is a task closely linked to
sentiment, emotion analysis, and target classification. In this section, we introduce these tasks
alongside previous work and also mention some HOF detection shared tasks that took place in
recent years in the NLP community.

2.1. Emotion Analysis
Emotion analysis from text (EA) consists of mapping textual units to a predefined set of emotions,
for instance basic emotions, as they have been proposed by Ekman [8] (anger, fear, sadness,
joy, disgust, surprise), the dimensional model of Plutchik [11] (adding trust and anticipation),
or the discrete model proposed by Shaver et al. [12] (anger, fear, joy, love, sadness, surprise).
Great efforts have been conducted in the last years by the NLP community in a variety of
emotion research tasks including emotion intensity prediction [13, 14, 15], emotion stimulus or
cause detection [16, 17, 18, 19], or emotion classification [20, 21]. Studying patterns of human
emotions is essential in various applications such as the detection of mental disorders, social
media mining, dialog systems, business intelligence, or e-learning. An important application is
the detection of HOF, since it is inextricably linked to the emotional and psychological state
of the speaker [22]. Negative emotions such as anger, disgust and fear can be conveyed in the
form of HOF. For example, in the text “I am sick and tired of this stupid situation” the author
feels angry and at the same time is using offensive language to express that emotion. Therefore,
the detection of negative emotions can be a clue to detect this type of behavior on the web.
   An important aspect of EA is the creation of annotated corpora to train machine learning
models. The availability of emotion corpora is highly fragmented, not only because of the
different emotion theories, but also because emotion classification appears to be genre- and
domain-specific [21]. We will limit the discussion of corpora in the following to those we use
in this paper. The Twitter Emotion Corpus (TEC) was annotated with labels corresponding to
Ekman’s model of basic emotions (anger, disgust, fear, joy, sadness, and surprise) and consists
of 21,051 tweets. It was automatically labeled with the use of hashtags that the authors self-
assigned to their posts. The grounded emotions corpus created by Liu et al. [23] is motivated
by the assumption that emotions are grounded in contextual experiences. It consists of 2,557
instances, labeled by domain experts for the emotions of happiness and sadness. EmoEvent,
on the contrary, was labeled via crowdsourcing via Amazon Mechanical Turk. It contains a
total of 8,409 tweets in Spanish and 7,303 in English, based on events related to different topics
such as entertainment, events, politics, global commemoration, and global strikes. The labels

   1
       https://hasocfire.github.io/hasoc/2021/dataset.html
that we use from this corpus correspond to Ekman’s basic emotions, complemented by ‘other’.
DailyDialog, developed by Li et al. [24], is a corpus consisting of 13,118 sentences reflecting the
daily communication style and covering various topics related to daily life. The dialogues in
the dataset cover totally ten topics. It was annotated following Ekman’s emotions by domain
experts. The ISEAR dataset was collected in the 90s by Klaus R. Scherer and Harald Wallbott
by asking people to report on their experience of emotion-eliciting events. [25]. The dataset
contains a total of 7,665 sentences from 3,000 participant reports labeled with single emotions.
The last dataset that we use, CrowdFlower2 , consists of 39,740 tweets labeled for 13 emotions.
It is quite large, but more noisy than some other corpora, given the annotation procedure via
crowdsourcing.

2.2. Sentiment Analysis
Sentiment analysis (SA) has emerged as one of the most well-known areas in NLP due to its
significant implications in social media mining. Construed broadly, the task includes sentiment
polarity classification, identifying the sentiment target or topic, opinion holder identification,
and identifying the sentiment of one specific aspect (e.g., a product, topic, or organization) in
its context sentence [7, 26]. Sentiment analysis is a stricter sense, i.e., polarity classification,
is often modeled as a two-class (positive, negative) or three-class (positive, negative, neutral)
categorization task. For instance, the opinionated expression “The movie was terrible, I wasted
my time watching it” is clearly negative. A negative sentiment can be an indicator of the
presence of offensive language, as previous studies have shown [27, 6]. Sentiment analysis
and the identification of HOF share common discursive properties. Considering the example
shown in Section 2.1, “I am sick and tired of this stupid situation”, in addition to expressing
anger, conveys a negative sentiment along with the presence of expletive language targeted to
a situation. Therefore, both sentiment and emotion features can be used as useful information
in the NLP systems to benefit the task of HOF detection in social media. Note that sentiment
analysis is not a “simplified” version of emotion analysis – sentiment analysis is about the
expression of an opinion, while emotion analysis is about inferring an emotional private state
of a user. These tasks are related, but at least to some degree complementary [28].
   Unlike EA, as SA classification is one of the most studied tasks due to its broader applications,
a larger number of corpora annotated with sentiments is available, particularly from Twitter.
For instance, one of the most well-known datasets is the Stanford Sentiment Treebank [29]. It
contains movie reviews in English from Rotten Tomatoes. Another popular dataset was released
in SemEval 2016 for Task 4 is labeled with positive, negative, or neutral sentiments and includes
a mixture of entities (e.g., Gaddafi, Steve Jobs), products (e.g., kindle, android phone), and events
(e.g., Japan earthquake, NHL playoffs) [30]. The same year, another dataset was released in
SemEval for Task 6 [31], the Twitter stance and sentiment corpus which is composed of 4,870
English tweets labeled with positive and negative sentiments. For a more detailed overview, we
refer the reader to recent surveys on the topic [32, 33, 34].




    2
        https://data.world/crowdflower/sentiment-analysis-in-text
2.3. Hate Speech and Offensive Language Detection
Hate speech and offensive language (HOF) is a phenomenon that can be observed with increasing
frequency in social media in the last years. As HOF became more widespread, attention from
the NLP community also increased substantially [35]. Most research is targeting Twitter as a
platform, due to its popularity across various user groups and its relatively liberal API terms
and conditions for researchers. The methodological approaches vary between lexicon-based
approaches (among others, [36]) which are preferable due to its transparency in decision-
making and machine learning methods which typically show higher performance [37, 38,
i.a.]. Definitions of hate speech for the operationalization in automatic detection systems
vary and do not always strictly follow the ECRI definition that we introduced in Section 1.
Sometimes, different categories such as hate speech, profanity, offensive language, and abuse
are collapsed into one class because they are related [4] and might trigger similar responses
(e.g., by authorities).
   An important role in the HOF field has been played by a series of shared tasks. One of the first
of such events was GermEval [39], which was organized for the first time in 2014 and focuses on
German. After initially focusing on information extraction tasks, the identification of offensive
language was introduced in 2018 [40]. Two subtasks were offered, one on the classification
into a non-offensive ‘other’, and ‘profanity’, ‘insult’, and ‘abuse’, and one, in which the three
latter classes are collapsed to obtain a coarse-grained binary classification setup. This setup
was retained in the 2019 edition of the shared task [41].
   Another well-known shared task event is OffensEval, which was held as part of the Interna-
tional Workshop on Semantic Evaluation (SemEval) in 2019 and 2020 [42, 43]. As part of the last
OffensEval event, the OLID dataset was published, which contains a total of 14,000 tweets in
English. It was annotated using a three-level hierarchical annotation scheme by crowdsourcing.
   A third shared task series that took place in 2021 for the third time is HASOC (Hate Speech
and Offensive Content Identification in English and Indo-Aryan Languages) [44, 45]. In the
first edition of the HASOC, in 2019 [44], Hindi, German and English datasets were created
for the definition of HOF based on Twitter and Facebook posts. HASOC 2020 introduced two
tasks, one on coarse-grained HOF vs. non-HOF language and one which distinguishes hate,
offensive language, and profane language for all these languages. HASOC 2021 was extended by
a subtask on code-mixed language. This paper is a participation and system description in the
coarse-grained identification of HOF in English (Subtask 1A), in the 2021 edition of HASOC3 .
   Another direction of research relevant for this paper is formed by studies that aim at obtaining
a better understanding of the challenges of HOF detection. Davidson et al. [46] focused on the
separation between the classes of hate speech and offensive language. They collected 33,458
tweets based on a crowdsourced lexicon and found, based on bag-of-words maximum entropy
classifiers, that racist and homophobic tweets are more likely to be classified as hate speech,
but that sexist tweets are generally classified as offensive. With a similar goal in mind, to
understand which cases are particularly challenging for HOF detection models, Röttger et al.
[47] introduced HateCheck, a suite of functional tests, to enable more detailed insights of
where models might fail. They particularly analyzed distinct expressions of hate, like derogatory
hate speech, threatening language, slurs, and profanity. Finally, Waseem and Hovy [48] perform
   3
       https://hasocfire.github.io/hasoc/2021/
a corpus study to understand which properties hate speech exhibits in contrast to non-hateful
language. This work is noteworthy because it focuses on properties that are grounded in
theories from social sciences rather than being primarily data-driven.
   Other research focused on the development of well-performing models for HOF detection
with adaptations to recent approaches to text classification via transfer learning. Mathur
et al. [49] investigated the usage of mixed language. They presented the Multi-Input Multi-
Channel Transfer Learning-based model (MIMCT) to detect HS, offensiveness, and abusive
language tweets from the proposed Hinglish Offensive Tweet (HOT) dataset using transfer
learning coupled with multiple feature inputs. They stated that their proposed MIMCT model
outperforms basic supervised classification models. Wiedemann et al. [50], participant in the
GermEval competition, used a different strategy for automatic offensive language classification
on German Twitter data. For this task, they used a set of BiLSTM and CNN neural networks
and include background knowledge in the form of topics into the models.
   We refer the interested reader to recent surveys on the topic of hate speech and offensive
language detection for a more comprehensive overview [35, 51].

2.4. Target Classification
According to the definition of hate speech, it must be targeted at a particular individual or
group, whether that target is mentioned explicitly or not. Typical examples include black people,
women, LGBT individuals or people of a particular religion [52]. The majority of current studies
do not aim at detecting targets that are mentioned in the text (in the sense of information
extraction), but aim at analyzing the properties of HOF towards a particular group by sampling
only posts aimed at that group from social media. For example, Kwok and Wang [53] focused
on the analysis of hate speech towards people of color, while Grimminger and Klinger [54]
analyzed hate and offensive language by and towards supporters of particular political parties.
   Some studies aim at answering the research question how various target groups are referred
to. As an example, Lemmens et al. [55] analyzed the language of hateful Dutch comments
regarding classes of metaphoric terms, including body parts, products, animals, or mental
conditions. Such a closed world approach, however, does not permit the identification of targets
that were not known at the development time of the HOF detection system. This is to some
degree addressed by Silva et al. [56], who developed a rule-based method to identify target
mentions that are then, similarly to Lemmens et al. [55], compared regarding the expressions
that are used.
   ElSherief et al. [57] focused on the distinction between directed hate, towards a particular
individual as a representative of a group, and generalized hate which mentions the group itself.
Their study is not focused on target classification, but on the analysis of which groups and
individuals are particularly in focus of hate speech, including religious groups, genders, and
ethnicities. To be able to do that, however, they needed to automatically detect words in context
of HOF. They did that with the use of a mixed-effect topic model [58].
   This label set is also used in the shared task OffensEval 2019 [43], which is the only competition
we are aware of which included target classification as a subtask. The OLID dataset of OffensEval
2019 has, next to HOF annotations, labels which indicate whether the target is an individual, a
group, some other target, or if it omitted. We use this annotation in our study.
3. Model
Maybe the most pertinent question arising from our intuition above – namely that HOF detection
is related to the tasks of emotion, sentiment and target classification – is how this intuition
can be operationalized as a computational architecture. Generally speaking, this is a transfer
learning problem, that is, a problem which involves generalization of models across tasks and/or
domains. There are a number of strategies to address transfer learning problems; see Ruder
[59] for a taxonomy. Structurally, our setup falls into the inductive transfer learning category,
where we consider different tasks and have labeled data for each. Procedurally, we propose to
learn the different tasks simultaneously, which amounts of multi-task learning (MTL). In the
MTL scenario, multiple tasks are learned in parallel while using a shared representation [60]. In
comparison to learning multiple tasks individually, this joint learning effectively increases the
sample size while training a model, which leads to improved performance by increasing the
generalization of the model [61].
   The concrete MTL architecture that we use is shown in Figure 1. We build on a standard
contextualized embedding setup where the input is represented by a transformer-based encoder,
BERT, pre-trained on a very large English corpus [62]. We add four sequence classification
heads to the encoder, one for each task, and fine-tune the model on the four tasks in question
(binary/multiclass classification tasks). For the sentiment classification task a tweet is categorized
into positive and negative categories; emotion classification classifies a tweet into different
emotion categories (anger, disgust, fear, joy, sadness, surprise, enthusiasm, fun, hate, neutral,
love, boredom, relief, none). Different subsets of these categories are considered in this task
depending on the emotion corpus that is used to represent the concept of an emotion. Target
classification categorize the target of the offense to an individual, group, to others and to be
not mentioned; and HOF detection classifies a tweet into HOF or non-HOF. While training, the
objective function weights each task equally. At prediction time, for each tweet in the HASOC
dataset, four predictions are assigned, one for each task.


4. Experimental Setup
4.1. Experimental Procedure
Our main research question is whether HOF detection can be improved by joint training with
sentiment, emotion and target. Even the adoption of the architecture described in Section 3
leaves open a number of design choices, which makes a model selection procedure necessary.
   For the purpose of model selection, we decided to use the dataset provided by the 2019 edition
of the HASOC shared task, under the assumption that the datasets are fundamentally similar
(we also experimented with the HASOC 2020 dataset, but the results indicated that this dataset
is sampled from a different distribution than the 2021 dataset). During the evaluation phase, we
then used the best model configurations we identified on HASOC 2019 to train a model on the
HASOC 2021 training data and produce predictions for the HASOC 2021 test set.
   The two main remaining model selection decisions are (a), which corpora to use to train the
components?; (b), which components to include? In the following, we first provide details on
the corpora we considered, addressing (a). We also describe the details of data preprocessing,
                                                                                       Sentiment
                                                                                      classification
        [CLS]

    token1
                                                                                        Emotion
                                                                                      classification
    token2
                             Input Embedding
                                                         Shared BERT encoder
                                   Layer
    token3
                                                                                         Target
         ...




                                                                                      classification
    tokenn


    [SEP]
                                                                                     HOF detection



Figure 1: Proposed multi-task learning system to evaluate the impact of including emotion, sentiment,
and target classification. The input representation is BERT-based tokenization and each task corresponds
to one classification head. Information can flow from one task to another through the shared encoder
that is updated during training via backpropagation.


training regimen and hyperparameter handling. The results are reported in Section 5 to address
point (b).

4.2. Corpora
We carry out MTL experiments to predict HOF jointly with the concepts of emotion, sentiment
and HOF target. The datasets are listed in Table 1. To represent sentiment in our MTL exper-
iments, we use the SemEval 2016 Task 6 dataset [31] composed of 4,870 tweets in total. We
include the task of target classification with the OLID dataset [63], which consists of 14,100
English Tweets. The concept of HOF is modelled based on the HASOC 2021 dataset, which
provides three sub-tasks. We participate as the team IMS-SINAI in sub-task 1A, which contains
5,214 English tweets splits into 3,074 tweets in the training set, 769 in the development set and
1,281 in the test set.4
   For emotion detection, we consider a set of six corpora in the model selection experiment.
These are the Crowdflower data5 , the TEC corpus [64], the Grounded Emotions corpus [23],
EmoEvent [65], DailyDialogues[24], and ISEAR. Among the available emotion corpora, we
chose those because they cover a range of general topics and/or the genre of tweets.

4.3. Data Preprocessing
Tweets present numerous challenges in their tokenization, such as user mentions, hashtags,
emojis, misspellings, among others. To address these challenges, we make use of the ekphrasis

    4
        https://hasocfire.github.io/hasoc/2021/call_for_participation.html
    5
        https://www.crowdflower.com/data/sentiment-analysis-emotion-text/
Table 1
Selection of resources for EA, SA and offensive target. The data sets that we use in our final experiments
are marked with a star∗ .
 Category         Dataset              Annotation                 Size      Source
                                 ∗
                  CrowdFlower          Ekman’s emo.               39,740    CrowdFlower (2016)
                  TEC                  Ekman’s emo                21,051    Mohammad (2012)
                  GroundedEmo.         sadness/joy                2,585     Liu et al. (2017)
 Emotion
                  EmoEvent             Ekman’s emo/other          7,303     Plaza-del-Arco et al. (2020)
                  DailyDialogues       Ekman’s emo                13,118    Li et al. (2017)
                  ISEAR                Ekman’s emo/shame/guilt    7,665     Scherer (1994)
 Sentiment        SemEval 2016∗        neg./pos./neutr.           63,192    Mohammad, Saif M. (2017)
 HOF              HASOC 2021∗          Non/HOF                    5,124     HASOC (2021)
 Target           OLID∗                None/ind./group/other      14,200    OffensEval (2019)


Python library6 [66]. Particularly, we normalize all mentions of URLs, emails, users’ mentions,
percentages, monetary amounts, time and date expressions, and phone numbers. For example,
“@user” is replaced by the token “”. We further normalize hashtags and split them
into their constituent words. As an example, “#CovidVaccine” is replaced by “Covid Vaccine”.
Further, we replace emojis by their aliases. For instance, the emoji 😂 is replaced by the token
“:face_with_tears_joy:” using the emoji Python library7 . Finally, we replace multiple consecutive
spaces by single spaces and replace line breaks by a space.

4.4. Training Regimen and Hyper-parameters
In the MTL stage, during each epoch, a mini-batch bt is selected among all 4 tasks, and the
model is updated according to the task-specific objective for the task t. This approximately
optimizes the sum of all multi-task objectives. As we are dealing with sequence classification
tasks, a standard cross-entropy loss function is used as the objective.

Hyper-parameters. For hyper-parameter optimization, we split the HASOC 2021 into train
(80 %) and validation data (20 %). Afterwards, in the evaluation phase we use the complete
training set of HASOC 2021 in order to take advantage of having more labeled data to train
our models. For the baseline BERT, we fine-tuned the model for four epochs, the learning rate
was set to 4 · 10−4 and the batch size to 32. For HASOC_sentiment and HASOC_emotion,
we fine-tuned the model for three epochs, the learning rate was set to 3 · 10−5 and 4 · 10−5
respectively, and the batch size to 32. For HASOC_target, the epochs were set to four, the
learning rate to 4 · 10−5 and the batch size to 16. For HASOC_all, we fine-tuned the model for
two epochs, the learning rate was set to 3 · 10−4 and the batch size to 16. All the configurations
used AdamW as optimizer.
   We run all experiments with the PyTorch high-performance deep learning library [67] on a
compute node equipped with a single Tesla-V100 GPU with 32 GB of memory.

    6
        https://github.com/cbaziotis/ekphrasis
    7
        https://pypi.org/project/emoji/
Table 2
MTL results for HOF detection on HASOC 2019 test, varying the emotion dataset
                                                        Macro Average
                         Emotion Dataset           P           R         F1
                         TEC                     0.7583    0.7900      0.7707
                         Grounded-Emotions       0.7744    0.7738      0.7741
                         EmoEvent                0.7739    0.7807      0.7772
                         DailyDialogs            0.7715    0.7865      0.7783
                         ISEAR                   0.7686    0.7917      0.7785
                         CrowdFlower             0.7981    0.7778      0.7870


Table 3
MTL results for HOF detection on HASOC 2019 test set.
                                               Macro Average                      Class HOF
   Model                                 P          R           F1            P      R         F1
   Baseline    BERT                    0.775       0.779       0.777      0.66      0.674     0.667
   MTL         HASOC_sentiment         0.773       0.789       0.780     0.646      0.708     0.676
               HASOC_emotion           0.798       0.778       0.787     0.712      0.642     0.675
               HASOC_target            0.778       0.802       0.788     0.648      0.736     0.689
               HASOC_all               0.791       0.807       0.799     0.674      0.733     0.702


5. Results
In this section, we present the results obtained by the systems we developed as part of our
participation in HASOC 2021 English subtask 1. We use the official competition metric macro
averaged precision, recall and F1 -score as evaluation measures and further report HOF-specific
results, as we believe that, for real-world applications, the detection of the concept HOF is more
important than non-HOF. The experiments are performed in two phases: the model selection
phase and the evaluation phase, which are explained in the following two sections.

5.1. Model Selection (HASOC 2019)
As described above, we perform model selection by training our systems on the training set
of HASOC 2019 and evaluating them on the corresponding test set. As our hypothesis is that
the MTL system trained on related tasks to HOF detection increased the generalization of the
model, we decided to use as a baseline the pre-trained language model BERT fine-tuned on the
HASOC 2019 corpora to compare the results.
   In order to decide which emotion corpora to use for the task of emotion classification in
the MTL setting, we test a number of emotion datasets, obtaining the results shown in the
Table 2. These results are on the main task of hate and offensive language detection, but vary
the emotion dataset used for MTL. As can be seen, the best performance is obtained by the
CrowdFlower dataset, with a substantial margin in terms of Macro-P score. This is despite our
Table 4
BERT vs. MTL predictions samples from HASOC 2019 test set, showing improved MTL performance.
neg.: negative sentiment, pos.: positive sentiment, noemo: no emotion, ind.: individual target, None: not
target detected
                                                                                        MTL
 ID    Tweet                                                Gold BERT HOF Sent.          Emot.    Targ.
 107   But Arlene and the extreme unionists do not          HOF HOF HOF neg.             noemo None
       want that, and they are the Jenga brick stopping
       the Tory roof collapsing
 952   I’ts his choice, you can’t force him to get served   HOF HOF HOF neg.             noemo None
       by Muslims
 506   Sad watching the UK making a total arse of itself    HOF HOF HOF neg. sadness None
 4517 When you got average marks in exam... And ur          HOF HOF HOF          pos.    noemo None
      Dad is like... dad.. She is so Beautiful. :- !
 254 I don’t think I have ever disliked anyone more         HOF HOF HOF neg. sadness               ind.
      than I dislike you.
 684 Yet you project the shortcomings of the muslim         HOF HOF HOF neg.             anger    None
      ruling class on to others, DEFLECTING, DIVERT-
      ING AND LYING TO THE MASSES!!!
 821 Really, sounds like youre inviting open hostilities    HOF HOF HOF neg.              fear     ind.
      again. Are you sure your up to this job? Don’t
      want to be rude but you’re just not very bright
      and have a persistent habit of telling lies too.


impression that this dataset is comparably noisy [21]. We believe that what makes the dataset
suitable for HOF detection is that it contains a large number of tweets labeled with a wide range
of emotion tags, including hate. Therefore, we decided to use this emotion dataset in the MTL
setting for the final submission of HASOC 2021.
   Table 3 shows the results of the MTL models including the different auxiliary tasks on the
HASOC 2019 test data. The setting HASOC_all refers to the MTL model trained on the combi-
nation of all tasks (HOF detection, emotion classification, polarity classification and offensive
target classification). As can be seen, the MTL models surpass the baseline BERT by at least 2
percentage points Macro-F1 . In particular, the MTL model that obtains the best performance is
HASOC_all, followed by HASOC_target, HASOC_emotion and HASOC_sentiment. The perfor-
mance of HASOC_all increases by 2 points Macro-F1 over the baseline, with Macro-Precision
increasing roughly 1.5 points and Macro-Recall roughly 2.5 points.
   Table 3 further shows the results of the MTL models on the HOF class in the HASOC 2019 test
set. In all MTL systems except HASOC_emotion, the recall improved over the BERT baseline.
The highest improvement in terms of this measure is observed in the HASOC_target model,
with an increase of 6.2 points. The precision increases by 5.2 points in the HASOC_emotion
model. The best run (HASOC_all) outperforms the baseline BERT with a substantial margin
(0.702 to 0.667).
Table 5
MTL results for HOF detection on HASOC 2021 dev set
                                                            Macro Average
                      Model                             P        R       F1
                      Baseline   BERT                 0.801     0.796   0.798
                      MTL        HASOC_sentiment      0.815     0.784   0.795
                                 HASOC_emotion        0.819     0.799   0.807
                                 HASOC_target         0.819     0.802   0.809
                                 HASOC_all            0.824     0.799   0.809


Model Analysis. As we aimed to improve HOF detection results by integrating the MTL
model with emotion, sentiment and target datasets, we decided to use the pre-trained language
model BERT in HASOC 2019 corpora as a basis and compared the results of both BERT and MTL
on HASOC_all models. The comparison of the two systems can be seen in Table 4. Specifically,
we show 7 examples, namely 4 false positives and 3 false negatives performed by the baseline
BERT model. Regarding the false positives, the first two tweets (IDs 107 and 952) are predicted
as HOF by the BERT model but MTL correctly classified them as non-HOF, presumably because
although the predicted sentiment is negative, the model could neither recognize a negative
emotion nor a target to classify it as HOF. Tweet with ID 506 is also correctly predicted by the
MTL model as non-HOF, in this case, although the emotion sadness is negative, we believe
that it is not strongly linked to HOF, moreover, the model does not recognize a specific target
directed at HOF. The last false positive (tweet ID 4517) expresses a positive sentiment and
the model is able to recognize it, thus we suppose that the MTL benefits from this affective
knowledge to classify the tweet as non-HOF. Regarding the false negatives, the tweet with ID
254 has been classified by the MTL system as negative sentiment, negative emotion (sadness)
and is directed to a person, therefore as these aspects are closely linked to the presence of HOF,
we assume that the MTL take advantage of these aspects to correctly classify the tweet. The
next sample, a tweet with ID 684, expresses a negative opinion and an anger emotion, correctly
predicted by the MTL, this emotion is one of the emotions most inextricably related to HOF, and
together with the negative sentiment could give a clue to the system to correctly classify the
tweet as HOF, although the target is not identified. Finally, instance 821 expresses a negative
sentiment towards a person, correctly identified by the MTL model. The model predicts fear
for this instance – which we would consider a wrong classification. However, even from this
classification (fear instead of anger), the MTL model benefits and makes the correct prediction,
which was not possible in the plain BERT model..
   These examples indicate that our MTL system predicts the class HOF more accurately than
BERT and is particularly improved in cases that have been missed by the plain model (which is
also reflected by the increased recall on the HASOC 2019 data).
Table 6
MTL results for HOF detection on HASOC 2021 test set (IMS-SINAI Team submissions). The official
metric is the macro average score.
                                               Macro Average                   Class HOF
   Model                                 P          R           F1      P         R         F1
   Baseline    BERT                    0.802       0.783       0.790   0.820     0.886     0.852
   MTL         HASOC_sentiment         0.805       0.784       0.792   0.820     0.891     0.854
               HASOC_emotion           0.790       0.762       0.771   0.800     0.892     0.844
               HASOC_target            0.800       0.776       0.785   0.813     0.892     0.851
               HASOC_all               0.819       0.784       0.795   0.812     0.917     0.862


5.2. Model Evaluation (HASOC 2021)
For evaluation, we use the dataset provided by the organizers of the HASOC 2021 English
subtask 1A. First, we want to verify that the MTL models surpass the baseline BERT also in
the evaluation setting. We train all models on the HASOC 2021 training set and test them on
the dev set of HASOC 2021. The results obtained are shown in Table 5. As can be seen, most
of the MTL systems except HASOC_sentiment outperform the baseline, which validates our
decision to select these models for the final evaluation of HASOC 2021. HASOC_sentiment
does improve over the baseline in Macro-Precision, but shows a drop in Macro-Recall. One
reason might be that the sentiment data that we use is in some relevant characteristic more
similar to the data from 2019 than to the data in the 2021 edition of the shared task.
   Table 6 finally shows the five models that we submitted to the HASOC 2021 Shared Task as
team IMS-SINAI, both with the official macro-average evaluation and the class-specific values
(which were reported during the submission period by the submission system). We observe that
BERT achieves a Macro-F1 score of 0.790. The multi-task learning models are, in contrast to the
HASOC 2019 results, mostly improved in terms of precision, and less consistently in terms of
recall. Considering the target classification and emotion classification in multi-task learning
models does not show any improvements, however, the sentiment classification does. These
results for the separate concepts are contradicting the results on the 2019 data, which is an
indicator that either the evaluation or annotation procedures or the data has changed in some
relevant property: In the 2019 data, sentiment+HOF is not better than HOF, but emotion+HOF
and target+HOF are. In the 2021 data, it is vice versa. However, when combining all concepts
of sentiment, emotion, target, and HOF in one model (HASOC_all), we see an improvement
that goes above the contribution by the sentiment model alone. Therefore we conclude that the
concepts indeed are all helpful for the identification of hate speech and offensive language.
   In addition, we report the results for the class HOF in the same table, without averaging
them with the class non-HOF. We find this result particularly important, as the practical task of
detecting hate speech is more relevant than detecting non-hate speech. The precision values
are lower than the recall values, in comparison to the average results. The recall is particularly
increased in the case of the best model configuration (HASOC_all) with 0.917 in comparison
to 0.866 to the plain BERT approach. It is noteworthy that all multi-task models increase the
recall at the cost of precision for the class HOF. This is both important for practical applications
to detect hate speech in the world and from a dataset perspective, as most resources have a
substantially lower label count of HOF than for other instances.


6. Conclusion
Most of the research conducted on the detection of hate speech and offensive language (HOF)
has focused on training automatic systems specifically for this task, without considering other
phenomena that are arguably correlated with HOF and could therefore be beneficial to recognize
this type of phenomenon.
   Our study builds on the assumption that the discourse of HOF involves other affective
components (notably emotion and sentiment), and is, by definition, targeted to a person or
group. Therefore, in this paper, as part of our participation as IMS-SINAI team in the HASOC
FIRE 2021 English Subtask1A, we explored if training a model concurrently for all of these tasks
(sentiment, emotion and target classification) via multi-task learning is useful for the purpose
of HOF detection. We have used corpora labeled for each of the tasks, we have studied how to
combine these aspects in our model, and also we have explored which combination of these
concepts could be the most successful. Our experiments show the utility of our enrichment
method. In particular, we find that the model that achieves the best performance in the final
evaluation considers the concepts of emotion, sentiment, and target together. This improvement
is even more clear in the HASOC 2019 data. In an analysis of results, we have found that the
model is good at improving in false positives errors performed by BERT. A plausible mechanism
here is that positive sentiments and positive emotions are opposite to the general spirit of hate
speech and offensive language so that the presence of these indicators permit the model to
predict the absence of HOF more accurately.
   This is in line with other previous results on multi-task learning amongst multiple related
tasks in the field of affective language. As an example, Akhtar et al. [68] has shown that both
tasks of sentiment and emotion benefit from each other. Similarly, Chauhan et al. [69] showed
an improvement in sarcasm detection when emotion and sentiment are additionally considered.
Particularly the latter study is an interesting result that is in line with our work, because the
sharp and sometimes offending property of sarcasm is shared with hate speech and offensive
language. Further, Rajamanickam et al. [70] has already shown that abusive language and
emotion prediction benefit from each other in a multi-task learning setup. This also is in line
with our result, given that HOF is an umbrella concept that also subsumes abusive language.
   A clear downside of our model is its high resource requirement: it needs annotated corpora
for all the phenomena involved, and as our model selection experiments showed, the quality
of these resources is very important. While resources that meet these needs are available for
English, for the vast majority of languages no comparable resources exist. At the same time,
the availability of multilingually trained embeddings makes it possible to extend the transfer
setup that we adopted to a multilingual dimension, and train a model jointly on resources
from different languages. This perspective fell beyond the scope of our study, but represents
a clear avenue for future research, and one that looks promising given the outcome of our
experiments. Other plausible extensions include the inclusion of further affective phenomena
that are arguably correlated to hate speech, including stylistic ones such as sarcasm/irony [71]
or author-based ones such as the "big five" personality traits [72]; or a more detailed modeling
of the hate speech target beyond the coarse-grained classification we used here, tying in, for
example, with emotion role labeling [73].
   Another aspect to study in more detail is based on the observation of substantial differ-
ences between the results on the HASOC 2019 and the HASOC 2021 data. Apparently, the
improvements of the MTL model are more clear on the 2019 data. This variance in results
is an opportunity to study the aspects that influence the performance improvements when
considering related concepts.


Acknowledgement
This work has been partially supported by a grant from European Regional Development Fund
(FEDER), LIVING-LANG project [RTI2018-094653-B-C21], and Ministry of Science, Innovation
and Universities (scholarship [FPI-PRE2019-089310]) from the Spanish Government.


References
 [1] P. Fortuna, S. Nunes, A Survey on Automatic Detection of Hate Speech in Text, ACM
     Comput. Surv. 51 (2018). doi:https://doi.org/10.1145/3232676.
 [2] European Commission against Racism and Intolerance (ECRI), ECRI General Policy
     Recommendation No. 15 on Combating Hate Speech, Online, 2015. https://rm.coe.int/
     ecri-general-policy-recommendation-no-15-on-combating-hate-speech/16808b5b01.
 [3] F.-M. Plaza-Del-Arco, M. D. Molina-González, L. A. Ureña López, M. T. Martín-Valdivia,
     Detecting Misogyny and Xenophobia in Spanish Tweets Using Language Technologies,
     ACM Trans. Internet Technol. 20 (2020). doi:https://doi.org/10.1145/3369869.
 [4] A. Schmidt, M. Wiegand, A Survey on Hate Speech Detection using Natural Language
     Processing, in: Proceedings of the Fifth International Workshop on Natural Language
     Processing for Social Media, Association for Computational Linguistics, Valencia, Spain,
     2017, pp. 1–10. URL: https://aclanthology.org/W17-1101.
 [5] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, The Risk of Racial Bias in Hate Speech
     Detection, in: Proceedings of the 57th Annual Meeting of the Association for Compu-
     tational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp.
     1668–1678. URL: https://aclanthology.org/P19-1163.
 [6] A. Rodríguez, C. Argueta, Y.-L. Chen, Automatic Detection of Hate Speech on Facebook
     Using Sentiment and Emotion Analysis, in: 2019 International Conference on Artificial
     Intelligence in Information and Communication (ICAIIC), 2019, pp. 169–174. URL: https:
     //ieeexplore.ieee.org/abstract/document/8669073.
 [7] B. Liu, Sentiment Analysis – Mining Opinions, Sentiments, and Emotions, 2nd edition ed.,
     Cambridge University Press, 2012.
 [8] P. Ekman, An argument for basic emotions, Cognition and Emotion 6 (1992) 169–200.
     doi:10.1080/02699939208411068.
 [9] S. Modha, T. Mandl, G. K. Shahi, H. Madhu, S. Satapara, T. Ranasinghe, M. Zampieri,
     Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content
     Identification in English and Indo-Aryan Languages and Conversational Hate Speech, in:
     FIRE 2021: Forum for Information Retrieval Evaluation, Virtual Event, 13th-17th December
     2021, ACM, 2021.
[10] T. Mandl, S. Modha, G. K. Shahi, H. Madhu, S. Satapara, P. Majumder, J. Schäfer, T. Ranas-
     inghe, M. Zampieri, D. Nandini, A. K. Jaiswal, Overview of the HASOC subtrack at FIRE
     2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Lan-
     guages, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021. URL: http://ceur-ws.org/.
[11] R. Plutchik, The nature of emotions: Human emotions have deep evolutionary roots, a
     fact that may explain their complexity and provide tools for clinical practice, American
     scientist 89 (2001) 344–350.
[12] P. Shaver, J. Schwartz, D. Kirson, C. O’connor, Emotion knowledge: further exploration of
     a prototype approach, Journal of personality and social psychology 52 (1987) 1061–1086.
     doi:10.1037//0022-3514.52.6.1061.
[13] C. Strapparava, R. Mihalcea, SemEval-2007 task 14: Affective text, in: Proceedings of
     the Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association
     for Computational Linguistics, Prague, Czech Republic, 2007, pp. 70–74. URL: https://
     aclanthology.org/S07-1013.
[14] S. Mohammad, F. Bravo-Marquez, WASSA-2017 Shared Task on Emotion Intensity, in:
     Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Senti-
     ment and Social Media Analysis, Association for Computational Linguistics, Copenhagen,
     Denmark, 2017, pp. 34–49. URL: https://aclanthology.org/W17-5205.
[15] S. Mohammad, F. Bravo-Marquez, Emotion Intensities in Tweets, in: Proceedings of the
     6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Association for
     Computational Linguistics, Vancouver, Canada, 2017, pp. 65–77. URL: https://aclanthology.
     org/S17-1007.
[16] Y. Chen, W. Hou, X. Cheng, S. Li, Joint Learning for Emotion Classification and Emotion
     Cause Detection, in: Proceedings of the 2018 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018,
     pp. 646–651. URL: https://aclanthology.org/D18-1066.
[17] L. A. M. Oberländer, R. Klinger, Token Sequence Labeling vs. Clause Classification for
     English Emotion Stimulus Detection, in: Proceedings of the Ninth Joint Conference
     on Lexical and Computational Semantics, Association for Computational Linguistics,
     Barcelona, Spain (Online), 2020, pp. 58–70. URL: https://aclanthology.org/2020.starsem-1.7.
[18] R. Xia, Z. Ding, Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in
     Texts, in: Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 1003–1012.
     URL: https://www.aclweb.org/anthology/P19-1096.
[19] B. M. Doan Dang, L. Oberländer, R. Klinger, Emotion Stimulus Detection in German
     News Headlines, in: Proceedings of the 17th Conference on Natural Language Processing
     (KONVENS 2021), German Society for Computational Linguistics & Language Technology,
     Düsseldorf, Germany, 2021. URL: https://konvens2021.phil.hhu.de/wp-content/uploads/
     2021/09/2021.KONVENS-1.7.pdf.
[20] S. Mohammad, F. Bravo-Marquez, M. Salameh, S. Kiritchenko, SemEval-2018 task 1: Affect
     in tweets, in: Proceedings of The 12th International Workshop on Semantic Evaluation,
     Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 1–17. URL:
     https://aclanthology.org/S18-1001.
[21] L.-A.-M. Bostan, R. Klinger, An Analysis of Annotated Corpora for Emotion Classification
     in Text, in: Proceedings of the 27th International Conference on Computational Linguistics,
     Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 2104–
     2119. URL: https://aclanthology.org/C18-1179.
[22] G. T. Patrick, The psychology of profanity, Psychological Review 8 (1901) 113. doi:10.
     1037/h0074772.
[23] V. Liu, C. Banea, R. Mihalcea, Grounded emotions, in: 2017 Seventh International
     Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 477–483.
     doi:10.1109/ACII.2017.8273642.
[24] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, S. Niu, DailyDialog: A Manually Labelled Multi-turn
     Dialogue Dataset, in: Proceedings of the Eighth International Joint Conference on Natural
     Language Processing (Volume 1: Long Papers), Asian Federation of Natural Language
     Processing, Taipei, Taiwan, 2017, pp. 986–995. URL: https://aclanthology.org/I17-1099.
[25] K. R. Scherer, H. G. Wallbott, The ISEAR Questionnaire and Codebook, Geneva
     Emotion Research Group,                1997. URL: https://www.unige.ch/cisa/research/
     materials-and-online-research/research-material/.
[26] A. Abbasi, H. Chen, A. Salem, Sentiment Analysis in Multiple Languages: Feature Selection
     for Opinion Classification in Web Forums, ACM Trans. Inf. Syst. 26 (2008). doi:10.1145/
     1361684.1361685.
[27] A. Elmadany, C. Zhang, M. Abdul-Mageed, A. Hashemi, Leveraging Affective Bidirectional
     Transformers for Offensive Language Detection, in: Proceedings of the 4th Workshop
     on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive
     Language Detection, European Language Resource Association, Marseille, France, 2020,
     pp. 102–108. URL: https://aclanthology.org/2020.osact-1.17.
[28] H. Schuff, J. Barnes, J. Mohme, S. Padó, R. Klinger, Annotation, Modelling and Analysis of
     Fine-Grained Emotions on a Stance and Sentiment Detection Corpus, in: Proceedings of
     the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social
     Media Analysis, Association for Computational Linguistics, Copenhagen, Denmark, 2017,
     pp. 13–23. URL: https://aclanthology.org/W17-5203.
[29] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, C. Potts, Recursive Deep
     Models for Semantic Compositionality Over a Sentiment Treebank, in: Proceedings of the
     2013 Conference on Empirical Methods in Natural Language Processing, Association for
     Computational Linguistics, Seattle, Washington, USA, 2013, pp. 1631–1642. URL: https:
     //aclanthology.org/D13-1170.
[30] P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, V. Stoyanov, SemEval-2016 task 4: Sentiment
     analysis in Twitter, in: Proceedings of the 10th International Workshop on Semantic Eval-
     uation (SemEval-2016), Association for Computational Linguistics, San Diego, California,
     2016, pp. 1–18. URL: https://aclanthology.org/S16-1001.
[31] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, C. Cherry, SemEval-2016 Task 6:
     Detecting Stance in Tweets, in: Proceedings of the 10th International Workshop on
     Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San
     Diego, California, 2016, pp. 31–41. URL: https://aclanthology.org/S16-1003.
[32] W. Medhat, A. Hassan, H. Korashy, Sentiment analysis algorithms and applications: A
     survey, Ain Shams engineering journal 5 (2014) 1093–1113. doi:10.1016/j.asej.2014.
     04.011.
[33] K. Chakraborty, S. Bhattacharyya, R. Bag, A Survey of Sentiment Analysis from Social
     Media Data, IEEE Transactions on Computational Social Systems 7 (2020) 450–464. doi:10.
     1109/TCSS.2019.2956957.
[34] J. Barnes, R. Klinger, S. Schulte im Walde, Assessing State-of-the-Art Sentiment Mod-
     els on State-of-the-Art Sentiment Datasets, in: Proceedings of the 8th Workshop on
     Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, As-
     sociation for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 2–12. URL:
     https://aclanthology.org/W17-5202.
[35] A. Tontodimamma, E. Nissi, A. Sarra, L. Fontanella, Thirty years of research into hate
     speech: topics of interest and their evolution, Scientometrics 126 (2021) 157–179. doi:10.
     1007/s11192-020-03737-6.
[36] N. D. Gitari, Z. Zuping, H. Damien, J. Long, A Lexicon-based Approach for Hate Speech
     Detection, International Journal of Multimedia and Ubiquitous Engineering 10 (2015)
     215–230. doi:10.14257/ijmue.2015.10.4.21.
[37] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate Speech
     Detection with Comment Embeddings, in: Proceedings of the 24th International Confer-
     ence on World Wide Web, WWW ’15 Companion, Association for Computing Machinery,
     New York, NY, USA, 2015, p. 29–30. URL: https://doi.org/10.1145/2740908.2742760.
[38] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep Learning for Hate Speech Detection
     in Tweets, in: Proceedings of the 26th International Conference on World Wide Web
     Companion, WWW ’17 Companion, International World Wide Web Conferences Steering
     Committee, Republic and Canton of Geneva, CHE, 2017, p. 759–760. URL: https://doi.org/
     10.1145/3041021.3054223.
[39] D. Benikova, C. Biemann, M. Kisselew, S. Pado, GermEval 2014 Named Entity Recognition
     Shared Task for German, in: Proceedings of the GermEval 2014 Workshop, 2014, pp.
     104–112. URL: https://hildok.bsz-bw.de/files/283/03_00.pdf.
[40] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the GermEval 2018 Shared Task
     on the Identification of Offensive Language, in: Proceedings of GermEval 2018, 14th
     Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria, 2018. URL:
     https://epub.oeaw.ac.at/0xc1aa5576_0x003a10d2.pdf.
[41] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, M. Klenner, Overview of GermEval
     Task 2, 2019 Shared Task on the Identification of Offensive Language, in: Proceedings of
     the 15th Conference on Natural Language Processing (KONVENS 2019), German Society
     for Computational Linguistics & Language Technology, Erlangen, Germany, 2019, pp.
     354–365. URL: https://d-nb.info/1198208546/34.
[42] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Der-
     czynski, Z. Pitenis, Ç. Çöltekin, SemEval-2020 Task 12: Multilingual Offensive Language
     Identification in Social Media (OffensEval 2020), in: Proceedings of the Fourteenth Work-
     shop on Semantic Evaluation, International Committee for Computational Linguistics,
     Barcelona (online), 2020, pp. 1425–1447. URL: https://aclanthology.org/2020.semeval-1.188.
[43] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, SemEval-2019 Task
     6: Identifying and Categorizing Offensive Language in Social Media (OffensEval), in:
     Proceedings of the 13th International Workshop on Semantic Evaluation, Association for
     Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 75–86. URL: https:
     //aclanthology.org/S19-2010.
[44] T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, A. Patel, Overview of
     the HASOC Track at FIRE 2019: Hate Speech and Offensive Content Identification in
     Indo-European Languages, in: Proceedings of the 11th Forum for Information Retrieval
     Evaluation, FIRE ’19, Association for Computing Machinery, New York, NY, USA, 2019, p.
     14–17. doi:10.1145/3368567.3368584.
[45] T. Mandl, S. Modha, G. K. Shahi, A. K. Jaiswal, D. Nandini, D. Patel, P. Majumder, J. Schäfer,
     Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identifi-
     cation in Indo-European Languages, in: Proceedings of the 12th Forum for Information
     Retrieval Evaluation, 2020. URL: http://ceur-ws.org/Vol-2826/T2-1.pdf.
[46] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated Hate Speech Detection and the
     Problem of Offensive Language, in: Proceedings of the International AAAI Conference
     on Web and Social Media, 2017. URL: https://ojs.aaai.org/index.php/ICWSM/article/view/
     14955.
[47] P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, J. Pierrehumbert, HateCheck:
     Functional Tests for Hate Speech Detection Models, in: Proceedings of the 59th Annual
     Meeting of the Association for Computational Linguistics and the 11th International Joint
     Conference on Natural Language Processing (Volume 1: Long Papers), Association for
     Computational Linguistics, Online, 2021, pp. 41–58. URL: https://aclanthology.org/2021.
     acl-long.4.
[48] Z. Waseem, D. Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate
     Speech Detection on Twitter, in: Proceedings of the NAACL Student Research Workshop,
     Association for Computational Linguistics, San Diego, California, 2016, pp. 88–93. URL:
     https://aclanthology.org/N16-2013.
[49] P. Mathur, R. Sawhney, M. Ayyar, R. Shah, Did you offend me? Classification of Offensive
     Tweets in Hinglish Language, in: Proceedings of the 2nd Workshop on Abusive Language
     Online (ALW2), Association for Computational Linguistics, Brussels, Belgium, 2018, pp.
     138–148. URL: https://aclanthology.org/W18-5118.
[50] G. Wiedemann, E. Ruppert, R. Jindal, C. Biemann, Transfer Learning from LDA to BiLSTM-
     CNN for Offensive Language Detection in Twitter, in: Proceedings of GermEval 2018,
     14th Conference on Natural Language Processing (KONVENS 2018), 2018. URL: https:
     //arxiv.org/abs/1811.02906.
[51] S. MacAvaney, H.-R. Yao, E. Yang, K. Russell, N. Goharian, O. Frieder, Hate speech detection:
     Challenges and solutions, PLOS ONE 14 (2019) 1–16. URL: https://doi.org/10.1371/journal.
     pone.0221152.
[52] V. Lingiardi, N. Carone, G. Semeraro, C. Musto, M. D’Amico, S. Brena, Mapping Twitter
     hate speech towards social and sexual minorities: a lexicon-based approach to semantic
     content analysis, Behaviour & Information Technology 39 (2020) 711–721. doi:10.1080/
     0144929X.2019.1607903.
[53] I. Kwok, Y. Wang, Locate the Hate: Detecting Tweets against Blacks, in: Proceedings of the
     Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, AAAI Press, 2013,
     p. 1621–1622. URL: https://www.aaai.org/ocs/index.php/AAAI/AAAI13/paper/viewFile/
     6419/6821.
[54] L. Grimminger, R. Klinger, Hate Towards the Political Opponent: A Twitter Corpus
     Study of the 2020 US Elections on the Basis of Offensive Speech and Stance Detection,
     in: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity,
     Sentiment and Social Media Analysis, Association for Computational Linguistics, Online,
     2021, pp. 171–180. URL: https://aclanthology.org/2021.wassa-1.18.
[55] J. Lemmens, I. Markov, W. Daelemans, Improving Hate Speech Type and Target Detection
     with Hateful Metaphor Features, in: Proceedings of the Fourth Workshop on NLP for Inter-
     net Freedom: Censorship, Disinformation, and Propaganda, Association for Computational
     Linguistics, Online, 2021, pp. 7–16. URL: https://aclanthology.org/2021.nlp4if-1.2.
[56] L. Silva, M. Mondal, D. Correa, F. Benevenuto, I. Weber, Analyzing the Targets of Hate in
     Online Social Media, in: Proceedings of the International AAAI Conference on Web and
     Social Media, 2016, pp. 687–690. URL: https://ojs.aaai.org/index.php/ICWSM/article/view/
     14811.
[57] M. ElSherief, V. Kulkarni, D. Nguyen, W. Y. Wang, E. Belding, Hate lingo: A target-based
     linguistic analysis of hate speech in social media, in: Proceedings of the International
     AAAI Conference on Web and Social Media, 2018. URL: https://www.aaai.org/ocs/index.
     php/ICWSM/ICWSM18/paper/viewFile/17910/16995.
[58] J. Eisenstein, A. Ahmed, E. P. Xing, Sparse Additive Generative Models of Text, in:
     Proceedings of the 28th International Conference on International Conference on Machine
     Learning, ICML’11, Omnipress, Madison, WI, USA, 2011, p. 1041–1048. URL: https://
     citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.2202&rep=rep1&type=pdf.
[59] S. Ruder, Neural Transfer Learning for Natural Language Processing, Ph.D. thesis, NUI
     Galway, 2019. URL: https://ruder.io/thesis/neural_transfer_learning_for_nlp.pdf.
[60] R. Caruana, Multitask learning, Machine learning 28 (1997) 41–75. doi:10.1023/A:
     1007379606734.
[61] Y. Zhang, Q. Yang, A Survey on Multi-Task Learning, IEEE Transactions on Knowledge
     and Data Engineering (2021) 1–1. doi:10.1109/TKDE.2021.3070203.
[62] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:10.18653/v1/N19-1423.
[63] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the Type
     and Target of Offensive Posts in Social Media, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 1415–1420. URL: https://aclanthology.org/
     N19-1144.
[64] S. Mohammad, #emotional tweets, in: *SEM 2012: The First Joint Conference on Lexical
     and Computational Semantics – Volume 1: Proceedings of the main conference and the
     shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic
     Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada,
     2012, pp. 246–255. URL: https://aclanthology.org/S12-1033.
[65] F. Plaza-del-Arco, C. Strapparava, L. A. Ureña-López, M. T. Martín-Valdivia, EmoEvent:
     A Multilingual Emotion Corpus based on different Events, in: Proceedings of the 12th
     Language Resources and Evaluation Conference, European Language Resources Associa-
     tion, Marseille, France, 2020, pp. 1492–1498. URL: https://www.aclweb.org/anthology/2020.
     lrec-1.186.
[66] C. Baziotis, N. Pelekis, C. Doulkeridis, DataStories at SemEval-2017 task 4: Deep
     LSTM with attention for message-level and topic-based sentiment analysis, in: Pro-
     ceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017),
     Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 747–754. URL:
     https://aclanthology.org/S17-2126.
[67] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
     Z. Lin, N. Gimelshein, L. Antiga, et al.,          Pytorch: An imperative style, high-
     performance deep learning library, in: Advances in neural information process-
     ing systems, 2019, pp. 8026–8037. URL: https://proceedings.neurips.cc/paper/2019/file/
     bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
[68] M. S. Akhtar, D. Chauhan, D. Ghosal, S. Poria, A. Ekbal, P. Bhattacharyya, Multi-task
     Learning for Multi-modal Emotion Recognition and Sentiment Analysis, in: Proceedings of
     the 2019 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association
     for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 370–379. URL: https:
     //aclanthology.org/N19-1034. doi:10.18653/v1/N19-1034.
[69] D. S. Chauhan, D. S R, A. Ekbal, P. Bhattacharyya, Sentiment and Emotion help Sarcasm?
     A Multi-task Learning Framework for Multi-Modal Sarcasm, Sentiment and Emotion
     Analysis, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 4351–4360. URL:
     https://aclanthology.org/2020.acl-main.401. doi:10.18653/v1/2020.acl-main.401.
[70] S. Rajamanickam, P. Mishra, H. Yannakoudakis, E. Shutova, Joint Modelling of Emotion
     and Abusive Language Detection, in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics,
     Online, 2020, pp. 4270–4279. URL: https://aclanthology.org/2020.acl-main.394. doi:10.
     18653/v1/2020.acl-main.394.
[71] A. Reyes, P. Rosso, D. Buscaldi, From humor recognition to irony detection: The figurative
     language of social media, Data & Knowledge Engineering 74 (2012) 1–12.
[72] L. Flek, Returning the N to NLP: Towards Contextually Personalized Classification Mod-
     els, in: Proceedings of the 58th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7828–7838. URL:
     https://aclanthology.org/2020.acl-main.700. doi:10.18653/v1/2020.acl-main.700.
[73] S. Mohammad, X. Zhu, J. Martin, Semantic Role Labeling of Emotions in Tweets, in:
     Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment
and Social Media Analysis, Association for Computational Linguistics, Baltimore, Maryland,
2014, pp. 32–41. URL: https://aclanthology.org/W14-2607. doi:10.3115/v1/W14-2607.