<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the HASOC track at FIRE 2020: Hate Speech and Ofensive Content Identification in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Mandl</string-name>
          <email>mandl@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandip Modha</string-name>
          <email>sandip_ce@ldrp.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gautam Kishore Shahi</string-name>
          <email>gautam.shahi@uni-due.de</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amit Kumar Jaiswal</string-name>
          <email>amitkumarj441@gmail.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Durgesh Nandini</string-name>
          <email>durgeshnandini16@yahoo.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daksh Patel</string-name>
          <email>dakshpatel68@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasenjit Majumderg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Schäfer</string-name>
          <email>johannes.schaefer@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dalhousie University</institution>
          ,
          <addr-line>Halifax</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LDRP-ITR</institution>
          ,
          <addr-line>Gandhinagar</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bamberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Bedfordshire</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Duisburg-Essen</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Hildesheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the growth of social media, the spread of hate speech is also increasing rapidly. Social media are widely used in many countries. Also Hate Speech is spreading in these countries. This brings a need for multilingual Hate Speech detection algorithms. Much research in this area is dedicated to English at the moment. The HASOC track intends to provide a platform to develop and optimize Hate Speech detection algorithms for Hindi, German and English. The dataset is collected from a Twitter archive and pre-classified by a machine learning system. HASOC has two sub-task for all three languages: task A is a binary classification problem (Hate and Not Ofensive) while task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY. Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively. For task B, the best classification algorithms achieved F1 measures of 0.26, 0.33 and 0.29 for English, Hindi, and German, respectively. This article presents the tasks and the data development as well as the results. The best performing algorithms were mainly variants of the transformer architecture BERT. However, also other systems were applied with good success.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate speech</kwd>
        <kwd>Ofensive Language</kwd>
        <kwd>Multilingual Text Classification</kwd>
        <kwd>Online Harm</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Evaluation</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction: Hate Speech and Its Identification
The large quantity of posts on social media has led to a growth in problematic content. Such
content is often considered harmful for a rationale and constructive debate [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Many countries
have defined more and more detailed rules for dealing with ofensive posts [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Companies
and platforms are also concerned about problematic content. Problematic content and in
particular hate speech has been a growing research area. Linguists have analysed and described
various forms of hate speech [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Political scientists and legal experts search for ways to
regulate platforms and to handle problematic content without oppressing free speech [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The identification of hate speech within large collections of posts has led to much research
in information science and computer science. Much research is carried out within big internet
platforms. It is important to provide open resources to keep the society informed about the
current performance of technology and the challenges of hate speech identification.</p>
      <p>
        Algorithms are continuously improved and diverse collections for a variety of related tasks
and for several languages are being generated and analysed. Collections were built recently for
many languages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], e.g. for Greek, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Portuguese [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Danish [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Mexican Spanish [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and
Turkish [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The availability of several benchmarks allows better analysis of their diferences
and their reliability [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        HASOC contributes to this research, in this second edition, a diferent approach for creating
the data was applied. The two main tasks and the languages were kept identical for better
comparison with the results from HASOC 2019 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. HASOC Task Description</title>
      <p>In HASOC 2020, two tasks in the research area of Hate Speech detection are proposed. These
two sub-tasks are ofered for all three languages: Hindi, German and English. We chose in
particular Hindi and German as languages fewer less resources. HASOC also provides a testbed for
English to see how the algorithms perform in comparison to a language with many resources.
Below is a description of each task.
2.1. Sub-task A: Identifying Hate-Ofensive and Non Hate-Ofensive content
(Binary)
This task focuses on Hate speech and Ofensive language identification ofered for English,
German, and Hindi. Sub-task A is coarse-grained binary classification in which participating
system are required to classify tweets into two classes, namely: Hate and Ofensive (HOF) and
Non- Hate and ofensive (NOT).</p>
      <p>• NOT - Non Hate-Ofensive : This post does not contain any Hate speech, profane,
ofensive content.
• HOF - Hate and Ofensive : This post contains Hate, ofensive, and profane content.</p>
      <p>RT @rjcmxrell: im not fine, i need you
You be playin= I’m tryna fuck
RT @femmevillain: jon snow a punk ass bitch catelyn was right to
bully him
Buhari His Not Our President, I’m Ready To Go To Prison – Ayo
Adebanjo Dares FG https://t.co/XXR6VRRI5b
This shit sad af “but I don’t have a daddy” ... you niggas gotta do better
by these kids they didn’t ask to be here.</p>
      <p>RT @GuitarMoog: As for bullshit being put about by people who do
know better, neither of the two biggest groups in the EP are going to
get. . .</p>
      <p>NOT
HOF
HOF
NOT
HOF
HOF</p>
      <p>NONE
PRFN
OFFN
NONE
HATE
PRFN
2.2. Sub-task B: Identifying Hate, Profane and Ofensive posts (fine-grained)
This sub-task is a fine-grained classification also ofered for English, German, and Hindi.
Hatespeech and ofensive posts from the sub-task A are further classified into three categories:
• HATE - Hate speech: Posts under this class contain Hate speech content. Ascribing
negative attributes or deficiencies to groups of individuals because they are members of
a group (e.g. all poor people are stupid) would belong to this class. In more detail, this
class combines any hateful comment toward groups because of race, political opinion,
sexual orientation, gender, social status, health condition or similar.
• OFFN - Ofensive : Posts under this class contain ofensive content. In particular, this
refers to degrading, dehumanizing or insulting an individual. Threatening with violent
acts also belongs to this class.
• PRFN - Profane: These posts contain profane words. Unacceptable language in the
absence of hate and ofensive content. This typically concerns the usage of obscenity,
swearwords and cursing.</p>
      <p>Some examples for posts from all classes from the final set are shown in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <p>
        Most hate speech datasets, including HASOC 2019 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], are sampled by crawling social
media platforms or addressing their API using keywords considered relevant for hate speech or
scrapping hashtags. As a variant, these methods are used to find user of social media who
frequently posts hate speech message and collect all message from their timeline [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. All of
these methods are based on hand crafted lists of hate speech related terms. This may introduce
a bias because this process might limit the collection to topics and word which are remembered.
An empirical analysis [15] has pointed out that these practices may lead to bias. It concluded
that datasets that are created using focused sampling exhibit more bias than those crated by
random sampling. Furthermore, deep learning or machine learning models that reported the
Sum
Class
NOT
HOF
PRFN
HATE
OFFN
Sum
Class
NOT
HOF
PRFN
HATE
OFFN
Sum
391
423
293
25
82
814
395
418
287
33
87
813
1,852
1,856
1,377
158
321
3,708
1,700
673
387
146
140
2,373
2,116
847
148
234
465
2,963
English
      </p>
      <p>German</p>
      <p>Hindi
English</p>
      <p>German</p>
      <p>Hindi
392
134
88
24
36
526
393
133
89
33
30
526
466
197
27
56
87
663
436
226
36
57
104
662
best results on these biased datasets substantially underperform on benchmark datasets
sampled using diferent keywords. Similar observations on bias induced from training data were
also reported [16]. However, fully random sampling leads to dataset with very few hate speech
samples which requires much more manual annotation.</p>
      <p>One of the HASOC 2020 dataset’s main objectives is to minimize the impact of bias in the
training data. The observations made during previous research Wiegand et al. [15], Davidson
et al. [16] inspired us to develop a hate speech dataset based on a sampling process which relies
on less input. The final size of the training, development and testing sets are shown in Tables
2, 3 and 4. The development set was used to calculate the metrics for participants during the
campaign.</p>
      <p>During planning for HASOC 2020, the organizers searched for tweets collections and
identified archive.org 1. We have downloaded the entire archive for month May 2019. The archive
contains tweets on an hourly basis. The volume of the tweets for a full day is around 2.25 GB
in compressed format and 22.9 GB in uncompressed format. To obtain a set of hateful tweet,
we developed a sampling method that will be presented in the next paragraph.</p>
      <p>After downloading the archive of tweets, we extracted English, German, and Hindi tweets
using the language attribute provided by the Twitter metadata. We trained a SVM machine
learning model with word features weighted by TF-IDF without considering any further
features.</p>
      <p>For the dataset development, the entire May 2019 archive was crawled for German. For
English and Hindi the archives of 1st, 10th and 19th May 2019 were crawled. We used Python
scripts for filtering.</p>
      <p>We found an average of some 1301000 tweets for English from the archive for each day. We
sampled only 35000 tweets as potential hate speech candidates. Similarly, the average volume
of Hindi tweets is around 24000 tweets and the amount of German tweets is around 12500.</p>
      <p>To obtain potentially hateful tweets in English, we have trained the SVM model on the OLID
[17] and HASOC 2019 dataset. The purpose was to create a weak binary classifier that gives
an F1-score around 0.5. We have tested these models on the English tweets that were extracted
from the archive. We considered all the tweets that are classified as hateful by the week
classifier. We added 5 percent of the tweets which were not classified as hateful randomly. The
main idea behind this merge is to ensure that the final dataset contains an equal distribution of
hateful and non-hateful tweets. Then this set of English tweets was distributed to the
annotators using heuristics within the platform. Out of 35000 tweets, some 2600 tweets were labeled
as potentially hateful tweets by the classifier.</p>
      <p>The Hindi dataset was prepared using the same method, but the SVM model was trained
with the TRAC corpus [18] and the HASOC 2019 corpus. The average total no of the potential
hate speech around 5700 out of 24000(around 24 percent). The German dataset was extracted
from the archive using the SVM model that was trained with the dataset from GERMeval 2018
[19] 2 and the dataset from HASOC 2019. It is worth to note that the number of the potentially
hateful tweets found for in English and Hindi is substantially higher than for German (more
than ten times). Therefore, we had to crawl the entire month May 2019 to obtain a dataset of
reasonable size for German. The weak classifier based on a SVM labeled only 150 (around 1.25
percent) tweets as a potentially hateful tweets.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Annotation</title>
        <p>All tweets in these sets were annotated manually by people who use social media in the
respective language. The annotators were students who received a small amount of money for
their work. They were neither aware how the tweets were collected nor did they know about
the classification result of a tweet.</p>
        <p>Tweet allocation for annotations The tweet allocation was performed in such a way that
each tweet was annotated twice. In case when there was a conflict in the annotation between
1https://archive.org/details/archiveteam-twitter-stream-2019-05
2https://projects.cai. fbi.h-da.de/iggsa/
the first two annotators, the tweet is automatically scheduled to be assigned to a third annotator
who had not yet seen the respective tweet. This way we ensured the integrity of the annotation,
and try to avoid human bias. Annotators could also report tweets for a variety of reasons. In
cases a particular tweet was reported by both the initial annotators, then it is considered as an
outlier and not used further while generating the dataset. However, the resources for labelling
were limited, so not all conflict cases could be resolved by a third annotation.
Annotation Platform and Process During the labelling process, the annotators for each
language engaged with an online system to judge the tweets. The online system presented the
text of the tweet only and users had to make the decision. The annotation system allows the
oversight of the process so that progress can be monitored.</p>
        <p>The interface of the system can be seen in Figure 1 and Figure 2. For Hindi and German,
native speakers were contracted as annotators. For English, students from India (Gujarat) were
contracted who are educated in English and who use social media regularly in English. The
annotators were given short guidelines that contained the information as mentioned in section
2.1 and 2.2. Apart from the definitions listed above, the guidelines listed the following rules.
• Dubious cases which are dificult to decide even for humans, should be left out.
• Content behind links is not considered
• Hashtags are considered
• Content in other languages is not labelled, but notified</p>
        <p>The annotators met online in brief meetings during the process at least twice for each
language. They discussed the guidelines and borderline cases in order to find a common practice
and interpretation of hate speech.</p>
        <p>Nevertheless, the process remains highly subjective, and even after discussions of
questionable often no agreement could be reached. This lies in the nature of Hate Speech and its
perception by humans. Overall, the new sampling method led to a large portion of profane content.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inter-Coder Reliability</title>
        <p>We randomly assigned the data to the annotators, and two or three annotators annotated each
tweet at a time. For tweets with three annotators the majority vote was considered. For the
tweets annotated by two annotators, we used these two approaches, case I: When both
annotators voted for the same judgment, this majority decision is accepted. Case II: Two annotators
contradict each other, and we considered the rating of the more reliable annotator. We
measured the reliability of the annotators with heuristics based on the overlap of their previous
annotations with others. The data were annotated by 11, 11 and 8 diferent annotators for
Hindi, English, and German.</p>
        <p>The first round data was annotated by diferent annotators, and the majority vote was
considered, the annotation agreement is shown in table 5. For the disagreed data, we considered
the second round of annotation. For the English and German language, we considered the
voting of the most reliable voters. The algorithm used for determining the voting reliability
of each annotator is shown in figure 3. For Hindi, we re-annotated the conflicted data with
diferent annotators.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Participation and Evaluation</title>
      <p>This section details the participants submission and the evaluation method in each of the three
language sub-tasks i.e., English, German and Hindi. Each language tasks consist of 2 sub-tasks
and registered teams were able to take part in any of the sub-tasks respectively. There were
116 registered participants and 40 teams finally submitted results.</p>
      <sec id="sec-4-1">
        <title>4.1. Submission Format</title>
        <p>The submission and evaluation of experiments was handled on Codalab 3. The system can be
seen in figure 4.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Measure</title>
        <p>We posed two sub-tasks for each of the languages - English, German and Hindi. As each of
the language sub-tasks contains multiple classes with non-uniform numbers of samples, we
decided to use an item weighted measure to rank the submission of the teams, in our case the
macro F1 measure.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Timeline</title>
        <p>The participants could receive online access to the training data and worked on the tasks. The
data of HASOC 2019 was also available for participants, however, not many reported using
it. During that phase, they could observe their performance based on the development set. In
particular, the position relative to other teams is interesting.</p>
        <p>• Release of Training data: September 15, 2020
3https://competitions.codalab.org/competitions/26027</p>
        <p>• Result submission on Codalab: September 27, 2020</p>
        <p>Overall, more than 252 experiments were submitted.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results for Tasks</title>
      <p>
        This section gives the results of the participating systems. They are ordered by language and
each subsection reports of both tasks. Unfortunately, a description was not submitted for all
systems. All metrics in the following tables are reported for the test set 4. The results overall
prove that the task of identifying and further classifying hate speech of ofensive language is
still challenging. No F1 score above 0.55 could be achieved. These scores are lower than those
achieved for the HASOC 2019 dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
5.1. Hindi
The results for Hindi are shown in Tables 6 and 7. The submission data shows the performance
and the date of submission in Figures 5 and 6. It suggests that the leaderboard on the website
was helpful for some teams to improve their score.
      </p>
      <p>The best result for task A received an F1 score of slightly above 0.53. This can be considered
a low score and the tasks proved to be more challenging than the HASOC 2019 experiments.
The 10 best submissions received very similar scores.</p>
      <p>The best system applied a BiLSTM with 1 layer and fastText word embeddings as basic
representation for the input [20]. The submission at position 3 has not used any deep learning
but a lexical approach with TF-IDF weighting and a SVM for classification. This system also
translated the tweets automatically to augment the set of available training samples[21]. The
fourth system has applied the BERT model distilBERT-base for diferent languages [22].</p>
      <p>For task B, the performance is even lower and the best experiment reaches only a score above
0.33. However, the first system has a much better performance than the following submissions.
Rank 2 to 10 score again very similar.</p>
      <p>For the best ranked system, uses fine-tuned BERT model for the classification [23]. The
second-ranked system was already successful for task A. It applied a BiLSTM and fastText as
basic representation[20]. The third ranked team from LDRP-ITR experimented with BERT and
GPT-2. For this run, they used a CNN which conducted a bigram and a trigram analysis in
parallel and fused the results [24].</p>
      <sec id="sec-5-1">
        <title>5.2. German</title>
        <p>The results for German are shown in Tables 8 and 9. The submission data is given in Figures 7
and 8. The situation for German is similar to the results of Hindi. The best F1 score is not very
high and the best submissions are close to each other.</p>
        <p>The best performance for German was achieved using fine-tuned versions of BERT,
DistilBERT and RoBERTa [25]. Also the second best system used BERT. The group adapted the
upper layer structure of BERT-Ger [26]. Also other systems have applied BERT and variants,
e.g. position 4 [27], position 8 [22] and position 14 [28].</p>
        <p>For task B, the results are very close together. The best model was submitted by team Siva
[29]. The second best submission used ALBERT [28]. For the third rank, experiments with
versions of BERT, DistilBERT and RoBERTa were submitted [25]. Huiping Shi used a
selfattention model [30].</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. English</title>
        <p>English attracted most experiments for both tasks. The results for English are shown in tables
10 and 11. Again, submission data is summarized in Figures 9 and 10. The performance
diferences between the top 30 teams are extremely small. The relative improvement is about 5%.
Like for Hindi and German, the F1-measure shows rather low values. The best system achieved
a performance of of 0.52.</p>
        <p>The best result for English is based on a LSTM which used GloVe embeddings as input [31].
The TU Berlin team used a character based LSTM which performed better than their
experiments with BERT [32]. Many other submissions used BERT. The team in position 6 has used a
self-developed transformer architecture [30]. The team from IIT Patna used a standard BERT
model and reached the third position [33]. The team from Jadavpur University (JU) [34] and one
team from Yunnan University [27] used RoBERTa. The second team from Yunnan University
applied a ensemble of three classifiers including BERT, LSTM and CNN [35].</p>
        <p>For task B, the top three systems used BERT and variants. The best result was achieved by
team Chrestotes with a F1 value of 0.26 for English [36]. They used a fine-tuned version of
BERT. The team HUB from Yunnan University applied ALBERT and BERT. The team ZEUS
from Yunnan University applied ALBERT and DPCNN [37].</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Leaderboard</title>
        <p>We report the participants statistics at team-level across all of the three languages for the
corresponding sub-tasks in Table 12.
Figure 8: German Task B
Figure 9: English Task A
Figure 10: English Task B</p>
        <p>Dataset
Team</p>
        <p>IIIT_DWD
CONCORDIA_CIT_TEAM</p>
        <p>AI_ML_NIT_Patna</p>
        <p>Oreo</p>
        <p>MUM
Huiping Shi</p>
        <p>TU Berlin
NITP-AI-NLP</p>
        <p>JU
HASOCOne</p>
        <p>Astralis
YNU_WU
YNU_OXZ
HRS-TECHIE</p>
        <p>ZYJ</p>
        <p>Buddi_SAP
HateDetectors</p>
        <p>QutBird</p>
        <p>NLP-CIC</p>
        <p>SSN_NLP_MLRG
Fazlourrahman Balouchzahi</p>
        <p>Lee
IRIT-PREVISION
chrestotes
zeus
DLRG
ComMA</p>
        <p>Siva
hub
CFILT IIT Bombay</p>
        <p>Salil Mishra
NSIT_ML_Geeks
Buddi_avengers</p>
        <p>yasuo</p>
        <p>UDE-LTL
Sushma Kumari</p>
        <p>simon
IRLab@IITVaranasi</p>
        <p>YUN111</p>
        <p>LoneWolf
# Teams = 40
Task B
1
1
1
2
2
5
1
2
7
1
4
1
2
2
1
1
4
2
1
4
1
1
1
1
1
1
50</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Interpretation</title>
      <p>The top teams are are close together. This shows that despite a variety of approaches that
was used, no advantage of a particular technology was identified. Most participants used deep
learning models and in particular transformer based architectures were popular. Variants of
BERT like ALBERT were used much. The best systems for the tasks have applied the following
methodology. The best submission for Hindi used a CNN with fastText embeddings as input
[20]. The best performance for German was achieved using fine-tuned versions of BERT,
DistilBERT and RoBERTa [25]. The best result for English is based on a LSTM which used GloVe
embeddings as input [31]. Very heterogeneous approaches led to the best for the respective
languages.</p>
      <p>For Task B, the best systems reached 0.29 for German, 0.33 [29] for Hindi and 0.26 for English
[36]. The fine-grained classification turned out to be a big challenge.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Outlook</title>
      <p>The results of HASOC 2020 have shown that hate speech identification remains a dificult
challenge. The performance measures for the test set in HASOC 2020 have been considerably
lower than those for HASOC 2019. This is likely an efect of the diferent data sampling method.
Despite the fact that the method is close to realistic proceedings at a platform, it has led to much
profane content. The best results are achieved with state of the art transformer models and its
variants like ALBERT. The diferences between the results for the three languages are small.
This seems to indicate that pre-trained deep learning models have the potential to deliver good
performance even for languages with little traditional resources. The organizers hope that the
data will be used for further research related to hate speech. Apart from classification, topic
modelling, analysis of the reliability of the evaluation and failure analysis seem promising areas
of research.</p>
      <p>For future evaluations, the analysis of language might need to be supplemented by an
analysis of visual material posted in social media. Often, ofensive intent can only be seen when
considering both text and, e.g. image content [38]. Many hateful tweets are also shared as
misinformation. We can also have a look at the hateful tweets, which are spread as misinformation[39,
40, 41].</p>
      <p>The identification of ofensive content still leaves the social questions unanswered: How to
react? Diferent approaches have been proposed; they reach from deletion [42] to labeling [43]
and to counter speech by either bots [44] or humans [45]. Societies need to find strategies
adequate for their specific demands. We could also use a diferent kind of algorithm like a
spiking neural network to improve the performance of hate speech detection using temporal
and non-temporal features. [46, 47].</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgement</title>
      <p>We thank all participants for their submissions and their valuable work. We thank all the jurors
who labelled the tweets in a short period of time. We also thank the FIRE organisers for their
support in organising the track.
task 2, 2019 shared task on the identification of ofensive language (2019). URL: https:
//doi.org/10.5167/uzh-178687.
[15] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, Detection of abusive language: the problem of
biased datasets, in: Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), 2019, pp. 602–608.
[16] T. Davidson, D. Bhattacharya, I. Weber, Racial bias in hate speech and abusive language
detection datasets, arXiv preprint arXiv:1905.12516 (2019).
[17] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, Predicting the type
and target of ofensive posts in social media, arXiv preprint arXiv:1902.09666 (2019).
[18] R. Kumar, A. K. Ojha, B. Lahiri, M. Zampieri, S. Malmasi, V. Murdock, D. Kadar,
Proceedings of the second workshop on trolling, aggression and cyberbullying, in:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, 2020. URL:
https://www.aclweb.org/anthology/2020.trac-1.0.pdf.
[19] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the GermEval 2018 shared task on the
identification of ofensive language, in: Proceedings of GermEval 2018, 14th Conference
on Natural Language Processing (KONVENS 2018), 2018. URL: https://www.austriaca.at/
8435-5.
[20] R. Raj, S. Srivastava, S. Saumya, NSIT &amp; IIITDWD @HASOC 2020: Deep learning model
for hate-speech identification in indo-european languages, in: FIRE (Working Notes),
CEUR, 2020.
[21] R. Rajalakshmi, B. Y. Reddy, DLRG@HASOC 2020: A hybrid approach for hate and
ofensive content identification in multilingual text, in: FIRE (Working Notes), CEUR, 2020.
[22] S. Kumar, A. Saumya, J. P. Singh, NITP-AINLP@HASOC-Dravidian-CodeMix-FIRE2020:
A Machine Learning Approach to Identify Ofensive Languages from Dravidian
CodeMixed Text, in: FIRE (Working Notes), CEUR, 2020.
[23] S. Kumari, Nohate at hasoc2020: Multilingual hate speech detection, arXiv preprint
(2021).
[24] H. Madhu, S. Satapara, H. Rathod, Astralis @HASOC 2020: Analysis on identification of
hate speech in indo-european languages with fine-tuned transformers, in: FIRE (Working
Notes), CEUR, 2020.
[25] R. Kumar, B. Lahiri, A. K. Ojha, A. Bansal, ComMA@FIRE 2020: Exploring multilingual
joint training across diferent classification tasks, in: FIRE (Working Notes), CEUR, 2020.
[26] Q. Que, R. Sun, S. Xie, Simon @HASOC 2020: Detecting hate speech and ofensive content
in German language with BERT and ensembles, in: FIRE (Working Notes), CEUR, 2020.
[27] X. Ou, H. Li, YNU OXZ at HASOC 2020: Multilingual Hate Speech and Ofensive Content</p>
      <p>Identification based on XLM-RoBERTa, in: FIRE (Working Notes), CEUR, 2020.
[28] A. Kalaivani, D. Thenmozhi, SSN NLP MLRG @HASOC-FIRE2020: Multilingual hate
speech and ofensive content detection in indo-european languages using ALBERT, in:
FIRE (Working Notes), CEUR, 2020.
[29] S. Sai, Y. Sharma, Siv@HASOC-Dravidian-CodeMix-FIRE-2020: Multilingual Ofensive
Speech Detection in Code-mixed and Romanized Text, in: FIRE (Working Notes), CEUR,
2020.
[30] H. Shi, X. Zhou, Huiping Shi @HASOC 2020: Multi-top self-attention with k-max
pooling for discrimination between hate, profane and ofensive posts, in: FIRE (Working
Notes), CEUR, 2020.
[31] A. K. Mishra, S. Saumya, A. Kumar, IIIT_DWD@HASOC 2020: Identifying ofensive
content in multitask indo-european languages, in: FIRE (Working Notes), CEUR, 2020.
[32] S. Mohtaj, V. Woloszyn, S. Möller, TUB at HASOC 2020: Character based LSTM for hate
speech detection in indo-european languages, in: FIRE (Working Notes), CEUR, 2020.
[33] K. Kumari, J. P. Singh, AI_ML NIT Patna @HASOC 2020: BERT Models for Hate Speech</p>
      <p>Identification in Indo-European Languages, in: FIRE (Working Notes), CEUR, 2020.
[34] B. Ray, A. Garain, JU at HASOC 2020: Deep learning with RoBERTa and random forest
for hate speech and ofensive content identification in Indo-European languages, in: FIRE
(Working Notes), CEUR, 2020.
[35] Z. Zhang, Y. Wu, H. Wu, YUN DE at HASOC 2020 subtask a: Multi-model ensemble
learning for identifying hate speech and ofensive language, in: FIRE (Working Notes),
CEUR, 2020.
[36] T. Ezike, M. Sivanesan, Chrestotes at HASOC 2020: Bert Fine-tuning for the Identification
of Hate Speech and Ofensive Language in Tweets, in: FIRE (Working Notes), CEUR, 2020.
[37] S. Zhou, R. Fu, J. Li, Zeus at HASOC 2020: Hate speech detection based on
ALBERT</p>
      <p>DPCNN, in: FIRE (Working Notes), CEUR, 2020.
[38] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, The
hateful memes challenge: Detecting hate speech in multimodal memes, arXiv preprint
arXiv:2005.04790 (2020). URL: https://arxiv.org/abs/2005.04790.
[39] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation
on twitter, arXiv preprint arXiv:2005.05710 (2020).
[40] G. K. Shahi, Amused: An annotation framework of multi-modal social media data, arXiv
preprint arXiv:2010.00502 (2020).
[41] G. K. Shahi, D. Nandini, Fakecovid–a multilingual cross-domain fact check news dataset
for covid-19, arXiv preprint arXiv:2006.11343 (2020).
[42] B. Kalsnes, K. A. Ihlebaek, Hiding hate speech: political moderation on facebook,
Media, Culture &amp; Society (2020) 1–17. URL: https://journals.sagepub.com/doi/pdf/10.1177/
0163443720957562.
[43] S. Modha, P. Majumder, T. Mandl, C. Mandalia, Detecting and visualizing hate speech in
social media: A cyber watchdog for surveillance, Expert Systems and Applications 161
(2020) 113725. URL: https://doi.org/10.1016/j.eswa.2020.113725. doi:10.1016/j.eswa.
2020.113725.
[44] A. M. de los Riscos, L. F. D’Haro, Toxicbot: A conversational agent to fight online hate
speech, in: Conversational Dialogue Systems for the Next Decade, Springer, 2020, pp.
15–30. URL: https://doi.org/10.1007/978-981-15-8395-7_2.
[45] A. Oboler, K. Connelly, et al., Building smarter communities of resistance and solidarity,
Cosmopolitan Civil Societies: An Interdisciplinary Journal 10 (2018) 99. URL: http://dx.
doi.org/10.5130/ccs.v10i2.6035.
[46] D. Nandini, E. Capecci, L. Koefoed, I. Laña, G. K. Shahi, N. Kasabov, Modelling and
analysis of temporal gene expression data using spiking neural networks, in: International
Conference on Neural Information Processing, Springer, 2018, pp. 571–581.
[47] G. K. Shahi, I. Bilbao, E. Capecci, D. Nandini, M. Choukri, N. Kasabov, Analysis,
classiifcation and marker discovery of gene expression data with evolving spiking neural
networks, in: International Conference on Neural Information Processing, Springer, 2018,
pp. 517–527.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Vedeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Olsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eriksen</surname>
          </string-name>
          ,
          <article-title>Hate speech harms: A social justice discussion of disabled norwegians' experiences</article-title>
          ,
          <source>Disability &amp; Society</source>
          <volume>34</volume>
          (
          <year>2019</year>
          )
          <fpage>368</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Asogwa</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Ezeibe,</surname>
          </string-name>
          <article-title>The state, hate speech regulation and sustainable democracy in africa: a study of nigeria and kenya</article-title>
          , African
          <string-name>
            <surname>Identities</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Quintel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ullrich</surname>
          </string-name>
          ,
          <article-title>Self-regulation of fundamental rights? the EU code of conduct on hate speech, related initiatives and beyond, Fundamental Rights Protection Online: The Future Regulation Of Intermediaries</article-title>
          , Edward Elgar Publishing, Summer/Autumn (
          <year>2019</year>
          ). URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=
          <fpage>3298719</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jaki</surname>
          </string-name>
          , T. De Smedt,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gwóźdź</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Panchal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossa</surname>
          </string-name>
          , G. De Pauw,
          <article-title>Online hatred of women in the incels. me forum: Linguistic analysis and automatic detection</article-title>
          ,
          <source>Journal of Language Aggression and Conflict</source>
          <volume>7</volume>
          (
          <year>2019</year>
          )
          <fpage>240</fpage>
          -
          <lpage>268</lpage>
          . doi: https://doi.org/10.1075/ jlac.00026.jak.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Casey</surname>
          </string-name>
          ,
          <article-title>Ending the incel rebellion: The tragic impacts of an online hate group</article-title>
          ,
          <source>Loyola Journal of Public Interest Law</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <article-title>71</article-title>
          . URL: https://heinonline.org/HOL/P?h=hein.
          <source>journals/loyjpubil21&amp;i=79.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>Resources and benchmark corpora for hate speech detection: a systematic review, Language Resources and Evaluation (</article-title>
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          . URL: https://doi.org/10.1007/s10579-020-09502-8.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitenis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , T. Ranasinghe, Ofensive language identification in Greek, arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>07459</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. R.</surname>
          </string-name>
          <article-title>da</article-title>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wanner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Nunes</surname>
          </string-name>
          , et al.,
          <article-title>A hierarchically-labeled portuguese hate speech dataset</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Abusive Language Online</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>104</lpage>
          . URL: https://www.aclweb.org/anthology/W19-3510.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Sigurbergsson</surname>
          </string-name>
          , L. Derczynski,
          <article-title>Ofensive language and hate speech detection for Danish</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>04531</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Aragón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Á.</given-names>
            <surname>Á. Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moctezuma</surname>
          </string-name>
          ,
          <article-title>Overview of MEX-A3T at IberLEF 2019: Authorship and aggressiveness analysis in mexican spanish tweets</article-title>
          .,
          <source>in: Iberian Languages Evaluation Forum (IberLEF) SEPLN</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>494</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-2421/MEX-A3T_
          <article-title>overview</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ç. Çöltekin</surname>
          </string-name>
          ,
          <article-title>A corpus of turkish ofensive language on social media</article-title>
          ,
          <source>in: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6174</fpage>
          -
          <lpage>6184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Madukwe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <article-title>In data we trust: A critical analysis of hate speech detection datasets</article-title>
          ,
          <source>in: Proceedings of the Fourth Workshop on Online Abuse and Harms</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>161</lpage>
          . URL: https://www.aclweb.org/anthology/2020.alw-
          <volume>1</volume>
          .18/.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC track at FIRE 2019: Hate Speech and Ofensive Content Identification in Indo-European Languages), in: Working Notes of the Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE, CEUR-</article-title>
          <string-name>
            <surname>WS</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2517</volume>
          /
          <fpage>T3</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J. M. Struß</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Siegel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ruppenhofer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wiegand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Klenner</surname>
          </string-name>
          , Overview of germeval
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>