<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection in Urdu at FIRE 2021</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maaz Amjad</string-name>
          <email>hamzaimamamjad@phystech.edu</email>
          <email>maazamjad@phystech.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alisa Zhila</string-name>
          <email>alisa.zhila@ronininstitute.org</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Labunets</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabur Butt</string-name>
          <email>sabur@nlp.cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Independent Researcher</institution>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC)</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Moscow Institute of Physics and Technology</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Ronin Institute for Independent Scholarship</institution>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>With the growth of social media platform influence, the efect of their misuse becomes more and more impactful. The importance of automatic detection of threatening and abusive language can not be overestimated. However, most of the existing studies and state-of-the-art methods focus on English as the target language, with limited work on low- and medium-resource languages. In this paper, we present two shared tasks of abusive and threatening language detection for the Urdu language that has more than 170 million speakers worldwide. Both are posed as binary classification tasks where participating systems are required to classify tweets in Urdu into two classes, namely: (i) Abusive and Non-Abusive for the first task, (ii) Threatening and Non-Threatening for the second. We present two manually annotated datasets containing tweets labeled as: (i) Abusive and Non-Abusive, (ii) Threatening and Non-Threatening. The abusive dataset contains 2400 annotated tweets in the train part and 1100 annotated tweets in the test part. The threatening dataset contains 6000 annotated tweets in the train part and 3950 annotated tweets in the test part. We also provide logistic regression and BERT-based baseline classifiers for both tasks. In this shared task, 21 teams from six countries registered for participation (India, Pakistan, China, Malaysia, United Arab Emirates, Taiwan), 10 teams submitted their runs for Subtask A -Abusive Language Detection, 9 teams submitted their runs for Subtask B -Threatening Language detection, and seven teams submitted their technical reports. The best performing system achieved an F1-score value of 0.880 for Subtask A and 0.545 for Subtask B. For both subtasks, m-Bert based transformer model showed the best performance.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural language processing</kwd>
        <kwd>text classification</kwd>
        <kwd>Twitter tweets</kwd>
        <kwd>Urdu language</kwd>
        <kwd>shared task</kwd>
        <kwd>abusive</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In cyberspace, abusive and threatening content is a glaring problem that has been present since
the beginning of human interaction on the internet and will continue to persist in future. Social
media platforms today are the venues for free expression for all communities, and community
https://nlp.cic.ipn.mx/maazamjad/ (M. Amjad)
backlashes can result in a lot of negative externalities. Thus, with the growth of social media
platforms and their audiences, regulating threatening and abusive content becomes a concern
for the welfare of all stakeholders. Though leading platforms such as Twitter and Facebook
have set up community standards for the prevention of cybercrimes, early detection of such
content is vital for the safety of cyberspace.</p>
      <p>Detection of abusive and threatening text is a complex problem as the platforms find it
challenging to maintain a balance between limiting the abuse and giving users ample freedom
to express themselves. Failing to meet the balance can result in users losing trust in the platform
as well as disengagement with the content. Platforms also find it challenging to detect texts
in multiple languages, especially low resourced languages and code mixed languages. Manual
ifltering and understanding of this content is logistically daunting and resource unfriendly. It
can also result in the delay of necessary and timely action needed in case of active threats and
abuse. Hence, Natural Language Processing (NLP) researchers have been working on the early
detection of threats and abuse by providing various automated solutions based on machine
learning and deep learning in particular.</p>
      <p>
        Several studies previously have attempted to deal with the problem of abusive language [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]
and threat detection [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These problems have been attempted through supervised machine
learning [5, 6, 7] and deep learning [8, 9, 10] approaches breaking it down into binary, multi-label,
or multi-class classification problems. However, these attempts are only limited to European
languages, Arabic, and a few South Asian languages such as Hindi, Bengali, and Indonesian.
      </p>
      <p>Here we present a new shared task for abusive and threatening language detection in tweets
written in Urdu. The task is aimed at driving attention and efort of the research community
to developing more eficient methods and approaches for this vastly spoken language and
highlighting dificulties that specific to the writing system and the use of Urdu. The paper
describes the abusive and threatening language tracks 1 organized by the authors within the Hate
Speech and Ofensive Content Identification (HASOC) evaluation tracks of the 13 th meeting of
Forum for Information Retrieval Evaluation (FIRE) 2021 2 and co-hosted by Open Data Science
(ODS) Summer of Code initiative 2021 3. The task is comprised of two sub-tasks:
1. Sub-task A: Abusive language detection 4. The task ofered a dataset of Twitter posts
(”tweets”) in Urdu language split into the training part with the annotations available to
participants and the testing part provided without annotations. The dataset annotation
procedure followed Twitter’s definition of abusive language 5 to identify posts that are
abusive towards a community, group, or an individual as the ones meant to harass,
intimidate or silence someone else’s voice. The tweets were annotated in a binary manner,
i.e., abusive or non-abusive. The participants were asked to determine the correct labels
for the testing part and submit their annotations. The solutions were evaluated using F1
and ROC-AUC metrics.
2. Sub-task B: Threatening language detection 6. Similarly, the task ofered a dataset
1https://www.urduthreat2021.cicling.org
2http://fire.irsi.res.in/fire/2021/hasoc
3https://ods.ai/competitions
4https://ods.ai/competitions/urdu-hack-soc2021
5https://help.twitter.com/en/rules-and-policies/abusive-behavior
6https://ods.ai/competitions/urdu-hack-soc2021-threat
of tweets in Urdu annotated as threatening or non-threatening split into the training
and testing parts, with the annotations of the testing part hidden from the participants.
The annotation procedure for the Sub-task B dataset followed Twitter’s definition of
threatening tweets 7 as those that are against an individual or group meant to threaten
with violent acts, to kill, inflict serious physical harm, to intimidate, or to use violent
language. The task and the evaluation procedure were identical to Sub-task A
With these shared tasks our contributions are:
• spreading awareness and motivating the community to propose more eficient methods
for automated detection of abusive and threatening messages in social media in Urdu as
well as providing means for standardized comparison as emphasized in Section 2;
• collection and annotation of the largest so far datasets for abuse and threat detection in
the Urdu language described in Section 4 , in particular, 3500 tweets annotated as abusive
or not and 9,9500 tweets annotated as threatening or not;
• the train and test split that allows for a fair result comparison (see Section 5 for details
and grounds) not only among the current participants but also for future research;
• provision of highly competitive baseline classifiers in Section 7;
• overview and comparison of the submitted solutions for abusive language and threat
detection in Urdu in Sections 8 and 9.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Importance of Identifying Abuse and Threat in Urdu</title>
      <p>Urdu is one of the largest spoken languages in South Asia. It is the national language of Pakistan
and has its roots in Persian and Arabic bearing additional structural similarities with many
languages from other language families, e.g., Hindi [11, 12]. Urdu is spoken by more than 170
million 8 people worldwide and the number is increasing every day. Yet it lacks solutions and
resources for the most essential natural language processing problems.</p>
      <p>Urdu is mostly written using the Nastalíq script. However, certain populations also use the
Devanagari script which is normally used for writing Hindi. Hence, Urdu texts experience
the phenomenon of digraphia that is the use of more than one writing system for the same
language. Additionally, Urdu is quite complex linguistically [13] as its morphological and
syntactic structure is a combination of Turkish, Arabic, Persian, Sanskrit, and English. Hence,
contributing to Urdu is also fundamental for the success of other languages.</p>
      <p>Population of the Urdu speaking countries have substantial access to social media, and
millions of speakers are exposed to unregulated or poorly regulated hate, abuse, and threats.
Various extremist and terrorist groups have developed communities on social media platforms
that spread abuse, threats, and terror [14]. As they post in local languages, in particular, in
Urdu, much of the content is left unchecked until reviewed and reported manually. Pakistan
sufered decades of terrorism and had to resort to banning social media on several occasions
to tackle terrorism [15]. Hence, the development of resources in Urdu for threat and abuse
detection is an urgent requirement for the safety of millions.</p>
      <p>7https://help.twitter.com/en/rules-and-policies/violent-threats-glorification
8https://www.ethnologue.com/language/urd</p>
    </sec>
    <sec id="sec-3">
      <title>3. Literature Review</title>
      <p>
        Ofensive content encompasses a variety of phenomena including aggression [ 16, 17],
sexism [18], hate speech [19, 5], threat detection [4? ], toxic comment detection [20], abusive
language detection [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and many others. Previous research [
        <xref ref-type="bibr" rid="ref5">21</xref>
        ] have attempted to distinguish
various types of abuse such as implicit vs. explicit abuse or identity vs. person-directed abuse
to identify more nuanced expressions of abuse.
      </p>
      <p>
        Multiple annotated datasets are available for a variety of ofensive content phenomena sourced
from numerous social media platforms and portals. Yahoo Finance corpus [19] comprises English
language texts from the Yahoo’s finance portal which is annotated into two classes: clean and
hate speech. Research [5] collected a dataset of Twitter posts in English and annotated them
into three classes as sexism, racism, or neither. Similarly, work [
        <xref ref-type="bibr" rid="ref6">22</xref>
        ] also annotated tweets in
English into three classes, yet diferent from work [ 5]: hate speech, ofensive language, and
neither. On the contrast, study [
        <xref ref-type="bibr" rid="ref7">23</xref>
        ] distinguished four ofensive classes in their collection of
Twitter posts in English: hateful, spam, abusive, and neutral.
      </p>
      <p>
        Youtube has been another source for data collection for abusive language in English [
        <xref ref-type="bibr" rid="ref8">20, 24</xref>
        ]
as well as in other languages, in particular, Arabic [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In particular, the study by Ashraf et
al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is based on the YouTube comment and replies collection introduced by Hammer et al. [
        <xref ref-type="bibr" rid="ref8">24</xref>
        ]
with additional annotation of its subset as whether a threat is directed towards a group or an
individual. Another study [
        <xref ref-type="bibr" rid="ref9">25</xref>
        ] collects a dataset of 2,304 YouTube comments with 6,139 replies
in English and annotates it in two ways: a binary annotation for abusive language as well as a
three class annotation for topics: politics, religion, and other.
      </p>
      <p>
        Attempts have been made to create threat and abuse detection models for Bengali language [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Posts and comments have been collected from diferent pages of Facebook. Threatening and
abusive language was labeled as “YES” in the dataset and the rest of the data which is not
abusive was labeled as “NO”. For more detailed analysis of the available datasets, we recommend
these studies [
        <xref ref-type="bibr" rid="ref1 ref10 ref11">26, 1, 27</xref>
        ].
      </p>
      <p>
        Apart from the papers proposing a single solution, a number of shared tasks have been
organized to incentivize creation of multiple robust systems for ofensive phenomena detection
in texts. Some of the popular shared tasks are OfensEval [
        <xref ref-type="bibr" rid="ref12 ref13">28, 29</xref>
        ] with available datasets in
Greek, English, Danish, Arabic, Turkish, and English; GermEval 2018 [
        <xref ref-type="bibr" rid="ref14">30</xref>
        ] for texts in German;
TRAC shared task [
        <xref ref-type="bibr" rid="ref15">31</xref>
        ] for Hindi, English, and Bengali; SemEval-2019 [
        <xref ref-type="bibr" rid="ref16">32</xref>
        ] for hate speech
detection in English and Spanish; HASOC 2019 and 2020 [
        <xref ref-type="bibr" rid="ref17 ref18">33, 34</xref>
        ] for German, English, Tamil,
Malayalam, and Hindi.
      </p>
      <p>
        Among the common approaches for ofensive language detection, we observe feature-based
approaches with traditional ML classifiers. Works [
        <xref ref-type="bibr" rid="ref12 ref13 ref16 ref17 ref18">6, 7, 5, 35, 33, 34, 32, 29, 28</xref>
        ] use various
combinations of features such as N-grams, Bag-of-Words (BOW), Part-of-Speech (POS) tags,
Term Frequency—Inverse Dense Frequency (TF-IDF) representation, word2vec representation,
sentiments, and dependency parsing features provided as input to the traditional ML models
such as Support Vector Machines (SVM), Logistic Regression (LR), Random Forest (RF), Decision
Tree (DT), Naive Bayes (NB), etc.
      </p>
      <p>
        Among the more eficient approaches for the task, we see boosting-based ensembles as well
as neural networks, in particular, deep NNs such as transformers. For example, Ashraf et
al. [
        <xref ref-type="bibr" rid="ref4 ref9">25, 4</xref>
        ] used n-gram and pre-trained word embeddings in combination with traditional ML
(LR, RF, SVM, NB, DT, VotingClassifier, and the boosting-based ensemble AdaBoost) as well as
Neural Network based methods (MLP, 1D-CNN, LSTM, and Bi-LSTM) for the abusive language
detection and for the prediction of an individual- vs. group-targeted threat correspondingly.
While the BiLSTM approach achieved an F1 score of 85%, the use of conversational context
along with the linguistic features achieved even higher F1 score of 91.96% using an ensemble
AdaBoost classifier.
      </p>
      <p>BiLSTM and Convolutional Neural Networks (CNN) were used to tackle abusive language and
hate speech detection in multiple other works. Studies employing graph embeddings to learn
graph representations from online texts [9], paragraph2vec [19], and Recurrent Neural Networks
(RNN) with attention [10], RNN with Gated Recurrent Units (GRUs) [8] have also shown
encouraging results. The pre-trained transformer methods such as RoBERTa, BERT, ALBERT
and GPT-2 to detect hate speech detection can be seen achieving high accuracies [36, 37, 38]. A
recent study [39] applied XLM, BERT and BETO models to achieve promising results on similar
tasks for hate speech detection.</p>
      <p>While each ofensive subcategory uses diferent definitions for annotation, similar methods
can be applied across the ofensive content detection tasks. All these techniques can be used to
test the best combinations for detection of abuse and threat in the Urdu language[40, 41] and
our study opens vast avenues for researchers to achieve this goal.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Datasets Collection and Annotation</title>
      <sec id="sec-4-1">
        <title>4.1. Threatening and Abusive Datasets Collection</title>
        <p>In the beginning, we created a dictionary of most used abusive and threatening words in Urdu.
We used those words as keywords on Twitter to mine tweets containing more abusive and
threatening words in Urdu, which we manually added to our dictionary. The dictionary includes
words that appeared even ones to threat or abuse someone. This dictionary is publicly available
for research purposes9.Thus, we collected a suficient number of abusive and threatening seed
words which were further used to crawl tweets through the Twitter Developer Application
Programming Interface (API)10 using Tweepy library. Thus, we gathered enough words and
phrases that are used to threat or abuse individuals. We collected tweets containing any of these
keywords from our dictionary for a 20 month period from January 1st, 2018 to August 31st,
2019. At this time the general elections being held in Pakistan in July 2018. Typically, during
the election season, people tend to be more expressive when supporting as well as opposing
political parties. In total we crawled 55,600 number of tweets containing the seed words.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Threatening and Abusive Datasets Pre-processing</title>
        <p>Since Urdu shared many common words in Persian, Turkish and Arabic, so when we crawled
tweets using our initially collected words, the Twitter APA also crawled many non-Urdu
tweets. Since this research was primary focused on Urdu lanaguage, we discarded all the
9https://github.com/MaazAmjad/Threatening 
10https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets
non-Urdu tweets manually. Thus, two diferent datasets have been created: (i) abusive dataset 11,
containing 3,500 tweets, 1,750 of them are abusive and 1,750 of them are non-abusive (ii)
threatening dataset12; containing 9,950 tweets, 1,782 threatening tweets and remaining tweets
are non-threatening.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Threatening and Abusive Datasets Annotation</title>
        <p>We defined guidelines to annotate abusive and threatening tweets.</p>
        <p>To annotate the dataset the annotators have been recruited. All of them satisfied the following
criteria: (i) country of origin - Pakistan; (ii) native speakers of Urdu; (iii) are familiar with
Twitter; (iv) aged 20–35 years; (v) detached from any political party or organization; (vi) have
prior experience of annotating data; (vii) educational level was a masters degree or above. We
computed Inter-Annotator Agreement (IAA) using Cohen’s Kappa coeficient [39] as it is a
statistic measure to check the reliability between two annotators. We provided instructions with
task definitions (which are reproduced below) and examples. Hierarchical annotation schema
was used and the main dataset was divided into two diferent datasets to distinguish between
whether the language is threatening ot non-threatening, abusive or non-abusive. We followed
Twitter definition to describe abusive 13 and threatening14 comments towards an individual or
groups to harass, intimidate, or silence someone else’s voice.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Training and Testing Dataset Split</title>
      <p>Due to the requirements extended by the competition conditions and in purpose of fair
evaluation of the participant’s submission, a slightly larger portion of the datasets was withheld
as corresponding testing parts than it would be done under ‘normal’ data science operations.
Namely, 40% of the data was withheld for the Threatening Language task, and 32%, for the
Abusive Language task. This is done, first of all, to ensure that the testing set is non-trivial
and represents well the variety of possible lexical expressions for both classes. Second, during
the active period of the competition, the participants could observe the scores only from the
“public” part of the testing set, whereas the scores on the “private” part of the testing set were
made public only after the end of the competition. The partitioning of the test set into public
and private is necessary to avoid pure guessing or tampering with predictions. We ensured
that each partition of the testing data was large enough to compute a score that is suficiently
reflective of the actual performance of a system. The details are presented in Table 1.</p>
      <p>To be clear, the participants were handed out the entire test set without true labels. After
a submission, the scores were shown only for the public partition of the test set. As it can be
observed from Tables 5 and 6, there still was some amount of shake-up among the scores and
corresponding ranks on the public and private partitions.</p>
      <p>Now that both the training and the testing sets along with their true labels are available
to the research community, a diferent approach to train/test split may be possible. However,
11https://github.com/MaazAmjad/Urdu-abusive-detection-FIRE2021
12https://github.com/MaazAmjad/Threatening 
13https://help.twitter.com/en/rules-and-policies/abusive-behavior
14https://help.twitter.com/en/rules-and-policies/glorification-of-violence
for a fair comparison with the competition submissions and results provided in this paper, we
suggest following the original split.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation Metrics</title>
      <p>The submitted systems were evaluated by comparing the labels predicted by the participants’
classifiers to the hidden ground truth annotations. For quantifying the classification
performance, we computed the commonly used evaluation metrics: F1 score and ROC-AUC score. F1
score serves as a better metrics for unbalanced datasets than Accuracy and, therefore,
accommodates our settings. The ROC-AUC score gives an estimate of the overall quality of the model
at the various level of predicted confidence thresholds and serves as a more holistic evaluator.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Baselines</title>
      <p>For the competition, the organizers prepared three baseline systems: two of them reflected
diferent aspects of traditional ML approach involving Bag-of-Words features and were meant
to be lower boundary scoring baselines while the third system was based on the recent deep
learning approach involving fine-tuning of the BERT model [ 36].</p>
      <sec id="sec-7-1">
        <title>7.1. LogReg with Lexical Features</title>
        <p>All data pre-processing steps and most of the modeling details are identical for both subtasks,
abusive and threat detection, if not explicitly indicated otherwise.</p>
        <p>First, all possible word unigrams and bigrams were extracted from the training dataset using
the popular NLTK15 [42] software package for NLP, v. 3.4.5, counting the numbers for n-gram
occurrences in the dataset. Further, the occurrence threshold of 3 was applied to unigrams
corresponding to the 75th-percentile of all encountered unigrams. In other words, we took
the top 25% of most frequently occurring unigrams as features. Similarly, the 95th-percentile
occurrence threshold of 4 was applied to bigrams. We also added 2 additional features to account
for Out-Of-Vocabulary (OOV) unigrams and bigrams correspondingly. Eventually, the feature
set was comprised of the top occurring unigram features, top occurring bigram features, and
the two OOV features. The statistics for each feature type by the subtask dataset and the total
number of features is provided in Table 2.</p>
        <p>Further, each tweet instance was represented as a straightforward count of feature occurrences
in the tweet, all OOV n-grams counting towards corresponding special OOV features. No
normalization was done as all tweets have approximately the same length.</p>
        <p>Logistic regression was selected as the classifier algorithm for our traditional ML baseline
solutions. In the 1st system, we used the implementation from scikit-learn16 [43] v. 0.22.1,
which is a popular software package that includes a number of ML algorithms. The m a x _ i t e r
parameter was set to 1000 to make sure the training converges.</p>
        <p>For the Threat Subtask dataset, where the positive and negative classes are imbalanced, we
also set the c l a s s _ w e i g h t parameter to balanced which ensured automatic instance reweighing.</p>
        <p>The code is available at the organizers’s GitHub repository17.</p>
        <p>The balanced baseline secured the 8th place on the Threat Subtask private leaderboard with
F1-score equal to 0.49186, ROC-AUC, to 0.76991. The unbalanced version applied in the Abusive
Subtask came 12th on the private leaderboard scoring 0.78684 F1-score, 0.88295 ROC-AUC.
7.1.1. A version of LogReg with lexical features and TF-IDF count
We also submitted a variation of the Log-Reg based classifier with a few technical as well as
conceptual modifications. Instead of a simple n-gram occurrence count, the TF-IDF vectorization
approach was used for text representation. For this, the T f i d f V e c t o r i z e r function from the
scikit-learn package was used. It is to note that the types of features were unigrams only. The
number of features was set as in the previous approach.</p>
        <p>Another purely technical diference was that the LogReg classifier was implemented as a
“single node neural network” which is algorithmically and equationally equivalent to logistic
regression.</p>
        <p>The implementation was done using the PyTorch framework18 [44]. This training set-up
converged much sooner, with mere 30 epochs, or in terminology of traditional ML, iterations,
for both datasets. The optimal number of epochs was determined using a validation dataset
which was 10% of the corresponding training data.</p>
        <p>For the threatening language detection dataset, similarly to the previous approach, the dataset
balancing was performed by applying t o r c h . n n . B C E W i t h L o g i t s L o s s function.</p>
        <p>These diferences in approaches were reflected in the final score diference. Interestingly, for
the abusive language detection task, while this variant showed slightly higher scoring (0.77008
F1-score for this version vs 0.72928 F1 for the above version, and 0.86674 vs 0.85286 ROC-AUC)
16https://scikit-learn.org
17https://github.com/UrduFake/urdutask2021/
18https://pytorch.org
and, hence, the rank (11th vs. 13th) on the public leaderboard, it actually showed same scores on
the private leaderboard, 0.78684 F1 and 0.88295 ROC-AUC, to the extent of decimal precision
displayed, sharing the 12th and 13th ranks.</p>
        <p>More notably, in the threatening language detection task, the results and scores returned
by the two versions, not only varied largely, but the score diference of the systems actually
lfipped significantly between the private and public leaderboards. On the public leaderboard, the
scikit-learn version gained higher scores: 0.46471 F1 vs. 0.45161 F1 for the PyTorch version, and
0.79502 ROC-AUC vs. 0.78899 ROC-AUC for the PyTorch version. Yet on the private leaderboard
the scikit-learn version gained less: 0.49186 F1 vs. 0.51404 F1 for the PyTorch version, and
0.76991 ROC-AUC vs. 0.78212 ROC-AUC, respectively.</p>
        <p>This brings us to a likely conclusion that, no matter the ML package, for the abusive task, the
LogReg classifier along with the lexical bag-of-word features is a suficiently powerful tool that
can properly converge on the provided dataset learning a coherent pattern.</p>
        <p>However, the threat detection task is a more complex task not only due to the label imbalance
but also due to the intrinsic semantic complexity of the phrases, the latter having a much larger
efect. Therefore, simple classifiers and purely lexical features are too weak to capture higher
levels of semantic complexities and should not be relied on for this subtask.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. BERT-based baseline</title>
        <p>The dataset sizes of 2400 and 6000 items along with training example length below 200 characters
made the tasks approachable with transfer learning-based methods using foundational deep
BERT-like models.</p>
        <p>The proposed deep learning-based solutions for both subtasks, Abusive and Threat detection,
used pretrained multilingual uncased BERT19 [36] from huggingface transformers library [45]
as a base model.</p>
        <p>Huggingface “built-in” B e r t F o r S e q u e n c e C l a s s i f i c a t i o n 20 class with 2 output units was
selected as a classification head, where pooled output from [CLS] token is passed through a
dropout layer, followed by a linear layer with output units leading into cross-entropy loss
function.</p>
        <p>For the Abusive Subtask, we split the provided training dataset into TRAIN/DEV sets via
a standard 80:20 ratio. Using the TRAIN set, the model is further fine-tuned for the target
classification task for 3 epochs with minibatch size of 32 and 60 minibatches per epoch. The
total number of minibatches, and correspondingly optimization steps, was 180. The fine-tuning
process used the DEV set to evaluate the model performance every 4 minibatches in order to
load a model with the best F1 score from checkpoints at the end of the fine-tuning.</p>
        <p>For the Threat Subtask, we split the provided training dataset into TRAIN/DEV sets via a
85:15 ratio. We deviated from the standard 80:20 split to let the model train with slightly more
data and more negative examples as a result at the cost of less accurate F1 score. Using the
TRAIN set, the model was later fine-tuned for the target task for 5 epochs with minibatch size
of 32 and 160 minibatches per epoch (total number of minibatches / optimization steps was
19https://huggingface.co/bert-base-multilingual-uncased
20https://github.com/huggingface/transformers/blob/27d4639779d2d316a7c5f18d22f22d2565b84e5e/src/transformers/models/bert/modeling_bert.py#L1486
800). In our set-up, the model for the Threat Subtask converged slower than the one for the
Abusive subtask, therefore we trained the network for 5 epochs instead of 3. The cross-entropy
loss function additionally used inverse class sizes as weights to account for imbalance. The
ifne-tuning used the DEV set to evaluate the model every 8 minibatches (not 4 due to longer
training) in order to load a model with the best F1 score from checkpoints at the end of the
ifne-tuning.</p>
        <p>The first baseline model for the Abusive Subtask came 3 rd on the private leaderboard with
F1-score equal to 0.86221, ROC-AUC to 0.92194. The second baseline model for the Threatening
Subtask came 9th on the private leaderboard scoring 0.48567 F1-score, 0.70047 ROC-AUC.
Considering the original BERT’s [36] scores at GLUE and other benchmarks, as well as further
progress in language model pretraining [46], the first model’s relatively high F1 score was
expected. The Abusive Subtask was a sentence classification task with little specific constraints
(such as overly large sequence length or similar obstacles), where deep bidirectional
architecturebased and other large pretrained language models generally outperform traditional machine
learning approaches in a number of domains. At the same time, better handling of class
imbalance in the Threat Subtask could help the second baseline model achieve better convergence
and a higher F1 score. We speculate, that domain-specific improvements at preprocessing,
additional intermediate-task training, and complementary handcrafted features used along with
the sentence embeddings can further boost the score for both models. In other words, subject
matter knowledge of language and relevant threat landscape is indispensable for real-world
threat and abuse detection in Urdu language. Finally, we see incorporating continued training
and more domain-specific research in adversarial training, out-of-distribution detection, and
outlier detection as viable directions to make a model robust to adversarial examples and
distribution shifts when it is deployed.</p>
        <p>The code for this baseline is available on organizers’ GitHub repository21.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Overview of Submitted Solutions</title>
      <p>This section gives a brief overview of the systems submitted to this competition. 21 teams
registered for participation, 10 teams submitted their runs for Subtask A —Abusive Language
Detection, 9 teams submitted their runs for Subtask B —Threatening Language detection.
Registered participants were from diferent countries: India, Pakistan, China, Malaysia, United
Arab Emirates, Taiwan. This wide range of the regions where the interested participants were
located confirms the importance of the task. The team members came from various types of
organizations: universities, research centers, and industry.</p>
      <sec id="sec-8-1">
        <title>8.1. Approaches to Text Representation</title>
        <p>Participants used a variety of text representation techniques for tweet representation. Team
SAKSHI SAKSHI represented tweets using contextual embedding representations that were
obtained from training on an Urdu news corpus. Individual participant Muhammad Hamayoun
used traditional bag-of-words representation for Subtask A and word2vec for word n-grams,
21https://github.com/UrduFake/urdutask2021/blob/main/bert
 = 1, 2 , for Subtask B. The hate-alert team used pre-trained Urdu laser embeddings and
multilingual BERT22 pre-trained embeddings generated from an Arabic dataset. Team Alt-Ed used
TF-IDF text representation. Participant Abhinav Kumar used 1, 6-gram character level TF-IDF
features for tweet representation. A summary of approaches is presented in Tables 3, 4.</p>
      </sec>
      <sec id="sec-8-2">
        <title>8.2. Classification Methods</title>
        <p>To implement their classifiers, some participating teams used the traditional, i.e., non-neural
network based machine learning algorithms, while other teams’ submissions were based on
various neural network architectures.</p>
        <p>For Subtask B, team SAKSHI SAKSHI fine-tuned a pre-trained RoBERTa model from the
popular HuggingFace library23 on the Urdu news corpus in an unsupervised manner. The
same team used three transformer-based techniques for Subtask A: (i) Urduhack, (ii) BERT,
and (iii) XLM-Roberta. Team hate-alert used Hate-speech-CNERG/dehatebert-mono-arabi24
model which is preliminary fine-tuned on an Arabic hate speech dataset. Another participant,
Muhammad Humayoun, used SVM with sigmoid kernel for Subtask A and SVM with polynomial
kernel of degree 3 for Subtask B. Participant Abhinav Kumar used an ensemble of ML models
SVM + LogReg + RF for both subtasks. Similarly to one of the organizers’ baseline systems, team
Alt-Ed used Logistic Regression for Subtask A, which turned out to be team’s best classifier for
the task.</p>
        <p>A summary of approaches is presented in Tables 3, 4.
22https://huggingface.co/bert-base-multilingual-cased
23https://huggingface.co/transformers/model_doc/roberta.html
24https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-arabic</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Results and Discussion</title>
      <p>Table 5 presents results and ranking for Abusive Language detection subask. Table 6 presents
results and ranking for Threatening Language detection subtask. The systems are ranked by
their F1 score on the private leaderboard.</p>
      <p>We observe that except for one participant system, all the other participating teams’ systems
outperformed the proposed LogReg baselines in terms of F1 score for Subtask A. However, only
two systems, hate-alert’s and SHAKSHI SHAKSHI’s, outperformed the proposed BERT-based
baseline. For Subtask B, on the contrast, quite a few systems scored below the described LogReg
baseline solutions. Interestingly, even the organizers’ BERT-based solution did not achieve
higher scores than the LogReg baselines despite that the size of the training dataset for Subtask
B was larger than the one for Subtask A. Eventually, only the top 3 systems, hate-alert, SATLab,
and participant Somnath Banerjee, achieved higher F1 scores than the organizers’ Keras-based
implementation of Logistic Regression described in Section 7.1.1. Interestingly, although the
two LogReg-based baselines score closely on Subtask A, their scores difer substantially for
Subtask B. It might be due to the diferent values of the number of iterations parameter that
permitted the LogReg-v2 system to converge on the larger training set in Subtask B, while in
Subtask A LogReg convergence arrives sooner, partly due to a smaller training set size.</p>
      <p>Among all the submitted runs for both sub-tasks, the hate-alert team’s solution achieved
the best F1 score and ranked highest. Their solutions are based on mBERT
dehatebert-monoarabic25 model that is trained on an Arabic news corpus. It is plausible that the combination of a
powerful deep learning model and fine-tuning on a relevant, although somewhat unexpectedly,
dataset was key for the high performance. These results may open a way to further research
about the efect of direct knowledge transfer among languages that use the same script, in
particular, Nastalíq.</p>
      <p>Overall for Subtask A, 75% of the participating systems obtained F1 score higher than 0.814
as it can be observed from the 25th percentile in Table 7. This is a good indicator that the task
of abuse detection for tweets in Urdu can be achieved by automated means. In Table 5 we also
observe that most of the top performing systems achieve both better F1 score and better ROC
AUC for Subtask A.</p>
      <p>In contrast, the task of threat detection for tweets in Urdu turned out to be extremely
challenging as more than 90% of the systems could not pass the 0.8 F1 score bar as may be
25https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-arabic
observed in Table 8. Nevertheless, the top performing system SSNCSE_NLP achieving F1 score
of 0.805 (Table 6) provides a promising perspective that this task is also solvable with the current
methods and means of NLP available for the Urdu language.</p>
      <p>However, at this moment it is still too soon to judge whether any of these approaches are
ready to be applied “in the wild”. While the results of over 0.88 F1 score shown by the winning
hate-alert system on Subtask A are impressively high, the modest size of the provided training
and testing datasets cannot guarantee the same performance on an arbitrary text input. To
ensure the robustness of the presented approaches, more multifaceted research at a larger scale
is needed. We see that one of the paths is a community-driven efort towards the increase of
available resources and datasets in the Urdu language.
10. Conclusion
This paper presents a shared task in identifying threatening and abusive language in Urdu,
namely, the CICLing 2021 track @ FIRE 2021 co-hosted with ODS SoC 2021. For this track,
the organizers collected two original datasets of text tweets in Urdu, one annotated for abuse
(Subtask A) and the other, for threatening content (Subtask B). We also provided a training and
testing split for both datasets, with the ground truth labels hidden from the participants for the
testing parts of the datsets. The solutions were submitted in the form of proposed annotations
for the testing sets along with the confidence score provided by the participants’ systems. The
submitted annotations were compared with the ground truth label to compute the F1 score,
while the submitted confidence scores served for ROC AUC metric computation. The solutions
were ranked by the achieved F1 scores.</p>
      <p>In this shared task, twenty one team from six diferent countries registered for the competition,
and seven teams submitted their solutions. Participants used diferent techniques ranging
from the traditional feature-crafting and application of traditional ML algorithms to word
representation through pre-trained embeddings to contextual representation and end-to-end
transformer based methods. An uncommon solution included an ensemble of traditional ML
classifiers, SVM+LogReg+RF, whereas the particularly successful solutions used specialized
BERT-based systems such as multi-lingual BERT and XLM-RoBERTa.</p>
      <p>In the abuse detection subtask, team hate-alert outperformed all other systems with m-BERT
transformer model achieving F1 score of 0.880. This and the rest of the top 3 results in Subtask
A indicate that the specialized transformer based models tend to perform better compared to
the feature-based traditional ML models.</p>
      <p>In the threat detection subtask, the hate-alert team was also a leader during the oficial part
of the competition with the 0.545 F1 score achieved by the same m-BERT system. However, the
results submitted by team SSNCSE_NLP after the oficial part of the competition was closed
showed a much higher F1 score of 0.805. We advert that after the end of the oficial part of the
competition, the ground truth annotations for the testing sets were revealed to public, by this
potentially putting the late submitting teams into a more advantageous position compared to the
oficial track participants. Therefore, late submissions were not assigned a rank. Additionally,
the technical details of SSNCSE_NLP’s solution should be enquired from the corresponding
team.</p>
      <p>This shared task aims to attract and encourage researchers working in diferent NLP domains
to address the threatening and abusive language detection problem and help to mitigate the
proliferation of ofensive content on the web. Moreover, this track ofers a unique opportunity
to fully explore the suficiency of textual content modality and efectiveness of fusion methods.
And last but not least, the annotated datasets in Urdu are provided to the public to encourage
further research and improvement of automatic detection of threatening and abusive texts in
Urdu.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>This competition was organized with the support from the Mexican Government through the
grant A1-S- 47854 of the CONACYT, Mexico and grants 20211784, 20211884, and 20211178 of
the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico.
[5] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate
speech detection on twitter, in: Proceedings of the NAACL student research workshop,
2016, pp. 88–93.
[6] Y. Chen, Y. Zhou, S. Zhu, H. Xu, Detecting ofensive language in social media to protect
adolescent online safety, in: 2012 International Conference on Privacy, Security, Risk and
Trust and 2012 International Confernece on Social Computing, IEEE, 2012, pp. 71–80.
[7] C. Van Hee, E. Lefever, B. Verhoeven, J. Mennes, B. Desmet, G. De Pauw, W. Daelemans,
V. Hoste, Detection and fine-grained classification of cyberbullying events, in: International
conference recent advances in natural language processing (RANLP), 2015, pp. 672–680.
[8] J. Pavlopoulos, P. Malakasiotis, I. Androutsopoulos, Deep learning for user comment
moderation, in: Proceedings of the First Workshop on Abusive Language Online, Association
for Computational Linguistics, 2017, pp. 25–35.
[9] N. Cecillon, V. Labatut, R. Dufour, G. Linares, Graph embeddings for abusive language
detection, SN Computer Science 2 (2021) 1–15.
[10] E. Wulczyn, N. Thain, L. Dixon, Ex machina: Personal attacks seen at scale, in: Proceedings
of the 26th international conference on world wide web, 2017, pp. 1391–1399.
[11] T. Ahmed, A. Hautli, Developing a basic lexical resource for Urdu using Hindi WordNet,</p>
      <p>Proceedings of CLT10, Islamabad, Pakistan (2010).
[12] K. Visweswariah, V. Chenthamarakshan, N. Kambhatla, Urdu and Hindi: Translation
and sharing of linguistic resources, in: Coling 2010: Posters, Beijing, China, 2010, pp.
1283–1291.
[13] F. Adeeba, S. Hussain, Experiences in building Urdu WordNet, in: Proceedings of the 9th
workshop on Asian language resources, 2011, pp. 31–35.
[14] L. Bertram, Terrorism, the Internet and the Social Media Advantage: Exploring how
terrorist organizations exploit aspects of the internet, social media and how these same
platforms could be used to counter-violent extremism., Journal for deradicalization (2016)
225–252.
[15] K. Hassan, Social media, media freedom and Pakistan’s war on terror, The Round Table
107 (2018) 189–202.
[16] S. T. Aroyehun, A. Gelbukh, Aggression detection in social media: Using deep neural
networks, data augmentation, and pseudo labeling, in: Proceedings of the First Workshop
on Trolling, Aggression and Cyberbullying (TRAC-2018), 2018, pp. 90–97.
[17] A. Y. A. R. B. Farhan, A. Noman, R. U. Mustafa, Human aggressiveness and reactions
towards uncertain decisions, International Journal of Advanced and Applied Sciences 6
(2019) 112–116.
[18] S. Butt, N. Ashraf, G. Sidorov, A. Gelbukh, Sexism identification using BERT and Data
Augmentation–EXIST2021, in: International Conference of the Spanish Society for Natural
Language Processing SEPLN, 2021.
[19] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate speech
detection with comment embeddings, in: Proceedings of the 24th international conference
on world wide web, 2015, pp. 29–30.
[20] A. Obadimu, E. Mead, M. N. Hussain, N. Agarwal, Identifying Toxicity within YouTube
video comment text data, in: International Conference on Social Computing,
BehavioralCultural Modeling and Prediction and Behavior Representation in Modeling and Simulation,
Hindi, English and German, in: Forum for Information Retrieval Evaluation, Association
for Computing Machinery, 2020, pp. 29–32.
[35] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, L. Edwards, Detection of harassment
on Web 2.0, in: Proceedings of the Content Analysis in the WEB, volume 2, 2009, pp. 1–7.
[36] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding (2019) 4171–4186.
[37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[38] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A Lite BERT for</p>
      <p>Self-supervised Learning of Language Representations (2020).
[39] N. Vashistha, A. Zubiaga, Online multilingual hate speech detection: experimenting with</p>
      <p>Hindi and English social media, Information 12 (2021) 5.
[40] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga, A. Gelbukh, Threatening language
detection and target identification in Urdu tweets, IEEE Access 9 (2021) 128302–128313.
[41] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, L. Chanona-Hernandez, A. Gelbukh, Automatic
abusive language detection in urdu tweets, Acta Polytechnica Hungarica (2021) 1785–8860.
[42] S. Bird, E. Loper, NLTK: The natural language toolkit, in: Proceedings of the ACL
Interactive Poster and Demonstration Sessions, Association for Computational Linguistics,
Barcelona, Spain, 2004, pp. 214–217. URL: https://aclanthology.org/P04-3031.
[43] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825–2830.
[44] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing
Systems 32, Curran Associates, Inc., 2019, pp. 8024–8035.
[45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural
language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, Association for Computational Linguistics,
Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[46] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
(2019). URL: http://arxiv.org/abs/1907.11692.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U.</given-names>
            <surname>Naseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farasat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <article-title>Abusive language detection: a comprehensive review</article-title>
          ,
          <source>Indian Journal of Science and Technology</source>
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          , W. Magdy,
          <article-title>Abusive language detection on arabic social media</article-title>
          ,
          <source>in: Proceedings of the first workshop on abusive language online</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Seddiqui</surname>
          </string-name>
          ,
          <article-title>Threat and Abusive Language Detection on Social Media in Bengali Language</article-title>
          ,
          <source>in: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>0</volume>
          <fpage>9</fpage>
          <string-name>
            <surname>/ I C A S E R T</surname>
          </string-name>
          .
          <volume>2 0 1 9 . 8 9 3 4 6 0 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Individual vs. group violent threats classiifcation in online discussions</article-title>
          ,
          <source>in: Companion Proceedings of the Web Conference</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          . Springer,
          <year>2019</year>
          , pp.
          <fpage>214</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Waseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warmsley</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <article-title>Understanding Abuse: A Typology of Abusive Language Detection Subtasks</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Abusive Language Online</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warmsley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <source>Automated Hate Speech Detection and the Problem of Ofensive Language, in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>11</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [23]
          <string-name>
            <surname>A. M. Founta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Djouvas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chatzakou</surname>
            ,
            <given-names>I. Leontiadis</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blackburn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Stringhini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vakali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sirivianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kourtellis</surname>
          </string-name>
          ,
          <article-title>Large scale crowdsourcing and characterization of Twitter abusive behavior</article-title>
          , in: Twelhft
          <source>International AAAI Conference on Web and Social Media</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Øvrelid</surname>
          </string-name>
          , E. Velldal,
          <article-title>Threat: A large annotated corpus for detection of violent threats</article-title>
          ,
          <source>in: 2019 International Conference on Content-Based Multimedia Indexing (CBMI)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / C B M I . 2 0</source>
          <volume>1 9 . 8 8 7 7 4 3 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Abusive Language Detection in YouTube Comments Leveraging Replies as Conversational Context</article-title>
          , PeerJ Computer Science (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          , L. Derczynski,
          <article-title>Directions in abusive language training data, a systematic review: Garbage in, garbage out</article-title>
          ,
          <source>PloS one 15</source>
          (
          <year>2020</year>
          )
          <article-title>e0243300</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhatawdekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Sarwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dinkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zlatkova</surname>
          </string-name>
          , G. Bouchard,
          <string-name>
            <surname>I. Augenstein</surname>
          </string-name>
          ,
          <article-title>Detecting abusive language on online platforms: A critical analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2103.00153</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          , G. Karadzhov,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Derczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pitenis</surname>
          </string-name>
          , Ç. Çöltekin, SemEval-2020
          <source>Task 12: Multilingual Ofensive Language Identification in Social Media</source>
          , (OfensEval),
          <source>International Committee for Computational Linguistics</source>
          (
          <year>2020</year>
          )
          <fpage>1425</fpage>
          -
          <lpage>1447</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Farra</surname>
          </string-name>
          , R. Kumar, SemEval
          <article-title>-2019 Task 6: Identifying and Categorizing Ofensive Language in Social Media (OfensEval), Association for Computational Linguistics (</article-title>
          <year>2019</year>
          )
          <fpage>75</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          ,
          <article-title>Overview of the GermEval 2018 Shared Task on the Identification of Ofensive Language (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pires</surname>
          </string-name>
          , G. Routar,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>Merging datasets for aggressive text identification</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Debora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M. R.</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , et al.,
          <article-title>Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection</article-title>
          ,
          <source>in: 13th International Workshop on Semantic Evaluation, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC track at FIRE 2019: Hate Speech and Ofensive Content Identification in Indo-European Languages</article-title>
          ,
          <source>in: Proceedings of the 11th forum for information retrieval evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC Track at FIRE 2020: Hate Speech and Ofensive Language Identification in Tamil</article-title>
          , Malayalam,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>