=Paper=
{{Paper
|id=Vol-3159/T4-2
|storemode=property
|title=Abusive and Threatening Language Detection in Urdu using Supervised Machine Learning and Feature Combinations
|pdfUrl=https://ceur-ws.org/Vol-3159/T4-2.pdf
|volume=Vol-3159
|authors=Muhammad Humayoun
|dblpUrl=https://dblp.org/rec/conf/fire/Humayoun21
}}
==Abusive and Threatening Language Detection in Urdu using Supervised Machine Learning and Feature Combinations==
<pdf width="1500px">https://ceur-ws.org/Vol-3159/T4-2.pdf</pdf>
<pre>
Abusive and Threatening Language Detection in Urdu using
Supervised Machine Learning and Feature Combinations
Muhammad Humayoun
Higher Colleges of Technology, Abu Dhabi, United Arab Emirates

                   Abstract
                   This paper presents the system descriptions submitted at the FIRE Shared Task 2021 on Urdu's
                   Abusive and Threatening Language Detection Task. This challenge aims at automatically
                   identifying abusive and threatening tweets written in Urdu. Our submitted results were selected
                   for the third recognition at the competition. This paper reports a non-exhaustive list of
                   experiments that allowed us to reach the submitted results. Moreover, after the result
                   declaration of the competition, we managed to attain even better results than the submitted
                   results. Our models achieved 0.8318 F1 score on Task A (Abusive Language Detection for
                   Urdu Tweets) and 0.4931 F1 score on Task B (Threatening Language Detection for Urdu
                   Tweets). Results show that Support Vector Machines with stopwords removed, lemmatization
                   applied, and features vector created by the combinations of word n-grams for n=1,2,3 produced
                   the best results for Task A. For Task B, Support Vector Machines with stopwords removed,
                   lemmatization not applied, feature vector created from a pre-trained Urdu Word2Vec (on word
                   unigrams and bigrams), and making the dataset balanced using oversampling technique
                   produced the best results. The code is made available for reproducibility 1.

                   Keywords 2
                   Abusive Language, Threatening Language, Urdu, SVM, CNN, Word2Vec

1. Introduction
    With the exponential growth of social media users, hate speech is becoming a pandemic in which
no one is safe. With the luxury of anonymity and virtual space, people tend to say things they may have
filtered in a physical setting. Two important types are the use of speech in writing which is (1) abusive
and (2) threatening. Rapid growing research is happening to find the methods and techniques that detect
such hate speech automatically for English. However, resource-poor languages do not get that much
attention. Shared tasks like this are generally considered a good step. Abusive and Threatening
Language Detection Task in Urdu 3 is a shared task at CICLing 2021 track at FIRE 2021 [1] [2] co-
hosted with ODS SoC 2021 4. Specifically, two tasks are proposed in this challenge. Task A focuses on
detecting Abusive language, whereas, Task B focuses on detecting Threatening language from the
tweets written in Urdu. Both can be treated as binary classification problems, and supervised learning
models can be employed to train.
    This paper presents the system descriptions which were submitted at the competition. Our submitted
results were selected for the third recognition with a monetary prize of 10K Rub sponsored through the
ODS Summer of Code. This paper reports a non-exhaustive list of experiments that allowed us to reach
the submitted results. In addition, we managed to attain better results than the submitted results in the
competition. The related research work outside of this competition describing the dataset construction
and producing excellent results are reported in [3] [4].


1
    https://github.com/humsha/UrduAbusiveandThreatDetectionTasks
Forum for Information Retrieval Evaluation, December 13-17, 2021, India
EMAIL: mhumayoun@hct.ac.ae
ORCID: 0000-0003-0623-6703
                © 2020 Copyright for this paper by its authors.
                Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                CEUR Workshop Proceedings (CEUR-WS.org)
3
  https://www.urduthreat2021.cicling.org/home
4
  https://ods.ai/tracks/summer-of-code-2021
   There are more than 100 million people worldwide speaking Urdu. It is an Indo-Aryan language,
having a modified Perso-Arabic alphabet [5], and written in Nastalique writing style. Urdu draws its
advanced vocabulary from Persian and Arabic and day-to-day usage vocabulary from the native
languages of South Asia [6]. Urdu lacks capitalization, which makes identifying proper nouns, titles,
acronyms, and abbreviations difficult. Similar to Arabic and Persian, vowels are optional and hardly
present in the written text. [7] Thus, words are often guessed with the help of context. Urdu is a free
word order (Subject Object Verb) language [8].

2. Dataset Description
    The dataset related to the Abusive task has 2,400 instances in the training set (1,213 instances labeled
as non-abusive, and 1,187 instances labeled as abusive). The test set has 1100 instances (non-abusive
instances: 537, abusive instances: 563). A cursory analysis of the dataset reveals that many of the
abusive instances contain abusive words making the detection straightforward.
    The dataset related to the Threatening task has 6K instances in the training set (4,929 instances
labeled as neutral and 1,071 instances as threatening). The balance of the classes is 70% (neutral) and
(30% threat), which is substantially imbalanced. The test set has 3,950 instances (neutral instances:
3231, threatening instances: 719). In addition, threatening task detection can fundamentally be a
difficult problem, mainly due to subtle hints of threat in the threatening instances. Also, there can be
unavoidable variation in human judgment regarding an instance in hand, and a difference of opinion
may arise. For instance, some of the tweets labeled "threatening" do not seem to be threatening to us as
native speakers. Also, a political, religious, and regional affiliation, among others, can add bias to the
annotation process. Indeed, this is what makes this task difficult. Figure 1 shows some of such instances
and our comments.
Further Observations:
    A superficial analysis of both datasets reveals extensive use of non-standard script in tweets with a
high number of spelling mistakes (which is quite normal with an informal text such as tweets). We also
found a high number of non-standard canonical form of words with spelling variations. Proper
segmentation of Urdu words is another important problem. We found that a number of words joined
together as one token should be separated by space. All these suggest the limitations of applying NLP
tools such as removing stopwords, normalization, and applying lemmatization & Word2Vec.

3. Preprocessing, Features extraction and Classification Techniques
3.1. The Preprocessing
   Preprocessing is the first step in NLP, and in particular in text classification. We apply the following
preprocessing:
1. Compulsory Preprocessing. The following preprocessing are applied to every document.
       a. Diacritic Removal. Diacritic marks are not consistently used in Urdu. To ensure the
           consistency of data, removing all the diacritics is a common practice, though we lose
           valuable information in this step.
       b. Text Normalization. Urdu is written using a modified Perso-Arabic script. Characters that
           visually look similar have different Unicode resulting in orthographic variations. We
           normalize all such variations in Normalization Form C (NFC) [9].
2. Stopword Removal. The stopword list we used is provided by [10, 11], and it contains nearly 500
   words.
3. Lemmatization. We used Urdu Morphological Analyzer [5] to convert all the surface forms of a
   word to its lemma or root. This tool covers approximately 5000 words, capable of handling 140,000
   word forms.
4. Artificial Data Generation. Synthetic Minority Oversampling Technique (SMOTE) [12] is used
   to balance the imbalanced dataset in Task B. The minority class in increased by SMOTE over-
   sample technique.
3.2.      Feature Extraction
   We have used the following word-level and character-level features.
1. Bag of words model. A standard bag of words model using a combination of a non-exhaustive list
   of features produced by character n-grams and word n-grams.
   1.1. Weighting schemes. The value of a feature in a bag of word model is calculated using:
       1.1.1. Raw counts
       1.1.2. Binary: words are marked as present (1) or absent (0).
       1.1.3. Frequency: words are scored based on their frequency of occurrence within the document.
       1.1.4. TF-IDF: words are scored based on their frequency, and common words across the
              documents are penalized.
2. Selecting K-best features using SelectKBest algorithm available in scikit-learn. We report results
   with K=1000 and 5000.
3. Word Embeddings. Urdu Word Embeddings [13] is a pre-trained Word2Vec implementation for
   Urdu that we have used in some of our models. It covers is 100K words. The Procedure to form a
   vector model for the tasks in hand is given in Figure 2.

3.3.      Classification Techniques
Both traditional machine learning and neural network techniques have been extensively used in the
literature for similar tasks [3] [4]. We selected the following machine learning classifiers because of
their superior results on the tasks in hand 5:
1. Support Vector Machines with kernels RBF, Sigmoid and Polynomial (with degree 1, 2, 3).
2. AdaBoost with default settings.
3. A Convolution Neural Network for sentence classification is reported in [14]. Our model is its
   simplified version. The main difference is that our model does not use pre-trained Embeddings. We
   define a model with four input channels. In these channels the text is processed with word n-grams
   (n is 1 to 4) settings. Each channel is defined as:
   3.1. An input layer
   3.2. Embedding layer set to the size of the vocabulary and 100-dimensional real-valued
         representation.
   3.3. Convolutional layer of 1-dimension with 32 filters and a kernel size set to the number of words
         to read at once (word n-grams where n=k for channelk with k=1, 2, 3, 4, i.e. channel1 used
         unigrams, channel2 used bigrams, and so on).
   3.4. Max Pooling layer to combine the output from the convolutional layer.
   3.5. Flatten layer to reduce the three-dimensional output to two dimensional for concatenation.
4. The output from the four channels are concatenated into a single vector and process by a Dense
   layer and an output layer.

4. Experiments:
   For both tasks, train and test sets are given. During the competition, a portion of the test set was
made public for the participants to compare the results. The private part of the test set was released after
the competition, and final results were announced 6. The results reported in this paper are from the
complete test set (both private and public). The models are produced by training a classifier on training
set and the predictions are performed on test set. The experiments are performed on a laptop with
processor Intel Core i7 8th generation with 8 GB RAM.


5
  Naïve Bayes, Logistic Regression, Random Forests, Decision Tree and Multilayer Perceptron were excluded due to space limitations and
somewhat mediocre results.
6
  https://www.urduthreat2021.cicling.org/
4.1.    Abusive language detection task (Task A)

Experiment A1: In this experiment, we produce a bag of word feature vector with an exhaustive list of
combinations between removal of stopwords (yes/no), applying lemmatization (yes/no), word n-grams
(n=1, 2, 3) and modes (freq, count, binary, tfidf). Character level n-grams are not included in this
experiment in order to keep the vector size reasonable for the laptop. The top results greater than 80%
F1 sore are shown in Table 1. The results reported in section are the best ones among all the experiments
for Task 1.
Experiment A2: In addition to the feature combinations in Experiment A1, we include character n-
grams of n=1, 2, 3. Then, K-best features for K=1000 and 5000 are selected. Finally, the traditional
classifiers are applied for training and prediction. The top results with F1 greater 82% are shown in
Table 2.
Experiment A3: The feature vector for each comment is produced by a pre-trained Urdu Word2Vec
[13] using the procedure given in Figure 2. The feature vector is produced for word n-grams n=1,2.
Then, the traditional classifiers are applied for training and prediction. The results do not improve
further from 79% as shown in Table 3. As experiment A1 demonstrates that removing stopwords and
applying lemmatization produces among the highest results, we only report these preprocessing settings
in Table 3.
Experiment A4: A model for Convolution Neural Network (CNN) is discussed in Section 3.3. In this
experiment, we apply this model on the dataset. As preprocessing, stopwords are removed, and
lemmatization is applied. As discussed in Section 3.3, word n-grams for n=1,2,3,4 are applied in 4
channels. The vocabulary size is ~25K, and the vector size is 300. As the model is stochastic, an average
of 5 runs of training and prediction is reported in Table 4.

4.2.    Threatening language detection task (Task B)
Experiment Task B1: As mentioned in Section 2, the dataset for Urdu's threatening language detection
task is substantially imbalanced. Most of the experiments we performed for Task A did not produce
reasonable results when applied to for task B. However, when an additional step of balancing the dataset
is added, we get somewhat good results as described below:
    The feature vector for each comment is produced using the procedure given in Figure 2. It means
that we employed the pre-trained Urdu Word2Vec to generate the feature vector for each tweet in the
dataset (the process is similar to experiment A3). The feature vector is produced for word n-grams
n=1,2. The dataset is balanced by over sampling using SMOTE [12]. Finally, the classifiers are applied
for training and prediction. Note that a non-exhaustive list of combinations of preprocessing settings
(stopwords and lemmatization) is applied, but only top results with F1 greater than 47% are reported in
Table 5. Character n-grams are not used as we cannot get vectors for characters from Urdu Word2Vec.
Experiment Task B2: In this experiment, we produce a bag of word feature vector. The stopwords are
removed, and lemmatization is applied; combinations of word n-grams (n=1, 2) and mode (‘freq’) is
used. The dataset’s bigger size (unigrams and bigrams together for the last two experiments in Table 6)
exhausted the 8GB RAM. Therefore, we reduced the vocabulary size for these two experiments by only
considering the tokens that occur at least 4 times in the dataset. The dataset is balanced by oversampling
using SMOTE. By looking at the mediocre results we do not see any point to produce an exhaustive list
of preprocessing combinations. The results are reported in Table 6.
Experiment Task B3: Another way of reducing the size of feature vector is to select top K features. For
K=2000, the results are shown in Table 7. Stopwords are removed and lemmatization is applied.
Combinations of word and character n-grams (n=1, 2) and modes (‘freq’, ‘tfidf’) are used. It is noted
that the score started to decrease when character n-grams n=2 is added to the word bigrams n=1,2 in
the last two experiments of Table 7. Therefore, we did not complete an exhaustive list of settings or go
for n=3 for word and character n-grams.
Experiment Task B4: In this experiment, we apply the Convolution Neural Network (CNN) model
discussed in Section 3.3 on the dataset. As preprocessing, stopwords are removed, and lemmatization
is applied. The dataset is balanced by oversampling using SMOTE. Finally, CNN is applied. As
discussed in Section 3.3, word n-grams for n=1,2,3,4 are used in 4 channels. The vocabulary size is
~11K, and the vector size is 300. As the model is stochastic, an average of 5 runs of training and
prediction is reported in Table 8. However, the results are not good.

5. Discussion on Results
    For the Abusive language task (Task A), the traditional models (Bag of words with SVM) performed
best in various settings. It is understandable as the dataset is small and CNN probably needs more data
to outperform the traditional models. The use of pre-trained Urdu Word2Vec in experiment A3 also
underperforms. It is probably because of the limited vocabulary of the pre-trained Word2Vec and the
non-standard orthographic variations found in the tweets. Experiment A1 also reveals that increasing
the feature set does not necessarily increase the results (though the results remain comparable but
expensive computationally). It turns out that removing stopwords and applying lemmatization is a
reliable preprocessing. Combining word level unigrams, bigrams and trigrams produce results among
the best results. Combining more n-grams increases the sentence vector size enormously and should be
avoided when computational power is scarce. Though, unigrams with bigram performed the highest
scoring results for task A (See Table 1). In terms of weighting scheme, “freq” turns out to be reasonably
reliable. Similarly, SVM with polynomial degree 1, sigmoid, and RBF turns out to be reasonably
reliable among various settings.
    As shown in experiment A2 (Table 2), combining word level and character level unigrams, bigrams,
and trigrams first and then selecting top 1K to 5K features is also a suitable method producing 80%+
results. In this case, it seems that keeping the stopword and not applying lemmatization still produces
the best results in most cases. It seems that keeping the top features from a huge feature set of word and
character level N-grams with n=1,2,3 manages to discover lexical patterns.

    The dataset for the threatening language detection task (Task B) is highly imbalanced. Most of the
experiments we performed for Task A did not produce good results on Task B. One of the main points
worth noting is the need to make the dataset balanced. It seems that oversampling has increased the
scores, but the best score reported is still below 50% for F1 score. The reason might be the subtle nature
of the task at hand, as discussed in Section 2.
    The highest F1 score is found in experiment B1 when the feature vector is produced by a pre-trained
Word2Vec using (word n-grams for n=1,2). We think that oversampling by SMOTE managed to create
the feature vectors for new instances in such a way that a somewhat separation of labels was possible.
Results in all other experiments are even worse. Especially the results for CNN with or without
oversampling are very low.

6. Conclusion
    In this paper, we have described the systems that produced the reported results for the competition.
Moreover, after the result declaration of the competition, we managed to attain even better results than
the submitted results. Our analysis of the results shows that until large enough datasets are made
available; we can rely on (1) the traditional Bag of word models to create features and (2) the
conventional classifiers such as SVM. Effective training of Neural Network-based methods can catch
up quickly only if a large training set is readily available. We found Task B quite challenging due to
the inherent shortcomings of an imbalanced dataset and the difficulty of the task at hand. The
shortcoming of the oversampling technique that we employed is witnessed. Generally, oversampling is
a challenging task for NLP datasets. We learnt that having a balanced and large enough dataset is an
important prerequisite for better results though such datasets might be hard to produce. As a future
work, we shall explore the recent transfer-learning techniques such as BERT fine-tuning, though getting
a large enough Urdu BERT model might be a challenge.
Acknowledgment
  We thank the anonymous reviewers for their insightful comments and suggestions.

7. References


  [1]      M. Amjad, A. Zhila, G. Sidorov, A. Labunets, S. Butt, H. I. Amjad, O. Vitman and A.
       Gelbukh, "Overview of Abusive and Threatening Language Detection in Urdu at FIRE 2021,"
       in HASOC-FIRE 2021, 2021.
  [2]      M. Amjad, A. Zhila, G. Sidorov, A. Labunets, S. Butt, H. I. Amjad, O. Vitman and A.
       Gelbukh, "UrduThreat@ FIRE2021: Shared Track on abusive threat Identification in Urdu.,"
       in In Forum for Information Retrieval Evaluation., 2021.
  [3]      M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga and A. Gelbukh, "Threatening
       Language Detecting and Threatening Target Identification in Urdu Tweets.," IEEE Access,
       vol. 9, pp. 128302-128313, 2021.
  [4]      M. Amjad, N. Ashraf, G. Sidorov, A. Zhila, L. Chanona-Hernandez and A. Gelbukh,
       "Automatic Abusive Language Detection in Urdu Tweets.," Automatic Abusive Language
       Detection in Urdu Tweets., p. Acta Polytechnica Hungarica [accepted], 2021.
  [5]      M. Humayoun, H. Hammarstrom and A. Ranta, "Urdu morphology, orthography and
       lexicon extraction," in CAASL-2: The Second Workshop on Computational Approaches to
       Arabic Script-based Languages, LSA Linguistic Institute Stanford University, California USA,
       2007.
  [6]      M. Virk Shafqat, M. Humayoun and A. Ranta, "An open source Urdu resource grammar,"
       in Proceedings of the Eighth Workshop on Asian Language Resouces, 153-160, 2010.
  [7]      M. Humayoun, H. Hammarström and A. Ranta, "Implementing Urdu Grammar as Open
       Source Software," in Conference on Language and Technology, University of Peshawar, 2007.
  [8]      M. Humayoun and N. Akhtar, "CORPURES: Benchmark Corpus for Urdu Extractive
       Summaries and Experiments using Supervised Learning," Intelligent Systems with
       Applications, Elsevier, 2021.
  [9]      A. Gulzar, "Urdu Normalization Utility v1.0.," Technical Report, Center for Language
       Engineering, Al-kwarzimi Institute of Computer Science (KICS), University of Engineering,
       Lahore, Pakistan, 2007.
  [10]     M. Humayoun, R. M. A. Nawab, M. Uzair, S. Aslam and O. Farzand, "Urdu summary
       corpus," in In Proceedings of the Tenth International Conference on Language Resources and
       Evaluation (LREC 2016) (pp. 796–800)., Portoroz, Slovenia, 2016.
  [11]     M. Humayoun and H. Yu, "Analyzing Preprocessing Settings for Urdu Single-document
       Extractive Summarization," in Proceedings of the Tenth International Conference on
       Language Resources and Evaluation ({LREC}'16), Portoroz, Slovenia, 2016.
  [12]     N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, "SMOTE: synthetic
       minority over-sampling technique," Journal of Artificial Intelligence Research, vol. 16, no. 1,
       pp. 321-355, 2002.
  [13]     S. Haider, "Urdu Word Embeddings," in Proceedings of the Eleventh International
       Conference on Language Resources and Evaluation (LREC 2018), European Language
       Resources Association (ELRA), isbn:979-10-95546-00-9, Miyazaki, Japan, 2018.
  [14]     Y. Kim, "Convolutional Neural Networks for Sentence Classification," in Empirical
       Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.
  [15]     M. Amjad, S. Butt, H. I. Amjad, A. Zhila, G. Sidorov and A. Gelbukh, "Overview of the
       shared task on fake news detection in Urdu at Fire 2021," in In CEUR Workshop Proceedings,
       2021.
Appendix
ID     Tweet
                                                                              ‫ز‬                      ‫ز‬                    ‫ز ز‬       Label
1594                                                                                                        ‫ب‬                  ‫ب‬
                                                            ‫ � رربياادده دد� رر�� اا� رربياادده � �رر� �۔‬،� ‫اان بب يع�ي�ر�ن � � ع�ج‬   1
       English: This is the cure for these shameless people. The longer they stay, the more they will create mess.
       Comment: Certainly Abusive but only a subtle hint of threat. (It is more like a threatening prediction).
       Author agrees with the label.
                               ‫ن‬                   ‫ن‬                   ‫ن‬                                    ‫ت‬
1533
                          ‫ آٓ آ جج �وو�ن � � � �م دد�ووي ��ن� ۔۔۔‬،� �� ‫�� �ت ااي يك اا� � � � � وو �م� دد� � � �ن‬                     1
       English: National pride is something that makes the reputation of a nation worldwide. Today, India has
       imposed all worldly laws on Kashmir which ...
       Comment: Not threatening; not abusive. Author disagrees with the label
                                                                                            ‫غ‬                      ‫ت‬
413
                                                         ‫تاارر� �وورر� � �اارر �ن � اا� � � � �اا �ق � ااوورر � � �۔‬                  1
       English: History will surely write who was a traitor, how and who destroyed the prosperous country
       Comment: Not threatening; not abusive. Author disagrees with the label
                                                                                 ‫ت ت‬
2150
                                                                           ‫� ااي يك � �ي ي�ي � � � ااس � ��ت �تاا �ں۔‬                 1
       English: I condemn it as a member of Muslim League

                                                                                ‫غ ت غ‬
       Comment: Not threatening; not abusive. Author disagrees with the label
2379
                                                                    ‫ئخآآ�ن � �آآ� آآوورر آٓ آ� � ذذذ� �وواا � اا�تستاان � �ااي ياا۔‬   1
       English: They were given a tough time and finally expelled from Afghanistan with humiliation.
       Comment: Not threatening; not abusive. Author disagrees with the label
  Figure 1: Some Sample tweets from Threatening language dataset


       x_doc= [token1, token2, …, tokenn] where tokeni is a space separated chunk of string
       x_docs= [x_doc1, x_doc2, …, x_docn] where x_doci is a complete comment
       doc_vectors = [doc_vector1, doc_vector2, …, doc_vectorn]
                                              where doc_vectori is document vector of x_doci
       for each doc in x_docs:
           temp = a 300-dimension vector of numbers
           for each word in doc:
                   word_vec = a 300-dimension vector for word from embedding
                   add word_vec in temp
           end for
           doc_vector = (aggregate n vectors from temp into one vector
                                              of size 300 using a column-wise average)
           add doc_vector in doc_vectors
       end for
       return doc_vectors

                      Figure 2: Procedure to form a vector model from Word2Vec
 Table 1
 Abusive language detection Task A1: Bag of words with traditional classifiers – Top results greater than
 80% F1 score
STW Removed             Lemmatized          Word ngrams           Mode         Model                  Vocab size         Vector size       F1
        Yes                   Yes           1,2                   freq         SVM-POLY-1             24949              18963             0.8318
        Yes                   Yes           1                     freq         SVM-POLY-1             5695               4657              0.8312
        Yes                   Yes           1,2,3                 freq         SVM-POLY-1             45014              33596             0.8245
        Yes                   Yes           1                     freq         SVM-SIGMOID            5695               4657              0.8231
        Yes                   Yes           1,2                   freq         SVM-SIGMOID            24949              18963             0.8218
        Yes                   Yes           1,2,3                 freq         SVM-SIGMOID            45014              33596             0.8209
        Yes                   Yes           1,2                   freq         SVM-RBF                24949              18963             0.819
        Yes                   Yes           1,2                   count        SVM-SIGMOID            24949              18963             0.8174
        Yes                   Yes           1,2                   binary       SVM-SIGMOID            24949              18963             0.8165
        Yes                   Yes           1,2                   count        SVM-RFB                24949              18963             0.8153
        Yes                   Yes           1,2                   count        SVM-POLY-1             24949              18963             0.815
        Yes                   No            1                     freq         SVM-SIGMOID            6711               5466              0.8144 7
        Yes                   Yes           1,2,3                 binary       SVM-RFB                45014              33596             0.8127
        Yes                   Yes           1,2                   binary       SVM-RFB                24949              18963             0.8126
        Yes                   Yes           1,2                   tfidf        SVM-POLY-1             24949              18963             0.8115
        Yes                   Yes           1,2,3                 tfidf        SVM-POLY-1             45014              33596             0.8104
        No                    No            1                     binary       SVM-SIGMOID            6955               5707              0.8102
        Yes                   Yes           1,2,3                 count        SVM-SIGMOID            45014              33596             0.8101
        Yes                   Yes           1,2,3                 binary       SVM-SIGMOID            45014              33596             0.8098
        No                    No            1                     binary       SVM-POLY-1             6955               5707              0.8098
        Yes                   Yes           1,2,3                 freq         SVM-RBF                45014              33596             0.8095
        No                    No            1                     tfidf        SVM-SIGMOID            6955               5707              0.809
        Yes                   Yes           1,2                   binary       SVM-POLY-1             24949              18963             0.8089
        No                    No            1                     count        SVM-SIGMOID            6955               5707              0.8084
        Yes                   Yes           1,2,3                 count        SVM-RBF                45014              33596             0.8082
        No                    No            1                     count        SVM-POLY-1             6955               5707              0.8081
        Yes                   Yes           1,2                   tfidf        SVM-SIGMOID            24949              18963             0.8068
        No                    No            1                     tfidf        SVM-POLY-1             6955               5707              0.8065
        Yes                   Yes           1,2                   count        AdaBoost               24949              18963             0.8063
        Yes                   Yes           1,2                   tfidf        AdaBoost               24949              18963             0.8063
        Yes                   Yes           1,2,3                 count        SVM-POLY-1             45014              33596             0.8063
        Yes                   Yes           1,2,3                 count        AdaBoost               45014              33596             0.8057
        Yes                   Yes           1,2,3                 tfidf        AdaBoost               45014              33596             0.8054
        No                    No            1                     freq         SVM-SIGMOID            6955               5707              0.8054
        Yes                   Yes           1,2,3                 tfidf        SVM-SIGMOID            45014              33596             0.8048
        No                    No            1                     freq         SVM-POLY-1             6955               5707              0.8028


 7
     This result is reported to the competition for Task A. There is a negligible variation because of slight changes made in the stopword list.
Table 2
Abusive language detection Task A2: Bag of words with traditional classifiers – Top results with F1
greater than 82% (Both word and character level n-grams included for n=1,2,3)
STW Removed          Lemmatized    Mode          Model            Vocab size      Vector size   Top K    F1
No              No                 count         SVM-POLY-1       135747          104243        5000     0.8282
No              No                 freq          SVM-POLY-1       135747          104243        1000     0.8277
Yes             Yes                binary        SVM-RBF          96923           76110         1000     0.8254
No              No                 count         SVM-POLY-1       135747          104243        1000     0.8251
No              No                 binary        SVM-RBF          135747          104243        1000     0.8236
Yes             Yes                freq          SVM-POLY-1       96923           76110         1000     0.8232
No              No                 freq          SVM-POLY-1       135747          104243        5000     0.8231
No              No                 binary        SVM-POLY-1       135747          104243        1000     0.8229
No              No                 tfidf         SVM-RBF          135747          104243        5000     0.8226
No              No                 binary        SVM-POLY-1       135747          104243        5000     0.822
No              No                 tfidf         SVM-RBF          135747          104243        1000     0.8206
Yes             Yes                tfidf         SVM-POLY-1       96923           76110         1000     0.8202
Yes             Yes                binary        SVM-POLY-1       96923           76110         1000     0.8198
No              No                 freq          SVM-RBF          135747          104243        5000     0.8196
No              No                 count         SVM-RBF          135747          104243        1000     0.8188
Yes             Yes                tfidf         SVM-RBF          96923           76110         1000     0.8185
No              No                 binary        SVM-RBF          135747          104243        5000     0.8181
Yes             Yes                freq          SVM-RBF          96923           76110         1000     0.8171
No              No                 tfidf         SVM-POLY-1       135747          104243        1000     0.8153
Yes             Yes                binary        AdaBoost         96923           76110         1000     0.8146
No              No                 tfidf         SVM-POLY-1       135747          104243        5000     0.8136
No              No                 count         SVM-RBF          135747          104243        5000     0.8114
No              No                 freq          SVM-RBF          135747          104243        1000     0.8113
Yes             Yes                tfidf         AdaBoost         96923           76110         1000     0.81
No              No                 count         AdaBoost         135747          104243        5000     0.8045
Yes             Yes                freq          AdaBoost         96923           76110         1000     0.8038


 Table 3                                                                   Table 4
 Abusive language detection Task A3: Feature vector by                     Abusive language detection Task
 Word2Vec using (word n-grams for n=1,2) with                              A4: CNN with 4 channels
 traditional classifiers                                                                          F1
        Model            Vocab size        Vector size      F1                 Average          0.797
      SVM-POLY-1           24949              300        0.7805                Maximum          0.812
      SVM-POLY-2           24949              300        0.7912                Minimum          0.7804
      SVM-POLY-3           24949              300        0.7855                Std. dev.         0.89
        SVM-RBF            24949              300        0.7916
      SVM-SIGMOID          24949              300        0.7579
Table 5
Threatening language detection Task B1: Feature vector by Word2Vec using (word n-grams for n=1,2)
with traditional classifiers. Top results with F1 greater than 47%.
       STW Removed            Lemmatized           Word ngrams            Model                Vocab size         Vector size        F1
              Yes                    No            1,2                    SVM-POLY-3           79885              300                0.4931
              Yes                    No            1,2                    SVM-POLY-2           79885              300                0.4921
              Yes                    No            1,2                    SVM-RBF              79885              300                0.4883
              Yes                    No            1                      SVM-POLY-3           15009              300                0.487 8
              No                     No            1,2                    SVM-POLY-3           102586             300                0.4866
              Yes                    Yes           1,2,3                  SVM-POLY-3           145299             300                0.485
              No                     No            1,2                    SVM-RBF              102586             300                0.484
              No                     No            1,2                    SVM-POLY-2           102586             300                0.481
              Yes                    No            1                      SVM-RBF              15009              300                0.4778
              Yes                    Yes           1,2,3                  SVM-POLY-2           145299             300                0.4766
              Yes                    Yes           1,2                    SVM-POLY-2           75244              300                0.475
              Yes                    Yes           1,2                    SVM-POLY-3           75244              300                0.4702

Table 6
Threatening language detection Task B2: Feature vector by bag of word using (word n-grams for n=1,2)
with traditional classifiers.
STW Removed               Lemmatized         Word ngrams           Mode       Model                  Vocab size        Vector size        F1
          Yes                  Yes           1                     freq       SVM-POLY-1             12803             9494               0.4749
          Yes                  Yes           1                     freq       SVM-SIGMOID            12803             9494               0.4503
          Yes                  Yes           1                     freq       SVM-POLY-2             12803             9494               0.4231
          Yes                  Yes           1                     freq       SVM-RBF                12803             9494               0.3886
          Yes                  Yes           1                     freq       SVM-POLY-3             12803             9494               0.3774
          Yes                  Yes           1,2                   freq       SVM-SIGMOID            55179             36036              0.3665
          Yes                  Yes           1,2                   freq       SVM-RBF                55179             36036              0.1814

Table 7
Threatening language detection Task B2: Feature vector by bag of word using (word n-grams for n=1,2
and character n-grams for n=2) with traditional classifiers. Top K features selected and used.
    Word ngrams            Char ngrams          Mode         Model                   Vocab size            Vector size       Top K        F1
           1,2               Not used           tfidf        SVM-SIGMOID             55179                 36036             2000         43.49
           1,2               Not used           freq         AdaBoost                55179                 36036             2000         43.22
           1,2               Not used           freq         SVM-POLY-1              55179                 36036             2000         43.09
           1,2               Not used           tfidf        AdaBoost                55179                 36036             2000         42.71
           1,2                  2               tfidf        SVM-POLY-1              58043                 38277             2000         42.6
           1,2                  2               tfidf        SVM-RBF                 58043                 38277             2000         42.42

    Table 8
    Threatening language detection Task B4: CNN with 4 channels
                                       Without SMOTE F1                   With SMOTE F1
           Average                              34.21                           34.52
          Maximum                               35.47                           35.28
          Minimum                               32.26                           33.7
      Standard deviation                        1.23                            0.61


8
    This result is reported to the competition for Task B. There is a negligible variation because of slight changes made in the stopword list.

</pre>