Vito at HASOC 2019: Detecting Hate Speech
     and Offensive Content through Ensembles

                                 Victor Nina-Alcocer

                 Department of Computer Systems and Computation
                        Universitat Politècnica de València
                             Camı́ de Vera s/n 46022
                                  València, Spain
                               vicnial@inf.upv.es


        Abstract. This paper describes our participation in the shared task
        “Hate Speech and Offensive Content Identification in Indo-European
        Languages” (HASOC) at the Forum for Information Retrieval Evalu-
        ation (FIRE) 2019. This work studies the detection of hate or offensive
        content on English posts published on Facebook or Twitter. For a fine-
        grained study of the task, we analyzed two different approaches: the first
        one regards the design of two architectures using convolutional and recur-
        rent neural networks. Meanwhile, the second approach examines a range
        of paradigms based on classical machine algorithms, neural networks,
        and transformers.

        Keywords: Facebook · Twitter · Hate Speech · Offensive Content · Con-
        volutional Neural Networks · Recurrent Neural Networks · Transformers.


1     Introduction
HASOC[12] at FIRE[2] aims to identify hate and offensive content in social me-
dia, taking into account tweets and Facebook posts for Indo-European languages.
To accomplish this goal, they propose three sub-tasks that tackle this issue in
three languages: English, German and, Hindi.
 – Sub-task A: Hate or Offensive Identification.
 – Sub-task B: Category Classification.
 – Sub-task C: Target Identification.
The aim of sub-task A is to classify short-posts into two classes: “Hate and
Offensive”(HOF) and “Non-Hate and Non-Offensive”(NOT). For the HOF posts,
sub-task B implements further fine-grained classifications to categorize them into
three groups: Hate speech (HATE), Offensive (OFFN) and Profane (PRFN).
Furthermore, sub-task C focuses on whether a HOF post is targeting a particular
individual/group (Targeted Insult, TIN) or just generally profane (Untargeted,
UNT). We can find further information about these tasks on [12].
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem-
    ber 2019, Kolkata, India.
       Victor Nina-Alcocer et al.

   As we did in previous researches [13–15], we put our effort to deal with
these kinds of challenges using Machine Learning (ML) and Natural Language
Processing (NLP). In this paper, we will use the same fields but digging a little
deeper in Deep Learning (DL) [10], Neural Networks (NN) [11], and architectures
based on transformers [19].
   We organize this paper into four sections. The first section provides an in-
troduction to this paper, followed by descriptions of the proposed approaches
and their respective systems. Meanwhile, the third section presents many exper-
iments, and those results achieved. The final section concludes our work.


2   Systems Description


This section presents the main approaches that have been taken into account in
this research.
    The first approach considers two architectures. The first one is DL-NN, which
utilizes Convolutional Neural Networks (CNNs) [8] and Recurrent Neural Net-
works (RNNs). It has three embedding layers as inputs. The first layer (I1) takes
into account the embedding of a pre-processed post. The following two layers:
consider the embedding of its Part of Speech (POS) tagging [3, 16] (I2) as well as
the existence of positive or negative words, according to a pre-defined lexicon [7]
(I3). After that, the first embedding input feeds a CNN layer, which is followed
by an LSTM layer (O1). The second and third input layers feed a CNN (O2) and
a dense layer (O3) respectively. The outputs of these three layers (O1, O2, O3)
are concatenated and feed to a dense layer, which compounded by two nodes.
Softmax is used to get the predictions.
    Our second architecture is NN-AA. It implements Long short-term memory
(LSTM) [21] layers, which is a typical type of RNN. It takes the embedding
representation of its respective pre-processed post, together with another feature
input extracted from the tweet’s POS tagging. It is essentially an embedding
layer followed by an LSTM layer. The model also incorporates an attention
layer [5] to pay focus on critical words in the sentence. We concatenate the
output of this attention layer to the POS tagging vector representation. The
vector is calculated as the sum of the frequency for each word’s POS tag divided
by the total number words of the sentence. We feed the concatenated vector into
a dense layer that implements softmax function and generates predictions.
    We considered another approach, which is called MA-MO. This approach
examines a variety of models, that go from traditional Machine Learning, i.e.
Support Vector Machine [1], Logistic Regression [20] and Naive Bayes [18] models
to some Deep Learning models i.e., a simple CNN, a simple Dense layer (SDL),
or a Multi-Layer Perceptron (MLP) [9]. We also considered the fine-tunning of
some architectures based on slightly modified transformers. We combined the
results of our systems to feed an ensemble classifier.
           Detecting Hate Speech and Offensive Content through Ensembles

3   Experiments and Results
In this section, we comment on all the experiments and the results achieved.
HASOC provides a training set of approximately 6000 posts written in English.
Table 1 shows that sub-task A has 5852 posts, and they are labeled as HOF
(Hate and offensive) and NOT (Non-Hate-Offensive). Sub-task B and sub-task
C each have 2261 posts. HATE (Hate speech), OFFN (Offensive), and PRFN
(Profane) labels are for sub-task B. TIN (Target insult) and, UNT (Untargeted)
are for sub-task C.

     Table 1. Summary of the English training set provided by the organizers.

                        SubtaskA     SubtaskB    SubtaskC
                        HOF NOT HATE OFFN PRFN TIN UNT
             Num. posts 2261 3591 1143 451   667 2041 220
             Total         5852        2261         2261


    Table 1 also shows that the dataset for sub-task A is slightly balanced. How-
ever, the others are not, especially for PRFN, OFFN, TIN and UNT classes in
their respective datasets. We tackled this issue by performing data augmentation
to the posts in the unbalanced categories and generates an acceptable training
set. Furthermore, to this set, the organizers provided an unlabeled test set of
almost 1000 posts to test our systems.
    To get a development set, we split the training set into two parts. So, we
used 10% of the training set for tunning our parameters. The official metrics to
evaluate the proposed systems for these three sub-tasks were F1 macro, and F1
weighted[12].
    We applied the following text pre-processing to the posts that are input to our
systems: we remove URLs, numbers, users, times, dates, emails, and percents.
Besides, we normalize hashtags (i.e., Converting “#TrumpIsATraitor” to “trump
is a traitor”), emojis with its CLDR version, elongated words (i.e., Transform-
ing “haaaaahaa” to “haha”), repetitions, emphasis, and censored words. Both
proposed approaches use the text pre-processing mentioned above. Additionally,
the first approach and part of the second approach (traditional ML and DL
models) uses Glove embeddings[17]. Meanwhile, we feed the architectures based
on transformers with pre-processed data only.
    Regarding our first approach, DL-NN allows us to try a variety of settings.
To get some patterns in n-gram words, we use CNN layers, with 256 filters and
various sizes of kernels (1, 2, 3, and 4) to find the optimum. As we describe in
section two, we use an LSTM for capturing long-term dependencies after the
CNN layers. GRU [6] and BiLSTM [4] layers are our first attempt, but LSTM
gives much better results. In short, our architecture is defined as we mentioned
in section two, it used a kernel size of 2 for both CNNs (O1, O2) and 128 nodes
for the dense layer (O3). In our second stage, it utilizes 128 nodes for LSTM.
Next, we concatenate the outputs to feed a dense fully-connected layer of 64
       Victor Nina-Alcocer et al.

nodes. Finally, a layer with two nodes and softmax is used to get predictions.
The dimensions of word embeddings are 200. We used Adam optimizer with a
learning rate of 0.01, and 50 epochs were run.
    In contrast to DL-NN, we implement the attention mechanism in NN-AA ar-
chitecture, which particularly focuses on important parts (words) of the sentence
(post). For the attention layer, we experimented with GRU, BiLSTM, and LSTM
structures, and LSTM outperforms the others. Another feature that boosts a few
of our results was the addition of a vector POS tagging representation. To get
this last feature, we took the main combinations of tags, i.e verbs or adjectives
followed by a noun. These combinations are some of the most repeated patterns
in the training dataset. We concatenated the vector pattern representation with
the output of the attention layer. A dropout rate of 0.2 was used, but the results
did not get better. Then we just used softmax to make the predictions.
    We carried out many experiments with various setups for the second approach
(MA-MO). The same features as with the other two architectures, together with
Glove word embeddings, are tested on conventional ML models. Most models
have experimented with default parameter settings. For the models that use
CNN (with 128 filters and a kernel size of 2), SDL (with 128 nodes), or MLP
(256 nodes), we just fed them with embeddings mentioned above (with the di-
mension of 200). Meanwhile, the fine-tuning models used raw tokenized posts as
inputs. There are two nodes in the output layer for all models mentioned, and
the softmax function for generating predictions.


Table 2. Results for English development dataset for SubtaskA, SubtaskB and Sub-
taskC.

                 SubtaskA          SubtaskB          SubtaskC
              F1 macro F1 weig. F1 macro F1 weig. F1 macro F1 weig.
        DL-NN 0.67091 0.72436 0.28458 0.61162 0.46098 0.70064
        NN-AA 0.70324 0.71982 0.35568 0.77016 0.48912 0.73466
        MA-MO 0.77912 0.72887 0.54701 0.82878 0.51928 0.80119
        ENSEM 0.78093 0.84193 0.55161 0.81019 0.51410 0.83994


As we can see, Table 2 shows a summary of the results that each one of the
proposed approaches achieved. We can observe that DL-NN has the worst per-
formance. However, when we incorporate the Attention layer (NN-AA) and the
vector POS representation, the performance increases further. Furthermore, the
performances significantly improved with the ensemble models (MA-MO). We
combined the first two architectures with MA-MO to get the ENSEM model.
And clearly, the last one outperforms the others.

3.1   Official Results
As we commented on in section 3, HASOC provides a test set of approximately
1000 unlabeled posts to evaluate our systems. We show the results of such an
           Detecting Hate Speech and Offensive Content through Ensembles

evaluation in Table 3. For each one of the sub-tasks, three runs were submitted.
The NN-AA, MA-MO, and ENSEM were submitted as run1, run3, and run2
respectively. It is emphasized that we used the same systems to face the three
sub-tasks. The proposed system exceeded our expectations, but the ensemble
classifier boosted us in the ranking.


     Table 3. Offical ranking for English SubtaskA, SubtaskB and SubtaskC.

                   SubtaskA          SubtaskB          SubtaskC
                F1 macro F1 weig. F1 macro F1 weig. F1 macro F1 weig.
     NN-AA.run1 0.65357 0.70063 0.25579 0.58895 0.41692 0.66778
     MA-MO.run3 0.74706 0.80711 0.50638 0.75140 0.49399 0.77353
     ENSEM.run2 0.75676 0.81820 0.50511 0.75957 0.48799 0.78403
       Ranking    3rd place./79     2nd place./50     2nd place./45


4   Conclusions

In this paper, we proposed two approaches that allow us to face this shared task.
We observe that, for the first approach, the combination of CNNs and RNNs
for handling n-grams and long-term dependencies cannot generate satisfactory
results. Such an issue gets solved when we incorporate attention layers and a
vector POS representation. We believe this improvement is the result of con-
siderations imposed on important words and the most repeated POS patterns
inside sentences. For the second approach, the proposed systems perform robust
enough. We consider that fact is due to each of the systems manages to capture
specific patterns that other systems ignore, whereas the ensemble could join all
these patterns. It is important to mention that for sub-task B and C, data aug-
mentation improved the performances.
For future works, we have observed that more in-depth studies on DL and taking
into account the actual state-of-the art can help us improve our results.


Acknowledgments

I would like to share my gratitude with Gong, Zheng. His suggestions were
important to write this paper.


References
 1. Chang, C.C., Lin, C.J.: Libsvm: A library for support vec-
    tor   machines.    ACM     Trans.   Intell.  Syst.   Technol.   2(3),   27:1–
    27:27        (May        2011).      https://doi.org/10.1145/1961189.1961199,
    http://doi.acm.org/10.1145/1961189.1961199
        Victor Nina-Alcocer et al.

 2. Forum for Information Retrieval Evaluation: FIRE 2019 (2019),
    http://fire.irsi.res.in/fire/2019/home
 3. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J.,
    Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tag-
    ging for twitter: Annotation, features, and experiments. In: Proceedings of
    the 49th Annual Meeting of the Association for Computational Linguistics:
    Human Language Technologies: Short Papers - Volume 2. pp. 42–47. HLT
    ’11, Association for Computational Linguistics, Stroudsburg, PA, USA (2011),
    http://dl.acm.org/citation.cfm?id=2002736.2002747
 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Com-
    put. 9(8), 1735–1780 (Nov 1997). https://doi.org/10.1162/neco.1997.9.8.1735,
    http://dx.doi.org/10.1162/neco.1997.9.8.1735
 5. Hu, D.: An introductory survey on attention mechanisms in nlp problems. In: Bi,
    Y., Bhatia, R., Kapoor, S. (eds.) Intelligent Systems and Applications. pp. 432–448.
    Springer International Publishing, Cham (2020)
 6. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent
    network architectures. In: Proceedings of the 32Nd International Conference on In-
    ternational Conference on Machine Learning - Volume 37. pp. 2342–2350. ICML’15,
    JMLR.org (2015), http://dl.acm.org/citation.cfm?id=3045118.3045367
 7. Khoo, C.S., Johnkhan, S.B.: Lexicon-based sentiment analysis: Compar-
    ative evaluation of six sentiment lexicons. Journal of Information Sci-
    ence 44(4), 491–511 (2018). https://doi.org/10.1177/0165551517703514,
    https://doi.org/10.1177/0165551517703514
 8. Kim, Y.: Convolutional neural networks for sentence classification. CoRR
    abs/1408.5882 (2014), http://arxiv.org/abs/1408.5882
 9. Kůrková, V.: Kolmogorov’s theorem and multilayer neural networks. Neural
    Netw. 5(3), 501–506 (Mar 1992). https://doi.org/10.1016/0893-6080(92)90012-8,
    http://dx.doi.org/10.1016/0893-6080(92)90012-8
10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (5
    2015). https://doi.org/10.1038/nature14539
11. Mikolov, T., Kombrink, S., Deoras, A., Burget, L., Černocký, J.: Rnnlm
    - recurrent neural network language modeling toolkit. In: Proceed-
    ings of ASRU 2011. pp. 1–4. IEEE Signal Processing Society (2011),
    https://www.fit.vut.cz/research/publication/10087
12. Modha, S., Mandl, T., Majumder, P., Patel, D.: Overview of the HASOC track at
    FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European
    Languages. In: Proceedings of the 11th annual meeting of the Forum for Informa-
    tion Retrieval Evaluation (December 2019)
13. Nina-Alcocer, V.: AMI at ibereval2018 automatic misogyny identification in span-
    ish and english tweets. In: Proceedings of the Third Workshop on Evaluation of
    Human Language Technologies for Iberian Languages (IberEval 2018) co-located
    with 34th Conference of the Spanish Society for Natural Language Processing (SE-
    PLN 2018), Sevilla, Spain, September 18th, 2018. pp. 274–279 (2018), http://ceur-
    ws.org/Vol-2150/AMI paper8.pdf
14. Nina-Alcocer, V.: Haterecognizer at semeval-2019 task 5: Using features and neural
    networks to face hate recognition. In: Proceedings of the 13th International Work-
    shop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN,
    USA, June 6-7, 2019. pp. 409–415 (2019), https://www.aclweb.org/anthology/S19-
    2072/
            Detecting Hate Speech and Offensive Content through Ensembles

15. Nina-Alcocer, V., González, J., Hurtado, L., Pla, F.: Aggressiveness detection
    through deep learning approaches. In: Proceedings of the Iberian Languages Eval-
    uation Forum co-located with 35th Conference of the Spanish Society for Natu-
    ral Language Processing, IberLEF@SEPLN 2019, Bilbao, Spain, September 24th,
    2019. pp. 544–549 (2019), http://ceur-ws.org/Vol-2421/MEX-A3T paper 9.pdf
16. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceed-
    ings of the Language Resources and Evaluation Conference (LREC 2012). ELRA,
    Istanbul, Turkey (May 2012)
17. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word rep-
    resentation. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Nat-
    ural Language Processing, Proceedings of the Conference. pp. 1532–1543 (2014),
    https://nlp.stanford.edu/projects/glove/
18. Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop
    on empirical methods in artificial intelligence. vol. 3, pp. 41–46. IBM New York
    (2001)
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio,
    S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in
    Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc.
    (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
20. Wright, R.E.: Logistic regression. In: Reading and understanding multivariate
    statistics., pp. 217–244. American Psychological Association, Washington, DC, US
    (1995)
21. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN
    and RNN for natural language processing. CoRR abs/1702.01923 (2017),
    http://arxiv.org/abs/1702.01923