UR NLP @ HaSpeeDe 2 at EVALITA 2020: Towards Robust Hate Speech
             Detection with Contextual Embeddings

                   Julia Hoffmann                                   Udo Kruschwitz
               University of Regensburg                          University of Regensburg
             Julia1.Hoffmann@ur.de                              Udo.Kruschwitz@ur.de


                      Abstract                             ously provide rights to freedom of expression and
                                                           prevent censorship and illegal discrimination. All
    We describe our approach to address                    this contributes to making automatically detecting
    Task A of the EVALITA 2020 Hate Speech                 hate speech a challenging task.
    Detection (HaSpeeDe2) challenge. We                       Nevertheless, social media platforms such as
    submitted two runs that are both based on              Twitter have defined clear guidelines prohibiting
    contextual embeddings – which we had                   the use of hateful behaviour.1 Accounts with such
    chosen due to their effectiveness in solving           contents can be reported and are subsequently
    a wide range of NLP problems. For our                  deleted. The challenge is to be able to detect such
    baseline run we use stacked embeddings                 content automatically with both high precision and
    that serve as features in a linear SVM. Our            high recall.
    second run is a simple ensemble approach                  The EVALITA evaluation campaign introduced
    of three SVMs with majority voting. Both               a hate speech detection challenge applied to Ital-
    approaches outperform the official base-               ian social media in 2018 (Bosco et al., 2018). Its
    lines by a large margin, and the ensemble              success led to the continuation of the challenge in
    classifier in particular demonstrates robust           2020, now called HaSpeeDe 2, which is split up
    performance on different types of test data            into three subtasks (Sanguinetti et al., 2020). This
    coming 6th (out of 27 runs) for news head-             report discusses our two runs that we submitted
    lines and 10th (out of 27) for Twitter feeds.          to HaSpeeDe 2 Task A of EVALITA 2020 (Basile
                                                           et al., 2020). We will first give some background
1   Introduction                                           on the problem aimed at motivating our choice of
                                                           approach. We will then introduce our systems, re-
Hate speech in social media (and its automatic             port results and discuss some findings. We will
detection) has become a major problem in recent            also outline some scope for future developments.
years. It can be generically defined as “language
that is used to express hatred towards a targeted          2    Background
group or is intended to be derogatory, to humili-
ate, or to insult the members of the group” (David-        We will provide some background that should
son et al., 2017) and is often based on aspects like       motivate the system architectures we developed.
race, religion, ethnicity, and gender. The prob-           There are several aspects to be mentioned here.
lem is that what is considered acceptable for some            First of all, given the impressive advances in a
might not be for others. In addition to that, there is     broad range of natural language processing tasks
a fine line between freedom of expression on the           using a transformer-based architecture (Vaswani
one hand and censorship and illegal discrimination         et al., 2017) capturing contextual embeddings –
on the other (Zimmerman et al., 2018). In fact,            most prominently utilizing the various flavours of
this fine balance is reflected by the fundamental          BERT (Devlin et al., 2019) – we decided to adopt
human rights (as outlined in articles 19 and 20 of         a transformer architecture as well. There are two
(The United Nations, 1948) and (The United Na-             ways language models such as BERT could be
tions General Assembly, 1966) which simultane-             used – using pre-training and fine-tuning or just
                                                           feature-based without fine-tuning.
     Copyright c 2020 for this paper by its authors. Use
                                                              1
permitted under Creative Commons License Attribution 4.0        https://help.twitter.com/en/rules-and-policies/hateful-
International (CC BY 4.0).                                 conduct-policy
   This leads us to the next design decision. The      3     System Architecture
winning team in the 2018 HaSpeeDe competition,
                                                       We submitted two runs of which the first one can
ItaliaNLP, submitted as one of their runs a SVM
                                                       be considered our own baseline approach. We first
with three different feature categories, namely raw
                                                       present both architectures at a conceptual level and
and lexical text, morpho-syntactic and lexicon fea-
                                                       will go into the technical details when we discuss
tures, which performed extremely well in par-
                                                       the experimental setup in the next section. Our
ticular when trained and tested on Twitter data
                                                       runs are:
(Cimino et al., 2018). Rather than designing an
end-to-end neural architecture that would be fine-         • Model 1: Stacked embeddings as features of
tuned on the available training data we therefore            a linear SVM
opted for a simpler and slightly more transparent
architecture with an SVM backbone as our clas-             • Model 2: Ensemble of several SVMs with dif-
sifier, i.e. the feature-based approach mentioned            ferent text representations – both contextual
above.                                                       embeddings and TF-IDF-based.

   Ensemble methods have repeatedly been shown            Both models can be realised in many different
to outperform individual classifiers for a variety     ways. The core idea, as motivated before, is to
of tasks including hate speech detection. For ex-      experiment with transformer-based contextual em-
ample, an ensemble of ten simple neural classi-        beddings but to avoid fine-tuning and instead de-
fiers proposed by (Zimmerman et al., 2018) out-        ploy a traditional, more transparent approach of
performed a BERT-based approach on the stan-           SVM. The ensemble can consist of a variety of
dard HatebaseTwitter benchmark dataset (MacA-          different systems that can be aggregated in many
vaney et al., 2019). Other recent examples that        ways. In this paper (and as submitted) we treat
demonstrate the effectiveness of ensemble meth-        each system as equally important and use a simple
ods for hate speech detection include (Alonso et       majority vote.
al., 2020; Nourbakhsh et al., 2019; Seganti et al.,       Stacked embeddings have been shown to be ef-
2019; Zampieri et al., 2020; Badjatiya et al., 2017;   fective in NLP applications, e.g. (Akbik et al.,
Park and Fung, 2017). We should add that these         2018; Akbik et al., 2019). Conceptually there is
findings are not limited to the area of hate speech    some similarity to ensemble approaches in that
detection as ensemble methods have a long history      a combination of differently derived embedding
in being successfully utilized in a broad range of     models turns out to be more effective than each
machine learning approaches, e.g. (Molteni et al.,     approach individually.
1996). Simple but effective ensemble approaches
have also been used for example in sentiment clas-     3.1    Model 1: Stacked embeddings + SVM
sification of tweets, e.g. (Hagen et al., 2015), and   Our own baseline model combines two different
other social media classification tasks.               document embeddings: transformer document and
                                                       document pool embeddings which are then fed
   Finally, given the task definition in which the
                                                       into a linear SVM to train a classifier. We keep
classifier was to be trained on social media data
                                                       the architecture deliberately simple.
but then tested on both social media and news
                                                          There is a wide range of transformer-based lan-
headlines we were aiming at an approach that
                                                       guage models. One of our motivations was to
would have a robust performance across domains
                                                       train a classifier that will generalise beyond a spe-
rather than being tailored specifically to one type
                                                       cific domain but also has the potential to gener-
of data.
                                                       alise beyond a specific language. We therefore
   One additional motivation for our work is the       opted for XLM-RoBERTa (XLM-R) that has been
intention to develop approaches that can be ap-        shown to outperform alternative multilingual mod-
plied to different languages (we will get back to      els such as mBERT in various NLP tasks (Con-
that point when we outline future directions).         neau et al., 2020). XLM-R is based on XLM
                                                       and RoBERTa. It is trained on data covering 100
  We will now demonstrate how those motivating         languages in a very large (2TB) CommonCrawl.
considerations lead to the system architecture we      Transformer document embeddings are obtained
propose.                                               from (the large version of) XLM-R. In addition
to that we use document pool embeddings which                   news headlines, respectively. The classes 0
consist of word embeddings using Flair (Akbik et                and 1 in the Twitter test set include 641 and
al., 2019). The exact experimental choices are de-              622 tweets respectively. In the news headline
scribed further down.                                           test set 319 entries have the label 0, 181 the
                                                                label 1 (see Table 2).
3.2    Model 2: Ensemble of SVMs
Our second system is an ensemble classifier con-               Label    Twitter Test Set   News Test Set
sisting of three SVMs each trained on a different                0            641              319
text representation, namely:                                     1            622              181
    • Transformer document embeddings using                    Total         1,263             500
      XLM-R
                                                                            Table 2: Test Data
    • Document pool embeddings

    • Straightforward TF-IDF.
                                                         4.2    Data Preprocessing
   The first two of these are exactly the same as
we have seen in Model 1 except that they are not         In line with our overall aim of simplicity and gen-
stacked but fed into different classifiers. Again        eralisibility (rather than tuning) we applied a sim-
we observe that the general setup is kept sim-           ple pre-processing pipeline that would apply to
ple to avoid overfitting for the specific problem at     both Twitter data as well as news headlines. There
hand thereby allowing more scope for future ex-          are only small variations in the different normal-
periments.                                               ization steps as follows.
                                                           For any embedding-based processing the text
4     Experimental Setup
                                                         was lower-cased and punctuation was removed so
We applied our systems to Task A - Hate Speech           that any input, be it tweet or news headline, would
Detection (Main Task).                                   be represented as a string of unpunctuated tokens.
                                                         For the calculation of our (sparse) TF-IDF repre-
4.1    Data Sets                                         sentation the text was tokenized and in addition to
Training and test data is briefly described here.        that stopwords were removed. After that each to-
                                                         ken was vectorized using TF-IDF. Figure 1 shows
    • Training Data Set: the training data set con-
                                                         an overview of the preprocessing.
      sists of 6,839 tweets in total, 2,766 of them
      classified as hate speech. The corpus has
      three columns: tweet ID, text and the label
      (0 = no hate speech, 1 = hate speech). Table
      1 summarises these numbers.
            Label     Training Data Set
              0             4,073
              1             2,766
            Total           6,839


               Table 1: Training Data

    • Test Data Set: unlike training data which was
      all Twitter feeds, there were two sets of test
      data, the first one sampled from Twitter and
      the second one from news headlines. The
      Twitter test set has 1,263 entries in total, the
      news test set 500. The two columns in both                       Figure 1: Data Preprocessing
      sets are the ID and the text of the tweet and
4.3      Implementation                                   Ensemble of SVMs: three different feature rep-
All implementation was done in Python. For all         resentations are used to train one SVM each as il-
text and document embeddings we used flairNLP2 .       lustrated in Table 4. The first two incorporate the
Our SVMs were developed using scikit-learn (Pe-        same representations as already seen in Figure 2.
dregosa et al., 2011), and for the preprocessing
of the TF-IDF version and TF-IDF calculation we         Classifier     Features
used NLTK 3 and scikit-learn.                           SVM2.1         Transformer Document Embeddings
   Stacked embeddings + SVM: as outlined, we            SVM2.2         Document Pool Embeddings
use stacked embeddings composed of Transformer          SVM2.3         TF-IDF
Document and Document Pool Embeddings. The
Transformer Document Embeddings are obtained                  Table 4: Overview of SVM Ensemble
using XLM-R. Document Pool Embeddings are
                                                          Again we used grid-search for parameter tuning
calculated using a mean-pooling over all word em-
                                                       (see Table 5).
beddings. It consists of forward and backward em-
beddings for the Italian language as provided by           Parameter    SVM2.1     SVM2.2      SVM2.3
flair (Akbik et al., 2018) and as recommended. An
                                                               C           1.0        1.0        1.0
overview is given in Figure 2.
                                                             kernel     ’linear’   ’linear’     ’rbf’
                                                             degree         3          3          3
                                                            gamma           1          1          1

                                                       Table 5: Parameters of the SVMs for Model 2 (En-
                                                       semble of SVMs)

                                                         Input is run against each classifier, and through
                                                       majority voting over these three predictions the fi-
                                                       nal classification category is determined.

                                                       5     Results
                                                       We first present detailed results and then discuss
                                                       our findings and insights. We start with our base-
Figure 2: Embeddings in our Baseline (Model 1)         line approach and then move on to the classi-
                                                       fier ensemble. Macro-F1 is the official metric
   Flair allows for the easy combination of embed-     for this competition. In addition to that we look
dings to create stacked embeddings – one for each      at Precision, Recall and F1 at category-level and
input text. These vectors (together with the labels)   also include confusion matrices for each approach
are then used to train the SVM. Using grid-search      (Model 1 and Model 2) and test set (Twitter data
on the training data the most suitable parameter       and news headlines). There were 27 runs submit-
settings were determined, and Table 3 specifies        ted for each dataset and the official baseline was a
the settings which were then used in the submit-       linear SVM with TF-IDF of word and char-grams.
ted run.
                                                       5.1    Model 1: Our Baseline
                    Parameter         Value            Twitter Data: Training and testing on Twitter
                        C               1.0            data results in a Macro-F1 score of 0.7399 which
                      kernel         ’linear’          makes it into position 16 (out of 27). The official
                      degree             3             task baseline is 0.7212. Details are displayed in
                     gamma               1             Table 6 and Figure 3.
                                                          News Headlines: On the news headlines test
      Table 3: Parameters of the SVM (Baseline)        data we get a Macro-F1 of 0.6684 with official
   2
       https://github.com/flairNLP/flair               baseline result of 0.6210 (rank 12). More details
   3
       https://www.nltk.org                            are in Table 7 and Figure 4.
           Metric        0         1                             Metric          0         1
          Precision   0.7722    0.7137                          Precision     0.7356    0.6780
           Recall     0.6927    0.7894                           Recall       0.8809    0.4420
             F1       0.7303    0.7496                             F1         0.8017    0.5351

Table 6: Results: Model 1 (Stacked embeddings +      Table 7: Results: Model 1 (Stacked embeddings +
SVM) on Twitter Data                                 SVM) on News Data

                                                                 Metric          0         1
                                                                Precision     0.7894    0.7349
                                                                 Recall       0.7192    0.8023
                                                                   F1         0.7527    0.7671

                                                     Table 8: Results: Model 2 (Ensemble of SVMs)
                                                     on Twitter Data

                                                       News Headlines: On the news headlines test
Figure 3: Confusion Matrix: Model 1 (Stacked         data we get a Macro-F1 of 0.6984 with an official
embeddings + SVM) on Twitter Data (p = pre-          baseline result of 0.6210 (rank 6). More details
dicted, t = true)                                    can be found in Table 9 and Figure 6.

                                                                 Metric          0         1
                                                                Precision     0.7445    0.8280
                                                                 Recall       0.9498    0.4254
                                                                   F1         0.8347    0.5620

                                                     Table 9: Results: Model 2 (Ensemble of SVMs)
                                                     on News Data


Figure 4: Confusion Matrix: Model 1 (Stacked
embeddings + SVM) on News Data (p = predicted,
t = true)


5.2   Model 2: Ensemble

Twitter Data: Our ensemble approach gets a
Macro-F1 of 0.7599 (rank 10). More details are
included in Table 8 and Figure 5.                    Figure 6: Confusion Matrix: Model for 2 (Ensem-
                                                     ble of SVMs) on News Data (p = predicted, t =
                                                     true)


                                                     6   Discussion
                                                     Our first observation we derive from the results is
                                                     that the ensemble approach we proposed for this
                                                     task does provide a robust and solid performance –
                                                     solid in that it scores well in the ranked list of sys-
                                                     tems and robust in that it also ranks highly when
Figure 5: Confusion Matrix: Model 2 (Ensemble        applied to out-of-domain data (coming 6th out of
of SVMs) on Twitter Data (p = predicted, t = true)   27 submitted runs on data it had not been trained
on). Given the simplicity of our system architec-      7   Conclusion
ture and the composition of the official baseline
                                                       We presented a simple but effective architecture
system we also note the superiority of transformer-
                                                       to detect hate speech in Italian social media and
based contextual embeddings over bag-of-words
                                                       news headlines. Our ensemble-based architecture
approaches (while this comes as no surprise it is
                                                       relies on contextual embeddings trained on a large
still worth pointing out). Moving from a feature-
                                                       multilingual corpus which we see as the basis for
based to a pre-training plus fine-tuning approach
                                                       the robustness of the approach. There is plenty
will most certainly further push up the scores.
                                                       of room for further improvement and the results
   Looking at the balance between precision and        we report here will serve as a benchmark in this
recall, we find that both our approaches have a ten-   development.
dency to return a fair number of false positives for
the Twitter data set. This could indicate that words   Acknowledgements
and phrases used to express hateful content is quite   This work was supported by the project
common in social media even if it does not actu-       COURAGE: A Social Media Companion Safe-
ally represent hate speech. On the other hand, we      guarding and Educating Students funded by the
record a large proportion of false negatives when      Volkswagen Foundation, grant number 95564.
classifying news headlines. This could be an indi-
cator of a more subtle way in which hate speech is
expressed in traditional news outlets.                 References
   Generally speaking, both models perform better      A. Akbik, D. Blythe, and R. Vollgraf. 2018. Con-
on Twitter data than on news headlines – again an        textual string embeddings for sequence labeling. In
                                                         E. M. Bender, L. Derczynski, and P. Isabelle, editors,
insight that was to be expected due to the training      Proceedings of the 27th International Conference
data. However, the fact that our approach managed        on Computational Linguistics, COLING 2018, Santa
to score higher in the ranked list of systems for        Fe, New Mexico, USA, August 20-26, 2018, pages
data it was not trained on is a result that confirms     1638–1649. Association for Computational Linguis-
                                                         tics.
our initial assumptions – that using a corpus with
a very broad range of topics, styles and languages     A. Akbik, T. Bergmann, D. Blythe, K. Rasul,
as our core language model would help in making          S. Schweter, and R. Vollgraf. 2019. FLAIR: An
                                                         easy-to-use framework for state-of-the-art NLP. In
the system transfer more easily to unseen input.         Proceedings of the 2019 Conference of the North
   This leads us to an area of future research.          American Chapter of the Association for Compu-
                                                         tational Linguistics: Human Language Technolo-
While it would be possible to improve the perfor-        gies, NAACL-HLT 2019, Demonstrations, pages 54–
mance of our system by making the preprocessing,         59, Minneapolis, Minnesota, June. Association for
the language model and any fine-tuning step match        Computational Linguistics.
more closely the expected test data – e.g. by using    P. Alonso, R. Saini, and G. Kovács. 2020. Hate
AlBERTo, a BERT-based transformer trained on              Speech Detection Using Transformer Ensembles on
Italian Twitter data (Polignano et al., 2019) – we        the HASOC Dataset. In A. Karpov and R. Potapova,
are actually aiming at something else. As part of         editors, Speech and Computer, pages 13–21, Cham.
                                                          Springer International Publishing.
the COURAGE research project4 we are exploring
ways to help teenagers manage social media expo-       P. Badjatiya, S. Gupta, M. Gupta, and V. Varma. 2017.
sure by providing a virtual companion that would,         Deep learning for hate speech detection in tweets.
                                                          In Proceedings of the 26th International Conference
among other things, automatically identify exam-          on World Wide Web Companion, pages 759–760. In-
ples of hate speech, bullying or other toxic con-         ternational World Wide Web Conferences Steering
tent. Given this is a multi-national effort we are        Committee.
interested in architectures that work for languages    V. Basile, D. Croce, M. Di Maro, and L. C. Passaro.
including Italian, Spanish, German and English            2020. EVALITA 2020: Overview of the 7th Evalua-
with as little fine-tuning as possible. The ensemble      tion Campaign of Natural Language Processing and
introduced here with its multilingual transformer         Speech Tools for Italian. In Valerio Basile, Danilo
                                                          Croce, Maria Di Maro, and Lucia C. Passaro, edi-
backbone turns out to be a step in that direction.        tors, Proceedings of Seventh Evaluation Campaign
                                                          of Natural Language Processing and Speech Tools
                                                          for Italian. Final Workshop (EVALITA 2020), On-
   4
       https://www.upf.edu/web/courage                    line. CEUR.org.
C. Bosco, F. Dell’Orletta, F. Poletto, M. Sanguinetti,       F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
  and M. Tesconi. 2018. Overview of the EVALITA                 B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
  2018 hate speech detection task. In T. Caselli,               R. Weiss, V. Dubourg, et al. 2011. Scikit-learn: Ma-
  N. Novielli, V. Patti, and P. Rosso, editors, Proceed-        chine learning in python. Journal of machine learn-
  ings of the Sixth Evaluation Campaign of Natural              ing research, 12(Oct):2825–2830.
  Language Processing and Speech Tools for Italian.
  Final Workshop (EVALITA 2018), volume 2263 of              M. Polignano, P. Basile, M. de Gemmis, G. Semer-
  CEUR Workshop Proceedings. CEUR-WS.org.                      aro, and V. Basile. 2019. Alberto: Italian BERT
                                                               language understanding model for NLP challenging
A. Cimino, L. De Mattei, and F. Dell’Orletta. 2018.            tasks based on tweets. In R. Bernardi, R. Navigli,
  Multi-task learning in deep neural networks at               and G. Semeraro, editors, Proceedings of the Sixth
  EVALITA 2018. In T. Caselli, N. Novielli, V. Patti,          Italian Conference on Computational Linguistics,
  and P. Rosso, editors, Proceedings of the Sixth Eval-        Bari, Italy, November 13-15, 2019, volume 2481 of
  uation Campaign of Natural Language Process-                 CEUR Workshop Proceedings. CEUR-WS.org.
  ing and Speech Tools for Italian. Final Workshop
  (EVALITA 2018), volume 2263 of CEUR Workshop               M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda,
  Proceedings. CEUR-WS.org.                                    M. Stranisci, C. Bosco, T. Caselli, V. Patti, and
                                                               I. Russo. 2020. HaSpeeDe 2@EVALITA2020:
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary,             Overview of the EVALITA 2020 Hate Speech De-
  G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettle-          tection Task. In Valerio Basile, Danilo Croce, Maria
  moyer, and V. Stoyanov. 2020. Unsupervised cross-            Di Maro, and Lucia C. Passaro, editors, Proceedings
  lingual representation learning at scale. In D. Juraf-       of Seventh Evaluation Campaign of Natural Lan-
  sky, J. Chai, N. Schluter, and J. R. Tetreault, editors,     guage Processing and Speech Tools for Italian. Fi-
  Proceedings of ACL, pages 8440–8451. Association             nal Workshop (EVALITA 2020), Online. CEUR.org.
  for Computational Linguistics.
                                                             A.     Seganti, H. Sobol, I. Orlova, H. Kim,
T. Davidson, D. Warmsley, M. W. Macy, and I. We-
                                                                  J. Staniszewski, T. Krumholc, and K. Koziel.
   ber. 2017. Automated hate speech detection and
                                                                  2019. NLPR@SRPOL at SemEval-2019 task 6 and
   the problem of offensive language. In Proceedings
                                                                  task 5: Linguistically enhanced deep learning offen-
   of ICWSM 2017, Montréal, Québec, Canada, May
                                                                  sive sentence classifier. In Proceedings of the 13th
   15-18, 2017, pages 512–515. AAAI Press.
                                                                  International Workshop on Semantic Evaluation,
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019.              pages 712–721, Minneapolis, Minnesota, USA,
   BERT: Pre-training of deep bidirectional transform-            June. Association for Computational Linguistics.
   ers for language understanding. In Proceedings
   of NAACL, pages 4171–4186, Minneapolis, Min-              The United Nations General Assembly. 1966. Interna-
   nesota, June. Association for Computational Lin-            tional covenant on civil and political rights. Treaty
   guistics.                                                   Series, 999:171, December.

M. Hagen, M. Potthast, M. Büchner, and B. Stein.            The United Nations. 1948. Universal Declaration of
  2015. Webis: An ensemble for twitter sentiment               Human Rights. The United Nations, December.
  detection. In Proceedings of the 9th International
  Workshop on Semantic Evaluation (SemEval 2015),            A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
  pages 582–589.                                               L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
                                                               2017. Attention is all you need. In Proceedings of
S. MacAvaney, H. Yao, E. Yang, K. Russell, N. Gohar-           the 31st International Conference on Neural Infor-
   ian, and O. Frieder. 2019. Hate speech detection:           mation Processing Systems, NIPS’17, pages 6000–
   Challenges and solutions. PLoS ONE, 14:1–16.                6010, USA. Curran Associates Inc.
F. Molteni, R. Buizza, T. N Palmer, and T. Petroliagis.      M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova,
   1996. The ECMWF ensemble prediction system:                 G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pite-
   Methodology and validation. Quarterly Journal of            nis, and Ç. Çöltekin. 2020. SemEval-2020 Task
   the Royal Meteorological Society, 122(529):73–119.          12: Multilingual Offensive Language Identifica-
                                                               tion in Social Media (OffensEval 2020). CoRR,
A. Nourbakhsh, F. Vermeer, G. Wiltvank, and
                                                               abs/2006.07235.
  R. van der Goot. 2019. sthruggle at SemEval-2019
  task 5: An ensemble approach to hate speech detec-         S. Zimmerman, U. Kruschwitz, and C. Fox. 2018. Im-
  tion. In Proceedings of the 13th International Work-          proving Hate Speech Detection with Deep Learn-
  shop on Semantic Evaluation, pages 484–488, Min-              ing Ensembles. In Proceedings of the Eleventh In-
  neapolis, Minnesota, USA, June. Association for               ternational Conference on Language Resources and
  Computational Linguistics.                                    Evaluation (LREC 2018), Miyazaki, Japan. Euro-
J. H. Park and P. Fung. 2017. One-step and two-step             pean Language Resources Association (ELRA).
   classification for abusive language detection on twit-
   ter. In Proceedings of The First Workshop on Abu-
   sive Language Online, pages 41–45. Association for
   Computational Linguistics.