=Paper= {{Paper |id=Vol-2481/paper22 |storemode=property |title=Cross-Platform Evaluation for Italian Hate Speech Detection |pdfUrl=https://ceur-ws.org/Vol-2481/paper22.pdf |volume=Vol-2481 |authors=Michele Corazza,Stefano Menini,Elena Cabrio,Sara Tonelli,Serena Villata |dblpUrl=https://dblp.org/rec/conf/clic-it/CorazzaMCTV19 }} ==Cross-Platform Evaluation for Italian Hate Speech Detection== https://ceur-ws.org/Vol-2481/paper22.pdf
            Cross-Platform Evaluation for Italian Hate Speech Detection
                                  Michele Corazza† , Stefano Menini‡ ,
                             Elena Cabrio† , Sara Tonelli‡ , Serena Villata†
                           †
                             Université Côte d’Azur, CNRS, Inria, I3S, France
                                ‡
                                  Fondazione Bruno Kessler, Trento, Italy
                                   michele.corazza@inria.fr
                                   {menini,satonelli}@fbk.eu
                         {elena.cabrio,serena.villata}@unice.fr

                        Abstract                                1   Introduction

       English. Despite the number of ap-                       Given the well-acknowledged rise in the pres-
       proaches recently proposed in NLP for                    ence of toxic and abusive speech on social media
       detecting abusive language on social net-                platforms like Twitter and Facebook, there have
       works, the issue of developing hate speech               been several efforts within the Natural Language
       detection systems that are robust across                 Processing community to deal with such prob-
       different platforms is still an unsolved                 lem, since the computational analysis of language
       problem. In this paper we perform a com-                 can be used to quickly identify offenses and ease
       parative evaluation on datasets for hate                 the removal of abusive messages. Several work-
       speech detection in Italian, extracted from              shops (Waseem et al., 2017; Fišer et al., 2018) and
       four different social media platforms, i.e.              evaluation campaigns (Fersini et al., 2018; Bosco
       Facebook, Twitter, Instagram and What-                   et al., 2018; Wiegand et al., 2018) have been re-
       sApp. We show that combining such                        cently organized to discuss existing approaches to
       platform-dependent datasets to take ad-                  hate speech detection, propose shared tasks and
       vantage of training data developed for                   foster the development of benchmarks for system
       other platforms is beneficial, although                  evaluation.
       their impact varies depending on the social                 However, most of the available datasets and
       network under consideration.1                            approaches for hate speech detection proposed
                                                                so far concern the English language, and even
       Italiano. Nonostante si osservi un cre-                  more frequently they target a single social me-
       scente interesse per approcci che identi-                dia platform (mainly Twitter). In low-resource
       fichino il linguaggio offensivo sui social               scenarios it is therefore common to have smaller
       network attraverso l’NLP, la necessità di               datasets for specific platforms, raising research
       sviluppare sistemi che mantengano una                    questions such as: would it be advisable to com-
       buona performance anche su piattaforme                   bine such platform-dependent datasets to take ad-
       diverse è ancora un tema di ricerca aper-               vantage of training data developed for other plat-
       to. In questo contributo presentiamo una                 forms? Should such data just be added to the train-
       valutazione comparativa su dataset per                   ing set or they should be selected in some way?
       l’identificazione di linguaggio d’odio pro-              And what happens if training data are available
       venienti da quattro diverse piattaforme:                 only for one platform and not for the other?
       Facebook, Twitter, Instagram and Wha-                       In this paper we address all the above questions
       tsApp. Lo studio dimostra che, combinan-                 focusing on hate speech detection for Italian. Af-
       do dataset diversi per aumentare i dati di               ter identifying a modular neural architecture that
       training, migliora le performance di clas-               is rather stable and well-performing across dif-
       sificazione, anche se l’impatto varia a se-              ferent languages and platforms (Corazza et al.,
       conda della piattaforma considerata.                     to appear), we perform our comparative evalua-
                                                                tion on freely available datasets for hate speech
   1
                                                                detection in Italian, extracted from four differ-
     Copyright c 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       ent social media platform, i.e. Facebook, Twit-
ternational (CC BY 4.0).                                        ter, Instagram and Whatsapp. In particular, we
test the same model while altering only some fea-      are the starting point of this study. The task sum-
tures and pre-processing aspects. Besides, we use      mary presented in (Bosco et al., 2018) listed some
a multi-platform training set but test on data taken   remarks on the elements affecting the system ro-
from the single platforms. We show that the pro-       bustness that led us extend the cross-platform ex-
posed solution of combining platform-dependent         periments to new platforms, including also What-
datasets in the training phase is beneficial for all   sApp and Instagram data. To our knowledge, there
platforms but Twitter, for which results obtained      have not been attempts to develop Italian systems
by training on tweets only outperform those ob-        for hate speech detection on these two platforms,
tained with a training on the mixed dataset.           probably because of the lack of suitable datasets.
                                                       We therefore annotate our own Instagram data for
2       Related work                                   the task, while we take advantage of a recently de-
                                                       veloped dataset for cyberbullying detection to test
In 2018, the first Hate Speech Detection               our system on WhastApp.
(HaSpeeDe) task for Italian (Bosco et al., 2018)
has been organized at EVALITA-20182 , the eval-        3     Data and linguistic resources
uation campaign for NLP and speech processing
tools for Italian. The task consists in automati-      In the following, we present the datasets used to
cally annotating messages from Twitter and Face-       train and test our system and their annotations
book, with a boolean value indicating the presence     (Section 3.1). Then, we describe the word embed-
(or not) of hate speech. Two cross-platform tasks      dings (Section 3.2) we have used in our experi-
(Cross-HaSpeeDe) were also proposed, where the         ments.
training was done on platform-specific data (Face-     3.1    Datasets
book or Twitter) and the test on data from an-
                                                       Twitter dataset released for the HaSpeeDe
other platform (Twitter or Facebook). In general,
                                                       (Hate Speech Detection) shared task organized at
as expected, results obtained for Cross-HaSpeeDe
                                                       EVALITA 2018. This dataset includes a total
were lower compared to those obtained for the in-
                                                       amount of 4,000 tweets (2,704 negative and 1,296
domain tasks, due to the heterogeneous nature of
                                                       positive instances, i.e. containing hate speech),
the datasets provided for the task, both in terms of
                                                       comprising for each tweet the respective annota-
class distribution and data composition. Indeed,
                                                       tion, as can be seen in Example 1. The two classes
not only are Facebook posts in the task dataset
                                                       considered in the annotation are “hateful post” or
longer, but they are also on average more likely to
                                                       “not”.
contain hate speech (68% hate posts in the Face-
book test set vs. 32% in the Twitter one). This led        1. Annotation: hateful.
to a performance drop, with the best system scor-             altro che profughi? sono zavorre e tutti uo-
ing 0.8288 F1 on in-domain Facebook data, and                 mini (EN: other than refugees? they are bal-
0.6068 when the same model is tested on Twitter               last and all men).
data (Cimino et al., 2018).
   The best performing systems on the cross-tasks      Facebook dataset also released for the
were ItaNLP (Cimino et al., 2018) when training        HaSpeeDe (Hate Speech Detection) shared task.
on Twitter data and testing on Facebook, and Inria-    It consists of 4,000 Facebook comments collected
FBK (Corazza et al., 2018) in the other configu-       from 99 posts crawled from web pages (1,941
ration. The former adopts a newly-introduced ap-       negative, and 2,059 positive instances), compris-
proach based on a 2-layer BiLSTM which exploits        ing for each comment the respective annotation,
multi-task learning with additional data from the      as can be seen in Example 2. The two classes
2016 SENTIPOLC task3 . The latter, instead, uses       considered in the annotation are “hateful post” or
a simple recurrent model with one hidden layer of      “not”.
size 500, a GRU of size 200 and no dropout.                2. Annotation: hateful.
   The Cross-HaSpeeDe tasks and the analysis of               Matteo serve un colpo di stato. Qua tra poco
system performance in a cross-platform scenario               dovremo andare in giro tutti armati come in
    2                                                         America. (EN: Matteo, we need a coup. Soon
   http://www.evalita.it/2018
    3
   http://www.di.unito.it/˜tutreeb/                           we will have to go around armed as in the
sentipolc-evalita16/index.html                                U.S.).
Whatsapp dataset collected to study pre-teen                  • Domain-specific embeddings: we trained
cyberbullying (Sprugnoli et al., 2018). Such                    Fasttext embeddings from a sample of Ital-
dataset has been collected through a WhatsApp                   ian tweets (Basile and Nissim, 2013), with
experimentation with Italian lower secondary                    embedding size of 300. We used the binary
school students and contains 10 chats, subse-                   version of the model.
quently annotated according to different dimen-
sions as the roles of the participants (e.g. bully,       4     System Description
victim) and the presence of cyberbullying expres-
sions in the message, distinguished between dif-          Since our goal is to compare the effect of various
ferent classes of insults, discrimination, sexual         features, word embeddings, pre-processing tech-
talk and aggressive statements. The annotation            niques on hate speech detection applied to differ-
is carried out at token level. To create additional       ent platforms, we use a modular neural architec-
training instances for our model, we join subse-          ture for binary classification that is able to support
quent sentences of the same author (to avoid cases        both word-level and message-level features. The
in which the user writes one word per message) re-        components are chosen to support the processing
sulting in 1,640 messages (595 positive instances).       of social-media specific language.
We consider as positive instances of hate speech
the ones in which at least one token was annotated        4.1    Modular neural architecture
as a cyberbullying expression, as in Example 3).
                                                          We use a modular neural architecture (see Figure
  3. Annotation: Cyberbulling expression.                 1) in Keras (Chollet and others, 2015). The ar-
     fai schifo, ciccione! (EN: you suck, fat guy).       chitecture that constitutes the base for all the dif-
                                                          ferent models uses a single feed forward hidden
Instagram dataset includes a total amount of
                                                          layer of 500 neurons, with a ReLu activation and
6,710 messages, which we randomly collected
                                                          a single output with a sigmoid activation. The loss
from Instagram focusing on students’ profiles
                                                          used to train the model is binary cross-entropy.
(6,510 negative and 200 positive instances) iden-
                                                          We choose this particular architecture because it
tified through the monitoring system described in
                                                          showed good performance in the EVALITA shared
(Menini et al., 2019). Since no Instagram datasets
                                                          task for cross-platform hate speech detection, as
in Italian were available, and we wanted to include
                                                          well as in other hate speech detection tasks for
this platform to our study, we manually annotated
                                                          German and English (Corazza et al., to appear).
them as “hateful post” (as in Example 4) or “not”.
                                                          The architecture is built to support both word-level
  4. Annotation: hateful.                                 (i.e. embeddings) and message-level features. In
     Sei una troglodita (EN: you are a caveman).          particular, we use a recurrent layer to learn an en-
                                                          coding (xn in the Figure) derived from word em-
3.2     Word Embeddings
                                                          beddings, obtained as the output of the recurrent
In our experiments we test two types of embed-            layer at the last timestep. This encoding gets then
dings, with the goal to compare generic with so-          concatenated with the other selected features, ob-
cial media-specific ones. In both cases, we rely          taining a vector of message-level features.
on Faxttext embeddings (Bojanowski et al., 2017),
since they include both word and subword infor-                                                  xn
                                                                             si
mation, tackling the issue of out-of-vocabulary
words, which are very common in social media                                                      ⊕
data:                                                     xe        xi                 yi         x1
                                                                            RNN
  • Generic embeddings: we use embedding                                                         ⊕
    spaces obtained directly from the Fasttext
                                                                            si−1                 ...
    website4 for Italian. In particular, we use
    the Italian embeddings trained on Common
    Crawl and Wikipedia (Grave et al., 2018)
    with size 300. A binary Fasttext model is also        Figure 1: Modular neural architecture for Italian
    available and was therefore used;                     hate speech detection
  4
      urlhttps://fasttext.cc/docs/en/crawl-vectors.html
4.2   Preprocessing                                          we measure the number of hashtags and men-
                                                             tions, the number of exclamation and ques-
The language used in social media platforms has
                                                             tion marks, the number of emojis, the number
some peculiarities with respect to standard lan-
                                                             of words written in uppercase
guage, as for example the presence of URLs, ”@”
user mentions, emojis and hashtags. We therefore
run the following pre-processing steps:                5    Experimental Setup

  • URL and mention replacement: both urls and         In order to be able to compare the results ob-
    mentions are replaced by the strings ”URL”         tained while experimenting with different train-
    and ”username” respectively;                       ing datasets and features, we used fixed hyper-
                                                       parameters, derived from our best submission at
  • Hashtag splitting: Since hashtags often pro-       EVALITA 2018 for the cross-platform task that in-
    vide important semantic content, we wanted         volved training on Facebook data and testing on
    to test how splitting them into single words       Twitter. In particular, we used a GRU (Cho et
    would impact on the performance of the clas-       al., 2014) of size 200 as the recurrent layer and
    sifier. To this end, we use the Ekphrasis tool     we applied no dropout to the feed-forward layer.
    (Baziotis et al., 2017) to do hashtag splitting    Additionally, we used the provided test set for the
    and evaluate the classifier performance with       two Evalita tasks, using 20% of the development
    and without splitting. Since the aforemen-         set for validation. For Instagram and WhatsApp,
    tioned tool only supports English, it has been     since no standard test set is available, we split the
    adapted to Italian by using language-specific      whole dataset using 60% of it for training, while
    Google ngrams.5                                    the remaining 40% is split in half and used for val-
                                                       idation and testing. For this purpose, we use the
4.3   Features                                         train test split function provided by sklearn (Pe-
                                                       dregosa et al., 2011), using 42 as seed for the ran-
  • Word Embeddings: We evaluate the contri-
                                                       dom number generator.
    bution of word embeddings extracted from
                                                          One of our goals was to establish whether merg-
    social media data, compared with the per-
                                                       ing data from multiple social media platforms can
    formance obtained using generic embedding
                                                       be used to improve performance on single plat-
    spaces, as described in Section 3.2.
                                                       form test sets. In particular, we used the following
                                                       datasets for training:
  • Emoji transcription: We evaluate the im-
    pact of keeping emojis or transcribing them            • Multi-platform: we merge all the datasets
    in plain text. To this purpose, we use the offi-         mentioned in Section 3 for training.
    cial plaintext descriptions of the emojis (from
    the unicode consortium website), translated            • Multi-platform filtered by length: we use
    to Italian with Google translate and then man-           the same datasets mentioned before, but only
    ually corrected, as a substitute for emojis              considered instances with a length lower or
                                                             equal to 280 characters, ignoring URLs and
  • Hurtlex: We assess the impact of using a                 user mentions. This was done to match Twit-
    lexicon of hurtful words (Bassignana et al.,             ter length restrictions.
    2018), created starting from the Italian hate
    lexicon developed by the linguist Tullio De            • Same Platform: for each of the datasets, we
    Mauro, organized in 17 categories. This is               trained and tested the model on data from the
    used to associate to the messages a score for            same platform.
    ‘hurtfulness’                                      In addition to the experiments performed on dif-
                                                       ferent datasets, we also compare the system per-
  • Social media specific features: We consider
                                                       formance obtained by using different embeddings.
    a number of metrics related to the language
                                                       In particular, we train the system by using Italian
    used in social media platforms. In particular,
                                                       Fasttext word embeddings trained on Common-
  5
    http://storage.googleapis.com/books/               Crawl and Wikipedia, and Fasttext word embed-
ngrams/books/datasetsv2.html                           dings trained by us on a sample of Italian tweets
                                                                     Emoji
    Platform    Training set              Embeddings   Features                   F1 no hate   F1 hate   Macro AVG
                                                                  Transcription
                Multi Platform              Twitter    Social         Yes           0.984      0.432       0.708
    Instagram
                Single Platform             Twitter    Social         Yes           0.981      0.424       0.702
                Multi Platform              Twitter    Social         Yes           0.773      0.871       0.822
    Facebook
                Single Platform             Twitter    Social         Yes           0.733      0.892       0.812
                Multi Platform              Twitter    Social         Yes           0.852      0.739       0.796
    WhatsApp
                Single Platform             Twitter    Social         Yes           0.814      0.694       0.754
                Single Platform             Twitter    Hurtlex         No           0.879      0.717       0.798
    Twitter     Filtered Multi Platform     Twitter    Hurtlex         No           0.858      0.720       0.789
                Multi Platform              Twitter    Hurtlex         No           0.851      0.712       0.782

                                           Table 1: Classification results

(Basile and Nissim, 2013), with an embedding                 Hurtlex is in this case more useful than social net-
size of 300. As described in Section 4.3, we also            work specific features. While the precise cause for
train our models including either social-media or            this would require more investigation, one possi-
Hurtlex features. Finally, we compare classifi-              ble explanation is the fact that Twitter is known
cation performance with and without emoji tran-              for having a relatively lenient approach to content
scription.                                                   moderation. This would let more hurtful words
                                                             slip in, increasing the effectiveness of Hurtlex as
6     Results                                                a feature, in addition to word embeddings. Addi-
For each platform, we report in Table 1 the                  tionally, emoji transcription seems to be less use-
best performing configuration considering embed-             ful for Twitter than for other platforms. This might
ding type, features and emoji transcription. We              be explained with the fact that the Twitter dataset
also report the performance obtained by merg-                has relatively less emojis when compared to the
ing all training data (Multi-platform), using only           others.
platform-specific training data (Single platform)               One final outtake confirmed by the results is
and filtering training instances > 280 characters            the fact that embeddings trained on social media
(Filtered Multi platform) when testing on Twitter.           platforms (in this case Twitter) always outperform
   For Instagram, Facebook and Whatsapp, the                 general-purpose embeddings. This shows that the
best performing configuration is identical. They             language used on social platforms has peculiarities
all use emoji transcription, Twitter embeddings              that might not be present in generic corpora, and
and social-specific features. Using multi-platform           that it is therefore advisable to use domain-specific
training data is also helpful, and all the best per-         resources.
forming models on the aforementioned datasets
                                                             7    Conclusions
use data obtained from multiple sources. How-
ever, the only substantial improvement can be ob-            In this paper, we examined the impact of using
served in the WhatsApp dataset, probably because             datasets from multiple platforms in order to clas-
it is the smallest one, and the classifier benefits          sify hate speech on social media. While the results
from more training data.                                     of our experiments successfully demonstrated that
   The results obtained on the Twitter test set dif-         using data from multiple sources helps the perfor-
fer from the aforementioned ones in several ways.            mance of our model in most cases, the resulting
First of all, the in-domain training set is the best         improvement is not always sizeable enough to be
performing one, while the restricted length dataset          useful. Additionally, when dealing with tweets,
is slightly better than the non restricted one. These        using data from other social platforms slightly de-
results suggest that learning to detect hate speech          creases performance, even when we filter the data
on the short length interactions that happen on              to contain only short sequences of text. As for
Twitter does not benefit from using data from other          future work, further experiments could be per-
platforms. This effect can be at least partially mit-        formed, by testing all possible combinations of
igated by restricting the length of the social inter-        training sources and test sets. This way, we could
actions considered and retaining only the training           establish what social platforms share more traits
instances that are more similar to Twitter ones.             when it comes to hate speech, allowing for better
   Another remark concerning only Twitter is that            detection systems. At the moment, however, the
size of the datasets varies too broadly to allow for         the 2014 Conference on Empirical Methods in Nat-
a fair comparison, and we would need to extend               ural Language Processing (EMNLP), pages 1724–
                                                             1734. Association for Computational Linguistics.
some of the datasets. Finally, another approach
could be tested, where a model trained on Face-            François Chollet et al. 2015. Keras. https://
book is used for longer sequences of text, while             github.com/fchollet/keras.
the Twitter model is applied to the shorter ones.          Andrea Cimino, Lorenzo De Mattei, and Felice
                                                             Dell’Orletta. 2018. Multi-task learning in deep
Acknowledgments                                              neural networks at EVALITA 2018. In Proceedings
                                                             of the Sixth Evaluation Campaign of Natural Lan-
Part of this work was funded by the CREEP                    guage Processing and Speech Tools for Italian. Fi-
project (http://creep-project.eu/), a                        nal Workshop (EVALITA 2018) co-located with the
Digital Wellbeing Activity supported by EIT Dig-             Fifth Italian Conference on Computational Linguis-
                                                             tics (CLiC-it 2018), Turin, Italy.
ital in 2018 and 2019. This research was also sup-
ported by the HATEMETER project (http://                   Michele Corazza, Stefano Menini, Pinar Arslan,
hatemeter.eu/) within the EU Rights, Equal-                  Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and
                                                             Serena Villata. 2018. Comparing different su-
ity and Citizenship Programme 2014-2020.                     pervised approaches to hate speech detection. In
                                                             Proceedings of the Sixth Evaluation Campaign of
                                                             Natural Language Processing and Speech Tools for
References                                                   Italian. Final Workshop (EVALITA 2018) co-located
                                                             with the Fifth Italian Conference on Computational
Valerio Basile and Malvina Nissim. 2013. Sentiment           Linguistics (CLiC-it 2018), Turin, Italy.
  analysis on italian tweets. In Proceedings of the 4th
  Workshop on Computational Approaches to Subjec-          Michele Corazza, Stefano Menini, Elena Cabrio, Sara
  tivity, Sentiment and Social Media Analysis, pages         Tonelli, and Serena Villata. to appear. Robust Hate
  100–107, Atlanta.                                          Speech Detection: A Cross-Language Evaluation.
                                                             Transactions on Internet Technology.
Elisa Bassignana, Valerio Basile, and Viviana Patti.
   2018. Hurtlex: A multilingual lexicon of words to       Elisabetta Fersini, Paolo Rosso, and Maria An-
   hurt. In 5th Italian Conference on Computational           zovino. 2018. Overview of the task on auto-
   Linguistics, CLiC-it 2018, volume 2253, pages 1–6.         matic misogyny identification at ibereval 2018. In
   CEUR-WS.                                                   IberEval@SEPLN, volume 2150 of CEUR Work-
                                                              shop Proceedings, pages 214–228. CEUR-WS.org.
Christos Baziotis, Nikos Pelekis, and Christos Doulk-
  eridis. 2017. DataStories at SemEval-2017 Task           Darja Fišer, Ruihong Huang, Vinodkumar Prab-
  4: Deep LSTM with Attention for Message-level              hakaran, Rob Voigt, Zeerak Waseem, and Jacqueline
  and Topic-based Sentiment Analysis. In Proceed-            Wernimont. 2018. Proceedings of the 2nd work-
  ings of the 11th International Workshop on Semantic        shop on abusive language online (alw2). In Pro-
  Evaluation (SemEval-2017), pages 747–754, Van-             ceedings of the 2nd Workshop on Abusive Language
  couver, Canada, August. Association for Computa-           Online (ALW2). Association for Computational Lin-
  tional Linguistics.                                        guistics.
                                                           Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and          mand Joulin, and Tomas Mikolov. 2018. Learn-
   Tomas Mikolov. 2017. Enriching word vectors with          ing word vectors for 157 languages. In Proceed-
   subword information. Transactions of the Associa-         ings of the International Conference on Language
   tion for Computational Linguistics, 5:135–146.            Resources and Evaluation (LREC 2018).
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,        Stefano Menini, Giovanni Moretti, Michele Corazza,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.            Elena Cabrio, Sara Tonelli, and Serena Villata.
  Overview of the EVALITA 2018 hate speech de-                2019. A system to monitor cyberbullying based on
  tection task. In Proceedings of the Sixth Evalua-           message classification and social network analysis.
  tion Campaign of Natural Language Processing and            In Proceedings of the Third Workshop on Abusive
  Speech Tools for Italian. Final Workshop (EVALITA           Language Online, pages 105–110, Florence, Italy,
  2018) co-located with the Fifth Italian Conference          August. Association for Computational Linguistics.
  on Computational Linguistics (CLiC-it 2018), Turin,
  Italy.                                                   F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
                                                              B. Thirion, O. Grisel, M. Blondel, P. Pretten-
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-              hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger             sos, D. Cournapeau, M. Brucher, M. Perrot, and
  Schwenk, and Yoshua Bengio. 2014. Learning                  E. Duchesnay. 2011. Scikit-learn: Machine learn-
  phrase representations using rnn encoder–decoder            ing in Python. Journal of Machine Learning Re-
  for statistical machine translation. In Proceedings of      search, 12:2825–2830.
Rachele Sprugnoli, Stefano Menini, Sara Tonelli, Fil-
  ippo Oncini, and Enrico Piras. 2018. Creating a
  whatsapp dataset to study pre-teen cyberbullying. In
  Proceedings of the 2nd Workshop on Abusive Lan-
  guage Online (ALW2), pages 51–59. Association for
  Computational Linguistics.
Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy,
  and Joel Tetreault. 2017. Proceedings of the first
  workshop on abusive language online. In Proceed-
  ings of the First Workshop on Abusive Language On-
  line. Association for Computational Linguistics.
Michael Wiegand, Melanie Siegel, and Josef Ruppen-
  hofer. 2018. Overview of the germeval 2018 shared
  task on the identification of offensive language. In
  Proceedings of GermEval 2018, 14th Conference on
  Natural Language Processing (KONVENS 2018).