=Paper=
{{Paper
|id=Vol-2765/175
|storemode=property
|title=CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection with Part-of-Speech Tagging (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper175.pdf
|volume=Vol-2765
|authors=Giuseppe Gambino,Roberto Pirrone
|dblpUrl=https://dblp.org/rec/conf/evalita/GambinoP20
}}
==CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection with Part-of-Speech Tagging (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2765/paper175.pdf</pdf>
<pre>
            CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection
                       with Part-of-Speech Tagging
                              Giuseppe Gambino and Roberto Pirrone
                                    Dipartimento di Ingegneria
                                  Università degli Studi di Palermo
                          giuseppe.gambino09@community.unipa.it
                                 roberto.pirrone@unipa.it


                        Abstract                                ting better over time. Due to the societal concern
                                                                and how widespread hate speech is becoming on
    The present paper describes two neural                      the Internet, there is strong motivation to study au-
    network systems used for Hate Speech De-                    tomatic detection of hate speech. By doing so, the
    tection tasks that make use not only of                     spread of hateful content can be reduced, having
    the pre-processed text but also of its Part-                a safer place to stay online for the community but
    of-Speech (PoS) tag. The first system                       also a more attractive place for advertising spon-
    uses a Transformer Encoder block, a rel-                    sors who do not want their brand to be associ-
    atively novel neural network architecture                   ated with hateful content. Obviously, detecting
    that arises as a substitute for recurrent neu-              hate speech is a challenging task. For example,
    ral networks. The second system uses a                      in case of wrong classification, a content creator
    Depth-wise Separable Convolutional Neu-                     could suffer socio-economic consequences such as
    ral Network, a new type of CNN that has                     the demonetization of one of its contents or the ban
    become known in the field of image pro-                     from the platform used. Therefore, the goal of hate
    cessing thanks to its computational effi-                   speech detection is not only to identify a text that
    ciency. These systems have been used for                    contains words that at first sight could be negative,
    the participation to the HaSpeeDe 2 task of                 but also to be able to distinguish news headlines
    the EVALITA 2020 workshop with CHI-                         that talk about crime news from a text that contains
    Lab as the team name, where our best sys-                   an effective “attack” against a person or group on
    tem, the one that uses Transformer, ranked                  the basis of attributes such as race, religion, eth-
    first in two out of four tasks and ranked                   nic origin, national origin, sex, disability, sexual
    third in the other two tasks. The systems                   orientation, or gender identity.
    have also been tested on English, Spanish
    and German languages.                                          The rest of the paper is arranged as follows.
                                                                Section 2 reports a description of our systems de-
                                                                veloped for hate speech detection tasks. Section
1    Introduction                                               3 shows the results obtained in the HaSpeeDe 2
Hate speech is not unfortunately a new problem in               (Sanguinetti et al., 2020) task of the EVALITA
the society, but recently it has found fertile ground           2020 (Basile et al., 2020) conference, together
in social media platforms that enable users to ex-              with other results obtained with different lan-
press themselves freely and often anonymously.                  guages. Results are showed in Section 4 and con-
While the ability to freely express oneself is a hu-            clusions are discussed in Section 5.
man right, inducing and spreading hate towards
another group is an abuse of this liberty (MacA-
                                                                2   Description of the Systems
vaney et al., 2019).
   As such, many online micro-blogs such as Face-
book, YouTube, Reddit, and Twitter consider hate                In this section we present the implementation de-
speech harmful, and have both policies and instru-              tails of all the used architectures. Both the sys-
ments to remove hate speech content, that are get-              tems we implemented share the use of PoS Tag-
                                                                ging technique that is applied to the pre-processed
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       text, and passed as an additional input to the neural
ternational (CC BY 4.0).                                        network.
2.1   Pre-processing                                     Tweet                                         HS
Before training a model, it is common practice to        @user useless people like                     1
clean the data, especially if they are retrieved from    all Muslims
social media. For this reason we implemented a           @user no more refugees in Italy               1
classic text pre-processing pipeline, that consists      please no more
of: lower casing the text; removing HTML tags,           Four bicycles stolen from Milan-Sanremo       0
mention and symbols; standardizing words by cut-         cyclists: found in a gypsy camp url
ting the characters repeated more than two times in      TRAGEDY IN PRISON - The nomad                 0
a row. We also made some keyword substitutions           Carlo Helt takes his own life url
in all our data sets:
                                                        Table 1: Some examples translated into English
   • URLs and the “url” keyword of the                  drawn from the development data set proposed in
     HaSpeeDe 2 data set were replaced by               the HaSpeeDe 2 competition together with their
     the symbol LINKURL                                 label: nominal utterances used in hate speech
                                                        along with journalistic tweets
   • Happy emoticons like “ :) ” or “ :D ” were
     replaced by the symbol HAPPYEMO
                                                        URLs to the vocabulary. In this way we have in-
   • Angry or sad emoticons like “ :@ ” or “ :( ”
                                                        jected some parts of the speech of the social me-
     were replaced by the symbol BADEMO
                                                        dia language into a standard PoS Tagging model.
   It is important to note that we have not removed     We were definitely aware that tweet oriented mod-
the emojis from the text as our word embedding          els such as UDPipe tool (Straka, 2018) trained
takes into account emojis as plain words.               on POSTWITA-UD Treebank (Sanguinetti et al.,
                                                        2018) would have performed better than our solu-
2.2   Part-of-Speech Tagging                            tion on the in-domain data but our solution guar-
In this work we use the PoS Tagging technique to        anteed a more balanced performance. An example
provide our networks with more information about        of our PoS Tagging is showed in Figure 1.
the meaning of a sentence through an explicit clas-
sification on the basis of its grammatical structure.
This is a crucial point with regards to hate sen-
tences. In fact they tend to have particular struc-
tures. As an example, one of the most widespread
hate sentence is the verbless one, also known as
nominal utterance (Comandini et al., 2018). An-
other example are journalistic tweets (Comandini
and Patti, 2019). Starting from a preliminary di-                Figure 1: PoS Tagging example
rect inspection of the development data set pro-
posed in HaSpeeDe 2, we found that usually a
                                                        2.3   Word Embedding
journalistic tweet is a short tweet that ends with an
URL. Such texts can be easily misclassified due to      It is well known in the NLP community that word
the presence of some negative words that explain        embeddings are one of the features that most af-
the news. Table 1 reports some examples of these        fects the performance of a model.
types of statements.                                       For our application we chose fastText (Bo-
   As the HaSpeeDe 2 organizers required explic-        janowski et al., 2016), a word embedding devel-
itly to use the same system for both tasks A and        oped by Facebook Research. FastText enriches
B, we set up a PoS Tagging model not too bi-            word vectors with subword information treating
ased towards either news headlines or tweets. As        each word as composed of n-grams. Each word
a consequence, we enriched the PoS Tagger pro-          vector is the sum of the vector representations of
vided by the Python’s spaCy library (Honnibal           each of its n-grams. In this way, two words not
and Montani, 2017). As this model is trained on         only will have nearby vectors if they have simi-
Wikipedia, we used some regex formulas to add           lar context but also if they are similar. This is a
the keywords for emoticons, emojis, hashtags, and       great feature to treat miss-spelling that occurs of-
ten in social languages. We trained from scratch         leads to a significantly shorter training time than
the word embedding for the Italian language with         recurrent solutions. Attention is a means of selec-
the Gensim library (Řehůřek and Sojka, 2010) on       tively weighting different elements in input data,
a 2014 MacBook Pro 13” with 8GB RAM and                  so that they will have an adjusted impact on the
AVX2 FMA CPU extension and it took about 5               hidden states of downstream layers.
hours. The embedding model has been trained                 A Transformer was conceived as an encoder-
for 10 epochs on 5 millions Italian tweets, with         decoder model, that is an ideal approach for ma-
a size = 300, window size = 5, and min count =           chine translation tasks and language modeling. In
2. These tweets were extracted from TWITA 2018           this work we used the Transformer encoder archi-
Dataset (Basile and Nissim, 2013) and are all re-        tecture, as an alternative to recurrent or convolu-
lated to the words: immigrati, islam, migranti,          tional neural networks (CNN) (see Figure 2). We
musulmani, profughi, rom, stranieri, salvini, crim-      used just one Transformer encoder for the text in-
inali, africani, terroni, #dallavostraparte, #salvini,   put and one for the PoS input, then we averaged
#stopinvasione, #piazzapulita, #quintacolonna.           them thorough max pooling. Finally, we used
   For the French, English and German tweets we          dropout and a dense layer to get the output proba-
used pre-trained models (Camacho-Collados et al.,        bilities. After testing various combinations of pa-
2020). Regarding the PoS Tagging embedding, we           rameters, we found that the most efficient for this
have applied the TensorFlow’s Embedding Layer            task are: 12 heads in Multi-Head attention layer,
for all the languages considered.                        768 hidden units, embedding size equal to 300,
                                                         dropout = 0.2 and batch size equal to 128. Train-
2.4   System 1: The Transformer
                                                         ing lasted 3 epochs, about 40 seconds each.

                                                         2.5   System 2: Depth-wise Separable
                                                               Convolutional Neural Network


                                                                    Figure 3: The DSC System

                                                            Depth-wise Separable Convolution (DSC) is
       Figure 2: The Transformer System                  a well known technique in Computer Vision to
                                                         lower dramatically the number of parameters in
   Transformers (Vaswani et al., 2017) are the cur-      CNN. DSC consists in decomposing classical 3D
rent state-of-the-art models for dealing with se-        convolution, performing at first a depth-wise spa-
quences. Unlike previous architectures for NLP,          tial convolution for each channel, followed by a
such as LSTM and GRU, there are no recurrent             point-wise convolution which mixes together the
connections and thus no real memory of previous          resulting output channels. This computational
states. Transformers get around this lack of mem-        trick achieves in mimicking the true convolution
ory by perceiving entire sequences simultaneously        kernel operation, while reducing the size of the
and treating them with an attention mechanism.           model, and speeding up the training with almost
In this way, Transformers achieve parallelism that       the same accuracy.
   Our neural network architecture is reported in       not. The task is motivated by the fact that stereo-
Figure 3, and takes inspiration from Yoon Kim’s         types constitute a common source of error in HS
well-known architecture (Kim, 2014). We made            identification (Francesconi et al., 2019). Task B
some changes taking into consideration both the         data sets are the same as Task A. Our results for
vectorized text and its PoS Tagging. The over-          both the in-domain and out-of-domain test sets are
all architecture is made by two parallel DSC net-       reported in Table 3.
works that receive the text, and PoS embedding re-
spectively. The two convolutional blocks are then             Test data   Model           Rank   F1
averaged through max pooling. After testing vari-             news        Transformer     1/12   0.7203
ous combinations of parameters, we found that the             news        DSC             2/12   0.7184
most efficient setup for this task: [16, 32, 64] con-         tweets      Transformer     3/12   0.7615
volutional filters, kernel size = 2, dropout = 0.3,           tweets      DSC             5/12   0.7386
and batch size = 32. Training lasted 8 epochs,
about 5 seconds each.                                         Table 3: Results of the HaSpeeDe 2 Task B

3     Results
                                                        3.3     Multilingual Detection of Hate Speech
In this Section we describe the HaSpeeDe 2
tasks of the EVALITA 2020 competition, and we           We tested our systems also against data sets com-
present our results obtained in each of them. To        ing from either Hate Speech or Offensive Lan-
evaluate the degree of generality of our approach,      guage detection tasks for other languages.
we also tested it on hate speech detection tasks for
languages other than Italian, that is English, Span-                            English    Spanish
ish and German. The official ranking reported for                 Min           0.3500     0.4930
each run is given in terms of macro-average F-                    Mean          0,4484     0.6821
score.                                                            Median        0.4500     0.7010
                                                                  Max           0.6510     0.7300
3.1     HaSpeeDe 2 Task A - Hate Speech                           Transformer   0.6041     0.7423
        Detection                                                 DSC           0.5823     0.7375
This is the main task, and it consists of a binary
classification aimed at determining whether the               Table 4: Results of the HatEval Subtask A
message contains Hate Speech or not. We fine-
tuned the parameters for this task and then we used        Table 4 reports the results of SemEval 2019
the model as it is for the other tasks. We were pro-    Task 5 (HateEval) (Basile et al., 2019) about the
vided with a labeled training set – made of tweets      binary detection of hate speech against immigrants
only – and two unlabeled test sets: one containing      and women in Spanish and English messages ex-
in-domain data, i.e. tweets, and the other out-of-      tracted from Twitter.
domain data, i.e. news headlines. Our results for
                                                                                     German
both Task A test sets are reported in Table 2.
                                                                       Min           0,5487
      Test data   Model         Rank      F1                           Mean          0,7151
      news        Transformer   1/27      0.7744                       Median        0.7295
      news        DSC           4/27      0.7183                       Max           0.7695
      tweets      Transformer   3/27      0.7893                       Transformer   0,7384
      tweets      DSC           5/27      0.7782                       DSC           0,7240

      Table 2: Results of the HaSpeeDe 2 Task A          Table 5: Results of the GermEval 2019 Task 2

                                                          Table 5 shows the results of GermEval 2019
3.2     HaSpeeDe 2 Task B - Stereotype                  Task 2 - Subtask A (Struß et al., 2019). The pour-
        Detection                                       pose of this task is to initiate and foster research
Task B is a binary classification aimed at determin-    on the binary identification of offensive content in
ing whether the message contains stereotypes or         German language micro-posts.
4   Discussion                                            will focus on injecting more and more the gram-
                                                          matical structure of a sentence into a model, in fact
As it can be seen in the results, the Transformer
                                                          we are planning a language model that does not
model has always outperformed the DSC model:
                                                          only have the purpose of predicting a word based
we expected this outcome due to the nature of the
                                                          on the given context but that it is also capable of
DSC model, designed to be as light as possible but
                                                          predicting the PoS Tag of that word.
still performing. Regarding the results obtained
with the Italian language, we are satisfied with
our implementations which have achieved excel-            References
lent ranking positions in all tasks. In particular, the
Transformer model outperformed all the systems            Valerio Basile and Malvina Nissim. 2013. Sentiment
                                                            analysis on Italian tweets. In Proceedings of the 4th
that participated to the tasks ranking first with out-      Workshop on Computational Approaches to Subjec-
of-domain data. This can be seen as a great ability         tivity, Sentiment and Social Media Analysis, pages
of our model to generalize starting from a training         100–107, Atlanta.
data set different from that of the application. Re-
garding the results obtained with in-domain data          Valerio Basile, Cristina Bosco, Elisabetta Fersini,
                                                            Debora Nozza, Viviana Patti, Francisco Manuel
we performed slightly worse, ranking third. This            Rangel Pardo, Paolo Rosso, and Manuela San-
is probably due to the PoS Tagging model that we            guinetti. 2019. SemEval-2019 task 5: Multilingual
used in fact it is a model trained on Wikipedia and         Detection of Hate Speech Against Immigrants and
not on social language, even if slightly modified           Women in Twitter. In Proceedings of the 13th Inter-
                                                            national Workshop on Semantic Evaluation, pages
to manage hashtags, emoticons and URLs, it cer-             54–63, Minneapolis, Minnesota, USA, June. Asso-
tainly does not perform well on social texts as if it       ciation for Computational Linguistics.
were a purely PoS Tagging model trained on social
media language.                                           Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
                                                            cia C. Passaro. 2020. EVALITA 2020: Overview of
    As regards the results obtained with the other          the 7th Evaluation Campaign of Natural Language
languages, we can see that with the Spanish lan-            Processing and Speech Tools for Italian. In Valerio
guage we get an excellent result, surpassing the            Basile, Danilo Croce, Maria Di Maro, and Lucia C.
first official ranked of the HatEval 2019 compe-            Passaro, editors, Proceedings of Seventh Evalua-
tition in Spanish. Our models do not achieve as             tion Campaign of Natural Language Processing and
                                                            Speech Tools for Italian. Final Workshop (EVALITA
good results as that of English and German even             2020), Online. CEUR.org.
if the Transformer’s score is always above the me-
dian value. We think that this is caused by the na-       Piotr Bojanowski, Edouard Grave, Armand Joulin, and
ture of languages, because Germanic languages,               Tomas Mikolov. 2016. Enriching Word Vec-
                                                             tors with Subword Information. arXiv preprint
such as English and German, probably benefit less            arXiv:1607.04606.
than Latin ones from the additional use of the PoS
Tagging, in the way we used it. We are still inves-       Jose Camacho-Collados, Yerai Doval, Eugenio
tigating how to get added value from PoS Tagging             Martı́nez-Cámara, Luis Espinosa-Anke, Francesco
for the English and German languages.                        Barbieri, and Steven Schockaert. 2020. Learning
                                                             Cross-lingual Embeddings from Twitter via Distant
                                                             Supervision. In Proceedings of ICWSM.
5   Conclusion
                                                          Gloria Comandini and Viviana Patti. 2019. An Impos-
In this paper we have introduced two systems for
                                                            sible Dialogue! Nominal Utterances and Populist
the hate speech detection of social media texts in          Rhetoric in an Italian Twitter Corpus of Hate Speech
Italian, Spanish, English and German language.              against Immigrants. In Proceedings of the Third
The main feature of these models is to use as input         Workshop on Abusive Language Online, pages 163–
to the neural network not only the pre-processed            171. Association for Computational Linguistics.
text, but also it’s PoS Tag. We are satisfied with        Gloria Comandini, Manuela Speranza, and Bernardo
the results obtained, because the systems imple-            Magnini. 2018. Effective Communication without
mented are light and performing. Furthermore we             Verbs? Sure! Identification of Nominal Utterances
have shown that the use of models that include the          in Italian Social Media Texts. In Proceedings of the
                                                            Fifth Italian Conference on Computational Linguis-
additional use of the PoS Tagging, to give it more          tics (CLiC-it 2018), Torino,Italy, December 10-12,
meaning, has given an added value, reached the              2018, volume 2253 of CEUR Workshop Proceed-
top positions in the tasks ranking. Our future work         ings. CEUR.org, 12.
Chiara Francesconi, Cristina Bosco, Fabio Poletto, and   Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  M. Sanguinetti. 2019. Error Analysis in a Hate           Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Speech Detection Task: The Case of HaSpeeDe-TW           Kaiser, and Illia Polosukhin. 2017. Attention Is All
  at EVALITA 2018. In CLiC-it.                             You Need.

Matthew Honnibal and Ines Montani. 2017. spaCy 2:
 Natural language understanding with Bloom embed-
 dings, convolutional neural networks and incremen-
 tal parsing. To appear.

Yoon Kim. 2014. Convolutional Neural Networks
  for Sentence Classification. In Proceedings of the
  2014 Conference on Empirical Methods in Natu-
  ral Language Processing (EMNLP), pages 1746–
  1751, Doha, Qatar, October. Association for Com-
  putational Linguistics.

Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina
  Russell, Nazli Goharian, and Ophir Frieder. 2019.
  Hate speech detection: Challenges and solutions.
  PLOS ONE, 14(8):1–16, 08.

Radim Řehůřek and Petr Sojka. 2010. Software
  Framework for Topic Modelling with Large Cor-
  pora. In Proceedings of the LREC 2010 Workshop
  on New Challenges for NLP Frameworks, pages 45–
  50, Valletta, Malta, May. ELRA. http://is.
  muni.cz/publication/884893/en.

Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
 Alessandro Mazzei, Oronzo Antonelli, and Fabio
 Tamburini. 2018. PoSTWITA-UD: an Italian Twit-
 ter treebank in Universal Dependencies. In Proceed-
 ings of the Eleventh International Conference on
 Language Resources and Evaluation (LREC 2018),
 Miyazaki, Japan, May. European Language Re-
 sources Association (ELRA).

Manuela Sanguinetti, Gloria Comandini, Elisa
 Di Nuovo, Simona Frenda, Marco Stranisci,
 Cristina Bosco, Tommaso Caselli, Viviana Patti, and
 Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
 Overview of the EVALITA 2020 Hate Speech
 Detection Task. In Valerio Basile, Danilo Croce,
 Maria Di Maro, and Lucia C. Passaro, editors,
 Proceedings of Seventh Evaluation Campaign of
 Natural Language Processing and Speech Tools for
 Italian. Final Workshop (EVALITA 2020), Online.
 CEUR.org.

Milan Straka. 2018. UDPipe 2.0 Prototype at CoNLL
  2018 UD Shared Task. In Proceedings of the
  CoNLL 2018 Shared Task: Multilingual Parsing
  from Raw Text to Universal Dependencies, pages
  197–207, Brussels, Belgium, October. Association
  for Computational Linguistics.

Julia Struß, Melanie Siegel, Josef Ruppenhofer,
   Michael Wiegand, and Manfred Klenner. 2019.
   Overview of GermEval Task 2, 2019 Shared Task on
   the Identification of Offensive Language. In ”Pro-
   ceedings of the 15th Conference on Natural Lan-
   guage Processing (KONVENS 2019)”, 10.

</pre>