No Place For Hate Speech @ HaSpeeDe 2: Ensemble to Identify Hate Speech in Italian Adriano dos S.R. da Silva Norton T. Roman Schoool of Arts, Sciences and Schoool of Arts, Sciences and Humanities Humanities – University of Sao Paulo University of Sao Paulo Sao Paulo - Brazil Sao Paulo - Brazil adriano.santos.silva@usp.br norton@usp.br Abstract real harm to other people, some of it bears dis- criminating discourse, not rarely filled with hate English. In this article, we present the for minorities or people with different viewpoints. results of applying a Stacking Ensemble Defined as “language which attacks or demeans method to the problem of hate speech a group based on race, ethnic origin, religion, gen- classification proposed in the main task der, age, disability, or sexual orientation/gender of HaSpeeDe 2 at EVALITA 2020. The identity” (Nobata et al., 2016), hate speech rep- model was then compared to a Logistic resents a problem that cannot be allowed to grow, Regression classifier, along with two other under the risk of having it lead to more concrete benchmarks defined by the competition’s actions, by some people, with truly undesired re- organising committee (an SVM with a lin- sults. ear kernel and a majority class classifier). This is so much of an issue, that some compa- Results showed our Ensemble to outper- nies have already decided to stop advertising on form the benchmarks to various degrees, Facebook1 , for example, as a way to try to pressure both when testing in the same domain as the company into facing this problem. Some ini- training and in a different domain. tiatives have also emerged in order to monitor and Italiano. In questo articolo, ci presen- combat this type of content, such as the code of tiamo i risultati dell’applicazione di un conduct that has been signed by some companies modello di Stacking Ensemble al problema (YouTube, Facebook, Twitter) so that this type of della classificazione dei discorsi di incita- publication can be monitored and removed within mento all’odio nel compito A di EVALITA 24 hours2 . (HaSpeeDe 2). Il modello è stato quindi Due to the large volume of data, machine learn- confrontato con un modello di regressione ing techniques, along with natural language pro- logistica, insieme ad altri due benchmark cessing, are being used to automate this activity definiti dal comitato organizzatore della and identify this type of speech more accurately. competizione (un SVM con un kernel lin- Other initiatives include the setting up of compe- eare e un classificatore di classe maggior- titions, aimed at developing and testing different itaria). I risultati hanno mostrato che il ways to tackle the problem. nostro Ensemble supera i benchmark a One such competitions is the evaluation cam- vari livelli, sia durante i test nello stesso paign of Natural Language Processing and Speech dominio di sviluppo che in un dominio di- Tools for Italian (EVALITA), which started in verso. 2007 aiming at promoting the development and dissemination of language resources for Italian. In its 2018 edition, a task (HaSpeeDe) was proposed 1 Introduction to identify hate speech on Facebook and Twit- ter (Bosco et al., 2018). HaSpeeDe had the par- Social networks are already part of people’s lives, generating thousands of publications on a daily ba- 1 https://www.nytimes.com/2020/08/01/ sis. Even though most of this material presents no business/media/facebook-boycott.html 2 https://ec.europa.eu/info/policies/justice-and- Copyright © 2020 for this paper by its authors. Use per- fundamental-rights/combatting-discrimination/racism- mitted under Creative Commons License Attribution 4.0 In- and-xenophobia/eu-code-conduct-countering-illegal-hate- ternational (CC BY 4.0). speech-online en ticipation of several teams and promising results glish, in this case focusing in hate speech towards were presented that stimulated the development of women, with a reported accuracy of 0.70 (Saha et the second edition of the event (HaSpeeDe2) at al., 2018). Delivering an accuracy value of 79.8, EVALITA 2020 (Sanguinetti et al., 2020; Basile an ensemble associated with a meta-classifier was et al., 2020). In this work, we describe our attempt also found to perform well in the task (Malmasi to deal with the hate speech identification problem and Zampieri, 2018). HaSpeeDe 2, by developing a stack ensemble of With an overall performance of F 1 = 0.749, three machine learning models to this task. Weak our ensemble method looks competitive, when classifiers used in the ensemble were an SVM with compared to these models. Even though one can- RBF kernel, a Bernoulli Naı̈ve Bayes (NB), and a not really make a true comparison between them, Random Forest model (RF), with a Linear Regres- we believe this to be an alternative to be consid- sion (LR) model serving as meta-classifier. ered. For the sake of comparison, and as a way to define some benchmarks to our model, we also 3 Task developed and tested a Linear Regression classi- HaSpeeDe 2 Task A consists of a binary classifi- fier, with L2 regularisation, along with both mod- cation to identify the presence or absence of hate els suggested by HaSpeeDe 2 organising commit- speech in tweets written in Italian. The competi- tee, to wit, an SVM model with a linear kernel tion’s organising committee provides participants and a majority class classifier. As it will be made with a data set for training and testing compet- clearer in the forthcoming sections, with a Macro ing models. This data set is slightly imbalanced, F1-score of 0.749, our ensemble outperforms all with approximately 40% of tweets presenting hate benchmarks, for both in and out-of-domain test speech language, as shown in Table 1. sets, even though sometimes differences were not high. Table 1: Data set class distribution The rest of this article is organized as follows. Hate Speech Not Hate Speech Total Section 2 presents some related work, aiming at 2766 4073 6839 identifying hate speech. Section 3, in turn, gives an overview of HaSpeeDe 2 task. Next, in sec- This data set is supposed to be used by the com- tions 4 and 5 we explain the preprocessing we petition participants to train and test their models. made, along with the classifiers we built for this Competing models will then be evaluated in a sep- task. Section 6, in turn, presents our results, arate data set, which consists of in-domain and which are further discussed in Section 7. Finally, out-of-domain data, defined by the competition’s Section 8 presents our final considerations to this organisation. work. 4 Preprocessing 2 Related Work As a preprocessing step, we removed stopwords Several strategies have been used to identify using the NLTK (Natural Language Toolkit 3 ) li- hate speech. Some classic algorithms, like Sup- brary. For each tweet in the corpus, we also added port Vector Machine (SVM), Naı̈ve Bayes (NB), the following new features: Logistic Regression (LR) and ensemble with these techniques have also shown good results • The number of words in the tweet; (e.g. (Basile et al., 2019; Saha et al., 2018; Mal- • The number of exclamation points (‘!’) masi and Zampieri, 2018)). present in the tweet; and An SVM with RBF kernel, for example, was used to identify hate speech against immigrants • The presence or not of a question mark (‘?’) and women in tweets written in English. Achiev- in the tweet. ing a macro-averaged F 1 score of 0.65 this model was the winner at SemEval 2019 (Basile et al., As a final measure, all features related to the 2019). tweet’s text were normalised in the range between Logistic Regression was another classic model 0 and 1. 3 to be applied to hate speech identification in En- https://www.nltk.org/ Table 2: Results of the classifiers in the training stage in terms of F1 Without Preprocessing With Preprocessing Classifier Lang. Model No Norm. TF-IDF No Norm. TF-IDF RF 3-Gram 0.662 0.657 06687 0.667 RF 4-Gram 0.683 0.694 0.690 0.689 RF 5-Gram 0.701 0.701 0.687 0.686 LR 3-Gram 0.681 0.703 0.676 0.696 LR 4-Gram 0.711 0.701 0.706 0.697 LR 5-Gram 0.711 0.673 0.708 0.673 NB 3-Gram 0.679 0.679 0.681 0.681 NB 4-Gram 0.689 0.689 0.694 0.694 NB 5-Gram 0.654 0.654 0.668 0.668 Table 3: Results of the classifiers in the test stage in terms of F1 Without Preprocessing With Preprocessing Classifier Lang. Model No Norm. TF-IDF No Norm. TF-IDF RF 3-Gram 0.650 0.668 0.650 0.674 RF 4-Gram 0.693 0.694 0.710 0.696 RF 5-Gram 0.707 0.709 0.703 0.700 LR 3-Gram 0.675 0.701 0.675 0.709 LR 4-Gram 0.684 0.696 0.685 0.710 LR 5-Gram 0.669 0.665 0.707 0.680 NB 3-Gram 0.696 0.696 0.707 0.707 NB 4-Gram 0.718 0.718 0.740 0.740 NB 5-Gram 0.658 0.658 0.687 0.687 5 Classifiers trained in the training/validation set using 10-fold cross-validation. (Han et al., 2011). In the sequence, three individual classifiers were developed using the Python Sklearn4 library. 6 Results These were a Naı̈ve Bayes (NB) with Bernoulli distribution, Logistic Regression (LR) with L2 Tables 2 and 3 show the performance and set- regularization, and Random Forest (RF) with tings of each classifier in the training/validation 150 trees. Each classifier was tested with N- and test sets, respectively. During training, best re- Gram representations (N ranging from 3 to 5), sults were observed without preprocessing, for RF with and without term frequency-inverse docu- and LR, whereas NB showed better results with ment frequency (TF-IDF) (Rajaraman and Ull- preprocessing. These results, however, were very man, 2011) normalisation, and with and without close to each other, ranging from F 1 = 0.69 to pre-processing the training and test sets. F 1 = 0.71. Regarding language model, best re- We then chose the two best models to compose sults were observed with 5-grams, for RF and LR, the ensemble to be used at the competition. As it and 4-grams, for LR and NB. will be shown in the next section, these were Ran- At the test set, best results, for all methods, were dom Forests and Naı̈ve Bayes. In the sequence, we observed with preprocessing the data. Normalis- also added an SVM classifier, to RBF kernel and ing the vectors does not seem, however, to have C = 2 penalty to the ensemble, making Logistic influenced results when preprocessing is used. All Regression our meta-classifier. best values were obtained with 4-grams. Over- The training set was divided into 90% for train- all, the best result was achieved with Naı̈ve Bayes ing/validation and 10% for test set. Models were (F = 0.74), with preprocessing, using a 4-gram language model, and both with and without TF- 4 https://scikit-learn.org/stable/ IDF normalisation. The ensemble model was tested with only one Table 4: Result of baselines and final performance configuration: 4-Gram, with normalization, and of classifiers in task A in terms of F1 without preprocessing. This configuration resulted Classifier Out-of-domain In-domain in an F 1 = 0.729 in the training set (a 2.5% Baseline-MC 0.3894 0.3366 increase over the best model in this set) and an Baseline-SVM 0.621 0.7212 F 1 = 0.751 in the test set, corresponding to a Ensemble 0.632 0.749 1.5% improvement over the best model in this LR 0.621 0.705 set. As it turns out, especially in the test set, dif- ferences between the ensemble and its best con- stituent method do not seem so high. our Logistic Regression model presented the same result as the baseline SVM, outscoring the major- 7 Discussion ity class baseline by 59.5%. Interestingly, both Ensemble and Logistic Regression models scored The competition rules allow only two models to similarly in this set. be sent by each team. Although our Naı̈ve Bayes model has shown good performance in the test 8 Conclusion set we had at hand, we chose not to send it to In this article we reported on the results ob- HaSpeeDe 2 due to the fact that it would also be tained by two models submitted to EVALITA’s tested in an out-of-domain data set. HaSpeeDe2 task. Even though our Ensemble Since this classifier can be very sensitive to do- model outscored both benchmarks, we believe it main changes, specially regarding null frequency could do better, should other choices regarding the words, which might bring the whole model down language model be made. to multiplying smoothing values, we thought we Since the best results were obtained with longer would be better off not sending it. Still, it re- word sequences (in our case, 4-grams), it might be mained as one of the weak classifiers in the En- the case that other language models, such as Glove semble we sent, so it was not completely put aside. or CBOW, for example, which make use of context The organization of the competition presented words at both sides of the target word, could come F1 results corresponding to two classifiers, run in up as better alternatives for the 4-gram model we the same data set distributed to all participants in used. BERT could also be a possibility to test. the competition. These were supposed to be taken Our best results were also obtained, at least dur- as baselines by all competing teams. The first ing test, with preprocessing the data. We thus be- consisted of a majority class classifiers (Baseline- lieve this is something to be kept. Regarding the MC), which always chooses the majority class to normalisation of feature vectors, we could not ob- label new examples. The second classifier, in turn, serve great differences between using it or not, at consisted of an SVM with linear kernel, running least when it comes to TF-IDF normalisation. with TF-IDF normalisation (Baseline-SVM). Another direction to be followed might be to Table 4 shows the result of these two baseline test other models as weak classifiers in the Ensem- classifiers, along with the classifiers we submit- ble, or even ensemble strategies other than stack- ted to the competition (i.e. our Ensemble model ing. This is something we leave for future work. and its constituent Logistic Regression classifier). As it turns out, for the within-domain task, only our Ensemble was superior to the baselines (3.9% References over the baseline SVM and almost 123% over the Valerio Basile, Cristina Bosco, Elisabetta Fersini, majority class baseline). When moving to the out- Debora Nozza, Viviana Patti, Francisco Manuel of-domain test set, this difference dropped to only Rangel Pardo, Paolo Rosso, and Manuela San- 1.8% over the SVM model and 62.3% over the ma- guinetti. 2019. SemEval-2019 task 5: Multilin- gual detection of hate speech against immigrants and jority class, still outscoring both baselines. women in twitter. In Proceedings of the 13th In- Regarding our Logistic Regression model, ternational Workshop on Semantic Evaluation, Min- when run in the within-domain test set, it neapolis, USA, June. outscored only the majority class baseline (109% Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- better), being however outscored by the baseline cia C. Passaro. 2020. Evalita 2020: Overview SVM by 2.3%. As for the out-of-domain test set, of the 7th evaluation campaign of natural language processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Overview of the EVALITA 2018 Hate Speech De- tection Task. In Proceedings of the 6th Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA’18). Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier. Shervin Malmasi and Marcos Zampieri. 2018. Chal- lenges in discriminating profanity from hate speech. Journal of Experimental & Theoretical Artificial In- telligence, 30(2):187–202. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive lan- guage detection in online user content. In Proceed- ings of the 25th international conference on world wide web. Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of massive datasets. Cambridge. Punyajoy Saha, Binny Mathew, Pawan Goyal, and Ani- mesh Mukherjee. 2018. Hateminers : Detecting hate speech against women. CoRR, abs/1812.06700. Manuela Sanguinetti, Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and Irene Russo. 2020. HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org.