TheNorth @ HaSpeeDe 2: BERT-based Language Model Fine-tuning for Italian Hate Speech Detection Eric Lavergne, Rajkumar Saini, György Kovács and Killian Murphy Luleå Tekniska Universitet eric.lavergne@gmx.fr rajkumar.saini@ltu.se gyorgy.kovacs@ltu.se killian.murphy@telecom-sudparis.eu Abstract tackle both the main task, and the first sub-task of Stereotype Detection that is potentially useful English. This report was written to de- for the main task. For this sub-task the organis- scribe the systems that were submitted by ers use the following definition of Stereotype: “a the team “TheNorth” for the HaSpeeDe standardized mental picture that is held in com- 2 shared task organised within EVALITA mon by members of a group and that represents 2020. To address the main task which an oversimplified opinion, prejudiced attitude, or is hate speech detection, we fine-tuned uncritical judgment” (Merriam-Webster, 2020). BERT-based models. We evaluated both Here, we have two binary classification tasks. A multilingual and Italian language models simple way to perform text classification is based trained with the data provided and addi- on bag-of-words representation counting the num- tional data. We also studied the contri- ber of occurrences of each word within text. It is butions of multitask learning considering often combined with term frequency-inverse doc- both hate speech detection and stereotype ument frequency (Sparck Jones, 1988) (TF-IDF) detection tasks. representation. TF-IDF allows the frequencies to be normalized according to how often the words 1 Introduction appear in all documents. With the rise of neu- ral networks, word vectors have provided useful Organised as part of the 7th EVALITA evalua- features for text classification tasks. Recurrent tion campaign (Basile et al., 2020), the HaSpeeDe Neural Networks as the Bidirectional Long-Short 2 shared task focuses on the detection of online Term Memory (BiLSTM) network (Schuster and hate speech (Sanguinetti et al., 2020) in Italian- Paliwal, 1997) have then be used to encode the Hate speech occurs frequently on social media. It long-term dependencies between the words. These is defined as “any communication that disparages systems were the most successful in the previous a person or a group on the basis of some char- HaSpeeDe campaign (Bosco et al., 2018). acteristics such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other In (Aluru et al., 2020), the authors showed characteristics” (Nockleby, 2000). Regulating all that when dealing with very low monolingual re- user messages is very time-consuming for a hu- sources, multilingual approaches can be interest- man, and this is one of the reasons why automatic ing for hate speech. In (Polignano et al., 2019b), methods are important. the AlBERTo monolingual Italian BERT-based Beside the main task of binary hate speech clas- language model was trained that outperformed the sification - aimed at deciding whether a message state-of-the-art on the HaSpeeDe 2018 evaluation contains hate speech or not - the HaSpeeDe 2 task (Polignano et al., 2019a). shared task has two more sub-tasks. One being We have chosen to deepen the approach of fine- stereotype detection, and the other the identifica- tuning a BERT based language model, comparing tion of nominal utterances. All tasks being eval- multilingual and monolingual settings. We also uated both on in-domain (tweets) data, and out- assessed the contribution of additional hate speech of-domain (newspaper headlines) data. Here, we data from different online sources. We finally sub- mitted the results of the same model fine-tuned Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 with and without multitask learning between hate International (CC BY 4.0). speech and stereotype detection tasks. 2 System Description where the basics shapes features are extracted by the first layers and the task-specific combinations 2.1 Fine-tuning process are processed in the last ones. Thus we applied The chosen classification approach is to fine-tune layer-wise learning rate with the following geo- a BERT-based language model. This kind of ap- metric equation: the learning rate in a layer is the proach is the state-of-the-art for many text classi- one of the following multiplied by a decay factor fications tasks today (Sun et al., 2019; Seganti et γ between 0 and 1. al., 2019). BERT is a language model which aims to learn the distribution of language (Devlin et al., LRk−1 = γ × LRk 2018). It is trained with the prediction of masked tokens in a text. The next sentence prediction task where LRk is the learning rate of the k-th layer. that was used simultaneously for training has been Then the case when γ is one is the case of clas- removed for some later BERT-based models such sic fine-tuning with the same learning rate every- as RoBERTa (Liu et al., 2019). BERT is a Trans- where, and the case when γ is zero is the case of former. In a Transformer, the recurrence of Re- feature extraction with the whole language model current Neural Networks is replaced by the mech- weights that are frozen and only the parameters of anism of attention (Vaswani et al., 2017). the classification head are trainable. This hyper- It has been shown that it is possible to fine-tune parameter γ was learned with the others during the these models for many downstream natural lan- hyper-parameters tuning process. guage processing tasks, including the one we are interested in, which is text classification. This can 2.3 Monolingual and multilingual language be achieved by removing the language modelling models head and replacing it by a head appropriate for We compared the use of several language mod- the target task. The designers of BERT prepared els. Many models similar to BERT have been this by adding a token at the beginning of each trained since 2018, and a lot are available for use. text sequence, named CLS for classification. The Although the models are often first and foremost purpose of this token is to contain the information trained for English, multilingual models have been useful for the classification task at the end of the trained on data of several languages in order to forwarding process. Then a classifier head can just counteract the lack of data for some languages. It take this CLS token as input to classify the whole is the case of mBERT and XLM-Roberta (Con- text sequence. In our case we decided to add a neau et al., 2020). Also machine learning re- simple linear layer with a softmax on top of it, for searchers trained monolingual models for their simplicity and because it is efficient enough since own language, as CamemBERT for French and the other layers are fine-tuned. AlBERTo or UmBERTo for Italian. Multilingual models have the advantage that they are trainable 2.2 Layer-wise learning rate on data in different languages; it is very useful for An important consideration of fine-tuning de- low-resources tasks. However, they are expected scribed in (Sun et al., 2019) is the choice of the to perform in dozens of languages while mono- learning rate. Besides being as usual the most lingual models focus on just one, with the same important hyper-parameter in the gradient descent number of parameters. For this reason, monolin- learning algorithm, it could also be responsible gual models often perform better when sufficient here for some catastrophic forgetting if it were too data is available, as we show here. high. Catastrophic forgetting refers to the fact of We evaluated two multilingual models, mBERT erasing the information of the weights of the pre- and XLM-RoBERTa, and three Italian monolin- trained model and can happen when the gradient gual models, AlBERTo, UmBERTo, and PoliB- updates are too high. ERT. AlBERTo was pretrained on TWITA, that Moreover, the learning rate can be gradually de- is a collection of Italian tweets (Polignano et al., creased in the first layers of the models. It aims at 2019b). UmBERTo was pretrained on Common- limiting the update in these first layers that have crawl ITA exploiting OSCAR Italian large corpus been showed to contain the most primal informa- (Parisi et al., 2020). Finally, PoliBERT was fine- tion about the language. One can think of the clas- tuned for sentiment analysis on Italian tweets by sical example in computer vision neural networks its creators (Barone, 2020). We tried to use more data, with different set- then a kind of transfer learning. The error analy- tings. For the multilingual models, we could use sis conducted on HaSpeeDe 2018 evaluation sug- all type of hate speech data. For the monolingual gests a significant correlation between the usage models, we used the little data available for Ital- of stereotype and hate speech (Francesconi et al., ian but we tried also to use translated multilingual 2019). Moreover, they showed that the false pos- data. These additions were not conclusive, so we itive rate of hate speech tweets is slightly bigger stuck to the HaSpeeDe 2 data for the submissions. for tweets with stereotype. A question that arises when doing multitasking 2.4 Random search hyper-parameters tuning is the way to combine the loss of the tasks in one. The tuning of the hyper-parameters is relevant in The simple solution is to sum them uniformly. It order to get good results, and that is especially might not be the best solution when there is imbal- the case for the learning rate and the layer-wise ance between the tasks, for instance when the scale decay factor γ. We tuned hyper-parameters with of the outputs of one is much higher than the oth- random search which has been shown to be of- ers. A solution brought by (Kendall et al., 2017) ten more efficient than grid-search (Bergstra and is to use trainable weights based on uncertainty. Bengio, 2012). The hyper-parameters to be tuned (Liebel and Körner, 2018) upgrades the regulari- are the batch size, the learning rate, the layer-wise sation term of this solution and (Gong et al., 2019) multiplier and the length of the model (maximum shows in a benchmark that this last solution is of- number of tokens). We did ten trials for each lan- ten the best. We evaluated this solution and we guage model. The number of epochs is selected compared with the single-task setting. with early stopping on the validation macro F1- score with a split of 80/20. Table 1 shows the best 2.6 Cross-validation ensembling and hyper-parameters obtained that have been used for submitted models the systems submitted. Two submissions are allowed during the HaSpeeDe 2 test phase. We chose to submit Hyper-parameter Value a fine-tuned UmBERTo trained separately for Learning rate 2.10-4 each of the two tasks and a fined-tuned UmBERTo Layer-wise γ 0.35 with multitasking on both Stereotype and Hate Batch Size 32 Speech detection. The hyper-parameters used to Max Length 100 train these models were presented in Table 1. Language Model UmBERTo Since we compared the different language mod- els with 5-fold cross-validation, we then ensem- Table 1: Hyper-parameters used for our bled the 5 models obtained for each fold in order to HaSpeeDe 2 submission after the tuning process get the final model. The ensembling was done by considering the mean of the probabilities returned It is very important that the learning rate and the by each model. layer-wise multiplier γ are tuned simultaneously because the choice of the multiplier strongly mod- 3 Data Description ifies the amplitude of the gradient. The organisers provided a train dataset of 6,839 2.5 Multitask Learning tweets, annotated with Hate Speech and Stereo- type labels (as described in Table 2). We evaluated the usage of multitask learning be- tween the two classification tasks of the competi- Dataset HS Ster tion that are hate speech detection and stereotype Development Data (Tweets) 0.404 0.445 detection. Multitask learning consists of learning Test Data (Tweets) 0.492 0.450 to perform several tasks. It can be done by learn- Test Data (News) 0.362 0.350 ing the tasks simultaneously with common first layers but task-specific heads (Ruder, 2017). In Table 2: Distribution of Hate Speech and Stereo- our case each task has its own output linear layer. type labels in HaSpeeDe 2 data. When the tasks should be based on similar rep- resentations, it is supposed to do a good regular- The test data of HaSpeeDe 2 consists of two ization with useful shared representations. It is subsets: an in-domain set (1,263 tweets) and an out-of-domain set (500 newspaper headlines). data. The averages of the 5 macro F1-scores are The hate speech labels are slightly unbalanced shown in Table 3. towards non-hate speech. Thus we tried to use System HS Ster adapted losses to prevent tendency towards non- Baselines hate speech predictions. We used class-weighted loss, which assigns a higher weight to the obser- Most Frequent Class 0.374 0.353 vations from the minority class in the computing TF-IDF Bag-of-words 0.703 0.677 of the loss. We also tried to use a smoothed F1- Word vectors + BiLSTM 0.721 0.654 score – a differentiable loss in phase with the F1. Multilingual language models Neither approach improved the results in a signif- mBERT 0.757 0.716 icant way. XLM-RoBERTa 0.761 0.677 The pre-processing was simple. We removed Italian language models emoticons and hashtags and we replaced urls and AlBERTo 0.773 0.716 user names with associated tags as done in the PoliBERT 0.795 0.733 evaluation data. Each tweet was padded with a UmBERTo 0.799 0.733 size of 100. Then we used the pre-processing and tokenization pipeline specific to each language Table 3: Macro F1-scores averaged over 5-fold model as provided by the authors of the models. cross-validation on HaSpeeDe 2 training data. 4 Results 4.4 Test Results 4.1 Macro F1-score The scores of our two systems evaluated on the The metric used for the evaluation is the macro HaSpeeDe 2 test data are summarized in Table 4. F1-score. The F1-score of a class is computed by These systems are 5 UmBERTo models trained on calculating the harmonic mean between the preci- each of the 5 training folds and ensembled. The sion and recall for this class. The macro F1-score second system is the same as the first with the use is the mean between the F1-scores for each class. of multitask learning. It is less sensitive to the imbalance between the classes. System Tweets News Hate Speech Detection 4.2 Baselines Most Frequent Class 0.337 0.389 We used several baselines to evaluate our results Classic Features + SVM 0.721 0.621 during the development process. The first ones UmBERTo 0.790 0.671 are those obtained by dummy classifiers, one that UmBERTo + Multitasking 0.809 0.660 always predicts the most frequent class and the Best HaSpeeDe 2 0.809 0.774 other one that makes a random stratified predic- Stereotype Detection tion according to the distribution of the classes in Most Frequent Class 0.355 0.394 the training data. We also computed the results of Classic Features + SVM 0.715 0.669 more developed systems, that are a TF-IDF bag of UmBERTo 0.772 0.685 words and a BiLSTM with trainable word vectors UmBERTo + Multitasking 0.768 0.647 inputs. Best HaSpeeDe 2 0.772 0.720 The HaSpeeDe 2 organisers provided two base- line systems after the results were submitted. The Table 4: Macro F1-scores on HaSpeeDe 2 test first is a most frequent class predictor and the sec- datasets. ond is a linear SVM with unigrams, char-grams and TF-IDF representation. 5 Discussion 4.3 Validation Results 5.1 Multilingual and monolingual models We tuned the hyper-parameters for each evaluated According to Table 3, multilingual models per- language model as described in Section 2.4. For formed worse than monolingual models based on each language model, we then computed 5-fold HaSpeeDe 2 data alone, although they achieved cross-validation results on HaSpeeDe 2 training respectable results. Moreover, even when we used additional data tasking learning performed much better on the in- from other languages to train the multilingual domain data for the hate speech detection task. It models, they still did not manage to outperform is not the case however for the out-of-domain data, the monolingual models, as we were hoping they neither for the stereotype detection task. would. Table 6 describes in more detail the differences Within the Italian models, UmBERTo and between the predictions of the two systems for PoliBERT performed better than AlBERTo on data containing stereotypes and data not contain- these tasks. While the good performance of PoliB- ing stereotypes. We observed that the improve- ERT can be linked to its pre-training for a tweet ment linked to multitask learning consists mainly classification task (sentiment analysis) potentially in a reduction in the number of false positives in useful for hate speech detection, it is more diffi- favour of the number of true negatives in data not cult to explain the competitiveness of UmBERTo, labeled as Stereotype. Assuming that hate speech which was trained on data not coming from Twit- makes significant use of stereotype, one could sup- ter and less numerous than for AlBERTo. One ex- pose that the multitask model has learned to dis- planation could be the better quality of this data, card some data that do not have the characteristics or a better optimisation by its creators. of stereotypes and are therefore unlikely to contain hate speech. 5.2 Out-of-domain data and in-domain data Our results on the HaSpeeDe 2 test dataset are Data labeled as Stereotype summarized in the Table 4. The results obtained Predicted False Predicted True on in-domain data correspond to what we ex- False +3 -3 pected from our cross-validation results. Our sys- True +7 -7 tems achieved the best macro F1-scores on the in- Data not labeled as Stereotype domain test set (Tweets) for both hate speech and Predicted False Predicted True stereotype detection. However, the results on out- False +28 -28 of-domain data (News) are far from being as good. True +1 -1 This can be explained by the different distribution of this data compared to the training data. Table 6: Hate Speech Confusion matrix of the Table 5 shows the confusion matrix for our first multitask system minus the one of the single-task system evaluated on out-of-domain data. The er- system, for Stereotype and Non Stereotype tweets ror is mostly due to the high number of false neg- test data. atives. The classifier predicts too many sequences as non-hate speech. This suggests that this clas- 6 Conclusion sifier trained with hate speech on Twitter is strug- gling to detect hate speech in newspaper headlines. In this work, we compared the fine-tuning of It can be assumed that hate speech in newspapers multilingual and monolingual BERT-based lan- is more subtle, with less coarseness and aggres- guage models for hate speech detection. We siveness that make it easier to detect on Twitter. also investigated the addition of multitask learning with the Stereotype detection task linked to hate Predicted False Predicted True speech. We obtained the best macro F1-scores of False 312 7 HaSpeeDe 2 on the in-domain test data. However, True 117 64 the results were worse for out-of-domain test data, and further research could be conducted to better Table 5: Hate Speech Confusion matrix for Um- understand the reasons for this and address it. BERTo evaluated on news test data. References 5.3 Multitasking Benefits Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, and We have chosen to submit a system with multitask Animesh Mukherjee. 2020. Deep Learning Models learning on both Stereotype and Hate Speech de- for Multilingual Hate Speech Detection. tection and an other one without, in order to study Gianfranco Barone. 2020. Politic BERT based Sen- the benefits of it. Indeed, the system with multi- timent Analysis. https://huggingface.co/ unideeplearning/polibert_sa. accessed Loreto Parisi, Simone Francia, and Paolo on Sept 18, 2020. Magnani. 2020. UmBERTo: an Ital- ian Language Model trained with whole Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- word Masking. https://github.com/ cia C. Passaro. 2020. EVALITA 2020: Overview of musixmatchresearch/umberto. accessed the 7th Evaluation Campaign of Natural Language on Sept 18, 2020. Processing and Speech Tools for Italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Marco Polignano, Pierpaolo Basile, Marco De Gem- Passaro, editors, Proceedings of Seventh Evalua- mis, and Giovanni Semeraro. 2019a. Hate Speech tion Campaign of Natural Language Processing and Detection through AlBERTo Italian Language Un- Speech Tools for Italian. Final Workshop (EVALITA derstanding Model. In NL4AI@AI*IA. 2020), Online. CEUR.org. Marco Polignano, Pierpaolo Basile, Marco De Gem- James Bergstra and Y. Bengio. 2012. Random Search mis, Giovanni Semeraro, and Valerio Basile. 2019b. for Hyper-Parameter Optimization. The Journal of AlBERTo: Italian BERT Language Understanding Machine Learning Research, 13:281–305, 03. Model for NLP Challenging Tasks Based on Tweets. In Proceedings of the Sixth Italian Conference on Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Computational Linguistics (CLiC-it 2019), volume M. Sanguinetti, and M. Tesconi. 2018. Overview 2481. CEUR. of the EVALITA 2018 Hate Speech Detection Task. In EVALITA@CLiC-it. Sebastian Ruder. 2017. An Overview of Multi- Task Learning in Deep Neural Networks. CoRR, Alexis Conneau, Kartikay Khandelwal, Naman Goyal, abs/1706.05098. Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Manuela Sanguinetti, Gloria Comandini, Elisa Di moyer, and Veselin Stoyanov. 2020. Unsupervised Nuovo, Simona Frenda, Marco Stranisci, Cristina Cross-lingual Representation Learning at Scale. Bosco, Tommaso Caselli, Viviana Patti, and Irene Russo. 2020. Overview of the EVALITA 2020 Sec- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ond Hate Speech Detection Task (HaSpeeDe 2). In Kristina Toutanova. 2018. BERT: Pre-training of Valerio Basile, Danilo Croce, Maria Di Maro, and Deep Bidirectional Transformers for Language Un- Lucia C. Passaro, editors, Proceedings of the 7th derstanding. CoRR, abs/1810.04805. evaluation campaign of Natural Language Process- Chiara Francesconi, Cristina Bosco, Fabio Poletto, and ing and Speech tools for Italian (EVALITA 2020), M. Sanguinetti. 2019. Error Analysis in a Hate Online. CEUR.org. Speech Detection Task: The Case of HaSpeeDe-TW M. Schuster and K.K. Paliwal. 1997. Bidirec- at EVALITA 2018. In CLiC-it. tional recurrent neural networks. Trans. Sig. Proc., Ting Gong, Tyler Lee, Cory Stephenson, Venkata Ren- 45(11):2673–2681, November. duchintala, Suchismita Padhy, Anthony Ndirango, Alessandro Seganti, Helena Sobol, Iryna Orlova, Gokce Keskin, and Oguz Elibol. 2019. A compari- Hannam Kim, Jakub Staniszewski, Tymo- son of loss weighting strategies for multi-task learn- teusz Krumholc, and Krystian Koziel. 2019. ing in deep neural networks. IEEE Access, PP:1–1, NLPR@SRPOL at SemEval-2019 Task 6 and Task 09. 5: Linguistically enhanced deep learning offensive Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017. sentence classifier. In SemEval@NAACL-HLT. Multi-Task Learning Using Uncertainty to weigh Losses for Scene Geometry and Semantics. CoRR, Karen Sparck Jones, 1988. A Statistical Interpretation abs/1705.07115. of Term Specificity and Its Application in Retrieval, page 132–142. Taylor Graham Publishing, GBR. Lukas Liebel and Marco Körner. 2018. Aux- iliary Tasks in Multi-task Learning. CoRR, Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. abs/1805.06334. 2019. How to Fine-Tune BERT for Text Classifi- cation? In Maosong Sun, Xuanjing Huang, Heng Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Ji, Zhiyuan Liu, and Yang Liu, editors, Chinese dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Computational Linguistics, pages 194–206, Cham. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Springer International Publishing. RoBERTa: A Robustly Optimized BERT Pretrain- ing Approach. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Merriam-Webster. 2020. stereotype, noun. Kaiser, and Illia Polosukhin. 2017. Attention is All https://www.merriam-webster.com/ you Need. In I. Guyon, U. V. Luxburg, S. Bengio, dictionary/stereotype. Accessed on H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- 2020-11-05. nett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Curran As- John T. Nockleby. 2000. Hate Speech. Macmillan, sociates, Inc. New York.