=Paper=
{{Paper
|id=Vol-2936/paper-184
|storemode=property
|title=University of Regensburg @ PAN: Profiling Hate Speech Spreaders on Twitter
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-184.pdf
|volume=Vol-2936
|authors=Kwabena Odame Akomeah,Udo Kruschwitz,Bernd Ludwig
|dblpUrl=https://dblp.org/rec/conf/clef/AkomeahKL21
}}
==University of Regensburg @ PAN: Profiling Hate Speech Spreaders on Twitter==
University of Regensburg @ PAN: Profiling Hate Speech Spreaders on Twitter Notebook for PAN at CLEF 2021 Kwabena Odame Akomeah, Udo Kruschwitz and Bernd Ludwig University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany Abstract This paper reports on our approach to addressing the Shared Task Profiling hate speech spreaders on Twitter for both English and Spanish, organised as part of the PAN@CLEF 2021 Challenge. We submitted one run for each language based on pre-trained language models. For English we fine-tuned a BERT- model while for Spanish we used a language-agnostic BERT-based sentence embedding model without fine-tuning. The second approach appears to have been a lot more effective than the first one. Given the simplicity of the approaches there is plenty of room for future directions based on the architectures adopted here. Keywords Hate Speech, BERT, Embeddings, Sentence Encoder 1. Introduction Hate speech is not a new phenomenon but it has become more and more of a problem in recent years and has consequently attracted a lot of attention in the research community making hate speech detection a very active research field, e.g. [1]. In particular the growing impact of social media on the way people share and access information has demonstrated the need to tackle the problem systematically as issues such as cyber-bullying and other hurtful and anti-social behaviours [2, 1, 3] have become a growing cancer that needs to be tackled broadly across many different platforms and applications. We should note that the task of removing hate speech is not as simple as it seems as there is a fine balance between filtering hate speech and the possible restriction to the freedom of speech if a perfectly reasonable opinion is incorrectly flagged as hate speech and subsequently removed [4]. The motivation of this task [5] is to move from a purely reactive to a more pro-active approach that does not simply identify messages as hate speech but instead identifies social media users as hate speech spreaders thereby allowing the problem to be addressed more effectively (e.g. by suggesting to the owner of the social media platform to ban such users). Transformer-based methods have been demonstrated to be highly effective for a wide range of NLP tasks, e.g. [6]. This is the reason we adopt state-of-the-art pre-trained transformer-based CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " kaodamie77@gmail.com (K. O. Akomeah); udo.kruschwitz@ur.de (U. Kruschwitz); bernd.ludwig@ur.de (B. Ludwig) 0000-0002-5503-0341 (U. Kruschwitz) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) deep neural text embeddings for tackling the PAN sub-task on profiling hate speech spreaders on Twitter. In this report, we will provide an overview of the steps taken and the models used in our experiment. We will start by briefly describing the dataset, looking at the pre-processing steps and models used before we report our results obtained for the two submissions as compared to the baselines [7] submitted by the organizers. 2. PAN Task 3: Profiling Hate Speech Spreaders on Twitter The third task [5] of the PAN Challenge at CLEF 2021 [8] involves the profiling of hate speech spreaders on Twitter towards, for instance, immigrants and women using sampled data from the individual user’s timeline. The training dataset consists of 40,000 tweets constituting a set of 200 tweets sampled per each of 200 anonymized users in XML format for two languages, English and Spanish. The test set contains tweets from 100 anonymized users per language. The tasks were treated separately for each language and therefore two different models were used for both English and Spanish. An small snapshot of the raw training English data is reproduced in Figure 1. Note that the 200 tweets of hate speech spreaders may not all contain hate speech. The aim of the experiment is to discover if frequent hate spreaders can be identified based on their timeline history. The systems are ranked using the average of Accuracy achieved on the English and Spanish test sets. Submission and evaluation of this year’s tasks were done on TIRA [9] or sent to the organizers through mail. All codes used in this experiment can be accessed via GitHub.1 2.1. Preparing the Data using Contextual Embeddings In recent years, transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) have emerged as the dominant paradigm in a broad range of NLP appli- cations ranging from translation to classification, e.g. [6, 10]. Part of the success story is the fact that the expensive task of pre-training is only done once and this pre-trained model can be subsequently fine-tuned to each NLP task at hand with just one additional output layer to create state-of-the-art models. In effect there has been a large number of different BERT-based models that have emerged from this, e.g. [11, 6, 12, 13, 14]. Specifically, we turn the textual representation of the input documents (tweets) into contextual embeddings as follows. The input data (in XML format) has to be turned initially into tensors for input in the Keras Model by first extracting all text for each user using an XML parser and pandas in Python. The dataframes indexed with user-ids are then formatted into tensors ready to be used as input for the transformer model. The training dataset was split for train-test reasons; 160 for training, 32 for validation, 8 for testing and shuffled for every other time it was run. 2.2. Approach for the English Run The experiment for the English task involved the binary classification test set of 100 users similarly parsed in XML format. The training set is composed of 200 users with 200 tweets each. 1 https://github.com/kaodamie Figure 1: Small sample of the raw XML of the English Training Data The model architecture used was a dense artificial neural network with a single output with sigmoid activations. With the success of BERT-based models in NLP, we employed ALBERT, a BERT-based model trained on a large corpus of the English language with reduced parameters without a significant effect on the performance benchmark [11]. Reusing the model only required a fine-tuning of its parameters on the dataset which requires that output from ALBERT is learned as well in the network. The network had a dense layer receiving BERT encoder outputs with a dropout of 0.1 with sigmoid activation on a single output layer. The network run for 10 epochs with 5 steps per epoch and a batch size of 32. The loss function used was binary cross-entropy with an adaptive moment estimation (Adam) optimizer and a learning rate of 1e-6. The metric used for training was binary accuracy in line with accuracy as the specified metric for evaluation of the challenge. A checkpoint was implemented for the neural network. The epochs had an average runtime of about 250 seconds and therefore a larger number of epochs would be costly. The checkpoint was to monitor the minimum binary validation loss with a patience of 3 epochs. The validation loss was chosen other than training loss to check overfitting of the training dataset. Table 1 Accuracy Results for English English(Albert) Sample Size Accuracy Training 160 0.59 Validation 32 0.65 Test 8 0.86 Evaluation Test 100 0.53 2.3. Approach for the Spanish Run A similar model to the English run was used for the Spanish run in that we used a transformer- based model containing a text input layer, prepocessing layer, an encoding layer, and a single densely-connected layer as output also with a dropout of 0.1. The binary cross-entropy loss function, as well as Adam optimizer were also used for the Spanish run. The preprocessing layer was a multilingual universal sentence encoder preprocessor [10]. This preprocessor is a companion to the BERT models for preprocsssing plain text inputs into the input format expected by BERT. The model uses a vocabulary for multilingual models extracted from Wikipedia, CommonCrawl, and translation pairs from the Web. It has no trainable parameters and can be used in an input pipeline outside the training loop [10, 15]. The encoder layer used was the language-agnostic BERT sentence embedding model (LaBSE) [16]. LaBSE supports about 109 languages including Spanish. The language-agnostic BERT sentence embedding encodes sentences into high-dimensional vectors. The model is trained and optimized to produce similar representations solely for bilingual sentence pairs that are translations of each other. Because of its usefulness in sentence translations in a larger multilin- gual corpus, text classification, semantic similarity, clustering and other natural language tasks [16, 15] we applied it for this classification task. The encoder was not fine-tuned because of the large memory requirement. Running on Google Colabs required a RAM of about 12 gigabyte and even for better performance and speed, a GPU and a RAM greater 32 gigabyte is recommended. The model was trained in 10 epochs with callbacks on the binary validation loss. 3. Results Obtained During training the fine-tuned ALBERT model used for the English task peaked at a best binary validation accuracy of 0.65 as illustrated in Table 1. A possible explanation for this rather low figure is that the length of data used for training was length of 200 and longer sentences for each user. Besides each user having 200 tweets, those users profiled as hate speech spreaders may still be quite similar to non-hate spreaders as not all tweets in the history of hate spreaders may contain hate. Therefore training a model to perform classification on such a dataset can be quite a challenge. However, with the Spanish model which applied a BERT-based language-agnostic sentence encoder that was not fine-tuned performance peaked at a binary validation accuracy of 0.71 (see Table 2) but (unlike the English system) performed better on the test set. The model attained Table 2 Accuracy results for Spanish Spanish(LaBSE) Sample Size Accuracy Training 160 0.56 Validation 32 0.71 Test 8 0.75 Evaluation Test 100 0.77 Table 3 Baselines comparison of Accuracy results for Spanish test set Model Accuracy Word nGram+SVM 0.83 LSDE 0.82 USE+LSTM 0.79 LaBSE(ours) 0.77 MBERT LSTM 0.76 XLMR-LSTM 0.73 TFIDF-LSTM 0.51 an accuracy of 0.53 for English and 0.77 for Spanish after evaluation on the test set for the challenge. It is not easy to put these numbers in context – other than observing that the performance of the English run was surprisingly low. The language-agnostic BERT-based sentence encoder on the other hand performed better as our Spanish run outperformed three of the baselines [7] submitted (see Table 3). Understanding why different approaches perform better or worse on a particular dataset is not easy anyway, in particular when it comes to the explainability and interpretability of neural network-based approaches, e.g. [17][18] as performance can be affected by many parameters including a particular sample, learning rate, initialized weights among others used in training. What we do however see is a huge performance variation across training, validation and test data as well as across different submissions for this task. We attribute this in part to the small sample making it difficult to draw generalizable conclusions from a single run. We conclude that the approaches need to be tested on a wide range of additional collections to gain a better understanding of strengths and weaknesses as well as variation of results and robustness more generally, something that fits well with the idea of moving away from aiming to train systems that do amazingly well on specific collections but tend to fall over when applying them elsewhere, e.g. [19]. 4. Conclusions The use of transformer-based models for NLP tasks has pushed the state of the art in many applications. In this experiment, two different (very simple) BERT-based models were applied in an attempt to classify hate speech spreaders through training on the bulkiness of their timelines. We extracted contextual embeddings by using pretrained transformer-based models and then run through a single layered output for classification. What we found in our experiments was that the power of transformer-based approaches varied substantially across runs, and some traditional baselines did in fact perform surprisingly well (at a much lower cost overall). We do however not see this as a reflection of the weakness of more sophisticated methodologies but more of an issues arising from the datasets that are used for training and testing. The aim should be to explore a wide range of datasets to find out which of the methods is most robust, something particularly important when thinking about hate speech. A particularly promising approach, which has been shown to work well in many NLP tasks including hate speech detection [4], is to use ensemble classifiers which can tap into the different strengths of individual classifiers, be it transformer-based or traditional ideas. Acknowledgements This work was supported by the project COURAGE: A Social Media Companion Safeguarding and Educating Students funded by the Volkswagen Foundation, grant number 95564. References [1] M.-A. Rizoiu, T. Wang, G. Ferraro, H. Suominen, Transfer learning for hate speech detection in social media, arXiv preprint arXiv:1906.03829 (2019). [2] W. Warner, J. Hirschberg, Detecting hate speech on the world wide web, in: Proceedings of the second workshop on language in social media, 2012, pp. 19–26. [3] D. Ognibene, D. Taibi, U. Kruschwitz, R. S. Wilkens, D. Hernandez-Leo, E. Theophilou, L. Scifo, R. A. Lobo, F. Lomonaco, S. Eimler, H. U. Hoppe, N. Malzahn, Challenging social media threats using collective well-being aware recommendation algorithms and an educational virtual companion, arXiv preprint arXiv:2102.04211 (2021). [4] S. Zimmerman, U. Kruschwitz, C. Fox, Improving hate speech detection with deep learn- ing ensembles, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. [5] F. Rangel, G. L. D. L. P. Sarracén, B. Chulvi, E. Fersini, P. Rosso, Profiling Hate Speech Spreaders on Twitter Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS.org, 2021. [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. doi:10.18653/v1/N19-1423. [7] F. Rangel, M. Franco-Salvador, P. Rosso, A Low Dimensionality Representation for Lan- guage Variety Identification, in: International Conference on Intelligent Text Processing and Computational Linguistics, Springer, 2016, pp. 156–169. [8] J. Bevendorff, B. Chulvi, G. L. D. L. P. Sarracén, M. Kestemont, E. Manjavacas, I. Markov, M. Mayerl, M. Potthast, F. Rangel, P. Rosso, E. Stamatatos, B. Stein, M. Wiegmann, M. Wol- ska, , E. Zangerle, Overview of PAN 2021: Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection, in: 12th International Conference of the CLEF Association (CLEF 2021), Springer, 2021. [9] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1_5. [10] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo- Cespedes, S. Yuan, C. Tar, et al., Universal sentence encoder, arXiv preprint arXiv:1803.11175 (2018). [11] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, arXiv preprint arXiv:1909.11942 (2019). [12] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [13] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as discriminators rather than generators, arXiv preprint arXiv:2003.10555 (2020). [14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [15] Z. Yang, Y. Yang, D. Cer, J. Law, E. Darve, Universal sentence representation learning with conditional masked language model, arXiv preprint arXiv:2012.14388 (2020). [16] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding, arXiv preprint arXiv:2007.01852 (2020). [17] C. Rudin, C. Chen, Z. Chen, H. Huang, L. Semenova, C. Zhong, Interpretable machine learning: Fundamental principles and 10 grand challenges, arXiv preprint arXiv:2103.11251 (2021). [18] Y. Zhang, P. Tiňo, A. Leonardis, K. Tang, A survey on neural network interpretability, arXiv preprint arXiv:2012.14261 (2020). [19] S. R. Bowman, G. E. Dahl, What will it take to fix benchmarking in natural lan- guage understanding?, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani- Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6- 11, 2021, Association for Computational Linguistics, 2021, pp. 4843–4855. URL: https: //www.aclweb.org/anthology/2021.naacl-main.385/.