Automatic Detection of Hope Speech Daniel García-Baena Computer Science Department, SINAI research group, CEATIC, Universidad de Jaén, Spain Abstract Hope speech is a type of discourse that has the power to help, inspire people for good and even relax hostile environments. The automatic detection of hope speech is an open challenge in Natural Language Processing that has been generally eclipsed by hate speech detection. Rather than simply deleting hate speech from the Internet, restricting freedom of speech, according to the outstanding importance that psychology gives to hope and the success that some social experiments had when they highlighted hope speech over the rest of the texts, we find specially necessary to study in depth the automatic identification of hope speech. In this work, we describe a thesis project that focuses on the development of new datasets and systems that allow the automatic detection, by means of different classical machine learning techniques and new deep learning architectures, of hope speech, mainly in Spanish. Keywords Hope speech, natural language processing, language that relaxes hostile environments, language that promotes equality, diversity and inclusion 1. Justification of the research Hope speech is the type of speech that is able to relax a hostile environment [1] and that helps, gives suggestions and inspires for good to a number of people when they are in times of illness, stress, loneliness or depression [2]. Detect it automatically, so that positive comments can be more widely disseminated, can have a very significant effect when it comes to combating sexual or racial discrimination or when we seek to foster less bellicose environments [1]. As stated in the work of Chakravarthi [2], hope speech is defined as the language that is related to fostering individuals’ potential, supporting them and reaffirming their self-confidence, as well as, again, making motivational and inspirational suggestions in difficult times of illness, loneliness, stress or depression [3]. However, Palakodety et al. [1] differ from the above definition and establish as hope speech simply that which has the capacity to relax situations of tension and violence. Even, Chakravarthi [2] also introduces a possible variation of what is meant by hope speech, now taking into account the ability of language to promote equality, diversity and inclusion (EDI) of women belonging to the fields of science, technology, engineering and management (STEM), lesbian, gay, bisexual, transgender, intersex and queer individuals (LGBTIQ); and racial minorities and individuals with disabilities. Doctoral Symposium on Natural Language Processing from the Proyecto ILENIA, 28 September 2023, Jaén, Spain. Envelope-Open daniel.gbaena@gmail.com (D. García-Baena) Orcid 0000-0002-3334-8447 (D. García-Baena) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In this thesis it is pretended to elaborate resources in order to automatically classify hope speech. Therefore, it will be created a new Spanish written dataset for hope speech identification and it will be developed too some systems for detecting hope speech. 2. Previous works As this is a recent task to be tackled automatically from Natural Language Processing (NLP),only a few corpora are available. Until now, the work that has been done in relation to hope speech identification has been focused in developing new datasets for English, Malayalam and Tamil; and automatic detection systems based on classic machine learning strategies and modern deep learning architectures. They will be discussed below. 2.1. HopeEDI The HopeEDI dataset [2] contains comments in English, Malayalam and Tamil. It consists of data obtained from comments posted on YouTube videos that were collected from November 2019 to June 2020. The corpus can be downloaded free of charge at Hugging Face: https: //huggingface.co/datasets/hope_edi. The subject matter of the comments written in English is EDI (Equality, Diversity and Inclusion). In this case, the comments come from videos posted by Indian and Sri Lankan users. It is important to note that since India is a multilingual country, many of the comments may be written in several languages at the same time (code-mixing). For the HopeEDI corpus, its author applied different machine learning algorithms on a TF- IDF (Term Frequency-Inverse Document Frequency) representation of the tokens. Specifically, the corpus was evaluated with the following: Bayesian multinomial classifier (multinomial Naïve Bayes or MNB) with a value of alpha equal to 0.7, k-nearest neighbors method, support vector machine (SVM), decision tree (DT), and with Logistic Regression (LR). In any case, for all commented techniques, results scored an F1 value no better than 0.56 and, consequently, they were quite disappointing. 2.2. India-Pakistan This dataset contains data from English comments posted on videos from YouTube [1]. The researchers chose this site as the source of the data because it is the most widely used video broadcasting platform in India and Pakistan today. Unfortunately, this dataset is not publicly available. For their compilation, a series of queries were prepared and then extended with searches related to the Kashmir conflict by consulting trends from India and Pakistan that took place between February 14, 2019 and March 13, 2019. Finally, such queries were used to search for related videos on YouTube and subsequently obtain their comments using the public API of that social network. The comments are all written in English and come from mainly Indian and Pakistani users. There are also comments submitted by immigrants from India and Pakistan, whose were in Bangladesh, Nepal, United States, United Kingdom, Afghanistan, China, Canada and Russia. In this case, the origin of the users was taken into account with the intention of maintaining an equal representation of citizens belonging to both sides of the conflict. This time, the authors used a logistic regression with L2 regularization classifier (Ridge Regression). The experiment was run a total of one hundred times on one hundred random sections of the dataset and achieved an F1 value of 0.79. 2.3. KanHope KanHope dataset [4] contains comments in code-mixed Kannada-English. All data was collected with the app YouTube Comment Scraper between February 2020 and August 2020. The dataset is publicly available on Hugging Face: https://huggingface.co/datasets/kan_hope. KanHope gathers comments from several videos on distinctive topics such as movie trailers, India-China border dispute, people’s opinion about the ban on several mobile apps in India, Mahabharata and other social issues that involved oppression, marginalization and mental health. KanHope dataset authors emphasize on the inclusion of people of marginalized communities, such as LGBT, racial and gender minorities. All comments were from users based in India and, being it a multilingual country, researchers were motivated to extract the comments to work on code-mixed texts. The corpus authors applied from primitive machine learning to complex deep learning approaches. The model DC-BERT4HOPE (roberta-mbert) obtained the best results for F1-scores with 0.752, followed by DC-BERT4HOPE (bert-mbert): 0.735, mBERT: 0.726, DC-BERT4HOPE (roberta-xlm): 0.720, and random forest with 0.706. 2.4. SpanishHopeEDI Finally, we have generated a quality dataset SpanishHopeEDI [5], a new Spanish Twitter corpus on LGBT community, and we have conducted some experiments that can serve as a baseline for further research. The dataset consists of 1,650 LGBT-related tweets annotated as HS (Hope Speech) or NHS (Non Hope Speech). A tweet is considered as HS if the text: 1. Explicitly supports the social integration of minorities. 2. Is a positive inspiration for the LGTBI community. 3. Explicitly encourages LGTBI people who might find themselves in a situation or uncondi- tionally promotes tolerance. On the contrary, a tweet is marked as NHS if the text: 1. Expresses negative sentiment towards the LGTBI community 2. Explicitly seeks violence or uses gender-based insults. The dataset was created from LGBT-related tweets. All of those tweets were written in Spanish and were collected using the Twitter API. As seed for the search we used a lexicon of LGBT-related terms, such as #OrgulloLGTBI and #LGTB. In addtion, it should be mentioned that our SpanishHopeEDI dataset was included in the second workshop on Language Technology for Equality, Diversity and Inclusion that was held as a part of the ACL 2022 [6]. 3. Description, hypotheses and objectives EDI is an important issue in many different areas. Language is a fundamental tool for commu- nication and it must be inclusive and treat everyone equally. However, sometimes on social media this is not the case, as more offensive messages are posted towards people because of their race, color, ethnicity, gender, sexual orientation, nationality or religion. As Chakravarthi [2] stated, the importance of the social media on the lives of vulnerable groups, such as for people belonging to the LGBT community, racial minorities or individuals with disabilities; plays an essential role in shaping their personalities and how they perceive society [7, 8, 9]. Therefore, it is found important to focus on researching on the inclusion of this people and to use promoting positive content on social media, in pursuit of EDI. The importance of hope has already been carefully studied by psychologists and, consequently, we can affirm that hope plays a crucial role in the well-being, recovery and restoration of humans [2]. Greater hope is consistently related to a better academic, athletic, physical health, psychological adjustment and psychotherapy outcomes. In general, Hope Theory is comparable to theories of Learned Optimism, Optimism, Self-Efficacy and Self-Esteem [10]. Individuals with high doses of hope do not react in the same way to barriers as those with low amounts of hope, but instead view barriers as challenges to overcome and use their pathway thoughts to plan an alternative route to their goals [11, 12]. In addition, high levels of hope has been found to be correlated with a number of beneficial elements, such as academic performance [13] and lower levels of depression [14]. In contrast, low hope proportions are associated with negative outcomes, such as reduced well-being [15]. Therefore, it is relevant to analyze the state of the art of automated hope speech detection technologies from the perspective of NLP. In this sense, automated detection of hope speech can be especially useful in promoting the dissemination of hopeful messages to those in difficult times and can be used to promote positive messages to support EDI. Previous studies have shown that a snowball effect occurs in social media and abusive comments lead to more abusive comments and positive comments inspire people to leave more positive comments [16, 17]. In order to study this, Facebook conducted an experiment by modifying its Newsfeed algorithm to show more positive or negative posts to certain users [18]. Their results showed that people tend to write positive posts when they see happy posts in their newsfeeds and vice versa. All this suggest the importance of reinforce positivity on social media, focusing then on promoting hope speech. Hence, it was considered important to pursue the following objectives: 1. To theoretically study the concept of hope speech, as well as its treatment from an NLP point of view. 2. Analyzing the already existing hope speech detection solutions and discussing the prob- lems derived from them. 3. To make a review of all available resources, providing experiences and an accessible introduction to those researchers who may be interested in tackle this problem. 4. Make a new dataset focused on the LGBT community for Spanish hope speech detection. 5. Create baseline experiments using machine learning and deep learning algorithms, in- cluding, of course, cutting edge technologies as transformers models. 6. Develop an extensive error analysis in order to be able to determine future directions of this study. 4. Methodology The methodology that is proposed in order to achieve the objectives of this thesis is detailed below: 1. Firstly, it is necessary to carefully review the state of the art of hope speech classification. Therefore, it will be important to evaluate both already existing corpus and classification systems. 2. Secondly, we will part from some of the currently available resources, in relation to hope speech detection, and we will develop new ones with the intention of make it possible to detect hope speech sentences from texts written in Spanish. 3. Therefore, it will be created a new corpus, containing several texts written only in Spanish, that we will focus in EDI. 4. Then, we will create different systems that will use the last dataset for making possible to automatically identify hope speech texts. 5. And, finally, we will experiment with and evaluate our new resources so as to improve them, always sharing our work with the scientific community, publishing all the results and organizing shared tasks. 5. Research questions The main research questions that we pretend to respond with this work are all of them listed afterwards: • How similar it is to detect hope and hate speech? • It is possible to elaborate unambiguous hope speech tagging notes? • Are tagging notes for hope speech corpus dependent of the language in which the texts from the dataset were written? • It is interesting, or useful, to create hope speech datasets for making possible to automati- cally detect it? • What can we learn from the already existing datasets for hope speech detection in languages different than Spanish? • For new classification systems, how could we improve them? • In relation to hope speech detection, is it viable to identify the main causes of possible classification errors? • What algorithms are the best for automatic detection of hope speech? Acknowledgments This work has been partially supported by Project CONSENSO (PID2021-122263OB-C21), Project MODERATES (TED2021-130145B-I00) and Project SocialTox (PDC2022-133146-C21) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR, Project PRECOM (SUBV-00016) funded by the Ministry of Consumer Affairs of the Spanish Gov- ernment, Project FedDAP (PID2020-116118GA-I00) supported by MICINN/AEI/10.13039/501100011033, WeLee project (1380939, FEDER Andalucía 2014-2020) funded by the Andalusian Regional Gov- ernment and by a grant from Fondo Social Europeo and the Administration of the Junta de Andalucía (DOC_01073). References [1] S. Palakodety, A. R. KhudaBukhsh, J. G. Carbonell, Hope speech detection: A computational analysis of the voice of peace, arXiv preprint arXiv:1909.12940 (2019). [2] B. R. Chakravarthi, HopeEDI: A multilingual hope speech detection dataset for equality, diversity, and inclusion, in: Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, Association for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 41–53. URL: https: //aclanthology.org/2020.peoples-1.5. [3] C. R. Snyder, S. J. Lopez, H. S. Shorey, K. L. Rand, D. B. Feldman, Hope theory, measurements, and applications to school psychology., School psychology quarterly 18 (2003) 122. [4] A. Hande, R. Priyadharshini, A. Sampath, K. P. Thamburaj, P. Chandran, B. R. Chakravarthi, Hope speech detection in under-resourced kannada language, 2021. arXiv:2108.04616 . [5] D. García-Baena, M. García-Cumbreras, S. M. Zafra, J. García-Díaz, R. Valencia-García, Hope speech detection in spanish, Language Resources and Evaluation (2023) 1–28. doi:10.1007/s10579- 023- 09638- 3 . [6] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, S. Chinnaudayar Navaneethakr- ishnan, J. P. McCrae, M. A. García-Cumbreras, S. M. Jiménez-Zafra, R. Valencia-García, P. Kumar Kumaresan, R. Ponnusamy, D. García-Baena, J. A. García-Díaz, Overview of the shared task on hope speech detection for equality, diversity, and inclusion, Association for Computational Linguistics (2022) 378–388. URL: https://aclanthology.org/2022.ltedi-1.58. doi:10.18653/v1/2022.ltedi- 1.58 . [7] V. Kitzie, I pretended to be a boy on the internet: Navigating affordances and constraints of social networking sites and search engines for lgbtq+ identity work, First Monday (2018). [8] P. Burnap, G. Colombo, R. Amery, A. Hodorog, J. Scourfield, Multi-class machine classifi- cation of suicide-related communication on twitter, Online social networks and media 2 (2017) 32–44. [9] D. N. Milne, G. Pink, B. Hachey, R. A. Calvo, Clpsych 2016 shared task: Triaging content in online peer-support forums, in: Proceedings of the third workshop on computational linguistics and clinical psychology, 2016, pp. 118–127. [10] C. R. Snyder, Hope theory: Rainbows in the mind., Psychological Inquiry 13 (2002) 249–275. [11] C. R. Snyder, The psychology of hope: You can get there from here, Simon and Schuster, 1994. [12] C. R. Snyder, Hypothesis: There is hope, in: Handbook of hope, Elsevier, 2000, pp. 3–21. [13] C. R. Snyder, H. S. Shorey, J. Cheavens, K. M. Pulvers, V. H. Adams III, C. Wiklund, Hope and academic success in college., Journal of educational psychology 94 (2002) 820. [14] C. R. Snyder, B. Hoza, W. E. Pelham, M. Rapoff, L. Ware, M. Danovsky, L. Highberger, H. Ribinstein, K. J. Stahl, The development and validation of the children’s hope scale, Journal of pediatric psychology 22 (1997) 399–421. [15] E. Diener, Subjective well-being, The science of well-being (2009) 11–58. [16] A. Sundar, A. Ramakrishnan, A. Balaji, T. Durairaj, Hope speech detection for dravidian languages using cross-lingual embeddings with stacked encoder architecture, SN Computer Science 3 (2022) 1–15. [17] L. Muchnik, S. Aral, S. J. Taylor, Social influence bias: A randomized experiment, Science 341 (2013) 647–651. [18] A. D. Kramer, J. E. Guillory, J. T. Hancock, Experimental evidence of massive-scale emotional contagion through social networks, Proceedings of the National Academy of Sciences 111 (2014) 8788–8790.