1. Introduction

Multilingual Hope Speech Detection using Machine Learning

Mesay Gemeda Yigezu

Girma Yohannis Bade

Olga Kolesnikova

Grigori Sidorov

Alexander Gelbukh

0 0 Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), NLP Lab , Mexico City , Mexico

Millions of individuals use social media platforms like Facebook, Twitter, Instagram, and YouTube to share or get opinions. These social media platforms also spread, negative and positive thoughts. Hope speech is one of the positive thoughts which can make relax an environment when people get anxious. This paper presents hope speech detection among posts in English and Spanish. Since it's a shared task of IberLEF 2023, train-test data sets for both English and Spanish labeled as hope speech and not hope speech. We used Python to develop a model and chose a support vector machine (SVM) to achieve the assigned task. We developed the hope speech-detecting model by using a train-development data set and evaluated it on test data sets. The performance of the model was measured by an average macro F1 score metric. The model showed an average macro F1 for English of 0.489 and 0.481 for Spanish.

eol>Hope speech Classification algorithm Shared task Machine learning Social media platform Support vector machine

1. Introduction

Technology has a big impact on every aspect of our lives. It has been changing the way we communicate, purchase, and make decisions in diferent application areas. Millions of individuals use social media sites like Facebook, Twitter, Instagram, and YouTube to share material and voice their opinions. These platforms also spread negative and positive opinions. Some linguistic computational tasks aimed at finding posts on online social media and trying to stop the spread of negativity are hate speech detection, ofensive language identification, and abusive language detection [ 1 ],[2]. On the other hand, people may look for good suggestions, encouragements, gratitude, appreciation, and acknowledgments; these are positive dimensions of social media posts and can be categorized as hope speech. Hope speech is a type of speech able to relax a hostile environment when people get anxious Palakodety et al. [3]. Classifying a given comment as Hope Speech or Non-Hope Speech is known as hope speech detection. This year, we participate in IberLEF 2023 HOPE shared task [4]. As the context of this collaborative endeavor, hope speech supports people sufering from disease, stress, loneliness, or sadness; moreover hope messages ofer advice and inspire readers to do good things. To counteract sexual or racial prejudice or to promote less combative workplaces, it can be quite efective to automatically detect hope speech so that favorable remarks can be more widely spread.

2. Related works

Recent research on the improvement of free speech through social media was undertaken by [5]. The researchers presented a revolutionary custom deep network architecture that employed a concatenation of embedding from T5-Sentence, rather than removing seemingly ofensive phrases, in order to detect and encourage positivity in the comments. Several machine learning methods such as SVM, logistic regression, K-nearest neighbor, decision tree, logistic neighbors, and a newly suggested CNN-based model have all been tested With a macro F1-score that was higher than the others, the suggested model performed better in the English language.

Tonja et al. [6] discussed social media mining for health, particularly in the area of classiifcation of self-reporting exact age in tweets and Reddit posts. In this regard, they applied transformer-based models such as BERT and RoBERTa, and their application in classification tasks. The study also presented the evaluation metrics for the classification of Self-reporting exact ages on Tweets and Reddit posts. It also highlighted the performance of the models which is capable to be compared with previous works in the field.

Puranik et al. [7] has used a variety of transformer-based models to categorize social media remarks in English, Malayalam, and Tamil as hope speech or not hope speech. The study’s whole dataset includes 59,354 YouTube comments, of which 28,451 are in English, 20,198 are in Tamil, and 10,705 are in Malayalam. These comments are categorized as Hope speech, not Hope speech, and other languages. As the best model, character-output Bert’s for the validation dataset is used.

Balouchzahi et al. [8] participated in the “Hope Speech Detection for Equality, Diversity, and Inclusion-EACL 2021” shared task. The team proposed three models for classifying English and code-mixed texts in Tamil-English and Malayalam-English into three categories - "Hope speech", "Non-hope speech", and "other languages". The three models, CoHope-ML, CoHope-NN, and CoHope-TL, are based on the Ensemble of classifiers, Keras Neural Network, and BiLSTM with Conv1d model, respectively. The CoHope-ML model obtained the best results among the three models, achieving the 1st, 2nd, and 3rd ranks with weighted F1-scores of 0.85, 0.92, and 0.59 for Malayalam-English, English, and Tamil-English texts, respectively.

Tonja et al. [ 1 ] presented in the text focuses on violent and related problematic behaviors in social media to detect and classify aggressive and violent incidents in Spanish social media using language-specific pre-trained language models. Their model achieved an F1 score of 0.7455 for violent event identification and an F1 score of 0.4903 for violent event category recognition on the DA-VINCIS dataset.

Mahajan et al. [9] carried out the study to forecast the presence of hope speech as well as the existence of samples from diferent languages in the data set. The method used RoBERTa to identify hope speech for English and XLM-RoBERTa for Tamil and Malayalam. It was noted as hope-speech, non-hope speech, and not-language. Their strategy had the highest F1 score in English. It was also part of shared task-2 of 2022 in codalab.

Arif et al. [10] specifically presented the use of diferent algorithms for the multiclass and cross-lingual fake news detection task and achieved a macro F1-score of 28.60% for a monolingual task in English using RoBERTa pre-trained model and 17.21% for a cross-lingual task for English and German using Bi-LSTM deep learning algorithm.

Balouchzahi et al. [11] provided a hope speech dataset that classifies English tweets into two broad categories, "Hope" and "Not Hope," and then three more specific hope categories, "Generalized Hope," "Realistic Hope," and "Unrealistic Hope." Finally, they provided a detailed description of their annotation process and guidelines. In addition, in order to benchmark the collected dataset, they reported several baselines that were based on various learning approaches. These learning approaches included traditional machine learning, deep learning, and transformers. They evaluated the baselines by using weighted-averaged and macro-averaged F1 scores.

Gupta et al. [12], looked for and promoted helpful and uplifting YouTube posts. They used a variety of machine learning models to categorize social media remarks in English as hope speech or non-hope speech. It represents the Cooperative Task on Hope Speech Detection for Equality, Diversity, and Inclusion during LT-EDI-ACL 2022.

3. Methodology

The specific requirements of shared tasks and the constraints imposed by the classification task served as the basis for the development of our methodology. When selecting models for machine learning, it is usual practice to base the decision on how well those models perform in binary classification tasks. The choice of a particular model can be influenced by a number of diferent considerations, including the size and complexity of the dataset, the level of accuracy that is desired, and the availability of computational resources that are readily available. figure 1 depicts, we discussed the methodology which is applied to this shared task as followed.

Step-1 Data Understanding and Preparation: To begin, it is necessary to gain an understanding of the issue that we are attempting to resolve and to understand the data. Make certain that the data is representative, well-balanced, and pre-processed, using methods such as normalization and the imputation of missing values as appropriate.

Step-2 Feature Selection: Choose the features that are going to be most helpful in building the SVM model. This method is used to determine which aspects provide the most useful information. In accordance with the parameters of our task, we utilized approaches known as term frequency and inverse document frequency (TF-IDF), which assign a greater significance to words that appear more frequently in a given document. The TF is a measurement that determines how often a particular term or phrase appears in a given document. It is determined by taking the total number of words in a document and dividing that number by the number of times a particular word appears in that document.

Step-3 Model Selection: We have decided that the support vector machine model is the best fit for the issue at hand; thus, that is the model we have chosen SVM because it works well in high-dimensional spaces, which makes them perfect for addressing dificult classification issues in which there are a lot of features.

Step-4 Training and Cross-Validation: These are crucial steps in the construction of machine learning models that help to ensure the model can generalize efectively to new data and contribute to the overall accuracy of the model.

In machine learning, "training" refers to the process of teaching a model how to generate predictions by using a labeled dataset. During training, the model is taught to recognize patterns and correlations in the data that are important for producing correct predictions. This learning takes place during the training process. The purpose of training is to develop a model that is able to generalize well to data that it has not before encountered.

The efectiveness of a machine learning model can be evaluated with the help of a technique known as cross-validation. In this step, the dataset is divided into a training set and a validation set. The model is then trained using the training set, and its accuracy is assessed using the validation set. This process is done a number of times, with various subsets of the data being utilized for training and validation each time, and the performance indicators are then averaged across all of the iterations.

In general, training and cross-validation are two essential processes in the process of developing a machine-learning model. These steps help to guarantee that the model is reliable and can generalize well to new data.

Step-5 Model Evaluation: Evaluate the model’s performance on the testing set using appropriate evaluation metrics such as precision, recall, and F1 score. We evaluated our model with a test dataset and submitted the result. Finally, evaluation is done by a shared task organizer.

Step-6 Interpretation and Visualization: Create a visual representation of the findings in order to acquire a deeper comprehension of the model’s behavior and overall performance.

4. System Task Description 4.1. Data sets Description

Training, development, and test data sets were given to the participants for two languages which are English [2] and Spanish [13]. The given data sets were annotated at the comment level or post level as shown in table 1 and table 2. The IberLEF-2023 shared tasks’ quality is that it has used the expanded and most improved data sets than the previous shared tasks for both Spanish and English languages [14].

4.2. Existing classification algorithms and selection

Choosing the right classifier is the most crucial stage in the pipeline for text classification. We are unable to choose the most successful model for a text categorization application without having a thorough conceptual knowledge of each approach Lee and Shin [15].

We described the current text and document categorization methods in this section. In history, the text categorization technique has begun with the Rocchio algorithm. Secondly, it advanced to boosting and bagging, two well-liked ensemble learning algorithmic approaches. Although becoming more conventional, techniques like logistic regression, Naive Bayes, and k-nearest neighbor [ 16 ] are still widely utilized in the scientific community. As a classification method, support vector machines (SVM), particularly kernel SVM, is also widely employed. For categorizing documents, tree-based classification algorithms like decision trees and random forests are eficient and precise. The are also other neural network-based text classification algorithms including hierarchical attention networks (HAN), deep belief networks (DBN), CNN [ 17 ], RNN, and combination methods [ 18 ] and could apply transformer-based approaches [ 19 ]. 4.2.1. Support Vector Machine(SVM) Classification Algorithm SVM was firstly developed for binary classification applications. However, a lot of scholars use this prevalent strategy when working on multi-class issues.

The study of text categorization using a string kernel is also known as kernel SVM. Using a function to map the string in the feature space is the fundamental concept behind the string kernel (SK). Several other applications, including the categorization of text, DNA, and proteins, have used kernels as part of the SKSVM rhythm Cervantes et al. [ 20 ]. SVM is the most efectiveness when there is a distinct line dividing classes, and when the number of samples is less than the number of dimensions; for this incredible advantages we chose SVM to classify the given social media posts are hope speech or not hope speech [ 21 ]. In addtion to that SVM able to handle high-dimensional feature spaces and non-linear classification problems. SVMs are particularly useful when the data is not linearly separable and the decision boundary is complex, as they can transform the original data into a higher-dimensional space where it may become linearly separable [ 22 ].

Furthermore, SVMs are robust to overfitting, as they use a regularization parameter that helps to prevent the model from fitting the noise in the data. They can also work well with small to medium-sized datasets.

5. Challenges of the task

This task is one of the essential activities in the NLP area. However, data poses many challenges for NLP due to its lack of context, informal language [4], and imbalanced dataset.

Lack of context: As a result, analyzing data has become a crucial step for NLP. Twitter is a well-known social networking site that produces enormous amounts of data every day. It is dificult to infer the context of a tweet in this task because Twitter data is short and limited to 240 characters per tweet. Lack of context causes ambiguity, which makes it challenging to extract the meaning of a tweet accurately.

Informal language: we used informal language that includes misspellings, acronyms, and emojis, which makes it challenging for NLP algorithms to understand the intended meaning of social media. We solved the above-mentioned problems in the pre-processing stage.

Imbalanced datasets: these datasets pose a significant challenge for NLP tasks, in this task unbalanced English datasets were given. As a result, biased models are inaccurate in predicting minority classes. Moreover, imbalanced datasets can lead to the overfitting of models and poor generalization to unseen data.

There are several techniques that can be used to address the problem of imbalanced datasets in machine learning [ 23 ], among this technique we used oversampling which involves duplicating instances from the minority class until it is balanced with the majority class.

5.1. Result and Discussion

The developed model was based on the SVM string kernel classifier. We have evaluated the developed model in terms of F1 scores. The model classifies social media comments/posts into hope speech or not hope speech as we are asked in shared tasks. We have tabulated Precision (P), Recall (R), F1-score, and the average macro F1-scores of the model for the test data set in Table 3. The average-macro-F1 for English is 0.4894 and Spanish is 0.4815. From the result, the model performed better in the English language than in Spanish because the given data size for Spanish is less than English.

6. Conclusion

Hope is a positive frame of mind that is both present- and future-focused. It is founded on the desire for favorable results in one’s life or the world as a whole and may also be found in motivational speeches about those who have faced and overcome hardship [ 24 ]. This study described a multilingual hope speech detection using machine learning algorithm. We applied a SVM algorithms to automatically classify whether the given text in both English and Spanish is hope speech or not hope speech. Two hope speech classification models were developed for both languages and their performances were also tested using average macro F1-score metric. The performance of the model is highly depend on the size and quality of data sets.

7. Future Work

Since hope speeches build the soft mindsets of human beings, the tasks should span to other languages. In addition to this, the performance of the proposed model in this study should be improved by increasing the number of dataset sizes and providing other more algorithm for the langauges used here.

Acknowledgments

The work was done with partial support from the Mexican Government through the grant A1S-47854 of CONACYT, Mexico, grants 20220852, 20220859, and 20221627 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America Ph.D. Award. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022), CEUR Workshop Proceedings. CEUR-WS. org, 2022. [2] B. R. Chakravarthi, Hopeedi: A multilingual hope speech detection dataset for equality, diversity, and inclusion, in: Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, 2020, pp. 41–53. [3] S. Palakodety, A. R. KhudaBukhsh, J. G. Carbonell, Hope speech detection: A computational analysis of the voice of peace, arXiv preprint arXiv:1909.12940 (2019). [4] S. M. Jiménez-Zafra, M. Á. García-Cumbreras, D. García-Baena, J. A. García-Díaz, B. R.

Chakravarthi, R. Valencia-García, L. A. Ureña-López, Overview of HOPE at IberLEF 2023: Multilingual Hope Speech Detection, Procesamiento del Lenguaje Natural 71 (2023). [5] M. Ahmed, A. Najmul Islam, Deep learning: hope or hype, Annals of Data Science 7 (2020) 427–432. [6] A. L. Tonja, O. E. Ojo, M. A. Khan, A. G. M. Meque, O. Kolesnikova, G. Sidorov, A. Gelbukh, Cic nlp at smm4h 2022: a bert-based approach for classification of social media forum posts, in: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, 2022, pp. 58–61. [7] K. Puranik, A. Hande, R. Priyadharshini, S. Thavareesan, B. R. Chakravarthi, Iiitt@ ltedi-eacl2021-hope speech detection: there is always hope in transformers, arXiv preprint arXiv:2104.09066 (2021). [8] F. Balouchzahi, B. Aparna, H. Shashirekha, Mucs@ lt-edi-eacl2021: Cohope-hope speech detection for equality, diversity, and inclusion in code-mixed texts, in: Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, 2021, pp. 180–187. [9] K. Mahajan, E. Al-Hossami, S. Shaikh, Teamuncc@ lt-edi-eacl2021: Hope speech detection using transfer learning with transformers, in: Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, 2021, pp. 136–142. [10] M. Arif, A. L. Tonja, I. Ameer, O. Kolesnikova, A. Gelbukh, G. Sidorov, A. G. M. Meque, Cic at checkthat! 2022: multi-class and cross-lingual fake news detection, Working Notes of CLEF (2022). [11] F. Balouchzahi, G. Sidorov, A. Gelbukh, Polyhope: Two-level hope speech detection from tweets, Expert Systems with Applications 225 (2023) 120078. [12] V. Gupta, R. Kumar, R. Pamula, Iit dhanbad@ lt-edi-acl2022-hope speech detection for equality, diversity, and inclusion, in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, 2022, pp. 229–233. [13] D. García-Baena, M. Á. García-Cumbreras, S. M. Jiménez-Zafra, J. A. García-Díaz, R. Valencia-García, Hope speech detection in spanish: The lgbt case, Language Resources and Evaluation (2023) 1–28. [14] S. M. Jiménez-Zafra, F. Rangel, M. Montes-y Gómez, Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023), co-located with the 39th Conference of the Spanish Society for Natural Language Processing (SEPLN 2023), CEURWS.org, 2023. [15] I. Lee, Y. J. Shin, Machine learning for enterprises: Applications, algorithm selection, and challenges, Business Horizons 63 (2020) 157–170.

[1]

A. L.

Tonja ,

Arif ,

Kolesnikova ,

Gelbukh , G. Sidorov, Detection of aggressive and violent incidents from social media in spanish using pre-trained language model , in:

[16]

M. S.

Tash ,

Ahani ,

Tonja ,

Gemeda ,

Hussain ,

Kolesnikova , Word level language identification in code-mixed kannada-english texts using traditional machine learning algorithms , in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 25 - 28 .

[17]

M. G.

Yigezu ,

A. L.

Tonja ,

Kolesnikova ,

M. S.

Tash ,

Sidorov ,

Gelbukh , Word level language identification in code-mixed kannada-english texts using deep learning approach , in: Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts , 2022 , pp. 29 - 33 .

[18]

Kowsari ,

Jafari Meimandi ,

Heidarysafa ,

Mendu ,

Barnes ,

Brown , Text classification algorithms: A survey , Information 10 ( 2019 ). URL: https://www.mdpi.com/ 2078-2489/10/4/150. doi: 10 .3390/info10040150.

[19]

Lambebo Tonja ,

M. Gemeda

Yigezu ,

Kolesnikova ,

M. Shahiki

Tash ,

Sidorov ,

Gelbuk , Transformer-based model for word level language identification in code-mixed kannada-english texts , arXiv e-prints ( 2022 ) arXiv- 2211 .

[20]

Cervantes ,

Garcia-Lamont ,

Rodríguez-Mazahua ,

Lopez , A comprehensive survey on support vector machine classification: Applications, challenges and trends , Neurocomputing 408 ( 2020 ) 189 - 215 .

[21]

Karamizadeh ,

S. M.

Abdullah ,

Halimi ,

Shayan , M. javad Rajabi, Advantage and drawback of support vector machine functionality , in: 2014 international conference on computer, communications, and control technology (I4CT) , IEEE, 2014 , pp. 63 - 65 .

[22]

Madhu ,

M. A.

Rahman ,

Mukherjee ,

M. Z.

Islam ,

Roy ,

L. E.

Ali , A comparative study of support vector machine and artificial neural network for option price prediction , Journal of Computer and Communications 9 ( 2021 ) 78 - 91 .

[23]

He ,

E. A.

Garcia , Learning from imbalanced data , IEEE Transactions on knowledge and data engineering 21 ( 2009 ) 1263 - 1284 .

[24]

B. R.

Chakravarthi , Multilingual hope speech detection in english and dravidian languages , International Journal of Data Science and Analytics 14 ( 2022 ) 389 - 406 .