ABCD Team at HOPE 2024: Hope Detection with
                                BERTology Models and Data Augmentation
                                Bui Hong Son1,2,* , Le Minh Quan1,2,* and Dang Van Thin1,2
                                1
                                  University of Information Technology-VNUHCM, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City,
                                Vietnam
                                2
                                  Vietnam National University, Ho Chi Minh City, Vietnam


                                                                         Abstract
                                                                         This paper presents our participation in the HOPE tasks at IberLEF 2024[1, 2, 3, 4, 5], focusing on two of
                                                                         them: Task 1: Hope for Equality, Diversity, and Inclusion, and Task 2: Hope as Expectations. To address
                                                                         Task 1, we implemented and investigated different techniques and strategies. We first investigated the
                                                                         effectiveness of pre-processing steps for social media texts. Second, we employed two data augmentation
                                                                         strategies to tackle the class imbalance issue in the training dataset. Finally, we implemented a fine-tuning
                                                                         approach based on pre-trained language models combined with a simple ensemble technique. The private
                                                                         test results show that our best system achieved a top 5 ranking in Task 1. For Task 2, we achieved 2nd
                                                                         place in the binary classification subtask for Spanish datasets and 1st place for the same subtask on
                                                                         English datasets. Furthermore, our best results ranked 1st in the multi-classification subtask for both
                                                                         languages in the competition.

                                                                         Keywords
                                                                         Hope classification, Spanish language, English language, sentiment analysis, aspect-based sentiment
                                                                         analysis


                                1. Introduction
                                HOPE at IberLEF 2024 [1, 2, 3, 4, 5] is a competition that aims to analyze the multifaceted
                                concept of hope through Natural Language Processing (NLP). HOPE shared-task consists of two
                                different tasks for Equality, Diversity, and Inclusion. This task is to identify the messages that
                                promote hope and acceptance for marginalized groups on social media platforms. The challenge
                                is designed for competitors to develop various NLP models capable of differentiating between
                                messages that uplift and empower these communities. Success hinges on your model’s ability
                                to accurately detect hope-oriented messages within this specific social media context. Task 2 -
                                Hope as Expectations. This second task focuses on hope as it relate to future expectations and
                                desires. The challenge here is to build NLP models proficient in detecting expressions of hope
                                within social media text. These models need to not only identify hope, but also categorize its
                                nature, distinguishing between realistic and unrealistic aspirations, as well as positive hope for

                                IberLEF 2024, September 2024, Valladolid, Spain
                                *
                                 Corresponding author.
                                $ 22521246@gm.uit.edu.vn (B. H. Son); 20520709@gm.uit.edu.vn (L. M. Quan); thindv@uit.edu.vn (D. V. Thin)
                                 https://nlp.uit.edu.vn/ (D. V. Thin)
                                 0000-0001-8340-1405 (D. V. Thin)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the future. Participating in HOPE 2024 is a unique opportunity to advance NLP significantly
and address complex problems with real-world impact, pushing the boundaries of NLP tools
and enhancing understanding of hope, social media, and human behavior.
   In the previous year, HOPE at IberLEF 2023 [6] is also organized and focusing on the task of
"Multilingual Hope Speech Detection" Various approaches were proposed and made public by
numerous author. Among these, I2C-Huelva [7] Team applied a transformer model proposed for
Spanish language, BERTuit. This team then achieved the second position and the first position
for Spanish subtask and English subtask respectively. The same main approach is used by
NLP URJC [8]. There is a little difference while this team applied BERT for English subtask
and BETO for Spanish subtasks. With their optained results, they would have ranked 8th for
the Spanish subtask and 1st for the English one. However, they missed the deadline for the
paper submission. Distinct from the two preceding teams, besides testing XLM-R with different
model setups, Zootopi Team [9] proposed two prompting scenarios for Large Language Model
(ChatGPT) for the English and Spanish subtasks respectively. In the end, they achieved the
1st position in the Spanish subtask and ranked 9th in the English subtask. As we supposed,
transformer-based models have been used in both subtasks and majority of the results are
at the top of the competion’s leaderboard. We cannot conclude that using tranformer-based
models resulted in the better result than other approaches, such as traditional machine learning
techniques like KNN (used by Zavira team [10]) or CNN (used be LIDOMA Team [11]), nor
using the ChatGPT as Zootopi Team applied.
   About the dataset for each task. In terms of Task 1, the dataset was collected between 2020
and 2023. It is an improved and extended version of the SpanishHopeEDI dataset [2]. The
version of the dataset for IberLEF 2024 consists of training and dev sets on LGTB-related tweets
and a test set on tweets related to the LGTBI collective and other EDI topics (unknown domains).
A tweet is considered as HS if the text:

    • i) explicitly supports the social integration of minorities;
    • ii) is a positive inspiration for the collective;
    • iii) explicitly encourages people who might find themselves in a situation;
    • iv) unconditionally promotes tolerance

On the contrary, a tweet is marked as NHS if the text:

    • i) expresses negative sentiment towards a community
    • ii) explicitly seeks violence
    • iii) uses gender-based insults

The dataset is composed of 2,000 tweets.
  In terms of Task 2, the data collection commenced by retrieving the most recent 50,000
tweets between January and June 2022. Following this, an additional batch of 50,000 tweets was
acquired within the same temporal scope using keywords associated with sentiments of hope.
The dataset encompassed English and Spanish tweets originating from the first half of 2022,
amounting to an aggregate of approximately 100,000 tweets per language.
                                                             Training
                               Simple preprocess    Data1

                                                                Model       BERT
                                                               Selection    Models

                   Raw Data   Enhanced preprocess   Data2                                  Model     Result
                                                                                          Analysis

                                                               Training    Validation

                               Data Augmentaion                                           Ensemble
                                                    Data3
                                                                                          Learning


                          Figure 1: Our overall pipeline for the HOPE 2024 shared task.


                    Raw text: "A veces si me gusta como salgo en las fotos              #transgirl
                    #transgender #trans #transwoman #transisbeautiful"

                    Preprocessed text: "A veces si me gusta como salgo en las fotos
                    transgirl transgender trans transwoman transisbeautiful"


                                       Figure 2: Simple pre-processing sample.


2. Methodology
To address this challenge, we employ fine-tuning with different pre-trained language models for
two tasks. We also investigate how pre-processing steps affect the models’ performance. This
is because the data originates from a social media platform, where proper pre-processing can
significantly improve overall performance. Furthermore, we utilize various data augmentation
techniques to enrich the training data. Finally, we implement a simple ensemble strategy to
enhance performance for both tasks further. Figure 1 illustrates our overall pipeline for the
HOPE 2024 shared task. Details of our main components are presented below.

2.1. Pre-processing Component
While analyzing the data, we discovered that the dataset contained noise and inconsistencies. To
address this, applying pre-processing steps helped clean and standardize the data. This allowed
the models to understand and context better, ultimately leading to more accurate results. To
demonstrate the importance of pre-processing steps, we compare two strategies, including
simple and specific strategies. We apply this method to both Task 1 and Task 2 to determine
whether these pre-processing methods improve performance.
       • Simple pre-processing steps: For this strategy, we only apply whitespace handling and
         punctuation removal. Figure 2 illustrates the steps in the simple pre-processing strategy.
       • Specific pre-processing steps: For this strategy, we leverage the tweet-processer library1
1
    https://pypi.org/project/tweet-preprocessor/
                    Raw text: "A veces si me gusta como salgo en las fotos        #transgirl
                    #transgender #trans #transwoman #transisbeautiful"


                    Preprocessed text: "A veces si me gusta como salgo en las fotos"


                    Raw text: "#USER# #USER# #USER# Ps, there are Anons who are working on
                    military airports and installations right now. The work takes time?   And
                    even if ruskies expect them, there?s nothing they can do to stop them   "
                    Preprocessed text: "Ps there are Anons who are working on military airports
                    and installations right now The work takes time And even if ruskies expect them
                    theres nothing they can do to stop them"


                                     Figure 3: Specific pre-processing steps.


          because this library offers pre-processing functionalities that include: Emoji Removal,
          Username Removal, Specific Substring Removal, Hyperlink Removal, Text Normalization.
          Figure 3 shows an example of the specific pre-processing strategy.


2.2. Data Augmentation
We observed an imbalance issue between classes in Task 2. To improve overall performance, we
aimed to expand the training data. To achieve this, we applied two data augmentation strategies
to create new samples that can help the model learn more robust features. We briefly introduce
two strategies which only applied for Task 2 as below:

       • Data Combination: In this method, we combine the training datasets for English and
         Spanish into a single final dataset. We employ this strategy because we are utilizing
         multilingual models as our primary classifiers. Combining the datasets increases the
         number of data samples and leverages the strengths of multilingual language models.
       • Data Augmentation through Large Language Model: Our main idea for this approach
         is to utilize the power of a pre-trained large language model to diversify the samples for
         imbalance classes. This work uses the Gemini models to create new samples through
         the prompt engineering with API function 2 . We send a request via API to run iterates
         through each text sample of the train set. With each sample, we order the Gemini to
         generate a distinct text with the same language and structure while still maintaining the
         expressiveness of the text.


2
    https://ai.google.dev/gemini-api/docs/api-overview?hl=vi
2.3. Classification Model
The Hope shared task3 consists of two sub-tasks: Task 1: Hope for Equality, Diversity and
Inclusion, and Task 2: Hope as Expectations. These sub-tasks involve binary classification and
multi-class classification problems, respectively. To address these different tasks, we employ fine-
tuning based on the pre-trained BERTology language models. Since several pre-trained language
models support both English and Spanish, we implemented various models to investigate their
performance on this shared task. A brief description of the models is presented below.

        • XLM-R (Conneau et al. [12]): This powerful language model tackles tasks across 100
          languages. It leverages a technique called self-supervised learning, where it analyzes a
          massive dataset (2.5TB of filtered CommonCrawl data) without any human intervention.
          This allows XLM-RoBERTa to learn from vast amounts of publicly available text, using an
          automated process to create both training examples and labels from the raw data itself.
          In this competition, we used both XLM-R-base and XLM-R-large.
        • DeBERTa (He et al. [13]): We applied DeBERTa-v3-base, an improved version of DeBERTa
          in order to verify whether we get a superior result while using DeBERTa, a transformer-
          based neural language model designed to improve the BERT and RoBERTa models with
          two techniques: a disentangled attention mechanism and an enhanced mask decoder.
        • mDeBERTa-v3 (He et al. [14]): Building upon the success of DeBERTa, mDeBERTa
          V3 extends its capabilities to handle multiple languages. It retains the core structure of
          DeBERTa but leverages a massive dataset known as CC100, containing 2.5 trillion words
          across 100 languages. This base version boasts 12 processing layers and a hidden size of
          768, allowing it to capture complex relationships within text. While the model itself has
          86 million parameters, the vocabulary (the set of words it understands) adds another 190
          million. This extensive vocabulary ensures that mDeBERTa V3 can effectively handle a
          vast range of languages.
        • RoBERTuito (Pérez et al. [15]): A pre-trained model used for Sentiment Analysis in
          Spanish, used 500 milion tweets while training with the RoBERTa guidelines. RoBERTuito
          comes in 3 flavors: cased, uncased, and uncased+deaccented. In our experiments, we use
          base model.
        • Twitter-roBERTa (Barbieri et al. [16]): This RoBERTa-base model specializes in under-
          standing the sentiment of English tweets. Trained on a massive dataset of 58 million
          tweets, it can effectively analyze the emotions conveyed in social media messages. (Tweet-
          Eval benchmark used).
        • Twitter-XLM-roBERTa (Barbieri et al. [17]): This XLM-RoBERTa model goes beyond
          just English. Trained on nearly 200 million tweets in eight languages (Arabic, English,
          French, German, Hindi, Italian, Spanish, and Portuguese), it can identify positive, nega-
          tive, or neutral sentiment in social media posts. While it’s pre-trained in these specific
          languages, it may even understand the sentiment in others. We decided to use this model
          to check whether it is effective while classifying different labels of social media texts.


3
    https://codalab.lisn.upsaclay.fr/competitions/17714
Table 1
The distribution of experimental datasets.
                     Labels                  Training set   Validation set   Test set
                     Hope Speech (hs)             700            100            -
                     Not Hope Speech (nhs)        700            100            -
                     Total                       1400            200           400


    • Bertin-RoBERTa ([18]): A series of BERT-base models for Spanish text. We applied this
      model in order to observe if this model is better than traditional BERT on specific Spanish
      subtasks.

2.4. Ensemble Learning approach
To improve the overall performance of our models for the HOPE at IberLEF 2024 shared task,
we leverage a max voting ensemble method. This technique is commonly used for classification
tasks, which aligns well with the binary and multi-class classification problems in Hope’s
subtasks. In max voting, multiple models make predictions for each data point in the test set.
Each model’s prediction is considered a "vote," and the final prediction is the class label that
receives the most votes from the ensemble.


3. Experimental Setup
3.1. Datasets and Evaluation Metrics
3.1.1. Task 1: Hope for Equality, Diversity and Inclusion
For Task 1, we used the official datasets provided by the organizers to train our models. To
facilitate a comprehensive understanding of the data, we present both a table outlining the data
distribution and a diagram illustrating the sequence lengths.
   Table 1 presents the data distribution for the datasets used in Task 1. Divided into a training
set (1400 samples), a validation set (200 samples) and a test set (400 samples). The data concerns
classifying "Hope Speech" (hs) and "Not Hope Speech" (nhs). A balanced distribution is evident
in the training set (700 samples each for hs and nhs), the same as the validation set (100 samples
for each category).The data in the table indicates that all hope classes have a comparable number
of participants (balanced). However, distribution across the three groups is uneven (different
distribution variations). These balances play a crucial role in training and fine-tuning our
models while also facilitating the resolution of data-related issues.
   Besides, Figure 2 depicts the distribution of sequence length, that is, the number of words
within a sequence, for two distinct categories in the datasets. There appear to be two distinct
clusters of data points, suggesting a possible separation between the sequence lengths of "Hope
Speech" and "Not Hope Speech" samples. Overall, the sequence length distributions for both
categories exhibit a remarkable degree of similarity. However, the "hs" category appears to
have some samples which have shorter sequences. The other category, "nhs" exhibits a broader
distribution, encompassing a wider range of sequence lengths, including a small amount of
longer samples.
     Figure 4: The distribution of sample length for each class in the training and validation sets.


3.1.2. Task 2: Hope as Expectations
In Task 2, we also use the original datasets provided by the organizers to train our models. Table
2 describes the statistics of datasets for Task 2.

Table 2
Statistics of official datasets for Task 2.
        Type of labels                          Spanish                                 English
 Binary     multi-class         Train set     Validation set   Test set   Train set   Validation set   Test set
 Not Hope Not Hope                4701              799           -         3088            502           -
            Generalized Hope      1151              186           -         1726            300           -
 Hope       Unrealistic Hope       546              91            -          648            102           -
            Realistic Hope         505              74            -          730            128           -
            Total                 6903             1150         1152        6192           1032         1032


  As shown in Table 2, it can be seen that the distribution of data cross training, validation
and test sets for binary and multi-class classification subtask in this Task. The hope can
be categorized as either Binary (Hope or Not Hope) or multi-class (Not Hope, Generalized
Hope, Unrealistic Hope, or Realistic Hope). The table separates the data into three sets: Train,
Validation, and Test, showcasing how many instances of each sentiment label are included in
each set.
  In terms of the Spanish corpus, the data is imbalanced across the categories. For both binary
and multi-class classifications, there are significantly more instances of Not Hope compared to
the positive sentiment labels ("Hope" in Binary and "Generalized Hope", "Unrealistic Hope", and
"Realistic Hope" in multi-class). The imbalanced nature of the data can make it difficult for our
model to learn the positive sentiment label accurately. The model might become biased towards
the majority class ("Not Hope") and misclassify positive sentiment instances.

3.2. System Settings
We deployed our main framework with the support of the Hugging Face Transformer library. All
models was set up to train with 10 epochs and the learning rate was set to 2e-4 for base models
and 5e-5 for large models. Considering the size of the pre-trained language models, we chose a
batch size of either 32 or 16. The hyperparameters of models are tuned based on the validation
set. The majority of our training are trained on Kaggle, and the P100 accelerator was selected to
accelerate our training. In terms of the tokenizer, in both tasks, we used the AutoTokenizer from
the pre-trained model we imported from HuggingFace. The maximum length for the sequence
that the Tokenizer will generate is 512. For all our experiments, we set a fixed random seed of
42 to train the models in both Task 1 and Task 2 (English datasets and Spanish datasets). The
datasets used in Task 2 have two different languages, Spanish and English. However, we decided
to apply the same pre-processing methods to all datasets. However, the pre-processing process
included one of our main approaches in the experiments which is discussed it more later.


4. Experiment Results and Discussion
4.1. Task 1: Hope for Equality, Diversity and Inclusion.
In Task 1, we will observe and evaluate whether diversifying the provided datasets improves
the final results. Also, we investigate whether Ensemble Learning results in different or im-
proved results compared to the base model. Table 3 depicts the performance of four machine
learning models (XLM-R-base, RoBERTuito, DeBERTa-v3-base, mDeBERTa-v3-base) on sim-
ple preprocessed-datasets and repeat 2 models (XLM-R-base, mDeBERTa-v3-base) on specific
preprocessed-datasets.

Table 3
Experimental result Task 1: Hope for Equality, Diversity and Inclusion.
                                                                         hs(a)                           nhs(b)
 Datasets              Model                Avg. Macro F1
                                                            Precision   Recall   Macro-F1   Precision   Recall    Macro-F1
                         XLM-R-base            58.79%        80.95%     80.95%    62.58%     73.33%     73.33%     55.00%
                         RoBERTuito            54.81%        73.68%     73.68%    63.64%     49.43%     49.43%     45.99%
 Simple pre-processing
                         DeBERTa-v3-base       56.06%        78.46%     78.46%    61.82%     62.69%     62.69%     50.30%
                         mDeBERTa-v3-base      59.30%        85.00%     85.00%    63.57%     64.00%     64.00%     54.86%
                         XLM-R-base            60.54%        74.36%     74.36%    65.17%     60.47%     60.47%     55.91%
 Specific pre-processing
                         mDeBERTa-v3-base      60.26%        75.31%     75.31%    67.40%     61.04%     61.04%     53.11%
     Ensemble Learning - Max Voting            61.11%        82.35%     82.35%    66.67%     62.50%     62.50%     55.56%


   When trained on data with a simple pre-processing function, a metric used to evaluate models,
at 56.06%. Other models performed with scores ranging from 48.79% to 54.81%. Remarkably, both
models, XLM-R-base and mDeBERTa-v3-base, exhibited a significant improvement when trained
on the dataset with specific pre-processing. The mDeBERTa-v3-base model archived a massive
Macro F1-score in this scenario, reaching 60.54% in terms of F1-score. The remaining models
Table 4
Experimental result of Subtask 2.a: Binary Hope Speech Detection from Spanish datasets.
                                                                                   Spanish
           Datasets                     Model
                                                        M_Pr     M_Re     M_F1     W_Pr      W_Re     W_F1     Acc
                                  XLM-R-base            82.66%   84.24%   83.32%   85.47%    84.90%   85.07%   84.90%
                                  RoBERTuito            83.03%   83.83%   83.40%   85.38%    85.16%   85.25%   85.16%
 Simple preprocess                Bertin-RoBERTa        82.91%   79.50%   80.79%   83.62%    83.85%   83.41%   83.85%
                                  Twitter-XLM-roBERTa   63.06%   84.03%   83.61%   85.63%    85.24%   85.38%   85.24%
                                  mDeBERTa              82.96%   84.30%   83.54%   85.60%    85.16%   85.30%   85.16%
 Specific preprocess              RoBERTuito            82.88%   84.03%   83.39%   85.43%    85.07%   85.20%   85.07%
 Combine 2 languages of datasets RoBERTuito             81.16%   82.80%   81.83%   84.18%    83.51%   83.72%   83.51%
 Generate data using AI           RoBERTuito            81.54%   83.06%   82.17%   84.44%    83.85%   84.04%   83.85%
             Ensemble Learning - Max Voting             83.69%   84.55%   84.09%   86.00%    85.76%   85.85%   85.76%


also witnessed improvements, with scores ranging from 59.30% to 60.26%. Also, Ensemble
learning significantly improves the overall accuracy of the classification, achieving an average
Macro F1 score of 61.11%, the highest among all evaluated models.
   These findings suggest that applying a wider range of pre-processing techniques can signifi-
cantly enhance the performance of sentiment analysis models on social media data. While the
DeBERTa-v3-base model achieved the highest with simple pre-processing, All models exhibited
performance gains thanks to the enhanced dataset with additional processing steps.
   Besides, we explore the application of Ensemble learning, especially Max Voting, to enhance
the performance of sentiment analysis models for social media data. Our findings demonstrate
that while the individual metrics for some models remain suboptimal, they still exhibit improve-
ment compared to several single models. These results underscore the effectiveness of ensemble
learning in boosting sentiment analysis performance and highlight the potential for further
optimization through more sophisticated techniques.

4.2. Task 2: Hope as expectations
To inform the experimental result for Task 2, we have 4 tables. Table 4 and Table 5 represent the
experimental results of binary classification task on both Spanish and English datasets, while
Table 6 and Table 7 describe the result on English datasets.

4.2.1. Subtask 2.a: Binary Hope Speech Detection
Table 4 presents the experimental findings for Task 2 Binary on Spanish datasets. Among the
models trained on the simply preprocessed dataset, Twitter-XLM-roBERTa achieved the best
performance with an M_F1 score of 83.61%. However, we decided to utilize the RoBERTuito
model for further experiments in this task as it is specifically trained for Spanish social media
data. However, despite employing more approaches, the subsequent methods failed to result in
any improvements. Finally, only by implementing Ensemble Learning based on the previously
obtained results did we observe a significant improvement and achieve the highest M_F1 score
of 84.09%.
   Table 5 depicts the influence of various techniques on the performance of our BERT models.
Among the evaluated models, the XLM-R-base exhibited the most promising performance on
the basic dataset with simple pre-processing, achieving the highest F1-score of 86.63%. The
Table 5
Experimental result of Subtask 2.a: Binary Hope Speech Detection from English datasets.
                                                                                   English
 Datasets                        Model
                                                         M_Pr     M_Re    M_F1     W_Pr      W_Re     W_F1     Acc
                                  XLM-R-base            86.65%   86.53%   86.58%   86.64%    86.63%   86.62%   86.63%
                                  DeBERTa-v3-base       85.62%   85.15%   85.26%   85.52%    85.37%   85.32%   85.37%
 Simple preprocess
                                  Twitter-roBERTa       84.65%   84.31%   84.40%   84.58%    84.50%   84.46%   84.50%
                                  Twitter-XLM-roBERTa   85.34%   84.59%   84.73%   85.20%    84.88%   84.80%   84.88%
 Specific preprocess              XLM-R-base            86.08%   85.94%   85.99%   86.06%    86.05%   86.03%   86.05%
 Combine 2 languages of datasets XLM-R-base             85.76%   85.36%   85.46%   85.68%    85.56%   85.52%   85.56%
 Generate data using AI           XLM-R-base            83.25%   83.32%   83.23%   83.37%    83.24%   83.25%   83.25%
              Ensemble Learning: Max Voting             86.46%   86.18%   86.26%   86.40%    86.34%   86.31%   86.34%


Table 6
Experimental result of Subtask 2.b: multi-class Hope Speech Detection from Spanish datasets.
                                                                                   Spanish
 Datasets                        Model
                                                        M_Pr     M_Re     M_F1     W_Pr      W_Re     W_F1     Acc
                                  XLM-R-base            64.42%   66.55%   65.29%   81.57%    80.64%   81.03%   80.64%
                                  RoBERTuito            62.41%   58.95%   60.49%   77.58%    78.65%   78.02%   78.65%
 Simple preprocess                Bertin-RoBERTa        65.64%   62.83%   64.09%   79.92%    80.56%   80.16%   80.56%
                                  Twitter-XLM-roBERTa   61.69%   65.43%   63.21%   80.85%    78.47%   79.34%   78.47%
                                  mDeBERTa              62.63%   66.86%   64.41%   80.79%    78.91%   79.70%   78.91%
 Specific preprocess              Bertin-RoBERTa        61.59%   60.38%   60.94%   78.56%    78.82%   78.65%   78.82%
 Combine 2 languages of datasets Bertin-RoBERTa         59.98%   59.33%   59.38%   77.69%    77.17%   77.26%   77.17%
 Generate data using AI           Bertin-RoBERTa        64.93%   59.76%   61.80%   79.12%    80.21%   79.42%   80.21%
             Ensemble Learning - Max Voting             68.31%   65.43%   66.68%   81.80%    82.03%   81.78%   82.03%


remaining models trained on the same datasets resulted in M_F1 scores ranging from 84.88% to
85.37%. Remarkably, applying additional pre-processing or data augmentation techniques did
not resulted in any significant improvements for these models. In some cases, it even caused
performance decreases compared to simple pre-processing scenarios. Besides, while Ensemble
Learning did not achieve the absolute best results, it demonstrated a notable improvement
compared to individual models’ results.

4.2.2. Subtask 2.b: Multi-class Hope Speech Detection
As described in Table 6, the models performed well on the dataset subjected to basic pre-
processing. Among these, XLM-R-base and Bertin-RoBERTa models achieved the highest and
second-highest M_F1 scores of 65.29% and 64.09%, respectively. However, we decided to employ
additional approaches on Bertin-RoBERTa to obtain more objective results using a model fine-
tuned specifically for the Spanish texts. Consequently, methods such as Specific pre-processing,
training the model using a combined train dataset, or generating more data did not cause any
remarkable results, while applying the Max Voting ensemble technique resulted in the best
performance, with an M_F1 score of 66.68%.
   Table 7 presents the experimental results of Task 2 multi-class Classification on the English
Dataset. Overall, DeBERTa-v3 resulted in a remarkable performance on the simple processed
dataset with an M_F1 score of 69.92% compared to Twitter-XLM-RoBERTa with an M_F1 score
of 69.00%. Nonetheless, we decided to choose Twitter-XLM-RoBERTa for further investigations
because it is a pretrainned model for sentiment analysis with social media text. Upon com-
Table 7
Experimental result of Subtask 2.b: multi-class Hope Speech Detection from English datasets.
                                                                                              English
 Datasets                             Model
                                                              M_Pr      M_Re     M_F1         W_Pr       W_Re     W_F1     Acc
                                   DeBERTa-v3-base            69.19%    70.80%   69.92%       76.33%     75.58%   75.89%   75.58%
 Simple preprocess
                                   Twitter-XLM-roBERTa        68.27%    69.85%   69.00%       75.63%     75.10%   75.32%   75.10%
 Specific preprocess               Twitter-XLM-roBERTa        68.60%    70.21%   69.34%       76.03%     75.39%   75.65%   75.39%
                                   Twitter-XLM-roBERTa        70.10%    72.42%   71.09%       77.52%     76.55%   76.92%   76.55%
 Combine 2 languages of datasets
                                   XLM-R-large                71.38%    72.82%   72.00%       78.05%     77.42%   77.67%   77.42%
 Generate data using ChatBot       Twitter-XLM-roBERTa        66.31%    66.11%   66.16%       72.14%     72.38%   72.21%   72.38%
              Ensemble Learning - Max Voting                  71.03%    72.21%   71.57%       77.66%     77.13%   77.35%   77.13%


Table 8
Ranking of our systems on two sub-tasks.
                                                                                     Score
         Task 2            Datasets    Model/Method                                                                        Ranking
                                                          M_Pr     M_Re     M_F1     W_Pr       W_Re     W_F1     Acc
  Subtask 2.a: Binary      Spanish    Ensemble Learning   83.69%   84.55%   84.09%   86.00%     85.76%   85.85%   85.76%      2
     Classification        English    XLM-R-base          86.65%   86.53%   86.58%   86.64%     86.63%   86.62%   86.63%      1
 Subtask 2.b: Multiclass   Spanish    Ensemble Learning   68.31%   65.43%   66.68%   81.80%     82.03%   81.78%   82.03%      1
     Classification        English    XLM-R-large         71.38%   72.82%   72.00%   78.05%     77.42%   77.67%   77.42%      1


bining two datasets of different languages, Twitter-XLM-RoBERTa exhibited improvement in
performance, achieving an M_F1 score of 71.09%, an increase of 2.09% compared to the original
one. The technique of generating additional data did not result in any noticeable enhancement,
whereas Ensemble Learning caused significant improvement in the overall result, with an M_F1
score of 71.57%. Finally, with the assistance of ample resources, we were able to employ the
XLM-RoBERTa-large model for experiments in this task. Naturally, utilizing the large model
resulted in the highest performance, with an M_F1 score of 72.00%.


5. System Ranking
Concerning Task 1, among the employed models, Ensemble Learning ultimately resulted in the
best Average Macro F1 score is 61.11%. However, the XLM-R-base model caused the highest
Precision score, so we submitted its prediction achieved fifth place in the overall ranking with
Average Macro F1 score is 58.79%.
   In terms of Task 2, the official ranking results are presented in Table 8. For Task 2 - Subtask
2.a: Binary Hope Speech Detection from Spanish datasets, Ensemble Learning emerged as the
most efficacious method, achieving an M_F1 score of 84.09%, which serves as the benchmark
metric for ranking. Our system in this task attained the second position. For Task 2 - Subtask
2.a: Binary Hope Speech Detection from English datasets, we attained the first position with
an M_F1 score of 86.58%, demonstrating a superior outcome compared to the preceding two
tasks, leveraging the XLM-R model. Transitioning to Task 2 - Subtask 2.b: multi-class Hope
Speech Detection from Spanish datasets, our methodology reached an M_F1 score of 66.68%
and secured the best rank utilizing the Ensemble Learning technique. Finally, in the final Task -
Subtask 2.b: multi-class Hope Speech Detection from English datasets, with a M_F1 score of
72.00%, we attained the topmost position employing the XLM-R model.
6. Conclusion
This work presented our system architecture, experimental procedures, and final ranking in the
HOPE 2024 competition. We implemented various techniques to investigate the performance of
this shared task. This included the simple and specific pre-processing steps, dataset combination
across languages, and data augmentation with large language models. We rigorously evaluated
these methodologies using pre-trained models for the sub-tasks. Finally, our approach achieved
the top scores in various sub-tasks. Specifically, our best system ranked in the Top 5 for Task 1,
Top 2 and Top 1 for Task 2 - PolyHope Binary (Spanish and English). For Task 2 - PolyHope
multi-class, we reach the Top 1 for English and Spanish language.


Acknowledgements
This research was supported by The VNUHCM-University of Information Technology’s Scientific
Research Support Fund.


References
 [1] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
     Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org,
     2024.
 [2] D. García-Baena, F. Balouchzahi, S. Butt, M. Á. García-Cumbreras, A. Lambebo Tonja,
     J. A. García-Díaz, S. Bozkurt, B. R. Chakravarthi, H. G. Ceballos, V.-G. Rafael, G. Sidorov,
     L. A. Ureña-López, A. Gelbukh, S. M. Jiménez-Zafra, Overview of HOPE at IberLEF 2024:
     Approaching Hope Speech Detection in Social Media from Two Perspectives, for Equality,
     Diversity and Inclusion and as Expectations, Procesamiento del Lenguaje Natural 73
     (2024).
 [3] D. García-Baena, M. Á. García-Cumbreras, S. M. Jiménez-Zafra, J. A. García-Díaz,
     R. Valencia-García, Hope speech detection in Spanish: The LGBT case, Language Resources
     and Evaluation (2023) 1–28.
 [4] F. Balouchzahi, G. Sidorov, A. Gelbukh, PolyHope: Two-level hope speech detection from
     tweets, Expert Systems with Applications 225 (2023) 120078. doi:10.1016/j.eswa.2023.
     120078.
 [5] G. Sidorov, F. Balouchzahi, S. Butt, A. Gelbukh, Regret and hope on transformers: An
     analysis of transformers on regret and hope speech detection datasets, Applied Sciences
     13 (2023) 3983.
 [6] S. M. Jiménez-Zafra, F. Rangel, M. M.-y. Gómez, Overview of iberlef 2023: Natural language
     processing challenges for spanish and other iberian languages, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2023), co-located with the 39th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2023), CEURWS. org, 2023.
 [7] J. L. D. Olmedo, J. M. Vázquez, V. P. Álvarez, I2c-huelva at hope2023@ iberlef: Simple use
     of transformers for automatic hope speech detection (2023).
 [8] M. Á. Rodríguez-García, A. Riaño-Martínez, S. M. Herranz, Urjc-team at hope2023@
     iberlef: Multilingual hope speech detection using transformers architecture (2023).
 [9] A. Ngo, H. T. H. Tran, Zootopi at hope2023iberlef: Is zero-shot chat-gpt the future of hope
     speech detection, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF
     2023), co-located with the 39th Conference of the Spanish Society for Natural Language
     Processing (SEPLN 2023), CEURWS. org, 2023.
[10] Z. Ahani, G. Sidorov, O. Kolesnikova, A. Gelbukh, Zavira at hope2023@ iberlef: Hope
     speech detection from text using tf-idf features and machine learning algorithms (2023).
[11] M. S. Tash, J. Armenta-Segura, O. Kolesnikova, G. Sidorov, A. F. Gelbukh, Lidoma at
     hope2023@iberlef: Hope speech detection using lexical features and convolutional neural
     networks, in: IberLEF@SEPLN, 2023. URL: https://api.semanticscholar.org/CorpusID:
     265309454.
[12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual
     Meeting of the Association for Computational Linguistics, Association for Computational
     Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747.
     doi:10.18653/v1/2020.acl-main.747.
[13] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled
     attention, in: International Conference on Learning Representations, 2020.
[14] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
     gradient-disentangled embedding sharing, 2021. arXiv:2111.09543.
[15] J. M. Pérez, D. A. Furman, L. Alonso Alemany, F. M. Luque, Robertuito: a pre-trained
     language model for social media text in spanish, in: Proceedings of the Language Resources
     and Evaluation Conference, European Language Resources Association, Marseille, France,
     2022, pp. 7235–7243. URL: https://aclanthology.org/2022.lrec-1.785.
[16] F. Barbieri, J. Camacho-Collados, L. Espinosa Anke, L. Neves, TweetEval: Unified bench-
     mark and comparative evaluation for tweet classification, in: Findings of the Association
     for Computational Linguistics: EMNLP 2020, Association for Computational Linguis-
     tics, Online, 2020, pp. 1644–1650. URL: https://aclanthology.org/2020.findings-emnlp.148.
     doi:10.18653/v1/2020.findings-emnlp.148.
[17] F. Barbieri, L. Espinosa Anke, J. Camacho-Collados, XLM-T: Multilingual language models
     in Twitter for sentiment analysis and beyond, in: Proceedings of the Thirteenth Lan-
     guage Resources and Evaluation Conference, European Language Resources Association,
     Marseille, France, 2022, pp. 258–266. URL: https://aclanthology.org/2022.lrec-1.27.
[18] J. D. la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González
     de Prado Salas y María Grandury, Bertin: Efficient pre-training of a spanish language
     model using perplexity sampling, Procesamiento del Lenguaje Natural 68 (2022) 13–23.
     URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403.