<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Data Science and Analytics 14 (2022) 389-406. doi:10.1007/
s41060</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2023.acl-long.212</article-id>
      <title-group>
        <article-title>Classification of Hope in Textual Data using Transformer-Based Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chukwuebuka Fortunate Ijezue</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fredrick Eneye Tania-Amanda Nkoyo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maaz Amjad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Texas Tech University</institution>
          ,
          <addr-line>Lubbock, Texas</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>389</fpage>
      <lpage>406</lpage>
      <abstract>
        <p>This paper presents a transformer-based approach for classifying hope expressions in text. We developed and compared three architectures (BERT, GPT-2, and DeBERTa) for both binary classification (“Hope" vs. “Not Hope") and multiclass categorization (five hope-related categories). Our initial BERT implementation achieved 83.65% binary and 74.87% multiclass accuracy. In the extended comparison, BERT demonstrated superior performance (84.49% binary, 72.03% multiclass accuracy) while requiring significantly fewer computational resources (443s vs. 704s training time) than newer architectures. GPT-2 showed lowest overall accuracy (79.34% binary, 71.29% multiclass), while DeBERTa achieved moderate results (80.70% binary, 71.56% multiclass) but at substantially higher computational cost (947s for multiclass training). Error analysis revealed architecture-specific strengths in detecting nuanced hope expressions, with GPT-2 excelling at sarcasm detection (92.46% recall). This study provides a framework for computational analysis of hope, with applications in mental health and social media analysis, while demonstrating that architectural suitability may outweigh model size for specialized emotion detection tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hope Classification</kwd>
        <kwd>NLP</kwd>
        <kwd>BERT</kwd>
        <kwd>GPT-2</kwd>
        <kwd>DeBERTa</kwd>
        <kwd>Comparative Analysis</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Emotion Detection</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In natural language processing (NLP), the computational analysis of text’s emotive and emotional
content is becoming a prominent area of study. Sentiment analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], emotion detection[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
toxicity classification [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] have all seen substantial research, but the particular field of hope detection
and classification remains relatively unexplored. As a complex psychological concept, hope is essential
to social discourse [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], mental health [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and human communication [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Automatically identifying
and classifying hopeful textual statements has potential applications in crisis response [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], social media
analysis, political discourse analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and mental health monitoring [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>This study presents a comprehensive approach to hope classification using transformer-based deep
learning models. We developed a two-tiered classification system: (1) a binary classifier that distinguishes
between hopeful expressions and those that are not, and (2) a multiclass classifier that categorizes text
into five distinct hope-related categories: Not Hope, Generalized Hope, Realistic Hope, Unrealistic Hope,
and Sarcasm. By diferentiating between various forms of hopeful expressions, this granular approach
enables a more thorough understanding of how hope manifests in text.</p>
      <p>
        Our research begins with implementing BERT (Bidirectional Encoder Representations from
Transformers) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for hope classification, leveraging its contextual understanding capabilities that have shown
state-of-the-art performance on various NLP tasks. We then expand our investigation to compare BERT
with more advanced transformer architectures: GPT-2, which employs unidirectional attention and
benefits from a larger pretraining corpus, and DeBERTa, which utilizes a disentangled attention mechanism
designed to better capture semantic nuances.
      </p>
      <p>This comprehensive comparison addresses a critical question in afective computing: do newer, more
complex language models provide meaningful performance improvements for specialized emotional
detection tasks like hope classification? Our experimental results reveal interesting patterns, with
BERT achieving the highest accuracy in both binary (84.49%) and multiclass (72.03%) tasks, despite
being architecturally simpler than alternatives. While DeBERTa (80.70% binary, 71.56% multiclass) and
GPT-2 (79.34% binary, 71.29% multiclass) showed competitive performance, they required substantially
higher computational resources, with DeBERTa taking more than twice the training time of BERT
for multiclass classification. Notably, GPT-2 demonstrated particular strength in detecting sarcastic
expressions of hope.</p>
      <p>These findings suggest that model complexity does not necessarily correlate with performance
improvement for hope classification tasks, highlighting the importance of architecture-task alignment
in emotion detection systems. By evaluating model accuracy, training eficiency, and error patterns
across these architectures, we identify the optimal approach for hope detection in practical applications,
balancing performance against computational requirements. This research contributes to the growing
ifeld of afective computing by providing empirical evidence on the relative eficacy of diferent
transformer architectures for the specialized task of hope classification, while establishing a framework for
computational analysis of hope with applications in mental health and social media analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <sec id="sec-2-1">
        <title>2.1. Hope in Computational Linguistics</title>
        <p>
          Hope speech detection is an emerging research area within Natural Language Processing (NLP) that
seeks to identify and distinguish encouraging, supportive, and positive content from traditional hate
or ofensive speech detection. While sentiment analysis has been extensively studied [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], nuanced
emotions like hope remain underexplored. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] defines hope speech as messages that “ofer support,
reassurance, suggestions, inspiration and insight” to inspire optimism. Early work by Snyder [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
established psychological frameworks for hope that inform computational approaches. Chakravarthi
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] released the HopeEDI dataset, the first large-scale multilingual corpus comprising 28,451 English,
20,198 Tamil, and 10,705 Malayalam YouTube comments labeled for hope speech. This dataset and its
HopeEDI benchmarks were vital in laying the foundation for subsequent work. The First Workshop
on Language Technology for Equality, Diversity and Inclusion (LT-EDI) in 2021 hosted a shared task
using these HopeEDI comments in English, Tamil, and Malayalam. Participants Saumya and Mishra
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] applied classical machine learning and neural models to demonstrate early baseline results in hope
classification.
        </p>
        <p>
          Subsequent studies and datasets have expanded the scope of computational hope classification. Malik
et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] compiled English and Russian hope-speech data and explored cross-lingual training. Their
RoBERTa-based framework showed that an approach of translating English content to Russian achieved
94% accuracy and an F1 score of 80.2% on binary hope classification. Thus demonstrating that transfer
learning and translation techniques can improve performance in lower-resource languages.
        </p>
        <p>
          While some approaches to hope classification and detection relied heavily on classical machine
learning classifiers such as SVM, Logistic Regression, and KNN on TF–IDF features [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], most of the
recent studies take a deep learning approach. Convolutional and recurrent networks were used by
Saumya and Mishra [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], but by 2022–23, pretrained transformers dominated. Baseline experiments on
HopeEDI showed that XLM-RoBERTa substantially outperforms traditional models. For example, on
the English HopeEDI test set, RoBERTa achieved a weighted F1 0.93 versus 0.90 for KNN and 0.87 for
SVM [15]. In balanced macro-F1 terms, RoBERTa reached a score of 0.52 on English hope classification
compared to 0.40 for KNN and 0.32 for SVM [15]. Similarly, Malik et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] showed that fine-tuned
RuBERTa (Russian RoBERTa), with a translation-based setup, outperformed all baselines.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Transformer Architectures for Emotion Detection</title>
        <p>
          BERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduced bidirectional context modeling and has achieved state-of-the-art results across
various NLP tasks. Its bidirectional attention mechanism allows it to consider the full context when
classifying emotional content. GPT-2 [16] employs unidirectional attention but benefits from a larger
pre-training corpus, potentially capturing more linguistic patterns related to hope expressions. DeBERTa
[17] enhances BERT with disentangled attention, separately computing content and position information,
which theoretically improves contextual understanding of complex emotions.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Comparative Performance Studies</title>
        <p>Comparative analyses of transformer architectures have shown task-dependent performance variations.
While newer models often outperform older ones on general benchmarks, specialized tasks may reveal
diferent patterns. Gao et al. [18] demonstrated that small pre-trained language models can be fine-tuned
to match larger models’ performance, suggesting that model architecture and fine-tuning strategies are
crucial for task-specific performance.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section details our comprehensive approach to hope classification, covering our dataset
characteristics, pre-processing strategies, model architectures, implementation details, and evaluation framework.
We present both our original BERT implementation and the extended comparison of three transformer
architectures (BERT, GPT-2, and DeBERTa) to provide a thorough analysis of hope detection capabilities.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          This study employs custom datasets for hope classification, obtained from the PolyHope shared task
[
          <xref ref-type="bibr" rid="ref10 ref11">19, 20, 21, 22, 23, 11, 10, 24, 25, 26, 27</xref>
          ] at IberLEF 2025 [28]. The training dataset contained 5,233
samples, while the development/test dataset comprised 1,902 samples. Both datasets maintained
similar class distributions, ensuring consistency between training and evaluation. The dataset supports
two classification schemes: a binary task (“Hope” vs. “Not Hope”) and a multiclass task with five
categories (“Not Hope,” “Generalized Hope,” “Realistic Hope,” “Unrealistic Hope,” and “Sarcasm”). For
the binary classification, the training set contained 2,426 (46.36%) “Hope” samples and 2,807 (53.64%)
“Not Hope” samples, with the test set maintaining a similar distribution of 899 (47.27%) “Hope” and
1,003 (52.73%) “Not Hope” samples. The multiclass distribution was also consistent across both sets,
with the following breakdown in the training data: “Not Hope” (42.90%), “Generalized Hope” (24.54%),
“Sarcasm” (13.22%), “Realistic Hope” (10.32%), and “Unrealistic Hope” (9.02%). The test set maintained
nearly identical proportions. This balanced representation across classes helped ensure the models
could learn to distinguish between all categories efectively. The text samples varied in length from 24
to 886 characters, with an average length of approximately 188 characters. This variation in text length
provided the models with diverse linguistic patterns and expressions of hope across diferent contexts.
To ensure robust model development, we implemented an 80-20 train-validation split on the training
data, maintaining the same random seed (42) across experiments for consistency and reproducibility.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Text Pre-processing</title>
        <p>In this study, our approach to text pre-processing difered between the original implementation and the
extended comparison. In the original BERT implementation, we deliberately minimized pre-processing,
feeding raw text directly into the tokenization pipeline to leverage BERT’s capability to capture
contextual nuances. For the extended comparison between models, we applied basic text cleaning through
a custom function that converted text to lowercase, removed URLs and web links, removed hashtags
and user mentions (patterns like #word and @word), and removed punctuation. This cleaned text
was stored in a separate ’clean_text’ column and used for tokenization across all three models. This
standardized preprocessing in the extended comparison ensured a fair evaluation across diferent
transformer architectures while allowing us to assess whether specialized cleaning benefits these pre-trained
models. The contrast between approaches also enabled us to evaluate the impact of pre-processing on
model performance for hope classification tasks.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Transformer Model Architectures</title>
        <p>Our study implements and compares three state-of-the-art transformer architectures for hope
classification, with BERT used in our original implementation and all three models (BERT, GPT-2, and DeBERTa)
compared in our extended analysis.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. BERT Architecture</title>
          <p>In both our original and extended implementations, we utilized the ‘bert-base-uncased’ variant from
Hugging Face’s Transformers library. This BERT model comprises 12 transformer layers, 12 attention
heads, and 768 hidden dimensions, totaling approximately 110 million parameters. BERT’s bidirectional
attention mechanism enables the model to consider the full context when representing each word,
potentially beneficial for capturing complex hope expressions. In our original implementation, BERT
served as the sole architecture for establishing baseline performance in hope classification.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. GPT-2 Architecture</title>
          <p>For our extended comparison, we incorporated the GPT-2 base model (124M parameters) with its
autoregressive architecture. Unlike BERT’s bidirectional attention, GPT-2 uses unidirectional attention
where each token can only attend to previous tokens in the sequence. While this limitation might afect
classification performance, GPT-2’s larger pre-training corpus potentially provides richer semantic
representations beneficial for hope classification. Special consideration was required for GPT-2
implementation, including setting the pad token to match the EOS token and disabling the cache to avoid
errors during training.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. DeBERTa Architecture</title>
          <p>Also included only in our extended comparison, DeBERTa (base version, 140M parameters) implements
a disentangled attention mechanism that separately computes attention weights for content and position
information. This approach theoretically allows for more nuanced contextual understanding, potentially
beneficial for distinguishing between subtle variations of hope expressions and identifying sarcasm.
DeBERTa represents the most complex architecture in our comparison, ofering insights into whether
architectural sophistication translates to improved hope classification performance.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Model Implementation and Training</title>
        <p>Our technical approach evolved across the original and extended studies, with consistent use of the
Hugging Face Transformers library and TensorFlow backend throughout both phases.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Original BERT Implementation</title>
          <p>In our initial implementation, we focused exclusively on BERT using
TFBertForSequenceClassification for both binary and multiclass classification tasks. We employed the BERT tokenizer with a
maximum sequence length of 128 tokens, with padding and truncation applied as needed. The model was
compiled with Adam optimizer (learning rate 2e-5) and SparseCategoricalCrossentropy loss function.
We trained the model for 3 epochs with a batch size of 8, using accuracy as our primary evaluation
metric. Model checkpointing was implemented to save the best-performing model based on validation
accuracy.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Extended Implementation Comparison</title>
          <p>For our comparative analysis, we expanded to include all three transformer architectures, implementing
custom setup functions for each model. For tokenization, each model used its corresponding tokenizer
with consistent parameters: maximum sequence length of 128 tokens, padding enabled, and truncation
applied. For GPT-2, which lacks a dedicated pad token, we assigned the EOS token as the pad token
and set use_cache=False to prevent errors with the past_key_values parameter. Additionally,
GPT-2 required inputs structured as dictionaries, necessitating the use of TensorFlow Dataset API for
compatibility.</p>
          <p>Across all models in our extended comparison, we maintained the same optimizer (Adam), loss
function (SparseCategoricalCrossentropy), and learning rate (2e-5) while increasing the batch size to 16.
Each model was trained for 3 epochs with identical ModelCheckpoint callbacks to ensure fair comparison
of architectural diferences rather than training hyperparameters. This standardized approach helped
isolate architectural performance diferences while mitigating overfitting through validation-based
checkpointing. All models were saved in TensorFlow format for consistency and to facilitate deployment
and further experimentation.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Evaluation Framework</title>
        <p>Our evaluation strategy remained consistent across both the original BERT implementation and extended
model comparison. We primarily relied on accuracy as our main metric for overall performance
assessment, allowing direct comparison between models and with prior research in hope classification.
Additionally, we calculated precision, recall, and F1 scores for both weighted and macro averages
to provide a more nuanced understanding of model performance across classes. For the extended
comparison, we expanded our analysis to include training time as a measure of computational eficiency,
an important consideration for real-world deployment scenarios. We also generated confusion matrices
for each model, revealing specific classification patterns and highlighting each architecture’s strengths
and weaknesses in distinguishing between diferent hope categories, particularly their ability to identify
subtle distinctions between hope types and sarcasm.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Computational Environment</title>
        <p>Our study employed diferent computational resources across implementation phases. For the original
BERT implementation, we utilized Texas Tech University’s High-Performance Computing Center
(HPCC) with NVIDIA A100 GPUs (40GB memory), providing substantial computational power for
baseline model development. For the extended comparison, we shifted to Google Colab with NVIDIA
T4 GPUs, which ofered more accessibility while still providing suficient capacity for comparative
analysis. This environment change explains some of the training time diferences observed between
implementations. Both environments used Python 3.x with TensorFlow as the primary framework,
supplemented by Hugging Face Transformers library, pandas for data manipulation, and scikit-learn
for evaluation metrics. Despite the diferent GPU types, we maintained consistent training parameters
and evaluation protocols to ensure meaningful comparisons across models and implementations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Comparison of Original and Extended Implementations</title>
        <p>Table 1 shows the performance metrics of our original BERT implementation and the extended model
comparison study. In our original implementation, BERT achieved 83.65% accuracy for binary
classification and 74.87% for multiclass classification. The extended implementation yielded diferent results
across models, with refined BERT showing notable improvement in binary classification (84.49% vs.
83.65%) but a decrease in multiclass performance (72.03% vs. 74.87%).</p>
        <p>This performance diference between implementations can be attributed to several factors. First, the
preprocessing approach difered, with the extended study applying more comprehensive text cleaning.
We observed a drop in multiclass classification accuracy for BERT, despite other architectural and
training conditions being the same. This accuracy decline suggests that the additional cleaning may
have removed important linguistic features such as capitalization, punctuation-based emphasis, or
hashtags that contribute to the nuanced expression of hope. This is in line with the findings of Siino
et al. [29], who showed that in some cases minimal preprocessing like lowercasing can reduce the
performance of transformer-based models.</p>
        <p>Second, the computational environments varied (HPCC A100 GPUs vs. Google Colab T4 GPUs),
potentially afecting optimization during training. Finally, the batch size increased from 8 in the original
implementation to 16 in the extended comparison, which may have afected the learning dynamics.
It’s particularly interesting that the multiclass performance declined across all models in the extended
implementation (ranging from 71.29% to 72.03%) compared to our original BERT implementation
(74.87%). This consistent decrease suggests that either the original implementation benefited from a
particularly advantageous random initialization or data split, or that the text pre-processing applied in
the extended study may have removed linguistic features valuable for distinguishing between nuanced
hope categories.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Performance Comparison</title>
        <p>Building upon our comparison with the original BERT implementation, we examined the relative
performance of our three models in the extended study as shown in Table 1 and Figure 1. For binary
classification, BERT achieved the highest accuracy at 84.49%, followed by DeBERTa at 80.70% and
GPT-2 at 79.34%. This ranking was somewhat unexpected, as the more complex architectures did not
translate to improved performance on the binary task despite their larger parameter counts and more
sophisticated attention mechanisms. For multiclass classification, BERT again outperformed the other
implementations with 72.03% accuracy, followed closely by DeBERTa at 71.56% and GPT-2 at 71.29%.
Interestingly, all three models in the extended study showed lower multiclass performance compared
to our original BERT implementation (74.87%). This consistent performance gap suggests that the
original implementation may have benefited from diferent preprocessing, batch size, or computational
environment that was altered in our comparative study. The similar performance across models in
multiclass classification (with only 0.74% diference between best and worst) indicates that architectural
diferences had minimal impact on the model’s ability to distinguish between nuanced hope categories.
This finding challenges the assumption that more complex transformer architectures necessarily yield
better performance on specialized classification tasks, at least in the context of hope detection.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Computational Eficiency</title>
        <p>As illustrated in Figure 2, the models exhibited significant diferences in computational requirements.
BERT demonstrated the highest eficiency for binary classification, requiring only 443 seconds for
training, followed by GPT-2 at 527 seconds. DeBERTa demanded substantially more computational
resources at 704 seconds, approximately 59% longer training time than BERT. For multiclass training,
BERT and GPT-2 showed similar eficiency (539s and 530s respectively), while DeBERTa required
significantly more time at 948 seconds - nearly double the training time of the other models. These
eficiency diferences have important implications for deployment scenarios, especially in
resourceconstrained environments. The substantially higher computational demands of DeBERTa did not
translate to proportional performance improvements, suggesting that BERT ofers the best balance of
accuracy and computational eficiency for hope classification tasks. Figure 3 shows a visual trade-of
between model size, training time, and accuracy across the three transformer models.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Classification Patterns</title>
        <p>The confusion matrices (Figures 4-9) reveal distinct classification patterns for each model. For binary
classification, GPT-2 demonstrated the highest sensitivity (93.77%) but lowest specificity (66.40%),
showing a strong tendency to classify texts as “Hope" more frequently than other models. BERT showed
the most balanced performance with 84.20% sensitivity and 84.75% specificity. DeBERTa exhibited
similar patterns to GPT-2, with high sensitivity (92.55%) but lower specificity (70.09%). For multiclass
classification, DeBERTa showed the strongest performance on “Not Hope" (82.35%) compared to BERT
(74.02%) and GPT-2 (74.14%). GPT-2 significantly outperformed other models on “Sarcasm" detection
with an impressive 92.46% recall, compared to DeBERTa’s 82.14% and BERT’s 77.38%. This suggests
that GPT-2’s larger pre-training corpus may provide advantages for detecting subtle linguistic patterns
like sarcasm. Across all models, “Unrealistic Hope" proved the most challenging category to classify
correctly, with accuracy rates of 67.25% (BERT), 46.78% (GPT-2), and 50.29% (DeBERTa). This category
was frequently confused with “Generalized Hope" and “Realistic Hope," likely due to its subjective
nature and semantic overlap with other hope categories.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Error Analysis</title>
      <sec id="sec-5-1">
        <title>5.1. Binary Classification Errors</title>
        <p>Analysis of the binary confusion matrices reveals error patterns across both our original and extended
implementations. In our original BERT implementation, we observed a relatively balanced error
distribution, with minor bias toward false positives. The extended study provided deeper insights
through comparison of all three architectures. In the extended implementation, BERT (Figure 4)
exhibited the most balanced error distribution, with 153 false negatives and 142 false positives, indicating
no strong bias toward either class. GPT-2 (Figure 5) showed a clear tendency toward false positives
(337) over false negatives (56), suggesting it may be overly sensitive to hope-related language patterns.
DeBERTa (Figure 6) demonstrated a similar trend to GPT-2, with more false positives (300) than
false negatives (67), though less pronounced. These patterns align with the architectural diferences
between the models. BERT’s bidirectional attention enables balanced context understanding from both
directions. GPT-2’s unidirectional attention may cause it to overweight certain hope-indicating phrases
once encountered, while DeBERTa’s disentangled attention appears to maintain high recall but with
lower precision for hope classification. The performance gap between our best model (BERT at 84.49%)
and the others suggests that for binary hope classification, simpler architectures may be suficient,
consistent with findings from our original implementation.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Multiclass Classification Errors</title>
        <p>The multiclass confusion matrices reveal more complex error patterns across implementations. Our
original BERT implementation showed particular strength in distinguishing between hope subtypes
compared to all models in the extended study, which helps explain its higher overall accuracy (74.87%
vs. 72.03% for the best extended model). In the extended implementation, all models struggled with
distinguishing between hope subtypes, particularly between “Generalized Hope" and “Realistic Hope."
For example, BERT (Figure 7) misclassified 84 instances of “Generalized Hope" as “Realistic Hope,"
while GPT-2 misclassified 44 such instances. DeBERTa (Figure 9) showed similar confusion with 83
such misclassifications. GPT-2 (Figure 8) demonstrated particular dificulty with “Unrealistic Hope,"
mis-classifying 34 instances as “Not Hope" and 31 as “Generalized Hope." Notably, GPT-2 performed
exceptionally well at “Sarcasm" detection (92.46% recall) compared to BERT (77.38%) and DeBERTa
(82.14%), likely because its larger pre-training corpus better captured the linguistic patterns associated
with sarcastic expressions. This specific strength represents a significant finding from our extended
implementation that wasn’t evident in the original BERT-only study.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Error Categories and Contributing Factors</title>
        <p>Several distinct error categories emerged across both our original and extended implementations,
providing comprehensive insights into the challenges of hope classification. Contextual Ambiguity
posed significant challenges in cases where hope expressions required broader context beyond the
model’s token window (128 tokens), afecting 15-20% of misclassifications. The limited context window
often prevented models from capturing the full narrative or conversational flow necessary to accurately
interpret hope expressions.</p>
        <p>Beyond these contextual limitations, we observed that Category Boundary Confusion represented
the largest source of errors, particularly between “Generalized Hope" and “Realistic Hope," accounting
for approximately 40% of multiclass errors. This confusion wasn’t surprising given the inherent overlap
and subjective boundaries between hope categories, which revealed fundamental limitations in the
models’ ability to make fine-grained distinctions between semantically similar expressions.</p>
        <p>Related to these boundary issues, our analysis uncovered challenges with Implicit Hope Expressions
across all architectures in both implementations. These subtle, culturally-specific, or figurative hope
expressions represented about 25% of errors, as they often relied on contextual knowledge or cultural
references that extended beyond the linguistic patterns captured during pre-training. This challenge
persisted regardless of model complexity or architecture, suggesting an inherent limitation in current
transformer-based approaches.</p>
        <p>Despite the sophisticated attention mechanisms in our models, Sarcasm Detection remained
particularly problematic. While GPT-2 demonstrated superior performance in this regard in our extended
study (92.46% recall), all models encountered dificulties with sarcasm, especially when contextual cues
were subtle or culture-specific. This challenge highlights how the inherent complexity of sarcasm,
which typically relies on tonal cues absent in text, creates a particularly demanding aspect of hope
classification.</p>
        <p>Taken together, these findings illustrate the complexity of hope as an emotion, with its various
manifestations and linguistic expressions posing inherent challenges for computational detection. Our
comprehensive analysis suggests that while advanced architectures like GPT-2 ofer specific strengths
for certain aspects of hope classification (particularly sarcasm detection), BERT consistently provides
the best overall performance with significantly lower computational costs across both our original and
extended implementations.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Implications of Results</title>
        <p>The performance of our transformer-based hope classification models provides several important
insights into both the technical aspects of hope detection and the broader implications for afective
computing. Our comparative analysis of BERT, GPT-2, and DeBERTa reveals significant findings
about transformer architecture suitability for hope classification, particularly when compared to our
original BERT implementation. These findings have implications for both model selection and practical
deployment considerations.</p>
        <p>For binary classification, our extended BERT implementation achieved the highest accuracy (84.49%)
among the three architectures tested, outperforming our original implementation (83.65%). DeBERTa
followed with 80.70%, and GPT-2 showed the lowest performance at 79.34%. This pattern suggests that
binary hope classification benefits from BERT’s bidirectional approach, providing suficient contextual
understanding while demanding fewer computational resources. These results indicate that simpler
architectures may be preferred for binary hope detection tasks, with BERT ofering the optimal balance
of performance and eficiency.</p>
        <p>In multiclass classification, a diferent pattern emerged. While BERT outperformed other architectures
in our extended study (72.03%), followed by DeBERTa (71.56%) and GPT-2 (71.29%), all three models
fell short of our original BERT implementation (74.87%). This performance gap warrants careful
consideration. It may indicate that our original implementation, with minimal text pre-processing and
diferent batch size (8 vs. 16), benefited from a configuration that better preserved linguistic features
important for nuanced hope classification. Alternatively, the diference in computational environments
(HPCC A100 GPUs vs. Google Colab T4 GPUs) may have influenced optimization during training.</p>
        <p>These findings challenge the common assumption that newer, larger models automatically yield
better results for specialized NLP tasks [30]. Despite BERT being an earlier architecture with fewer
parameters than both GPT-2 and DeBERTa, it demonstrated competitive or superior performance for
hope classification. This suggests that architectural fit to the specific task may be more important than
model recency or size for specialized afective computing applications.</p>
        <p>From a computational eficiency perspective, the similar performance across models in multiclass
classification (with only 0.74% diference between best and worst) makes BERT’s significantly lower
computational requirements particularly notable. DeBERTa required nearly double BERT’s training
time while delivering slightly worse performance, raising questions about the value of such advanced
architectures for this specific task. These eficiency diferences have significant implications for
deployment scenarios, especially in resource-constrained environments where BERT’s balance of performance
and eficiency may be optimal.</p>
        <p>GPT-2’s performance, particularly its strength in sarcasm detection (92.46% recall) but overall lower
accuracy, suggests that auto-regressive, unidirectional architectures have specific strengths and
weaknesses for emotion classification tasks. While less suited for overall hope classification, GPT-2’s superior
performance in detecting sarcasm highlights the potential value of hybrid approaches that leverage the
strengths of diferent architectures for specific subcategories of emotional expression.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Limitations and Challenges</title>
        <p>Despite the promising results, some limitations should be acknowledged. The fixed context window of
transformer models (128 tokens in our implementation) potentially limits the model’s ability to capture
hope expressions that require broader textual context. Hope is often expressed in narratives or extended
discourses, and truncating these contexts may result in lost information. For example, context-dependent
constructs such as sarcasm or unrealistic hope can span numerous clauses or sentences, which may be
trimmed in shorter inputs, potentially omitting crucial semantic signals. Previous research in emotion
detection and sentiment analysis has shown that limiting the context duration can significantly impact
model understanding, particularly complex emotions such as sarcasm and irony. [31][32]</p>
        <p>
          Additionally, while our multiclass classifier performed adequately, the boundaries between diferent
hope categories (particularly between "Generalized Hope" and "Realistic Hope") may be inherently
ambiguous. This ambiguity could contribute to some classification errors and might reflect genuine
conceptual overlap rather than model limitations. Similar challenges with categorical emotion boundaries
have been observed by Demszky et al. [
          <xref ref-type="bibr" rid="ref15">33</xref>
          ] in their work on fine-grained emotion detection.
        </p>
        <p>
          The reliance on text alone also overlooks multi-modal aspects of hope expression, such as tone,
emphasis, or accompanying visual cues that might be present in spoken or video communications.
Future work could explore multi-modal approaches to hope detection that incorporate these additional
signals, following the approach of Soleymani et al. [
          <xref ref-type="bibr" rid="ref16">34</xref>
          ] in multi-modal emotion recognition.
        </p>
        <p>Our implementations used base versions of each model rather than larger variants. Future work could
explore whether larger versions of DeBERTa or GPT-2 (GPT-3 or GPT-4) would overcome the limitations
observed. Moreover, the diferences between our original and extended implementations highlight the
sensitivity of these models to preprocessing approaches and training environments, suggesting that
careful ablation studies may be valuable for optimizing hope classification systems. A key challenge
arising from our extended study is the observed decline in multiclass inference performance (from
74.87% in the original BERT to 72.03% for BERT in the extended study) despite the inclusion of newer
architectures. This suggests that factors such as the standardized preprocessing applied in the extended
comparison, which difered from the minimal preprocessing in the original BERT implementation or
changes in the computational environment and batch size may have inadvertently impacted performance.</p>
        <p>Furthermore, our study fixed key hyperparameters like learning rate and batch size across all models
in the extended comparison to ensure a controlled evaluation of architectural diferences. While this
aids in comparing architectures directly, it may not represent the optimal performance achievable by
each model, as individual architectures could benefit from specific hyperparameter tuning.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Eficiency and Deployment Considerations</title>
        <p>Our experiments revealed significant diferences in computational eficiency across the three
architectures. BERT demonstrated the highest eficiency, requiring only 443 seconds for binary classification
training and 539 seconds for multiclass training. GPT-2 showed moderate eficiency (527s for binary,
530s for multiclass), while DeBERTa demanded substantially more computational resources, requiring
approximately 59% more time for binary classification (704s) and nearly double the training time for
multiclass classification (948s).</p>
        <p>These eficiency diferences have important implications for deployment scenarios, especially in
resource-constrained environments. For hope classification specifically, our results suggest that BERT
ofers the optimal balance of performance and eficiency. Not only did BERT achieve the highest
accuracy in both binary and multiclass tasks in our extended study, but it did so with significantly lower
computational requirements than more complex alternatives.</p>
        <p>The performance diferences between our original and extended BERT implementations highlight
another crucial point: implementation details can significantly impact results, sometimes more than
architectural changes. Our original implementation with minimal pre-processing achieved better
multiclass performance (74.87% vs. 72.03%), suggesting that extensive text cleaning may remove linguistic
features valuable for distinguishing between nuanced hope categories. Organizations considering hope
classification systems should potentially invest in optimizing pre-processing strategies and training
configurations before transitioning to more computationally expensive models. This observation aligns
with findings by Turc et al. [30], who demonstrated that well-optimized smaller models can match or
exceed the performance of larger models while requiring substantially fewer resources.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Applications and Future Directions</title>
        <p>
          The ability to automatically detect and classify hope expressions has numerous potential applications.
In mental health monitoring, tracking hope patterns over time could provide valuable insights into
psychological well-being and treatment eficacy. In social media analysis, measuring hope levels in
public discourse could serve as an indicator of collective emotional states during crises or social change,
similar to the work of Bollen et al. [
          <xref ref-type="bibr" rid="ref17">35</xref>
          ] on public mood analysis via Twitter.
        </p>
        <p>
          Political discourse analysis could benefit from automated hope detection to examine how diferent
rhetorical strategies employ various forms of hope to persuade or mobilize audiences, extending the
research of Nabi et al. [
          <xref ref-type="bibr" rid="ref18">36</xref>
          ] on emotional appeals in persuasive communications. Similarly, marketing
research could use hope classification to analyze the efectiveness of hope-based appeals in advertising
and consumer communications [
          <xref ref-type="bibr" rid="ref19">37</xref>
          ].
        </p>
        <p>
          Future research could explore several promising directions. Developing domain-specific hope
classifiers for areas like healthcare, politics, or crisis response could improve performance in specialized
contexts, following the domain adaptation approach described by Gururangan et al. [
          <xref ref-type="bibr" rid="ref20">38</xref>
          ]. Investigating
hope expressions across diferent languages and cultures would provide insights into cultural variations
in how hope is expressed and understood, building on cross-cultural emotion research by Jackson et al.
[
          <xref ref-type="bibr" rid="ref21">39</xref>
          ].
        </p>
        <p>A area for future work is a more detailed investigation into the factors contributing to the decline in
multiclass performance in our extended study. This would involve ablation studies to understand the
efects of preprocessing changes, batch size adjustments, and computational environment variations on
model performance.</p>
        <p>Further research should also incorporate model-specific hyperparameter optimization. While our
study maintained consistent hyperparameters, future eforts should tune parameters like learning rate,
batch size, and optimizer settings for each model to unlock their full potential on hope classification tasks.
Additionally, exploring the capabilities of larger pre-trained language models or more computationally
eficient distilled versions, could ofer a better understanding of the trade-ofs between model size,
performance, and eficiency for this specific task.</p>
        <p>Exploring ensemble approaches combining the strengths of diferent architectures might yield superior
performance without the full computational cost of the most expensive models. For instance, a two-stage
classification system might use BERT for initial binary classification and leverage GPT-2’s strength
in sarcasm detection when that specific category is suspected. Additionally, exploring knowledge
distillation techniques to transfer the capabilities of larger models like DeBERTa into more eficient
architectures could provide an optimal balance of performance and eficiency.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Methodological and Ethical Considerations</title>
        <p>
          Our work demonstrates the efectiveness of fine-tuning pre-trained language models for specialized
emotion detection tasks, with performance improvements across epochs indicating successful domain
adaptation [
          <xref ref-type="bibr" rid="ref22">40</xref>
          ]. The small validation-test performance gap suggests good generalization to unseen
data, addressing common concerns about overfitting in deep learning [
          <xref ref-type="bibr" rid="ref23">41</xref>
          ].
        </p>
        <p>Our comparison between the original and extended implementations also highlights the importance of
systematic comparisons under controlled conditions. While our original BERT implementation showed
superior multiclass performance, the extended study enabled a more comprehensive understanding of
architectural trade-ofs and specific strengths, such as GPT-2’s superior sarcasm detection capability.</p>
        <p>
          From an ethical perspective, hope detection technologies must be deployed responsibly given hope’s
psychological significance. Key concerns include privacy protection when analyzing personal
communications [
          <xref ref-type="bibr" rid="ref24">42</xref>
          ], potential manipulation based on detected hope patterns, and biases in training data
that could lead to uneven performance across demographic groups [
          <xref ref-type="bibr" rid="ref25">43</xref>
          ]. Transparency about system
capabilities and limitations is essential, particularly when these technologies inform decisions
afecting well-being. Researchers and practitioners should follow established ethical frameworks for AI
development to ensure hope detection systems respect autonomy and promote positive outcomes [
          <xref ref-type="bibr" rid="ref26">44</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This study presented a comparative analysis of transformer-based models for hope classification,
extending our original BERT implementation to include GPT-2 and DeBERTa architectures. We evaluated these
models on both binary hope detection and multiclass hope categorization tasks, assessing performance,
eficiency, and error patterns to determine their suitability for practical applications.</p>
      <p>Our findings reveal several key insights. First, despite being an earlier architecture, BERT
demonstrated superior performance for both binary classification (84.49%) and multiclass classification (72.03%)
while requiring significantly less computational resources than newer models. This finding is notable
given that our original BERT implementation achieved 83.65% for binary and 74.87% for multiclass
tasks, suggesting that implementation details like preprocessing and batch size significantly impact
performance. Interestingly, all models in our extended comparison showed lower multiclass performance
than our original implementation, highlighting that architectural sophistication does not necessarily
translate to improved results for nuanced hope detection.</p>
      <p>Second, our error analysis identified consistent challenges across all architectures: contextual
ambiguity, category boundary confusion, implicit hope expressions, and sarcasm detection. While GPT-2
demonstrated remarkable strength in sarcasm detection (92.46% recall), overall performance patterns
suggest that certain challenges in hope classification transcend architectural diferences, emphasizing
the complex psychological nature of hope as an emotion.</p>
      <p>Third, the substantial diference in computational requirements—with DeBERTa requiring nearly
double BERT’s training time for multiclass classification (948s vs. 539s)—underscores important
eficiency considerations for real-world deployment. Given BERT’s superior or comparable performance
across tasks, the additional computational cost of more complex architectures appears dificult to justify
for hope classification applications.</p>
      <p>The development of computational methods for hope detection opens new possibilities for applications
in mental health monitoring, social media analysis, and discourse studies. By enabling automatic
identification of hope expressions and their subcategories, our approach contributes to the broader field
of afective computing and extends the range of emotions that can be computationally analyzed.</p>
      <p>Future work could explore ensemble approaches combining the strengths of diferent architectures
(particularly leveraging GPT-2’s superior sarcasm detection), domain-specific hope classifiers for
applications like healthcare or crisis response, and cross-cultural explorations of hope expression. Additionally,
further investigation into the impact of preprocessing strategies could help explain the performance
diferences between our original and extended implementations.</p>
      <p>This study represents an important step toward more nuanced emotional analysis in text, moving
beyond basic sentiment categorization to capture the richness and complexity of human emotional
expression. By empirically evaluating diferent transformer architectures for hope classification, we
provide practical guidance for researchers and practitioners seeking to implement eficient and efective
hope detection systems in real-world applications, demonstrating that established architectures like
BERT may ofer the optimal balance of performance and eficiency for specialized emotion detection
tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors used LLM-based tools to improve readability and consistency. All research elements, results,
and conclusions were produced and verified by the authors. AI tools were not used to create or validate
data.</p>
    </sec>
    <sec id="sec-9">
      <title>8. Terminology</title>
      <p>This appendix provides definitions of specialized terms used throughout the paper that may not be
familiar to all readers.</p>
      <p>BERT Bidirectional Encoder Representations from Transformers. A transformer-based machine
learning model for natural language processing pre-trained on a large corpus of text.</p>
      <p>GPT-2 Generative Pre-trained Transformer 2. An autoregressive language model that uses
unidirectional attention (each token can only attend to previous tokens). It contains 124 million parameters
in its base version and was pre-trained on a larger corpus than BERT, but its unidirectional nature
may limit contextual understanding for classification tasks.</p>
      <p>DeBERTa Decoding-enhanced BERT with Disentangled Attention. A transformer model that
implements a novel attention mechanism which separately computes attention weights for content and
position information. This architecture aims to provide more nuanced contextual understanding
by disentangling the content and position information in the self-attention mechanism.
bert-base-uncased A specific pre-trained variant of BERT that uses a vocabulary of uncased
(lowercase) text. It contains 12 transformer layers, 12 attention heads, and 110 million parameters.
Generalized Hope A broad, non-specific form of hope that is not tied to a particular outcome,
timeframe, or realistic expectation. Often expressed as general optimism about the future.
Realistic Hope Hope that is grounded in reality, with reasonable expectations of what could potentially
happen based on evidence, experience, or logical reasoning.</p>
      <p>Unrealistic Hope Hope characterized by expectations that have a very low probability of being
realized, often disregarding evidence or practical limitations.</p>
      <p>Sarcasm In the context of hope classification, expressions that superficially appear hopeful but actually
convey the opposite meaning through irony, often with the intent to mock or criticize.
Fine-tuning The process of taking a pre-trained model (like BERT) and further training it on a specific
task or domain with a smaller dataset to adapt its knowledge to that particular application.
Attention Masks Binary tensors used in transformer models to indicate which tokens should be
attended to and which should be ignored (such as padding tokens).</p>
      <p>Transfer Learning A machine learning technique where knowledge gained while solving one problem
is applied to a diferent but related problem, often allowing models to perform well with less
task-specific data.</p>
      <p>Tokenization The process of breaking text into smaller units called tokens, which could be words,
subwords, or characters, that serve as the input to NLP models.</p>
      <p>Transformer Architecture A deep learning architecture that uses self-attention mechanisms to
process sequential data, allowing the model to weigh the importance of diferent words in relation to
each other regardless of their position in the sequence.</p>
      <p>TFBertForSequenceClassification A TensorFlow implementation of BERT specifically designed for
sequence classification tasks, with an additional classification layer on top of the BERT model.
SparseCategoricalCrossentropy A loss function used in multi-class classification problems when
the target values are represented as integers rather than one-hot encoded vectors.
Legacy Adam Optimizer A version of the Adam optimization algorithm in TensorFlow that maintains
compatibility with older implementations. Adam (Adaptive Moment Estimation) combines the
benefits of two other extensions of stochastic gradient descent: AdaGrad and RMSProp.
Learning Rate A hyperparameter that controls how much to change the model in response to the
estimated error each time the model weights are updated. The value 2e-5 (0.00002) is commonly
used for fine-tuning BERT models.</p>
      <p>ModelCheckpoint Callbacks Functions in TensorFlow that save the model’s state at specific points
during training, typically when the model achieves better performance on validation data than it
has previously.</p>
      <p>TensorFlow Format A file format for saving TensorFlow models that preserves the model architecture,
weights, and computational graph, allowing for model reuse and deployment.</p>
      <p>Weighted Metrics Performance metrics (precision, recall, F1-score) that account for class imbalance
by calculating scores for each class and then taking a weighted average based on the number of
samples in each class.</p>
      <p>Macro Metrics Performance metrics that calculate scores for each class independently and then take
an unweighted average, treating all classes equally regardless of their size.</p>
      <p>F1-Score A measure of a model’s accuracy that combines precision and recall. It is the harmonic mean
of precision and recall, providing a balance between the two metrics.</p>
      <p>Overfitting A modeling error that occurs when a model learns the training data too well, including
its noise and outliers, resulting in poor performance on new, unseen data.</p>
      <p>Epoch One complete pass through the entire training dataset during the training of a machine learning
model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Mining and summarizing customer reviews</article-title>
          ,
          <source>in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mohammad</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis: Detecting valence, emotions, and other afectual states from text</article-title>
          , in: Emotion measurement, Elsevier,
          <year>2016</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tepper</surname>
          </string-name>
          ,
          <article-title>Detecting hate speech on twitter using a convolution-gru based deep neural network</article-title>
          , in: A.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
            , M.-E. Vidal,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hollink</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tordai</surname>
          </string-name>
          , M. Alam (Eds.),
          <source>The Semantic Web</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>745</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Herth</surname>
          </string-name>
          ,
          <article-title>Abbreviated instrument to measure hope: development and psychometric evaluation</article-title>
          ,
          <source>Journal of Advanced Nursing</source>
          <volume>17</volume>
          (
          <year>1992</year>
          )
          <fpage>1251</fpage>
          -
          <lpage>1259</lpage>
          . doi:
          <volume>10</volume>
          .1111/j.1365-
          <fpage>2648</fpage>
          .
          <year>1992</year>
          .tb01843. x.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Snyder</surname>
          </string-name>
          ,
          <article-title>Hope theory: Rainbows in the mind</article-title>
          ,
          <source>Psychological inquiry 13</source>
          (
          <year>2002</year>
          )
          <fpage>249</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lei</surname>
          </string-name>
          , W. Liu,
          <article-title>A survey on sentiment analysis and opinion mining for social multimedia</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>78</volume>
          (
          <year>2019</year>
          )
          <fpage>6939</fpage>
          -
          <lpage>6967</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis</article-title>
          and
          <source>opinion mining</source>
          , Springer Nature,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Coppersmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <article-title>Quantifying mental health signals in twitter, in: Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality</article-title>
          ,
          <year>2014</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muralidaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priyadharshini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valencia-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>García-Baena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>García-Díaz</surname>
          </string-name>
          ,
          <article-title>Overview of the shared task on hope speech detection for equality, diversity, and inclusion</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>378</fpage>
          -
          <lpage>388</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .ltedi-
          <volume>1</volume>
          .
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Chakravarthi</surname>
          </string-name>
          ,
          <article-title>Hopeedi: A multilingual hope speech detection dataset for equality, diversity, and inclusion</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Computational Modeling of People's Opinions</source>
          , Personality, and
          <article-title>Emotion's in Social Media</article-title>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>53</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .peoples-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Mishra</surname>
          </string-name>
          , IIIT_DWD@
          <article-title>LT-EDI-EACL2021: Hope speech detection in YouTube multilingual comments</article-title>
          , in: B.
          <string-name>
            <surname>R. Chakravarthi</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zarrouk</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bali</surname>
          </string-name>
          , P. Buitelaar (Eds.),
          <source>Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion</source>
          , Association for Computational Linguistics, Kyiv,
          <year>2021</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .ltedi-
          <volume>1</volume>
          .14/.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. S. I.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nazarova</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Jamjoom</surname>
            ,
            <given-names>D. I. Ignatov</given-names>
          </string-name>
          ,
          <article-title>Multilingual hope speech detection: A robust framework using transfer learning of fine-tuning roberta model</article-title>
          ,
          <source>J. King Saud Univ. Comput. Inf. Sci</source>
          .
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <article-title>101736</article-title>
          . URL: https://doi.org/10.1016/j.jksuci.
          <year>2023</year>
          .
          <volume>101736</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Yigezu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Y.</given-names>
            <surname>Bade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kolesnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sidorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <article-title>Multilingual hope speech detection using machine learning</article-title>
          ,
          <source>in: Proceedings of IberLEF</source>
          <year>2023</year>
          ,
          <article-title>co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2023</year>
          ), volume Vol.
          <source>TBD of CEUR Workshop Proceedings</source>
          , ???
          <article-title>? learn any-domain representations for detecting sentiment, emotion and sarcasm</article-title>
          , in: M.
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hwa</surname>
          </string-name>
          , S. Riedel (Eds.),
          <source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>1615</fpage>
          -
          <lpage>1625</lpage>
          . URL: https://aclanthology.org/D17-1169/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1169.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D.</given-names>
            <surname>Demszky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Movshovitz-Attias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cowen</surname>
          </string-name>
          , G. Nemade,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ravi</surname>
          </string-name>
          ,
          <article-title>Goemotions: A dataset of fine-grained emotions, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>4040</fpage>
          -
          <lpage>4054</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          ,
          <article-title>A survey of multimodal sentiment analysis</article-title>
          ,
          <source>Image and Vision Computing</source>
          <volume>65</volume>
          (
          <year>2017</year>
          )
          <fpage>3</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bollen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Twitter mood predicts the stock market</article-title>
          ,
          <source>Journal of computational science 2</source>
          (
          <year>2011</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Nabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prestin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <article-title>Facebook friends with (health) benefits? exploring social network site use and perceptions of social support, stress, and well-being, Cyberpsychology, behavior, and social networking 16 (</article-title>
          <year>2013</year>
          )
          <fpage>721</fpage>
          -
          <lpage>727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [37]
          <string-name>
            <surname>D. J. MacInnis</surname>
          </string-name>
          , G. E. De Mello,
          <article-title>The concept of hope and its relevance to product evaluation and choice</article-title>
          ,
          <source>Journal of Marketing</source>
          <volume>69</volume>
          (
          <year>2005</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          . URL: https: //aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>740</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>740</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Watts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Henry</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. List</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Forkel</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <string-name>
            <surname>Mucha</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Greenhill</surname>
            ,
            <given-names>R. D.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          <string-name>
            <surname>Lindquist</surname>
          </string-name>
          ,
          <article-title>Emotion semantics show both cultural variation and universal structure</article-title>
          ,
          <source>Science</source>
          <volume>366</volume>
          (
          <year>2019</year>
          )
          <fpage>1517</fpage>
          -
          <lpage>1522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How to fine-tune bert for text classification?</article-title>
          ,
          <source>in: China national conference on Chinese computational linguistics</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>194</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Bengio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Understanding deep learning (still) requires rethinking generalization</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>107</fpage>
          -
          <lpage>115</lpage>
          . URL: https://doi.org/10.1145/3446776. doi:
          <volume>10</volume>
          .1145/3446776.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Richards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Big data ethics</article-title>
          , Wake Forest L.
          <year>Rev</year>
          .
          <volume>49</volume>
          (
          <year>2014</year>
          )
          <fpage>393</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Calo</surname>
          </string-name>
          ,
          <article-title>There is a blind spot in ai research</article-title>
          ,
          <source>Nature</source>
          <volume>538</volume>
          (
          <year>2016</year>
          )
          <fpage>311</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowls</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beltrametti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chatila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chazerand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dignum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luetge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Madelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Pagallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rossi</surname>
          </string-name>
          , et al.,
          <article-title>Ai4people-an ethical framework for a good ai society: opportunities, risks, principles, and recommendations</article-title>
          ,
          <source>Minds and machines 28</source>
          (
          <year>2018</year>
          )
          <fpage>689</fpage>
          -
          <lpage>707</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>