<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. Filip);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>models in Eastern European languages on Twitter/X</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomáš Filip</string-name>
          <email>tomas.filip@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Pavlíček</string-name>
          <email>martin.pavlicek@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petr Sosík</string-name>
          <email>petr.sosik@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mistral</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>BERTweet</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>XLM-T</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Research Applications of Fuzzy Modeling, University of Ostrava</institution>
          ,
          <addr-line>30. dubna 22, Ostrava, 70200</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Faculty of Philosophy and Science, Silesian University in Opava</institution>
          ,
          <addr-line>Bezručovo náměstí 1150/13</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Opava</institution>
          ,
          <addr-line>746 01</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>We address the problem of fine-tuning large language models (LLMs) for sentiment analysis on Twitter/X in underrepresented Eastern European languages (Czech, Slovak, Polish, and Hungarian). We study the influence of a number of experimental settings on the eficiency of fine-tuning in two groups of LLMs: transfer-learning models (BERT, BERTweet or XLM-T, the latter two pre-trained on a Twitter corpus) and popular mid-sized universal models (Llama, Mistral). We show that adapter fine-tuning with as few as ≈ 600 tweets improved scores of our universal models to the level previously reported by Twitter/X-specialised models on popular datasets, while our transfer-learning models performed worse. We also show that, despite previous successful experiments with multilingual models, translating from underrepresented languages into English still improves the results of all models tested. Several other factors that influence the success of fine-tuning are also included in the study.</p>
      </abstract>
      <kwd-group>
        <kwd>Large language model</kwd>
        <kwd>Sentiment analysis</kwd>
        <kwd>Twitter</kwd>
        <kwd>Eastern-European language</kwd>
        <kwd>Russo-Ukraine conflict</kwd>
        <kwd>Llama</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>adequately researched.</p>
      <p>
        few exceptions, such as [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        Sentiment analysis is one of the most common topics in natural language processing, with rapidly
emerging techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recently, machine learning methods, especially large language models (LLM),
have been considered the state of the art on suficiently large training datasets. As end-user deployment
of language models is now common and afordable, their performance in underrepresented languages
is becoming important.
      </p>
      <p>This paper focusses on fine-tuning LLMs for sentiment analysis in Eastern European languages
(Czech, Slovak, Polish, and Hungarian) belonging to the so-called Visegrád (V4) group. As a case study,
we chose the topic of the Ukraine war crisis on Twitter/X, providing a large textual corpus with rich
sentiment polarity. This topic is also the target of intensive cyberbullying attacks and, simultaneously,
a crucial source of Open Source Intelligence (OSINT), further underlining its relevance. The novelty
aspects:
• Twitter/X studies in Eastern European (EE) languages are rare in LLM-based sentiment analysis,
and we are not aware of any studies focussing on the Russo-Ukraine conflict.
• The aspects of the tunability of various LLMs on Twitter/X (or similar) EE data have not been
• The performance of mid-sized or large models (Llama, Mistral, or GPT-4) versus transfer learning
models (BERT, BERTweet, RoBERTa) in Twitter/X-based tasks has been poorly studied, with very</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>We downloaded and annotated three monolingual datasets (CS/SK, PL, HU) from Twitter/X. The
dataset was used to fine-tune three transfer learning models (BERT, BERTweet, XLM-T) and three
mid-sized LLMs (Llama 2, Llama 3, Mistral) in a number of experimental settings illustrated in Fig. 1.
The training objective was the sentiment polarity towards either Ukraine or Russia. We evaluated the
influence of various settings, such as the size of the dataset, the translation into English, or the presence
of the reference tweet (the one to which the tweet reacted) on the eficiency of fine-tuning. The key
ifndings are as follows.</p>
      <p>• Fine-tuning with as few as ≈ 600 tweets in underrepresented Eastern European languages
improved the F1 score of the Llama and Mistral models by 30–40%, reaching the level of specialised
models on Twitter/X benchmarks.
• Fine-tuned general mid-sized LLM such as Llama or Mistral significantly outperformed equally
ifne-tuned transfer learning models (BERTweet, XLM-T) pre-trained on a large Twitter/X corpus.
• All models (including multilingual XLM-T or GPT-4) performed best when fine-tuned on a dataset
translated into English by DeepL.
• Unsurprisingly, in-context learning did not help the small- and mid-sized models, but neither the
context of the reference tweets improved the fine-tuning.</p>
      <p>The rest of the paper is organised as follows. Section 2 briefly resumes sentiment analysis in texts,
with a focus on Twitter/X datasets. Section 3 describes the construction of our dataset, followed by Sec.
4 that outlines the experimental settings. Section 5 contains an overview of the results, which are then
discussed in more detail in Sec. 6. Section 7 provides an ablation study that focusses on the impact of
selected experimental variables. Finally, Section 8 summarizes the results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        With the rapid growth of social networks and e-commerce, sentiment analysis has emerged as one of the
fastest-growing research areas in computer science. To capture sentiment with greater granularity, Hu
and Liu [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced the concept of aspect-based sentiment analysis (ABSA) in their foundational work,
which has since inspired numerous follow-up studies. A comprehensive review of recent developments
in NLP-based sentiment analysis is provided by Jim et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Recent progress in ABSA has been significantly driven by the integration of large language models
(LLMs).1 For example, Zhang et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a generative framework that formulates ABSA as a
text generation problem, ofering a flexible alternative to traditional classification approaches. Building
on the strengths of instruction-based learning in LLMs, Scaria et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced the InstructABSA
model, which leverages task instructions to improve performance. Periodic survey studies, such as
that by Brauwers and Frasincar [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], continue to provide structured overviews of the evolving ABSA
landscape.
      </p>
      <p>
        Sentiment classification can be challenging in Twitter/X data due to the lack of explicit context and
specific style. TweetEval benchmark [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] evaluated models that analyse sentiment in tweets on detection
tasks of emotion, irony, hate speech, ofensive language, stance, emoji prediction and sentiment analysis.
The TweetEval leaderboard on GitHub lists BERTweet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as the current SoTA model, closely followed
by TimeLM-21. The family of TimeLM models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] reflects the current context problem by periodic
updates with tweet datasets, and outperformed BERTweet in many tasks.
      </p>
      <p>
        Barbieri et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] expanded the focus on multilingual tweet analysis and presented a unified tweet
benchmark in eight languages (UMSAB). The paper also introduced the XLM-Twitter model (XLM-T)
developed by pre-training the XLM-R [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] using 198M multilingual tweets. XLM-T was further
finetuned in UMSAB, and the resulting model was named XLM-T Sentiment. Barreto et al. [13] studied,
among other topics, the performance of BERT, RoBERTa and BERTweet in Twitter ABSC tasks.
      </p>
      <p>
        Krugmann et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] compared the performance of established transfer learning models (BERT,
BERTweet, RoBERTa) with recent LLM (GPT-3.5, GPT-4, and Llama 2) on Twitter/X data, with the
superiority of the latter. In contrast to these results, Stigall et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] presented a fine-tuned model
EmoBERTTiny for emotion and sentiment classification tasks and reported its superiority over non-tuned
Llama-2-7B-chat and Mistral-7B-Instruct across all metrics. These and other authors also reported on
the domain sensitivity of the models.
      </p>
      <p>Finally, of many existing sentiment studies on Russia–Ukraine war on social networks, we mention
two. An evaluation of traditional ML models (logistic regression, decision trees, random forests, SVMs
etc.) on Twitter data was provided in [14]. A deep learning approach combining multi-feature CNN
with BiLSTM was applied in [15] to an analogous task. Both studies relied on monolingual English
datasets.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset construction</title>
      <p>Our data were collected using the academic Twitter/X API during the period 4/2/2023 to 20/5/2023.
Filtering by languages (Czech/Slovak, Polish, Hungarian), and keywords (Ukraine, Russia, Zelensky,
Putin) resulted in 34,124 relevant tweets split into three monolingual parts according to the language.
There was no filter available for Slovak so it was mixed with Czech. In every monolingual dataset, we
manually annotated a random subset of tweets by their sentiment toward Ukraine or Russia, keeping
the classes roughly balanced. Certain class imbalance resulted from the lack of relevant tweets neutral
to a given aspect. To avoid annotation bias, the annotators followed the principles of the CAMEO2
conflicting topic codebook, and the annotated tweets were cross-validated among the annotators. The
annotated datasets are not the same size (see Table 1), to study the impact of the size on the models’</p>
      <sec id="sec-3-1">
        <title>1https://paperswithcode.com/task/aspect-based-sentiment-analysis 2http://data.gdeltproject.org/documentation/CAMEO.Manual.1.1b3.pdf</title>
        <p>performance. Each annotated dataset was split into a training set (75 %) and a testing set (25 %). The
datasets are available in the supplementary data on GitHub; see the link in Conclusions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <sec id="sec-4-1">
        <title>Language models</title>
        <p>
          The models we tested (Table 2) belong to two categories: (i) transfer learning models popular in the ABSA
literature and in the TweetEVAL and UMSAB benchmarks: BERT, BERTweet, and XLM-T. The latter two
have been pre-trained on large Twitter/X corpuses. As we intended to study the tunability of universal
models, we did not use language-specific variants as the PolBERT 3, huBERT4 or the SlovakBERT5. (ii)
Mid-sized open-source models (up to 10B parameters) which are fine-tunable on limited end-user GPU
hardware: Llama-2 7B, Llama-3 8B, and Mistral 7B. Recent studies such as [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3, 16</xref>
          ] point out missing
studies on ABSA using these and similar models. Furthermore, ChatGPT-4 [17] was used as a reference
model for tweet classification.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Translation</title>
        <p>
          When applying pre-trained LLMs to datasets in underrepresented languages, some sources such as
[22, 23] report better results with machine translation to English, while others rely on follow-up training
or fine-tuning in original languages [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. To compare the efectiveness of both approaches, the
annotated datasets were used for both training and testing in three diferent language modes:
• translated to English using the Helsinki Neural Machine Translation System6;
        </p>
        <sec id="sec-4-2-1">
          <title>3https://github.com/kldarek/polbert</title>
          <p>4https://huggingface.co/SZTAKI-HLT/hubert-base-cc
5https://huggingface.co/gerulata/slovakbert
6https://huggingface.co/Helsinki-NLP
• translated to English using the DeepL API7;
• no translation, original languages (CS/SK, PL, HU).</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Training</title>
        <p>We trained each decoder model by using a tweet as input and generated a single output token. The loss
function was the cross-entropy between the generated token and the ground truth label. Each model in
Table 2 in combination with each translation mode was fine-tuned on each language-specific training
set (not their combination). For Llama 2/3 and Mistral we used the PEFT adapter-based technique [24]
using the Python PEFT library8. The number of tuned parameters varied between 3.5–4 million. The
training was run for 10 epochs on all models. The learning rate was set to 3 −4, batch size was 4. The
learning rate schedule was linearly growing to maximum during warm-up (the first 100 iterations)
and then linearly decreasing towards zero. The remaining hyperparameters were library-default. All
metrics were calculated at the best checkpoint of the model. Both training and inference were run on a
server 2 x 2060 RTX (8GB) for smaller BERT-derived models, and another server with 2 x NVIDIA V100
(32GB) for larger models.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Inference</title>
        <p>After fine-tuning in a specific language, all models in 2 were prompted the same way using the testing
set in the same language. The experiments were carried out with and without the use of the
reference tweet (to which the classified tweet reacted). We used a simple English prompt in all experiments:
tweet: {tweet}
The sentiment of the tweet towards {aspect} is…</p>
        <p>For GPT-4 we did not use fine-tuning but instead applied in-context instruction learning (ICL), that
is, expanding the prompt with context information related to the question asked. The expanded prompt
can be found in the online Appendices to the paper; please follow the link in the “Supplementary
material” section.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We conducted an extensive series of tweet sentiment classification experiments that varied in the
following settings:
• sentiment aspect (Russia/Ukraine)
• language of the tweet (CS/SK, HU, PL)
• language model (BERT, BERTweet, XLM-T, Llama 2, Llama 3, Mistral, GPT-4)
• tweet translation (DeepL, Helsinki translator, none)
• positive/neutral/negative classification, or only positive/negative
• the presence of a reference tweet</p>
      <p>Standard metrics were used to evaluate the results: accuracy and macro-averaged recall, precision,
and F1 score [25]. The macro-averaged F1 was chosen as our primary evaluation measure due to
its balanced assessment for the evaluation of model performance across multiple classes (negative,
neutral, positive). Unless stated otherwise, tables and graphs show results for positive/neutral/negative
sentiment classification. With the exception of ChatGPT-4, all results were obtained without using
reference tweets. The complete results are contained in the supplementary data on GitHub; see the link
in Conclusions.
7https://www.deepl.com/translator
8https://huggingface.co/docs/peft
Figure 2 summarises the main results organised by language models and type of translation. Concerning
the performance of individual models, surprisingly, Llama 3 scored approx. 6% F1 worse than Llama 2
and BERTweet large performed worse than BERT base; perhaps pre-training on older tweets could have
afected tunability of BERTweet to a newer context. Neither did XLM-T reach the level of the larger
models, although it was pre-trained on a large multilingual tweet corpus. The order of magnitude larger
model size seems to be the prevailing factor. Finally, all models benefitted from the DeepL translation.
Therefore, the remaining results included in the paper are restricted to DeepL-translated datasets.</p>
      <sec id="sec-5-1">
        <title>Results by languages</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>Relation to the SoTA</title>
        <p>
          Our focus on underrepresented EE languages does not allow direct comparison with popular Twitter/X
benchmarks, and the following figures provide only an approximate picture. The TweetEVAL
leaderboard [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] marks TIMELM-21 as the SoTA model with macro-averaged recall 73.7 for three-valued ABSA,
followed by BERTweet with recall 73.4. Our best macro-averaged result (Llama 2, translation by DeepL,
averaged over all aspects and languages) provided the F1 score 73.7. Our task is on the one hand much
narrower than TweetEVAL. On the other hand, TweetEVAL is monolingual and BERTweet was trained
on 850M English tweets, while we fine-tuned our models using three datasets with a few hundreds of
tweets in underrepresented languages.
        </p>
        <p>
          The UMSAB Twitter benchmark [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] reports XLM-Tw Multi as the best model with an F1 score of
69.4, macro-averaged in eight languages. Again, this task is wider than ours, but XLM-Tw Multi used a
much larger fine-tuning dataset; therefore, we cannot provide an exact comparison.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Size of the training sets</title>
        <p>The support of the CZ/SK training set was approximately three times that of HU or PL which were
almost equal. This imbalance allowed for some interesting observations. In the simpler task of
twovalued classification, almost all fine-tuned models returned scores irrelevant to the language, implying
that the training sets with about 600 tweets were suficient to bridge the language diferences. However,
in the case of three-valued classification, the CZ / SK dataset was favoured by all fine-tuned models.
Hence, for this harder task, the smaller HU/PL training set was insuficient. The efect was stronger for
smaller models (BERT, BERTweet), confirming the multiplicative joint scaling law for LLM fine-tuning
[26].</p>
      </sec>
      <sec id="sec-6-3">
        <title>Model and human bias</title>
        <p>In the context of the current situation where Russia is described as the aggressor, human annotators
who know more about the context may tend to see the situation in terms of cause and efect, and
therefore their sentiment determination is usually biased diferently than the models [ 27]. In particular,
LLMs struggled with tweets neutral (or positive) to a given aspect but generally negative, for example,
addressing bombing, war, attack. Models such as Llama 2 or Mistral showed significantly lower precision
and recall for the neutral class than for the negative or positive one.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Scores for individual classes</title>
        <p>All experiments in Section 5 used macro-averaged recall, precision and F1 scores, since the scores were
mostly similar for all classes, with a few exceptions. In particular, in Hungarian, the recall of the positive
class was often approximately 10% lower than that of the negative class, and the trend was opposite in
precision, meaning that the models tended to classify Hungarian tweets more negatively than human
annotators. This might possibly be due to the fact that the overall ratio of negative samples in the
Hungarian dataset was a bit higher than in the other languages.</p>
      </sec>
      <sec id="sec-6-5">
        <title>Non efective in-context learning</title>
        <p>When employing small, computationally inexpensive models, in-context learning (ICL) often entails
notable trade-ofs. Due to their more limited representational capacity, these models may be unable to
leverage ICL efectively. Another contributing factor may be insuficient pre-training alignment with
the target domain or topic. Furthermore, the additional complexity introduced by ICL can increase
task ambiguity in aspect-based sentiment analysis (ABSA). Fine-tuning may also override any marginal
gains that ICL might provide. To rigorously identify the primary factors underlying the lack of ICL
efectiveness, further fine-grained experimental analyses are required.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Ablation study</title>
      <sec id="sec-7-1">
        <title>Reference tweet use</title>
        <p>In this section, we discuss the contribution of several components of the experimental pipeline to the
classification performance.</p>
        <p>The reference tweet was always used in the in-context prompt for GPT-4 as it improved its performance
(data not shown). For all other models, reference tweets slightly worsened the macro-averaged F1
score (e.g., Bert by 4%, XLM-T by 2%, Llama 2 by 0.5%, Llama 3 by 0.8%, Mistral by 2.5% in the case of
positive/neutral/negative classification). Therefore, we agree with [ 28] that, while smaller models rely
substantially on semantic priors from pre-training, large models can override them by contradicting
exemplars contained in the prompt.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Fine-tuning and in-context learning</title>
        <p>To compare these two approaches for Twitter/X task adaptation, we evaluated models Llama 2, Llama 3,
Mistral, and GPT-4 in the vanilla version, i.e., without fine-tuning and in-context learning, respectively.
The study was restricted to the case of DeepL translation and positive/neutral/negative classification.
Table 3 shows that fine-tuning improved the F1 score of Llama 2/3 and Mistral mainly by 20–40% over
the vanilla versions, while GPT-4 benefited from the ICL by about 10%.
In the overwhelming majority of settings (see Fig. 2 and the supplementary material), all LLMs performed
better when fine-tuned and tested on English-translated datasets, and the DeepL translator gave better
results than the Helsinki translator. The improvement in the macro-averaged F1 score in all models was
0.8% for the Helsinki translator and 3.1% for the DeepL. DeepL translation improved the F1 score by
1.2% even for the multilingual XLM-T sentiment model. In the supplementary material, we also provide
the comparison of the original tweets with both translated versions, to ensure that the classification
diferences were caused by the quality of the translation and not by a systematic bias of sentiment
caused by the translator.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>We addressed the fine-tuning of large language models for sentiment analysis tasks on Twitter/X in
underrepresented Eastern-European languages. We manually annotated a Twitter/X-based dataset
related to the Russo-Ukrainian conflict, narrowed to the V4 (Czech Republic, Slovakia, Poland, Hungary)
language space. The dataset was used to fine-tune six language models (BERT, BERTweet, XLM-T,
Llama 2/3, Mistral) used frequently for sentiment analysis. The tuning was done separately for each
language in several variants, using either the original tweets or the English translation with the Helsinki
or DeepL translator. Furthermore, GPT-4 (with or without in-context learning) was used as a reference
model. The results were evaluated using standard metrics, mostly F1.</p>
      <p>
        We demonstrated that adapter fine-tuning, even with as few as hundreds of samples in
underrepresented languages, was able to draw the model’s attention to the desired aspects and also to balance
language and culture diferences (at least for most models). Experiments have shown that, despite
previous successful experiments with multilingual models [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], translating from underrepresented
languages into English still improves the fine-tuning of all models tested in a wide variety of
experimental settings. However, neither the instruction in-context learning nor the enrichment of fine-tuning
with the context of reference tweets improved the results. Finally, our experiments also confirmed that
the success of fine-tuning depends on the model and the task, as reported by other studies such as [ 26].
      </p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This article was produced with the financial support of the European Union under the
REFRESH – Research Excellence For REgion Sustainability and High-tech Industries project number
CZ.10.03.01/00/22_003/0000048 via the Operational Programme Just Transition, and under the:
Biography of Fake News with a Touch of AI: Dangerous Phenomenon through the Prism of Modern
Human Sciences project no.: CZ.02.01.01/00/23_025/0008724 via the Operational Programme Jan Ámos
Komenský. It was also supported by the Silesian University in Opava under the Student Funding Plan,
project SGS/9/2024.</p>
    </sec>
    <sec id="sec-10">
      <title>Supplementary material and data</title>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT (GPT4-o), Writefull (Overleaf integration)
and DeepL for following - Text Translation, paraphrase and reword, improve writing style, grammar
and spelling check. After using these tools, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
[13] S. Barreto, R. Moura, J. Carvalho, A. Paes, A. Plastino, Sentiment analysis in tweets: an assessment
study from classical to modern word representation models, Data Mining and Knowledge Discovery
37 (2023) 318–380.
[14] G. K. Wadhwani, P. K. Varshney, A. Gupta, S. Kumar, Sentiment analysis and comprehensive
evaluation of supervised machine learning models using Twitter data on Russia–Ukraine war, SN
Computer Science 4 (2023) 346.
[15] S. Aslan, A deep learning-based sentiment analysis approach (MF-CNN-BILSTM) and topic
modeling of tweets related to the Ukraine–Russia conflict, Applied Soft Computing 143 (2023)
110404.
[16] N. Mughal, G. Mujtaba, A. Kumar, S. M. Daudpota, Comparative analysis of deep natural networks
and large language models for aspect-based sentiment analysis, IEEE Access (2024).
[17] J. Achiam, S. Adler, S. Agarwal, et al., GPT-4 technical report, arXiv preprint arXiv:2303.08774
(2023).
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics,
volume 1, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[19] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[20] AI@Meta, Llama 3 model card, 2024. URL: https://github.com/meta-llama/llama3/blob/main/</p>
      <p>MODEL_CARD.md.
[21] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,</p>
      <p>G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).
[22] M. Araújo, A. Pereira, F. Benevenuto, A comparative study of machine translation for multilingual
sentence-level sentiment analysis, Information Sciences 512 (2020) 1078–1102.
[23] V. Barriere, A. Balahur, Improving sentiment analysis over non-english tweets using multilingual
transformers and automatic translation for data-augmentation, arXiv preprint arXiv:2010.03486
(2020).
[24] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, C. A. Rafel, Few-shot
parametereficient fine-tuning is better and cheaper than in-context learning, Advances in Neural Information
Processing Systems 35 (2022) 1950–1965.
[25] O. Rainio, J. Teuho, R. Klén, Evaluation metrics and statistical tests for machine learning, Scientific</p>
      <p>Reports 14 (2024) 6086.
[26] B. Zhang, Z. Liu, C. Cherry, O. Firat, When scaling meets LLM finetuning: The efect of data,
model and finetuning method, arXiv preprint arXiv:2402.17193 (2024). The Twelfth International
Conference on Learning Representations.
[27] G. H. Chen, S. Chen, Z. Liu, F. Jiang, B. Wang, Humans or LLMs as the judge? A study on
judgement biases, arXiv preprint arXiv:2402.10669 (2024).
[28] J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhou, et al., Larger
language models do in-context learning diferently, arXiv preprint arXiv:2303.03846 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Jim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. R.</given-names>
            <surname>Talukder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Kabir</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Nur</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mridha</surname>
          </string-name>
          ,
          <article-title>Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          (
          <year>2024</year>
          )
          <fpage>100059</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Krugmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis in the age of generative ai</article-title>
          ,
          <source>Customer Needs and Solutions</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Stigall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Al Hafiz Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Attota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nweke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <article-title>Large language models performance comparison of emotion and sentiment classification</article-title>
          ,
          <source>in: Proceedings of the 2024 ACM Southeast Conference</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Mining and summarizing customer reviews</article-title>
          ,
          <source>in: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          , W. Lam,
          <article-title>Towards generative aspect-based sentiment analysis</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>504</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Scaria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Sawant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          , C. Baral,
          <article-title>InstructABSA: instruction learning for aspect based sentiment analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2302.08624</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brauwers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frasincar</surname>
          </string-name>
          ,
          <article-title>A survey on aspect-based sentiment classification</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          , L. Neves, TweetEval: Unified Benchmark and
          <article-title>Comparative Evaluation for Tweet Classification</article-title>
          ,
          <source>in: Proceedings of Findings of EMNLP</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1644</fpage>
          --
          <lpage>1650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          , A. T. Nguyen,
          <article-title>BERTweet: a pre-trained language model for English tweets</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Anke</surname>
          </string-name>
          , J. Camacho-Collados,
          <article-title>TimeLMs: Diachronic language models from Twitter</article-title>
          ,
          <source>in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , É. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>