<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cagliari, Italy
* Corresponding author.
$ iftikhar.muhammad@univr.it (I. Muhammad);
marco.rospocher@univr.it (M. Rospocher);
timotej.knez@fri.uni-lj.si (T. Knez); slavko.zitnik@fri.uni-lj.si
(S. Žitnik)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Benchmarking Large Language Models for Target-Based Financial Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iftikhar Muhammad</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Rospocher</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Timotej Knez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Slavko Žitnik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ljubljana</institution>
          ,
          <addr-line>1000 Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Verona</institution>
          ,
          <addr-line>37129 Verona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Sentiment analysis is vital for understanding market dynamics and formulating informed investing strategies, especially in volatile financial conditions. This study advances target-based financial sentiment analysis (TBFSA) by rigorously evaluating the eficacy of Large Language Models (LLMs) in zero-shot and few-shot learning contexts. We compare cutting-edge generative LLMs, such as ChatGPT-4o, ChatGPT-4, ChatGPT-o1, DeepSeek-R1, Llama-3-8B, Gemma-2-9B, and Gemma2-27B, with conventional lexicon-based tools (VADER, TextBlob) and discriminative transformer-based models (FinBERT, FinBERT-Tone, DistilFinRoBERTa, Deberta-v3-base-absa-v1.1). Our analysis utilizes a newly curated dataset of 1,162 manually annotated Bloomberg news articles, designed explicitly for TBFSA (due to copyright constraints, only URLs are publicly released, with full news content accessible through a Bloomberg Terminal). The findings indicate that LLMs, particularly DeepSeek-R1 and ChatGPT variants (especially ChatGPT-o1), outperform lexicon-based approaches and discriminative transformer-based models across all evaluation metrics, without requiring additional training or task-specific fine-tuning. The study establishes generative LLMs as a scalable and cost-efective method for target-level sentiment analysis, relieving the need for expensive, rigorous fine-tuning. The research provides valuable insights, enabling institutions to use unstructured textual data efectively for improved real-time risk assessment, portfolio management, and algorithmic trading.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Target-Based Sentiment Analysis</kwd>
        <kwd>Financial Sector</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>niques and machine learning algorithms to advanced
deep learning models, particularly transformer
architecThe financial sector, a pivotal pillar of the global economy, tures [5]. Recently, generative large language models
is increasingly influenced by vast amounts of unstruc- (LLMs) such as Llama, Gemma, ChatGPT, and DeepSeek
tured textual data, including news articles, earnings call have exhibited considerable promise in NLP tasks,
espetranscripts, regulatory filings, and analyst reports [ 1]. cially in zero-shot and few-shot learning contexts, owing
These textual sources significantly impact investor deci- to their ability to reduce reliance on extensive manual
sions, market volatility, and strategic financial activities annotations [6]. However, the eficacy of these models
[2]. The inadequacy of traditional manual methods for in specialized fields, such as finance, is still inadequately
processing such extensive data has led to adopting au- examined, underscoring the necessity for thorough
astomated procedures using Natural Language Processing sessment before their incorporation into practical
ap(NLP) techniques [3]. Sentiment analysis, a crucial NLP plications like financial reporting software and trading
tool, evaluates the emotional tone of the text, providing algorithms.
valuable predictive insights on investor sentiment and A notably complex facet of sentiment analysis in
fimarket movements [2]. nancial texts is the recurrent presence of conflicting
sen</p>
      <p>Financial Sentiment Analysis (FSA), a specific subtask timents towards multiple entities within a single
narraof NLP, identifies subjective tones in financial texts, of- tive [7]. For example, the statement “Nvidia’s AI-driven
fering insights for market forecasting, risk management, growth overshadows Netflix’s subscriber stagnation”
conand the development of trading strategies [4]. Meth- currently expresses positive and negative sentiments
reods for FSA range from conventional lexicon-based tech- garding two distinct entities. Conventional sentiment
analysis methods at the sentence or document level
frequently conflate these subtle perspectives, obscuring
critical insights necessary for precise decision-making. To
overcome this constraint, Target-Based Financial
Sentiment Analysis (TBFSA) disaggregates sentiment at the
entity level, facilitating a more detailed examination of
specific financial instruments, business entities, or
market segments [8]. Nonetheless, the capacity of LLMs to
execute zero-shot and few-shot TBFSA tasks in
finan</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>cial markets remains insuficiently investigated.
Furthermore, rigorous comparison analyses of lexicon-based
tools, discriminative transformer-based approaches, and 2.1. Lexicon-Based Methods
generative LLMs in this particular setting remain scarce.</p>
      <p>The current study aims to fill these significant gaps Lexicon-based approaches, which form the foundation of
by evaluating the potential of LLMs to conduct target- ifnancial sentiment analysis, initially drew from
generalspecific sentiment analysis in financial news articles. purpose instruments such as LIWC and SentiWordNet.
Specifically, we seek to answer the following research However, these tools lacked domain-specific accuracy
questions: and contextual nuance [10]. Frameworks like VADER
and TextBlob were then developed to incorporate
contex1. How do zero-shot and few-shot generative LLMs tual scoring and automatic lexicon enhancement [11, 12].
perform in TBFSA compared to lexicon-based and Numerous scholars have utilized VADER in the
finandiscriminative transformer-based models? cial domain [13, 14, 15]. However, it struggles to
handle sector-specific terminology [ 16]. Similarly, TextBlob,
2. Does few-shot learning substantially improve which integrates predefined lexicons with a classifier
the performance of LLMs compared to zero-shot trained on film reviews, allows for swift implementation
methods in TBFSA? in initial analyses. However, it falls short in complex
Our contributions can be summarized as follows: ifnancial scenarios due to its inadequate domain
adaptation [16].</p>
      <p>While lexicon-based methods have been practical, they
face significant challenges in deciphering complex
linguistic patterns, domain-specific vocabulary, and
contextual nuances [17]. These limitations have led to
transformer-based models leveraging deep learning to
capture semantic and contextual subtleties more
efectively in financial texts.
1. We develop and publicly release a novel,
manually annotated TBFSA dataset comprising 1,162
ifnancial news articles categorized by
targetspecific sentiments. In contrast to current
financial datasets (e.g., FiQA-2018,1 Financial
PhraseBank [9]), our dataset distinctly encapsulates
sophisticated entity-level opinions within intricate
ifnancial narratives that exhibit conflicting
sentiments.</p>
      <sec id="sec-2-1">
        <title>2. Utilizing this dataset, we systematically evalu</title>
        <p>ate generative LLMs (ChatGPT, Llama, Gemma,
DeepSeek), conventional lexicon-based
instruments (VADER, TextBlob), and discriminative
transformer-based models (Finbert,
DistilFinRoBERTa, Finbert-Tone,
Deberta-v3-base-absav1.1), emphasizing the strengths and limitations
of each approach specifically in the context of
TBFSA. This extensive comparison investigation
is among the first to critically evaluate advanced
LLMs’ performance in zero-shot and few-shot
frameworks for target-level financial sentiment
analysis.</p>
        <p>The subsequent sections of this research are structured
as follows: Section 2 presents relevant literature on
financial sentiment analysis. Section 3 delineates the
establishment of our dataset, annotation processes, and
methodological techniques. Section 4 delineates
empirical findings and discussion, while Section 5 concludes
the study and provides key implications and avenues for
future research.</p>
      </sec>
      <sec id="sec-2-2">
        <title>1https://sites.google.com/view/fiqa/home</title>
        <sec id="sec-2-2-1">
          <title>2.2. Discriminative Transformer-Based</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Models</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Transformer-based architectures, particularly BERT [18],</title>
        <p>transformed NLP by employing a self-attention technique
that efectively captures contextual relationships.
Although general transformers excel at conventional NLP
tasks, their efectiveness declines in financial contexts
due to specialized lexicons and nuanced tone diferences.</p>
        <p>As a result, domain-specific models fine-tuned on
financial data have developed an increased sensitivity to the
subtleties of financial language and numerical settings
[19].</p>
        <p>FinBERT [20], trained initially on financial documents
like SEC filings and subsequently fine-tuned with the
FiQA dataset, represented a notable progression in
financial sentiment analysis. Studies conducted by [19, 21]
confirmed FinBERT’s superiority compared to
generalpurpose models, especially in analyzing earnings
transcripts. Expanding on this, FinBERT-Tone [22]
implemented tonal analysis to discern subtle sentiment
indications essential for market forecasting. Initiatives to
improve eficiency, shown by DistilFinRoBERTa [ 23],
tailored for real-time applications, have also garnered
attention. Furthermore, sophisticated models like
DeBERTav3-base-absa-v1.1 exhibited accuracy in aspect- and
target-oriented sentiment analysis, adeptly interpreting
intricate narratives in financial documents [17].</p>
        <p>Comparative assessments consistently demonstrate tion with discriminative transformer-based models,
inthat fine-tuned transformer-based models exceed tradi- cluding FinBERT, DistilFinRoBERTa, FinBERT-Tone, and
tional lexicon-based and machine-learning methodolo- DeBERTa-v3-base-absa-v1.1. We began by outlining our
gies [19, 24]. Nevertheless, their demand for processing methodology for dataset collecting and annotation, a
resources and extensive labelled datasets has initiated meticulous process that ensured high reliability and
validthe exploration of generative LLMs as viable alternatives ity criteria. Subsequently, we fine-tuned the benchmark
that scale more efectively with fewer task-specific labels. discriminative transformer-based models utilizing this
dataset to achieve optimal alignment with the specific
2.3. Generative Large Language Models requirements of financial sentiment analysis. To
thoroughly assess the generative LLMs, we designed precise,
Recent developments in LLMs have shown exceptional task-oriented prompts appropriate for TBFSA. Finally, we
proficiency in FSA, surpassing conventional lexicon- conducted a comprehensive comparative study to
evalubased and discriminative transformer-based methodolo- ate the eficacy and robustness of LLMs compared to the
gies [21]. The intricate linguistic characteristics of fi- benchmark models.
nancial texts have prompted the creation of specialized
LLMs, such as BloombergGPT [25] and FinVis-GPT [26], 3.1. Dataset Construction and Annotation
specifically tailored for the financial sector. Models such
as InvestLM [27], especially fine-tuned for investing en- To establish a thorough evaluation framework, we
obvironments, have demonstrated efectiveness equivalent tained news articles from the Bloomberg Terminal
regardto commercial advice systems. ing four prominent stock companies—Alphabet, Amazon,</p>
        <p>Furthermore, recent research highlights the eficacy Netflix, and Nvidia. The assembled dataset comprises
of smaller, computationally eficient models, attaining 1,170 articles dated from September 4, 2023, to January
performance akin to larger LLMs via focused fine-tuning. 30, 2024. Each article was systematically analyzed to
Methods like parameter-eficient tuning (e.g., LoRA) have extract critical information, including the timestamps,
enhanced their utilization in practical financial scenarios news text (excluding headlines), and URLs, which were
[28]. Significantly, even general-purpose models such then organized in a structured database (as depicted in
as ChatGPT have exhibited remarkable proficiency in Figure 1).
ifnancial sentiment analysis without the necessity for Each article was meticulously annotated for sentiment
domain-specific fine-tuning [29]. concerning the target companies to ensure data quality</p>
        <p>Despite significant progress, previous studies have pri- and confirm the experimental evaluation. The annotation
marily focused on generic sentiment analysis, with lim- was carried out by three annotators with extensive
experited investigation into target-based sentiment analysis tise in finance and economics, all possessing advanced
within financial contexts. While [ 17] examined the zero- English competence (CEFR level C1). Their annotations
shot eficacy of LLMs on financial headlines, our research were guided by comprehensive guidelines aimed at
stanexpands this investigation by evaluating full-text articles dardizing target identification and sentiment assessment.
to provide more extensive contextual insights. Addition- A concise summary of these guidelines entails:
ally, we extend the evaluation framework to encompass
few-shot scenarios and a varied array of models—such as • A thorough examination of each article to identify
Llama 3-8B, Gemma 2 (9B and 27B), DeepSeek-R1, and direct references to the target entities: Alphabet,
ChatGPT variants—benchmarked against conventional Amazon, Netflix, and Nvidia.
lexicon-based and discriminative transformer-based mod- • Identification of multiple target entities within a
els. Unlike [17], which examined sentiment toward a sin- single article, where applicable.
gle target per headline, our study investigates multiple
targets within each article, enabling more granular and • Labelling articles devoid of explicit target
refercomprehensive financial sentiment analysis. ences as “no target.”</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section delineates the methodological framework
utilized to assess the performance of generative LLMs—
specifically, Gemma, Llama, ChatGPT, and DeepSeek—in
executing TBFSA. To efectively benchmark these LLMs,
we utilized various lexicon-based sentiment analysis
tools, specifically VADER and TextBlob, in
conjunc• Evaluation of sentiment from an investor’s
view</p>
      <p>point, relying exclusively on the textual content.
• Sentiment classification as positive (1), negative</p>
      <p>(-1), or neutral (0).
• Identification of prevailing sentiment in instances</p>
      <p>of mixed expressions.
• Neutral labelling for vague, ambiguous, or
passing references.</p>
      <p>The annotating procedure was organized into two
separate phases. The annotators initially conducted target
identification individually across all 1,170 articles. Eight
articles were excluded as having "no target" by
consensus. Inter-annotator reliability for target identification
yielded a Krippendorf’s alpha [ 30] of 0.96 and a
percentage agreement [31] of 98.95% for the remaining 1,162
articles, signifying consistent annotations. Texts with
majority-agreed targets were forwarded for sentiment
annotation, yielding 1,334 unique annotation cases due
to multiple target references within specific articles.</p>
      <p>In the second phase, sentiment annotation was
performed for all identified target entities. Annotators used
a defined scale to assign sentiments: ‘1’ for positive,
’1’ for negative, and ’0’ for neutral sentiment. To
ensure consistency, annotators collaboratively annotated
a shared subset of 150 texts, resulting in satisfactory
inter-annotator reliability (Krippendorf’s Alpha of 0.81;
percentage agreement of 83%). The sentiment labels for
the 150 texts were established by majority consensus, and Figure 2: Sentiment distribution across the targets.
the remaining 1,184 texts were allocated evenly among
annotators for individual sentiment labelling.</p>
      <p>The final annotated dataset consists of 1,334 texts; each
explicitly associated with a target entity and an annotated the Bloomberg Terminal, a subscription-based platform
sentiment label. The dataset demonstrates a moderate widely accessible in academic and financial institutions.
class imbalance, with positive sentiments accounting for
45%, negative sentiments for 27%, and neutral sentiments 3.2. Baseline Models
for 28%. Table 1 presents annotated instances, whereas
Figure 2 represents the sentiment distribution. Additional To meticulously assess generative LLMs in TBFSA, we
quantitative parameters, including the total number of have conducted a comprehensive comparison of their
news texts, average daily texts, average text length (mea- eficacy with established benchmarks: lexicon-based
insure in tokens), and average target mentions, are outlined struments (TextBlob, VADER) and discriminative
transin Table 2. former architectures (FinBERT, FinBERT-Tone,
DistilFin</p>
      <p>We publicly release our curated dataset2 to assist RoBERTa, DeBERTa-v3-absa-v1.1).
the academic community and guarantee methodologi- TextBlob,3 an open-source python library developed
cal transparency and reproducibility. Due to copyright on the Natural Language Toolkit (NLTK) and Pattern
lirestrictions, we cannot disseminate the complete content braries, assigns sentiment polarity scores ranging from
of the news articles. However, we provide comprehen- −1 to +1 and has been widely utilized for financial
sive metadata, encompassing publication dates, times- texts [16, 32, 33]. VADER,4 developed by [11], a
ruletamps, specified target entities, and Bloomberg article based framework, incorporates lexical, grammatical,
URLs, facilitating the retrieval of original articles via and syntactic heuristics—validated against LIWC and</p>
      <sec id="sec-3-1">
        <title>2https://github.com/iftikharm895/Target-Based_Sentiment_</title>
        <p>Analysis_in_Financial_News</p>
      </sec>
      <sec id="sec-3-2">
        <title>3https://textblob.readthedocs.io/en/dev/ 4https://github.com/cjhutto/vaderSentiment</title>
        <p>Text
Alphabet Inc. shares tumbled the most in a year on Wednesday after the
Google parent reported a smaller than expected profit in cloud computing,
raising concerns about its position in a market critical to its future. Ed
Ludlow reports.</p>
        <p>Amazon Japan says it will build its first “sort center” in Japan in Shinagawa,
Tokyo, located ∼ 3.5km from Haneda International Airport. Expects to create
∼1,000 new jobs. Will handle as many as 750,000 items/day.</p>
        <p>Netflix co-CEO Ted Sarandos says talks with striking actors broke down
after the union asked for a “levy” on streaming customers. Sarandos speaks
at the first-ever Bloomberg Screentime conference in Los Angeles.</p>
        <p>The projected ex-date for Nvidia’s dividend moved to Dec. 6 from Nov. 30,
according to an updated Bloomberg Dividend Forecast. The new ex-date
falls after the Dec. 1 option expiry.
2. FinBERT6 [20], a BERT adaptation pre-trained on
earnings calls news articles and regulatory filings
and fine-tuned on Financial PhraseBank [9];
ANEW—and has also been extensively employed in fi- These discriminative transformer-based models have
nancial contexts [17, 34, 35]. been extensively employed in financial sentiment
reDiscriminative transformer-based baselines comprise: search [23, 38, 39].</p>
        <p>The current study involved fine-tuning
DistilFin1. DistilFinRoBERTa,5 a distilled variant of RoBERTa, FinBERT, and FinBERT-Tone using a learning
RoBERTa fine-tuned on financial datasets for rate of 3 × 10 −5 , 10 training epochs, and a batch size of
three-class sentiment analysis [23]; 32. For DeBERTa-v3-absa-v1.1, we utilized a 5-fold
crossvalidation approach to enhance robustness, training each
fold for 10 epochs using default hyperparameters on an</p>
        <p>NVIDIA RTX 4090 GPU.
3. FinBERT-Tone7 [19], which enhances FinBERT to
identify tonal nuances, fine-tuned on SEC filings,
earning reports, and financial news; and
4. DeBERTa-v3-absa-v1.1,8 builds upon the</p>
        <p>DeBERTa-v3 architecture [36], has been
finetuned for Aspect-Based Sentiment Analysis
(ABSA) through the FAST-LCF-BERT framework
[37]. It is trained on an extensive dataset,
comprising 30,000 ABSA-specific samples and
further fine-tuned on an additional 180,000
annotated examples from a variety of datasets.</p>
      </sec>
      <sec id="sec-3-3">
        <title>5https://huggingface.co/mrm8488/</title>
        <p>distilroberta-finetuned-financial-news-sentiment-analysis
6https://huggingface.co/ProsusAI/finbert
7https://huggingface.co/yiyanghkust/finbert-tone
8https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1</p>
        <sec id="sec-3-3-1">
          <title>3.3. Evaluated Generative LLMs</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Recent improvements in LLMs have garnered significant</title>
        <p>academic interest owing to their proven efectiveness in
several text-based tasks [40]. Notable and widely utilized
models include OpenAI’s ChatGPT,9 Gemma10—a series
of open models based on Google’s Gemini architecture,
Meta’s LLaMA,11 and DeepSeek-12</p>
        <p>The current study assessed the eficacy of various
advanced generative LLMs within the framework of TBFSA.</p>
        <p>The evaluated models include ChatGPT-4, ChatGPT-4o,
9https://chatgpt.com/. The ChatGPT variants analyzed in this study
are limited to those available during the research period. Newer
versions released during manuscript preparation will be examined
in future work.
10https://gemini.google.com/app
11https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
12https://www.deepseek.com/
ChatGPT-o1, LLaMA 3 8B, Gemma 2 9B, Gemma 2 27B, poses, we used default hyperparameters and advanced
and DeepSeek-R1. All models were assessed in their de- optimization techniques, including 4-bit quantization, to
fault configurations, without any additional fine-tuning, eficiently execute this model on consumer-grade GPU
to evaluate their zero-shot and few-shot capabilities in systems.
executing the specified task. Interactions with ChatGPT To assess the performance of generative LLMs in
variations were executed via OpenAI’s standard web in- TBFSA, we employed zero-shot and few-shot
promptterface, utilizing a temperature setting of 0.7. The Gemma ing strategies using manually designed, fixed prompts
models were accessed via the Gemini API, which sug- without task-specific tuning. The prompt used in the
gests a temperature setting of 1.0 for both the 9B and zero/few-shot learning approach is presented in Figure 3.
27B variants. DeepSeek-R1 was accessed via its public In the zero-shot context, models were given task
instrucchat interface, employing its standard temperature set- tions without illustrative examples. In few-shot contexts,
ting. To interact with the LLaMA model, we utilized a prompts were augmented by either one (1-shot) or five
local instance of the Meta-Llama-3-8B-Instruct model, (5-shot) additionally annotated examples, to provide
conrunning under the Ollama13 application. For testing pur- textual grounding.</p>
        <p>This approach utilizes LLMs’ inherent language and
Model
TextBlob
VADER
FinBERT
DistilFinRoBERTa
FinBERT-Tone
Deberta-v3-absa-v1.1
Llama 3 8B
Gemma 2 9B
Gemma 2 27B
ChatGPT-4
ChatGPT-4o
ChatGPT-o1
DeepSeek-R1
Llama 3 8B
Gemma 2 9B
Gemma 2 27B
ChatGPT-4
ChatGPT-4o
ChatGPT-o1
DeepSeek-R1
Llama 3 8B
Gemma 2 9B
Gemma 2 27B
ChatGPT-4
ChatGPT-4o
ChatGPT-o1
DeepSeek-R1
contextual reasoning abilities, enabling performance eval- els are limited by their dependence on static,
generaluation without requiring task-specific training or model purpose sentiment lexicons that do not incorporate
adaptation. The method ofers a clear assessment of domain-specific financial language, in addition to their
model generality and adaptability, enhancing their suit- document-level emphasis and rigid rule-based
architecability for efortless implementation in diverse practical ture. As a result, they fail to capture the contextual
inapplications. tricacies and entity-specific sentiment diferentiations</p>
        <p>The evaluation of model performance utilized rec- necessary for efective TBFSA.
ognized criteria for sentiment categorization, includ- Conversely, discriminative transformer-based models
ing precision, accuracy, recall, and F1-score [41]. The optimized for FSA tasks substantially exceed the
permetrics were calculated across three sentiment cate- formance of lexicon-based models. FinBERT,
DistilFingories—negative, neutral, and positive—utilizing both RoBERTa, and FinBERT-Tone attain increasingly higher
macro-averaging (equal weight across classes) and macro-F1 scores (ranging from 0.54 to 0.62),
demonstratweighted averaging (weighted by class sample size) to ing the advantages of domain-specific pretraining and
ensure robustness amid moderately imbalanced class dis- contextualized embeddings. Nonetheless, these models
tributions, as done in analogous situations (e.g., [42]). operate at the sentence or document level and fail to
assign sentiment to specific entities, hence constraining
their eficacy in multi-entity financial texts. Conversely,
4. Results and Discussions DeBERTa-v3-base-ABSA-v1.1, tailored for
target/aspectbased sentiment analysis, attains the highest macro-F1
Table 3 presents the outcomes for all models evaluated</p>
        <p>score (0.66) among fine-tuned transformer models. Its
on the novel dataset introduced in this research.
Lexicon</p>
        <p>disentangled attention mechanism and structured input
based approaches, such as VADER and TextBlob, exhibit</p>
        <p>encoding provide fine-grained, token-level sentiment
atconsistently subpar performance across all evaluation</p>
        <p>tribution, rendering it more suitable for intricate,
entitymetrics, with macro-F1 scores below 0.37. These
mod</p>
        <p>aware financial analysis.</p>
        <p>Among the generative LLMs evaluated under zero- SEC Rule 15c3-5 that necessitate model interpretability
shot settings, DeepSeek-R1 and the ChatGPT models for audit and risk governance. These limitations
under(ChatGPT-o1, ChatGPT-4, and ChatGPT-4o) consistently score the need for transdisciplinary innovation. The
efsurpass baseline models. DeepSeek-R1 attains the high- fective incorporation of LLMs into financial analytics
est zero-shot macro-F1 score (0.82), closely followed by will likely rely on hybrid architectures that combine
lanChatGPT-o1 (0.80). Performance enhances with few-shot guage capabilities with conventional econometric models.
prompting: in the 1-shot setting, ChatGPT-o1 slightly These hybrid architectures hold the potential to
revoluoutperforms DeepSeek-R1 with a macro-F1 score of 0.84 tionize financial analytics, balancing traditional financial
compared to 0.83. The highest scores are recorded in the metrics’ interpretability with AI’s adaptive learning
capa5-shot setting, with DeepSeek-R1 achieving 0.87, slightly bilities, and thereby mitigating the risks linked to opaque
above ChatGPT-o1’s score of 0.86. These findings high- algorithmic decision-making. Resolving these
complexlight the eficacy of few-shot learning in improving con- ities necessitates collaboration among AI researchers,
textual comprehension and sentiment categorization out- economists, and regulatory authorities to ensure that
incomes. Nonetheless, smaller models such as LLaMA 3 novations, such as federated learning for data privacy and
8B exhibit significant sensitivity to few-shot prompting. synthetic financial text generation for enhanced training
While it attains a zero-shot macro-F1 score of 0.63, perfor- robustness, are implemented ethically and efectively.
mance significantly declines to 0.44 in the 1-shot scenario,
with only a modest recovery to 0.63 at the 5-shot level.</p>
        <p>In summary, lexicon-based sentiment analysis meth- 5. Conclusions
ods like VADER and TextBlob are insuficient for TBFSA
because they fail to capture contextual financial seman- This study ofers a comprehensive evaluation of
targettics. Discriminative transformer-based models such as based financial sentiment analysis (TBFSA) by
systematDistilFinRoBERTa, FinBERT, and FinBERT-Tone provide ically comparing the efectiveness of cutting-edge
genquantifiable enhancements but remain inadequate regard- erative large language models (LLMs)—including
Chating precision and entity-level interpretability. Domain- GPT, DeepSeek, LLaMA, and Gemma—with conventional
adapted models like DeBERTa-v3-absa-v1.1, although tai- lexicon-based methods (VADER, TextBlob) and
discrimlored for target/aspect-based tasks, are surpassed by gen- inative transformer-based models (FinBERT,
DistilFinerative LLMs such as ChatGPT variants and DeepSeek- RoBERTa, FinBERT-Tone, and
DeBERTa-v3-base-ABSAR1. v1.1).</p>
        <p>The consistent success of ChatGPT-4, ChatGPT-4o, The findings indicate that LLMs—especially
ChatChatGPT-o1, and DeepSeek-R1 on the TBFSA task GPT variants (notably ChatGPT-o1) and
DeepSeekdemonstrates the eficacy of comprehensive pre-training, R1—surpass all baseline models in target-level sentiment
which equips these LLMs to perform exceptionally in analysis. Their capacity to deduce implicit sentiment,
zero/few-shot scenarios and generalize across several do- adapt to financial terminology, and function eficiently
mains without requiring task-specific fine-tuning. Their without task-specific fine-tuning makes them scalable,
consistent superiority over conventional lexicon-based ready-to-deploy solutions for practical applications like
systems and discriminative transformer-based models algorithmic trading and real-time risk assessment. These
underscores a significant transition towards genera- ifndings bear immediate implications for financial
intive LLMs that integrate high adaptability with robust stitutions, fintech developers, and analysts seeking to
domain-agnostic generalization, thus providing an efi- incorporate sentiment-driven insights into investing and
cient substitute for resource-intensive supervised meth- risk management processes.
ods in specialized tasks such as TBFSA. Such granular, Despite the promising findings, the study
acentity-specific sentiment interpretation holds substantial knowledges numerous limitations. The investigation
implications for investors, financial analysts, and algo- is confined to news articles from four prominent
rithmic trading systems. These advanced models allow technological firms—Alphabet, Amazon, Netflix, and
stakeholders to participate in more informed decision- Nvidia—potentially constraining the generalizability of
making, potentially improving portfolio management the findings to other industries or smaller market-cap
techniques and optimizing market timing decisions. companies with possibly distinct sentiment patterns.
Fur</p>
        <p>However, implementing LLMs in financial markets thermore, the study encompasses a short time frame (Sep
presents obstacles. Significant processing complexity and 4, 2023, to Jan 30, 2024), ofering short-term insights while
inference latency limit their applicability in ultra-high- potentially neglecting long-term patterns, seasonal
flucfrequency trading, where execution times are quantified tuations, and macroeconomic changes. In addition, the
in milliseconds. Moreover, regulatory issues arise from sole dependence on news articles neglects other vital data
the intrinsic opacity of LLM decision-making, which con- sources, such as social media sentiment, earnings reports,
tradicts compliance requirements such as MiFID II and and macroeconomic indicators, which could enhance the
research. To address these constraints, future research index prediction using investor sentiments, Expert
could broaden the analysis to encompass various sectors Systems with Applications 238 (2024) 121710.
and global markets, integrate additional data sources, and [9] P. Malo, A. Sinha, P. Korhonen, J. Wallenius,
prolong the study over several years to assess LLM per- P. Takala, Good debt or bad debt: Detecting
seformance over various market regimes, including bulls mantic orientations in economic texts, Journal of
and bear cycles. Moreover, enhancing prompt designs the Association for Information Science and
Techvia automated techniques, investigating time-lagged sen- nology 65 (2014) 782–796.
timent efects, and improving the interpretability of LLM [10] F. Ribeiro, M. Araújo, P. Gonçalves, M. A. Gonçalves,
outputs signify promising avenues for attaining more F. Benevenuto, Sentibench-a benchmark
comrobust, comprehensible, and sector-agnostic applications parison of state-of-the-practice sentiment analysis
of LLM-driven financial sentiment research. methods, EPJ Data Science 5 (2014) 1–29.</p>
        <p>[11] C. Hutto, E. Gilbert, Vader: A parsimonious
ruleData Availability based model for sentiment analysis of social media
text, in: Proceedings of the international AAAI
The dataset developed with this research is avail- conference on web and social media, volume 8, 2014,
able at https://github.com/iftikharm895/Target-Based_ pp. 216–225.</p>
        <p>Sentiment_Analysis_in_Financial_News. Due to copy- [12] W. Aljedaani, F. Rustam, M. Mkaouer, A. Ghallab,
right constraints, only URLs with manual annotations V. Rupapara, P. Washington, E. Lee, I. Ashraf,
Sentiare publicly released, with full news content accessible ment analysis on Twitter data integrating TextBlob
through a Bloomberg Terminal. and deep learning models: The case of US airline
industry, Knowledge-Based Systems 255 (2022)
109780.</p>
        <p>References [13] M. Siek, V. Chandra, Analysis of News Sentiment
[1] M. Wu, G. Subramaniam, Z. Li, X. Gao, Using for Stock Price Prediction Using VADER Sentiment,
AI Technology to Enhance Data-Driven Decision- in: 2024 6th International Conference on
CybernetMaking in the Financial Sector, in: Artificial ics and Intelligent System (ICORIS), IEEE, 2024, pp.
Intelligence-Enabled Businesses: How to Develop 1–6.</p>
        <p>Strategies for Innovation, 2025, pp. 187–207. [14] T. Saleem, U. Yaqub, S. Zaman, Twitter sentiment
[2] Y. Yang, Y. Zhang, M. Wu, K. Zhang, Y. Zhang, H. Yu, analysis and bitcoin price forecasting: implications
Y. Hu, B. Wang, TwinMarket: A Scalable Behavioral for financial risk management, The Journal of Risk
and Social Simulation for Financial Markets, arXiv Finance 25 (2024) 407–421.</p>
        <p>preprint arXiv:2502.01506 (2025). [15] B. Nagendra, S. Chandar, J. Simha, J. Bazil, Financial
[3] E. Cambria, B. White, Jumping NLP curves: A re- Lexicon based Sentiment Prediction for Earnings
view of natural language processing research, IEEE Call Transcripts for Market Intelligence, in: 2024
Computational intelligence magazine 9 (2024) 48– 5th International Conference on Image Processing
57. and Capsule Networks (ICIPCN), IEEE, 2024, pp.
[4] K. Du, F. Xing, R. Mao, E. Cambria, Financial senti- 595–603.</p>
        <p>ment analysis: Techniques and applications, ACM [16] V. Khandelwal, H. Varshney, G. Munjal,
SentiComputing Surveys 56 (2024) 1–42. ment analysis-based stock price prediction using
[5] M. Rizinski, H. Peshov, K. Mishev, M. Jovanovik, machine learning, in: 2024 2nd International
ConD. Trajanov, Sentiment analysis in finance: From ference on Advancement in Computation &amp;
Comtransformers back to explainable lexicons (xlex), puter Technologies (InCACCT), IEEE, 2024, pp. 182–
IEEE Access 12 (2024) 7170–7198. 187.
[6] A. Matarazzo, R. Torlone, A Survey on Large [17] I. Muhammad, M. Rospocher, On Assessing the
Language Models with some Insights on their Performance of LLMs for Target-Level Sentiment
Capabilities and Limitations, arXiv preprint Analysis in Financial News Headlines, Algorithms
arXiv:2501.04040 (2025). 18 (2025) 46.
[7] R. Wadawadagi, S. Tiwari, V. Pagi, Polarity-aware [18] J. Devlin, M. Chang, K. Lee, K. Toutanova, Bert:
deep attention network for aspect-based sentiment Pre-training of deep bidirectional transformers for
analysis, Progress in Artificial Intelligence 14 (2025) language understanding, in: Proceedings of the
33–48. 2019 conference of the North American chapter of
[8] S. Deng, Y. Zhu, Y. Yu, X. Huang, An integrated the association for computational linguistics:
huapproach of ensemble learning methods for stock man language technologies, volume 1 (long and
short papers), 2019, pp. 4171–4186.</p>
        <p>[19] Z. Liu, D. Huang, K. Huang, Z. Li, J. Zhao,
Finbert: A pre-trained financial language representa- conference on semantic computing (ICSC), IEEE,
tion model for financial text mining, in: Proceed- 2018, pp. 286–289.
ings of the twenty-ninth international conference [33] L. Nemes, A. Kiss, Prediction of stock values
on international joint conferences on artificial in- changes using sentiment analysis of stock news
telligence, 2021, pp. 4513–4519. headlines, Journal of Information and
Telecommu[20] D. Araci, Finbert: Financial sentiment analysis nication 5 (2021) 375–394.</p>
        <p>with pre-trained language models, arXiv preprint [34] M. El Idrissi, N. Chafik, R. Tachicart, Stock Price
arXiv:1908.10063 (2019). Prediction Using Sentiment Analysis and LSTM
[21] M. Mahendran, A. Gokul, P. Lakshmi, S. Pavithra, Networks, in: IBIMA Conference on Artificial
inComparative Advances in Financial Sentiment telligence and Machine Learning, Springer Nature
Analysis: A Review of BERT, FinBert, and Large Switzerland, 2024, pp. 149–156.</p>
        <p>Language Models, in: 2025 3rd International Con- [35] A. Patil, H. Sharma, A. Sinha, Sentiment Analysis of
ference on Intelligent Data Communication Tech- Financial News and its Impact on the Stock Market,
nologies and Internet of Things (IDCIoT), IEEE, in: 2024 2nd World Conference on Communication
2025, pp. 39–45. &amp; Computing (WCONF), IEEE, 2024, pp. 1–5.
[22] A. Huang, H. Wang, Y. Yang, FinBERT: A large [36] P. He, J. Gao, W. Chen, Debertav3: Improving
delanguage model for extracting information from berta using electra-style pre-training with
gradientifnancial text, Contemporary Accounting Research disentangled embedding sharing, arXiv preprint
40 (2023) 806–841. arXiv:2111.09543 (2021).
[23] A. Atak, Exploring the sentiment in Borsa Istanbul [37] H. Yang, C. Zhang, K. Li, Pyabsa: A modularized
with deep learning, Borsa Istanbul Review 23 (2023) framework for reproducible aspect-based sentiment
S84–S95. analysis, in: Proceedings of the 32nd ACM
interna[24] Y. Shen, P. Zhang, Financial sentiment analysis tional conference on information and knowledge
on news and reports using large language models management, 2023, pp. 5117–5122.
and finbert, in: 2024 IEEE 6th International Confer- [38] F. Voigt, J. Calero, K. Dahal, Q. Wang, K. V. Luck,
ence on Power, Intelligent Computing and Systems P. Stelldinger, Towards machine learning based
(ICPICS), IEEE, 2024, pp. 717–721. text categorization in the financial domain, in: 2024
[25] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, IEEE 3rd Conference on Information Technology
S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, and Data Science (CITDS), IEEE, 2024, pp. 1–6.
Bloomberggpt: A large language model for finance, [39] A. Dmonte, E. Ko, M. Zampieri, An Evaluation
arXiv preprint arXiv:2303.17564 (2023). of Large Language Models in Financial Sentiment
[26] Z. Wang, Y. Li, J. Wu, J. Soon, X. Zhang, Finvis-gpt: Analysis, in: 2024 IEEE International Conference
A multimodal large language model for financial on Big Data (BigData), IEEE, 2024, pp. 4869–4874.
chart analysis, arXiv preprint arXiv:2308.01430 [40] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang,
(2023). S. Zhong, B. Yin, X. Hu, Harnessing the power of
[27] Y. Yang, Y. Tang, K. Tam, Investlm: A large lan- llms in practice: A survey on chatgpt and beyond,
guage model for investment using financial domain ACM Transactions on Knowledge Discovery from
instruction tuning, arXiv preprint arXiv:2309.13064 Data 18 (2024) 1–32.</p>
        <p>(2023). [41] D. Chicco, G. Jurman, The advantages of the
[28] P. Agarwal, A. Gupta, Strategic business insights Matthews correlation coeficient (MCC) over F1
through enhanced financial sentiment analysis: A score and accuracy in binary classification
evaluaifne-tuned llama 2 approach, in: 2024 International tion, BMC genomics 21 (2020) 1–13.</p>
        <p>Conference on Inventive Computation Technolo- [42] M. Rospocher, S. Eksir, Assessing fine-grained
exgies (ICICT), IEEE, 2024, pp. 1446–1453. plicitness of song lyrics, Information 14 (2023).
[29] W. Kang, X. Yuan, X. Zhang, Y. Chen, J. Li,
ChatGPTbased Sentiment Analysis and Risk Prediction in
the Bitcoin Market, Procedia Computer Science 242
(2024) 211–218.
[30] K. Krippendorf, Content analysis: An introduction</p>
        <p>to its methodology, Sage publications, 2018.
[31] R. Artstein, M. Poesio, Inter-coder agreement for
computational linguistics, Computational
linguistics 34 (2008) 555–596.
[32] S. Sohangir, N. Petty, D. Wang, Financial sentiment</p>
        <p>lexicon analysis, in: 2018 IEEE 12th international
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using
these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>