1. Introduction

Cagliari, Italy * Corresponding author. $ iftikhar.muhammad@univr.it (I. Muhammad); marco.rospocher@univr.it (M. Rospocher); timotej.knez@fri.uni-lj.si (T. Knez); slavko.zitnik@fri.uni-lj.si (S. Žitnik)

Benchmarking Large Language Models for Target-Based Financial Sentiment Analysis

Iftikhar Muhammad

Marco Rospocher

Timotej Knez

Slavko Žitnik

0 0 University of Ljubljana , 1000 Ljubljana , Slovenia 1 University of Verona , 37129 Verona , Italy

2025

000 0 0003

Sentiment analysis is vital for understanding market dynamics and formulating informed investing strategies, especially in volatile financial conditions. This study advances target-based financial sentiment analysis (TBFSA) by rigorously evaluating the eficacy of Large Language Models (LLMs) in zero-shot and few-shot learning contexts. We compare cutting-edge generative LLMs, such as ChatGPT-4o, ChatGPT-4, ChatGPT-o1, DeepSeek-R1, Llama-3-8B, Gemma-2-9B, and Gemma2-27B, with conventional lexicon-based tools (VADER, TextBlob) and discriminative transformer-based models (FinBERT, FinBERT-Tone, DistilFinRoBERTa, Deberta-v3-base-absa-v1.1). Our analysis utilizes a newly curated dataset of 1,162 manually annotated Bloomberg news articles, designed explicitly for TBFSA (due to copyright constraints, only URLs are publicly released, with full news content accessible through a Bloomberg Terminal). The findings indicate that LLMs, particularly DeepSeek-R1 and ChatGPT variants (especially ChatGPT-o1), outperform lexicon-based approaches and discriminative transformer-based models across all evaluation metrics, without requiring additional training or task-specific fine-tuning. The study establishes generative LLMs as a scalable and cost-efective method for target-level sentiment analysis, relieving the need for expensive, rigorous fine-tuning. The research provides valuable insights, enabling institutions to use unstructured textual data efectively for improved real-time risk assessment, portfolio management, and algorithmic trading.

eol>Large Language Models Target-Based Sentiment Analysis Financial Sector

1. Introduction

niques and machine learning algorithms to advanced deep learning models, particularly transformer architecThe financial sector, a pivotal pillar of the global economy, tures [5]. Recently, generative large language models is increasingly influenced by vast amounts of unstruc- (LLMs) such as Llama, Gemma, ChatGPT, and DeepSeek tured textual data, including news articles, earnings call have exhibited considerable promise in NLP tasks, espetranscripts, regulatory filings, and analyst reports [ 1]. cially in zero-shot and few-shot learning contexts, owing These textual sources significantly impact investor deci- to their ability to reduce reliance on extensive manual sions, market volatility, and strategic financial activities annotations [6]. However, the eficacy of these models [2]. The inadequacy of traditional manual methods for in specialized fields, such as finance, is still inadequately processing such extensive data has led to adopting au- examined, underscoring the necessity for thorough astomated procedures using Natural Language Processing sessment before their incorporation into practical ap(NLP) techniques [3]. Sentiment analysis, a crucial NLP plications like financial reporting software and trading tool, evaluates the emotional tone of the text, providing algorithms. valuable predictive insights on investor sentiment and A notably complex facet of sentiment analysis in fimarket movements [2]. nancial texts is the recurrent presence of conflicting sen

Financial Sentiment Analysis (FSA), a specific subtask timents towards multiple entities within a single narraof NLP, identifies subjective tones in financial texts, of- tive [7]. For example, the statement “Nvidia’s AI-driven fering insights for market forecasting, risk management, growth overshadows Netflix’s subscriber stagnation” conand the development of trading strategies [4]. Meth- currently expresses positive and negative sentiments reods for FSA range from conventional lexicon-based tech- garding two distinct entities. Conventional sentiment analysis methods at the sentence or document level frequently conflate these subtle perspectives, obscuring critical insights necessary for precise decision-making. To overcome this constraint, Target-Based Financial Sentiment Analysis (TBFSA) disaggregates sentiment at the entity level, facilitating a more detailed examination of specific financial instruments, business entities, or market segments [8]. Nonetheless, the capacity of LLMs to execute zero-shot and few-shot TBFSA tasks in finan

2. Related Work

cial markets remains insuficiently investigated. Furthermore, rigorous comparison analyses of lexicon-based tools, discriminative transformer-based approaches, and 2.1. Lexicon-Based Methods generative LLMs in this particular setting remain scarce.

The current study aims to fill these significant gaps Lexicon-based approaches, which form the foundation of by evaluating the potential of LLMs to conduct target- ifnancial sentiment analysis, initially drew from generalspecific sentiment analysis in financial news articles. purpose instruments such as LIWC and SentiWordNet. Specifically, we seek to answer the following research However, these tools lacked domain-specific accuracy questions: and contextual nuance [10]. Frameworks like VADER and TextBlob were then developed to incorporate contex1. How do zero-shot and few-shot generative LLMs tual scoring and automatic lexicon enhancement [11, 12]. perform in TBFSA compared to lexicon-based and Numerous scholars have utilized VADER in the finandiscriminative transformer-based models? cial domain [13, 14, 15]. However, it struggles to handle sector-specific terminology [ 16]. Similarly, TextBlob, 2. Does few-shot learning substantially improve which integrates predefined lexicons with a classifier the performance of LLMs compared to zero-shot trained on film reviews, allows for swift implementation methods in TBFSA? in initial analyses. However, it falls short in complex Our contributions can be summarized as follows: ifnancial scenarios due to its inadequate domain adaptation [16].

While lexicon-based methods have been practical, they face significant challenges in deciphering complex linguistic patterns, domain-specific vocabulary, and contextual nuances [17]. These limitations have led to transformer-based models leveraging deep learning to capture semantic and contextual subtleties more efectively in financial texts. 1. We develop and publicly release a novel, manually annotated TBFSA dataset comprising 1,162 ifnancial news articles categorized by targetspecific sentiments. In contrast to current financial datasets (e.g., FiQA-2018,1 Financial PhraseBank [9]), our dataset distinctly encapsulates sophisticated entity-level opinions within intricate ifnancial narratives that exhibit conflicting sentiments.

2. Utilizing this dataset, we systematically evalu

ate generative LLMs (ChatGPT, Llama, Gemma, DeepSeek), conventional lexicon-based instruments (VADER, TextBlob), and discriminative transformer-based models (Finbert, DistilFinRoBERTa, Finbert-Tone, Deberta-v3-base-absav1.1), emphasizing the strengths and limitations of each approach specifically in the context of TBFSA. This extensive comparison investigation is among the first to critically evaluate advanced LLMs’ performance in zero-shot and few-shot frameworks for target-level financial sentiment analysis.

The subsequent sections of this research are structured as follows: Section 2 presents relevant literature on financial sentiment analysis. Section 3 delineates the establishment of our dataset, annotation processes, and methodological techniques. Section 4 delineates empirical findings and discussion, while Section 5 concludes the study and provides key implications and avenues for future research.

1https://sites.google.com/view/fiqa/home 2.2. Discriminative Transformer-Based Models Transformer-based architectures, particularly BERT [18],

transformed NLP by employing a self-attention technique that efectively captures contextual relationships. Although general transformers excel at conventional NLP tasks, their efectiveness declines in financial contexts due to specialized lexicons and nuanced tone diferences.

As a result, domain-specific models fine-tuned on financial data have developed an increased sensitivity to the subtleties of financial language and numerical settings [19].

FinBERT [20], trained initially on financial documents like SEC filings and subsequently fine-tuned with the FiQA dataset, represented a notable progression in financial sentiment analysis. Studies conducted by [19, 21] confirmed FinBERT’s superiority compared to generalpurpose models, especially in analyzing earnings transcripts. Expanding on this, FinBERT-Tone [22] implemented tonal analysis to discern subtle sentiment indications essential for market forecasting. Initiatives to improve eficiency, shown by DistilFinRoBERTa [ 23], tailored for real-time applications, have also garnered attention. Furthermore, sophisticated models like DeBERTav3-base-absa-v1.1 exhibited accuracy in aspect- and target-oriented sentiment analysis, adeptly interpreting intricate narratives in financial documents [17].

Comparative assessments consistently demonstrate tion with discriminative transformer-based models, inthat fine-tuned transformer-based models exceed tradi- cluding FinBERT, DistilFinRoBERTa, FinBERT-Tone, and tional lexicon-based and machine-learning methodolo- DeBERTa-v3-base-absa-v1.1. We began by outlining our gies [19, 24]. Nevertheless, their demand for processing methodology for dataset collecting and annotation, a resources and extensive labelled datasets has initiated meticulous process that ensured high reliability and validthe exploration of generative LLMs as viable alternatives ity criteria. Subsequently, we fine-tuned the benchmark that scale more efectively with fewer task-specific labels. discriminative transformer-based models utilizing this dataset to achieve optimal alignment with the specific 2.3. Generative Large Language Models requirements of financial sentiment analysis. To thoroughly assess the generative LLMs, we designed precise, Recent developments in LLMs have shown exceptional task-oriented prompts appropriate for TBFSA. Finally, we proficiency in FSA, surpassing conventional lexicon- conducted a comprehensive comparative study to evalubased and discriminative transformer-based methodolo- ate the eficacy and robustness of LLMs compared to the gies [21]. The intricate linguistic characteristics of fi- benchmark models. nancial texts have prompted the creation of specialized LLMs, such as BloombergGPT [25] and FinVis-GPT [26], 3.1. Dataset Construction and Annotation specifically tailored for the financial sector. Models such as InvestLM [27], especially fine-tuned for investing en- To establish a thorough evaluation framework, we obvironments, have demonstrated efectiveness equivalent tained news articles from the Bloomberg Terminal regardto commercial advice systems. ing four prominent stock companies—Alphabet, Amazon,

Furthermore, recent research highlights the eficacy Netflix, and Nvidia. The assembled dataset comprises of smaller, computationally eficient models, attaining 1,170 articles dated from September 4, 2023, to January performance akin to larger LLMs via focused fine-tuning. 30, 2024. Each article was systematically analyzed to Methods like parameter-eficient tuning (e.g., LoRA) have extract critical information, including the timestamps, enhanced their utilization in practical financial scenarios news text (excluding headlines), and URLs, which were [28]. Significantly, even general-purpose models such then organized in a structured database (as depicted in as ChatGPT have exhibited remarkable proficiency in Figure 1). ifnancial sentiment analysis without the necessity for Each article was meticulously annotated for sentiment domain-specific fine-tuning [29]. concerning the target companies to ensure data quality

Despite significant progress, previous studies have pri- and confirm the experimental evaluation. The annotation marily focused on generic sentiment analysis, with lim- was carried out by three annotators with extensive experited investigation into target-based sentiment analysis tise in finance and economics, all possessing advanced within financial contexts. While [ 17] examined the zero- English competence (CEFR level C1). Their annotations shot eficacy of LLMs on financial headlines, our research were guided by comprehensive guidelines aimed at stanexpands this investigation by evaluating full-text articles dardizing target identification and sentiment assessment. to provide more extensive contextual insights. Addition- A concise summary of these guidelines entails: ally, we extend the evaluation framework to encompass few-shot scenarios and a varied array of models—such as • A thorough examination of each article to identify Llama 3-8B, Gemma 2 (9B and 27B), DeepSeek-R1, and direct references to the target entities: Alphabet, ChatGPT variants—benchmarked against conventional Amazon, Netflix, and Nvidia. lexicon-based and discriminative transformer-based mod- • Identification of multiple target entities within a els. Unlike [17], which examined sentiment toward a sin- single article, where applicable. gle target per headline, our study investigates multiple targets within each article, enabling more granular and • Labelling articles devoid of explicit target refercomprehensive financial sentiment analysis. ences as “no target.”

3. Methodology

This section delineates the methodological framework utilized to assess the performance of generative LLMs— specifically, Gemma, Llama, ChatGPT, and DeepSeek—in executing TBFSA. To efectively benchmark these LLMs, we utilized various lexicon-based sentiment analysis tools, specifically VADER and TextBlob, in conjunc• Evaluation of sentiment from an investor’s view

point, relying exclusively on the textual content. • Sentiment classification as positive (1), negative

(-1), or neutral (0). • Identification of prevailing sentiment in instances

of mixed expressions. • Neutral labelling for vague, ambiguous, or passing references.

The annotating procedure was organized into two separate phases. The annotators initially conducted target identification individually across all 1,170 articles. Eight articles were excluded as having "no target" by consensus. Inter-annotator reliability for target identification yielded a Krippendorf’s alpha [ 30] of 0.96 and a percentage agreement [31] of 98.95% for the remaining 1,162 articles, signifying consistent annotations. Texts with majority-agreed targets were forwarded for sentiment annotation, yielding 1,334 unique annotation cases due to multiple target references within specific articles.

In the second phase, sentiment annotation was performed for all identified target entities. Annotators used a defined scale to assign sentiments: ‘1’ for positive, ’1’ for negative, and ’0’ for neutral sentiment. To ensure consistency, annotators collaboratively annotated a shared subset of 150 texts, resulting in satisfactory inter-annotator reliability (Krippendorf’s Alpha of 0.81; percentage agreement of 83%). The sentiment labels for the 150 texts were established by majority consensus, and Figure 2: Sentiment distribution across the targets. the remaining 1,184 texts were allocated evenly among annotators for individual sentiment labelling.

The final annotated dataset consists of 1,334 texts; each explicitly associated with a target entity and an annotated the Bloomberg Terminal, a subscription-based platform sentiment label. The dataset demonstrates a moderate widely accessible in academic and financial institutions. class imbalance, with positive sentiments accounting for 45%, negative sentiments for 27%, and neutral sentiments 3.2. Baseline Models for 28%. Table 1 presents annotated instances, whereas Figure 2 represents the sentiment distribution. Additional To meticulously assess generative LLMs in TBFSA, we quantitative parameters, including the total number of have conducted a comprehensive comparison of their news texts, average daily texts, average text length (mea- eficacy with established benchmarks: lexicon-based insure in tokens), and average target mentions, are outlined struments (TextBlob, VADER) and discriminative transin Table 2. former architectures (FinBERT, FinBERT-Tone, DistilFin

We publicly release our curated dataset2 to assist RoBERTa, DeBERTa-v3-absa-v1.1). the academic community and guarantee methodologi- TextBlob,3 an open-source python library developed cal transparency and reproducibility. Due to copyright on the Natural Language Toolkit (NLTK) and Pattern lirestrictions, we cannot disseminate the complete content braries, assigns sentiment polarity scores ranging from of the news articles. However, we provide comprehen- −1 to +1 and has been widely utilized for financial sive metadata, encompassing publication dates, times- texts [16, 32, 33]. VADER,4 developed by [11], a ruletamps, specified target entities, and Bloomberg article based framework, incorporates lexical, grammatical, URLs, facilitating the retrieval of original articles via and syntactic heuristics—validated against LIWC and

2https://github.com/iftikharm895/Target-Based_Sentiment_

Analysis_in_Financial_News

3https://textblob.readthedocs.io/en/dev/ 4https://github.com/cjhutto/vaderSentiment

Text Alphabet Inc. shares tumbled the most in a year on Wednesday after the Google parent reported a smaller than expected profit in cloud computing, raising concerns about its position in a market critical to its future. Ed Ludlow reports.

Amazon Japan says it will build its first “sort center” in Japan in Shinagawa, Tokyo, located ∼ 3.5km from Haneda International Airport. Expects to create ∼1,000 new jobs. Will handle as many as 750,000 items/day.

Netflix co-CEO Ted Sarandos says talks with striking actors broke down after the union asked for a “levy” on streaming customers. Sarandos speaks at the first-ever Bloomberg Screentime conference in Los Angeles.

The projected ex-date for Nvidia’s dividend moved to Dec. 6 from Nov. 30, according to an updated Bloomberg Dividend Forecast. The new ex-date falls after the Dec. 1 option expiry. 2. FinBERT6 [20], a BERT adaptation pre-trained on earnings calls news articles and regulatory filings and fine-tuned on Financial PhraseBank [9]; ANEW—and has also been extensively employed in fi- These discriminative transformer-based models have nancial contexts [17, 34, 35]. been extensively employed in financial sentiment reDiscriminative transformer-based baselines comprise: search [23, 38, 39].

The current study involved fine-tuning DistilFin1. DistilFinRoBERTa,5 a distilled variant of RoBERTa, FinBERT, and FinBERT-Tone using a learning RoBERTa fine-tuned on financial datasets for rate of 3 × 10 −5 , 10 training epochs, and a batch size of three-class sentiment analysis [23]; 32. For DeBERTa-v3-absa-v1.1, we utilized a 5-fold crossvalidation approach to enhance robustness, training each fold for 10 epochs using default hyperparameters on an

NVIDIA RTX 4090 GPU. 3. FinBERT-Tone7 [19], which enhances FinBERT to identify tonal nuances, fine-tuned on SEC filings, earning reports, and financial news; and 4. DeBERTa-v3-absa-v1.1,8 builds upon the

DeBERTa-v3 architecture [36], has been finetuned for Aspect-Based Sentiment Analysis (ABSA) through the FAST-LCF-BERT framework [37]. It is trained on an extensive dataset, comprising 30,000 ABSA-specific samples and further fine-tuned on an additional 180,000 annotated examples from a variety of datasets.

5https://huggingface.co/mrm8488/

distilroberta-finetuned-financial-news-sentiment-analysis 6https://huggingface.co/ProsusAI/finbert 7https://huggingface.co/yiyanghkust/finbert-tone 8https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

3.3. Evaluated Generative LLMs Recent improvements in LLMs have garnered significant

academic interest owing to their proven efectiveness in several text-based tasks [40]. Notable and widely utilized models include OpenAI’s ChatGPT,9 Gemma10—a series of open models based on Google’s Gemini architecture, Meta’s LLaMA,11 and DeepSeek-12

The current study assessed the eficacy of various advanced generative LLMs within the framework of TBFSA.

The evaluated models include ChatGPT-4, ChatGPT-4o, 9https://chatgpt.com/. The ChatGPT variants analyzed in this study are limited to those available during the research period. Newer versions released during manuscript preparation will be examined in future work. 10https://gemini.google.com/app 11https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 12https://www.deepseek.com/ ChatGPT-o1, LLaMA 3 8B, Gemma 2 9B, Gemma 2 27B, poses, we used default hyperparameters and advanced and DeepSeek-R1. All models were assessed in their de- optimization techniques, including 4-bit quantization, to fault configurations, without any additional fine-tuning, eficiently execute this model on consumer-grade GPU to evaluate their zero-shot and few-shot capabilities in systems. executing the specified task. Interactions with ChatGPT To assess the performance of generative LLMs in variations were executed via OpenAI’s standard web in- TBFSA, we employed zero-shot and few-shot promptterface, utilizing a temperature setting of 0.7. The Gemma ing strategies using manually designed, fixed prompts models were accessed via the Gemini API, which sug- without task-specific tuning. The prompt used in the gests a temperature setting of 1.0 for both the 9B and zero/few-shot learning approach is presented in Figure 3. 27B variants. DeepSeek-R1 was accessed via its public In the zero-shot context, models were given task instrucchat interface, employing its standard temperature set- tions without illustrative examples. In few-shot contexts, ting. To interact with the LLaMA model, we utilized a prompts were augmented by either one (1-shot) or five local instance of the Meta-Llama-3-8B-Instruct model, (5-shot) additionally annotated examples, to provide conrunning under the Ollama13 application. For testing pur- textual grounding.

This approach utilizes LLMs’ inherent language and Model TextBlob VADER FinBERT DistilFinRoBERTa FinBERT-Tone Deberta-v3-absa-v1.1 Llama 3 8B Gemma 2 9B Gemma 2 27B ChatGPT-4 ChatGPT-4o ChatGPT-o1 DeepSeek-R1 Llama 3 8B Gemma 2 9B Gemma 2 27B ChatGPT-4 ChatGPT-4o ChatGPT-o1 DeepSeek-R1 Llama 3 8B Gemma 2 9B Gemma 2 27B ChatGPT-4 ChatGPT-4o ChatGPT-o1 DeepSeek-R1 contextual reasoning abilities, enabling performance eval- els are limited by their dependence on static, generaluation without requiring task-specific training or model purpose sentiment lexicons that do not incorporate adaptation. The method ofers a clear assessment of domain-specific financial language, in addition to their model generality and adaptability, enhancing their suit- document-level emphasis and rigid rule-based architecability for efortless implementation in diverse practical ture. As a result, they fail to capture the contextual inapplications. tricacies and entity-specific sentiment diferentiations

The evaluation of model performance utilized rec- necessary for efective TBFSA. ognized criteria for sentiment categorization, includ- Conversely, discriminative transformer-based models ing precision, accuracy, recall, and F1-score [41]. The optimized for FSA tasks substantially exceed the permetrics were calculated across three sentiment cate- formance of lexicon-based models. FinBERT, DistilFingories—negative, neutral, and positive—utilizing both RoBERTa, and FinBERT-Tone attain increasingly higher macro-averaging (equal weight across classes) and macro-F1 scores (ranging from 0.54 to 0.62), demonstratweighted averaging (weighted by class sample size) to ing the advantages of domain-specific pretraining and ensure robustness amid moderately imbalanced class dis- contextualized embeddings. Nonetheless, these models tributions, as done in analogous situations (e.g., [42]). operate at the sentence or document level and fail to assign sentiment to specific entities, hence constraining their eficacy in multi-entity financial texts. Conversely, 4. Results and Discussions DeBERTa-v3-base-ABSA-v1.1, tailored for target/aspectbased sentiment analysis, attains the highest macro-F1 Table 3 presents the outcomes for all models evaluated

score (0.66) among fine-tuned transformer models. Its on the novel dataset introduced in this research. Lexicon

disentangled attention mechanism and structured input based approaches, such as VADER and TextBlob, exhibit

encoding provide fine-grained, token-level sentiment atconsistently subpar performance across all evaluation

tribution, rendering it more suitable for intricate, entitymetrics, with macro-F1 scores below 0.37. These mod

aware financial analysis.

Among the generative LLMs evaluated under zero- SEC Rule 15c3-5 that necessitate model interpretability shot settings, DeepSeek-R1 and the ChatGPT models for audit and risk governance. These limitations under(ChatGPT-o1, ChatGPT-4, and ChatGPT-4o) consistently score the need for transdisciplinary innovation. The efsurpass baseline models. DeepSeek-R1 attains the high- fective incorporation of LLMs into financial analytics est zero-shot macro-F1 score (0.82), closely followed by will likely rely on hybrid architectures that combine lanChatGPT-o1 (0.80). Performance enhances with few-shot guage capabilities with conventional econometric models. prompting: in the 1-shot setting, ChatGPT-o1 slightly These hybrid architectures hold the potential to revoluoutperforms DeepSeek-R1 with a macro-F1 score of 0.84 tionize financial analytics, balancing traditional financial compared to 0.83. The highest scores are recorded in the metrics’ interpretability with AI’s adaptive learning capa5-shot setting, with DeepSeek-R1 achieving 0.87, slightly bilities, and thereby mitigating the risks linked to opaque above ChatGPT-o1’s score of 0.86. These findings high- algorithmic decision-making. Resolving these complexlight the eficacy of few-shot learning in improving con- ities necessitates collaboration among AI researchers, textual comprehension and sentiment categorization out- economists, and regulatory authorities to ensure that incomes. Nonetheless, smaller models such as LLaMA 3 novations, such as federated learning for data privacy and 8B exhibit significant sensitivity to few-shot prompting. synthetic financial text generation for enhanced training While it attains a zero-shot macro-F1 score of 0.63, perfor- robustness, are implemented ethically and efectively. mance significantly declines to 0.44 in the 1-shot scenario, with only a modest recovery to 0.63 at the 5-shot level.

In summary, lexicon-based sentiment analysis meth- 5. Conclusions ods like VADER and TextBlob are insuficient for TBFSA because they fail to capture contextual financial seman- This study ofers a comprehensive evaluation of targettics. Discriminative transformer-based models such as based financial sentiment analysis (TBFSA) by systematDistilFinRoBERTa, FinBERT, and FinBERT-Tone provide ically comparing the efectiveness of cutting-edge genquantifiable enhancements but remain inadequate regard- erative large language models (LLMs)—including Chating precision and entity-level interpretability. Domain- GPT, DeepSeek, LLaMA, and Gemma—with conventional adapted models like DeBERTa-v3-absa-v1.1, although tai- lexicon-based methods (VADER, TextBlob) and discrimlored for target/aspect-based tasks, are surpassed by gen- inative transformer-based models (FinBERT, DistilFinerative LLMs such as ChatGPT variants and DeepSeek- RoBERTa, FinBERT-Tone, and DeBERTa-v3-base-ABSAR1. v1.1).

The consistent success of ChatGPT-4, ChatGPT-4o, The findings indicate that LLMs—especially ChatChatGPT-o1, and DeepSeek-R1 on the TBFSA task GPT variants (notably ChatGPT-o1) and DeepSeekdemonstrates the eficacy of comprehensive pre-training, R1—surpass all baseline models in target-level sentiment which equips these LLMs to perform exceptionally in analysis. Their capacity to deduce implicit sentiment, zero/few-shot scenarios and generalize across several do- adapt to financial terminology, and function eficiently mains without requiring task-specific fine-tuning. Their without task-specific fine-tuning makes them scalable, consistent superiority over conventional lexicon-based ready-to-deploy solutions for practical applications like systems and discriminative transformer-based models algorithmic trading and real-time risk assessment. These underscores a significant transition towards genera- ifndings bear immediate implications for financial intive LLMs that integrate high adaptability with robust stitutions, fintech developers, and analysts seeking to domain-agnostic generalization, thus providing an efi- incorporate sentiment-driven insights into investing and cient substitute for resource-intensive supervised meth- risk management processes. ods in specialized tasks such as TBFSA. Such granular, Despite the promising findings, the study acentity-specific sentiment interpretation holds substantial knowledges numerous limitations. The investigation implications for investors, financial analysts, and algo- is confined to news articles from four prominent rithmic trading systems. These advanced models allow technological firms—Alphabet, Amazon, Netflix, and stakeholders to participate in more informed decision- Nvidia—potentially constraining the generalizability of making, potentially improving portfolio management the findings to other industries or smaller market-cap techniques and optimizing market timing decisions. companies with possibly distinct sentiment patterns. Fur

However, implementing LLMs in financial markets thermore, the study encompasses a short time frame (Sep presents obstacles. Significant processing complexity and 4, 2023, to Jan 30, 2024), ofering short-term insights while inference latency limit their applicability in ultra-high- potentially neglecting long-term patterns, seasonal flucfrequency trading, where execution times are quantified tuations, and macroeconomic changes. In addition, the in milliseconds. Moreover, regulatory issues arise from sole dependence on news articles neglects other vital data the intrinsic opacity of LLM decision-making, which con- sources, such as social media sentiment, earnings reports, tradicts compliance requirements such as MiFID II and and macroeconomic indicators, which could enhance the research. To address these constraints, future research index prediction using investor sentiments, Expert could broaden the analysis to encompass various sectors Systems with Applications 238 (2024) 121710. and global markets, integrate additional data sources, and [9] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, prolong the study over several years to assess LLM per- P. Takala, Good debt or bad debt: Detecting seformance over various market regimes, including bulls mantic orientations in economic texts, Journal of and bear cycles. Moreover, enhancing prompt designs the Association for Information Science and Techvia automated techniques, investigating time-lagged sen- nology 65 (2014) 782–796. timent efects, and improving the interpretability of LLM [10] F. Ribeiro, M. Araújo, P. Gonçalves, M. A. Gonçalves, outputs signify promising avenues for attaining more F. Benevenuto, Sentibench-a benchmark comrobust, comprehensible, and sector-agnostic applications parison of state-of-the-practice sentiment analysis of LLM-driven financial sentiment research. methods, EPJ Data Science 5 (2014) 1–29.

[11] C. Hutto, E. Gilbert, Vader: A parsimonious ruleData Availability based model for sentiment analysis of social media text, in: Proceedings of the international AAAI The dataset developed with this research is avail- conference on web and social media, volume 8, 2014, able at https://github.com/iftikharm895/Target-Based_ pp. 216–225.

Sentiment_Analysis_in_Financial_News. Due to copy- [12] W. Aljedaani, F. Rustam, M. Mkaouer, A. Ghallab, right constraints, only URLs with manual annotations V. Rupapara, P. Washington, E. Lee, I. Ashraf, Sentiare publicly released, with full news content accessible ment analysis on Twitter data integrating TextBlob through a Bloomberg Terminal. and deep learning models: The case of US airline industry, Knowledge-Based Systems 255 (2022) 109780.

References [13] M. Siek, V. Chandra, Analysis of News Sentiment [1] M. Wu, G. Subramaniam, Z. Li, X. Gao, Using for Stock Price Prediction Using VADER Sentiment, AI Technology to Enhance Data-Driven Decision- in: 2024 6th International Conference on CybernetMaking in the Financial Sector, in: Artificial ics and Intelligent System (ICORIS), IEEE, 2024, pp. Intelligence-Enabled Businesses: How to Develop 1–6.

Strategies for Innovation, 2025, pp. 187–207. [14] T. Saleem, U. Yaqub, S. Zaman, Twitter sentiment [2] Y. Yang, Y. Zhang, M. Wu, K. Zhang, Y. Zhang, H. Yu, analysis and bitcoin price forecasting: implications Y. Hu, B. Wang, TwinMarket: A Scalable Behavioral for financial risk management, The Journal of Risk and Social Simulation for Financial Markets, arXiv Finance 25 (2024) 407–421.

preprint arXiv:2502.01506 (2025). [15] B. Nagendra, S. Chandar, J. Simha, J. Bazil, Financial [3] E. Cambria, B. White, Jumping NLP curves: A re- Lexicon based Sentiment Prediction for Earnings view of natural language processing research, IEEE Call Transcripts for Market Intelligence, in: 2024 Computational intelligence magazine 9 (2024) 48– 5th International Conference on Image Processing 57. and Capsule Networks (ICIPCN), IEEE, 2024, pp. [4] K. Du, F. Xing, R. Mao, E. Cambria, Financial senti- 595–603.

ment analysis: Techniques and applications, ACM [16] V. Khandelwal, H. Varshney, G. Munjal, SentiComputing Surveys 56 (2024) 1–42. ment analysis-based stock price prediction using [5] M. Rizinski, H. Peshov, K. Mishev, M. Jovanovik, machine learning, in: 2024 2nd International ConD. Trajanov, Sentiment analysis in finance: From ference on Advancement in Computation & Comtransformers back to explainable lexicons (xlex), puter Technologies (InCACCT), IEEE, 2024, pp. 182– IEEE Access 12 (2024) 7170–7198. 187. [6] A. Matarazzo, R. Torlone, A Survey on Large [17] I. Muhammad, M. Rospocher, On Assessing the Language Models with some Insights on their Performance of LLMs for Target-Level Sentiment Capabilities and Limitations, arXiv preprint Analysis in Financial News Headlines, Algorithms arXiv:2501.04040 (2025). 18 (2025) 46. [7] R. Wadawadagi, S. Tiwari, V. Pagi, Polarity-aware [18] J. Devlin, M. Chang, K. Lee, K. Toutanova, Bert: deep attention network for aspect-based sentiment Pre-training of deep bidirectional transformers for analysis, Progress in Artificial Intelligence 14 (2025) language understanding, in: Proceedings of the 33–48. 2019 conference of the North American chapter of [8] S. Deng, Y. Zhu, Y. Yu, X. Huang, An integrated the association for computational linguistics: huapproach of ensemble learning methods for stock man language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186.

[19] Z. Liu, D. Huang, K. Huang, Z. Li, J. Zhao, Finbert: A pre-trained financial language representa- conference on semantic computing (ICSC), IEEE, tion model for financial text mining, in: Proceed- 2018, pp. 286–289. ings of the twenty-ninth international conference [33] L. Nemes, A. Kiss, Prediction of stock values on international joint conferences on artificial in- changes using sentiment analysis of stock news telligence, 2021, pp. 4513–4519. headlines, Journal of Information and Telecommu[20] D. Araci, Finbert: Financial sentiment analysis nication 5 (2021) 375–394.

with pre-trained language models, arXiv preprint [34] M. El Idrissi, N. Chafik, R. Tachicart, Stock Price arXiv:1908.10063 (2019). Prediction Using Sentiment Analysis and LSTM [21] M. Mahendran, A. Gokul, P. Lakshmi, S. Pavithra, Networks, in: IBIMA Conference on Artificial inComparative Advances in Financial Sentiment telligence and Machine Learning, Springer Nature Analysis: A Review of BERT, FinBert, and Large Switzerland, 2024, pp. 149–156.

Language Models, in: 2025 3rd International Con- [35] A. Patil, H. Sharma, A. Sinha, Sentiment Analysis of ference on Intelligent Data Communication Tech- Financial News and its Impact on the Stock Market, nologies and Internet of Things (IDCIoT), IEEE, in: 2024 2nd World Conference on Communication 2025, pp. 39–45. & Computing (WCONF), IEEE, 2024, pp. 1–5. [22] A. Huang, H. Wang, Y. Yang, FinBERT: A large [36] P. He, J. Gao, W. Chen, Debertav3: Improving delanguage model for extracting information from berta using electra-style pre-training with gradientifnancial text, Contemporary Accounting Research disentangled embedding sharing, arXiv preprint 40 (2023) 806–841. arXiv:2111.09543 (2021). [23] A. Atak, Exploring the sentiment in Borsa Istanbul [37] H. Yang, C. Zhang, K. Li, Pyabsa: A modularized with deep learning, Borsa Istanbul Review 23 (2023) framework for reproducible aspect-based sentiment S84–S95. analysis, in: Proceedings of the 32nd ACM interna[24] Y. Shen, P. Zhang, Financial sentiment analysis tional conference on information and knowledge on news and reports using large language models management, 2023, pp. 5117–5122. and finbert, in: 2024 IEEE 6th International Confer- [38] F. Voigt, J. Calero, K. Dahal, Q. Wang, K. V. Luck, ence on Power, Intelligent Computing and Systems P. Stelldinger, Towards machine learning based (ICPICS), IEEE, 2024, pp. 717–721. text categorization in the financial domain, in: 2024 [25] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, IEEE 3rd Conference on Information Technology S. Gehrmann, P. Kambadur, D. Rosenberg, G. Mann, and Data Science (CITDS), IEEE, 2024, pp. 1–6. Bloomberggpt: A large language model for finance, [39] A. Dmonte, E. Ko, M. Zampieri, An Evaluation arXiv preprint arXiv:2303.17564 (2023). of Large Language Models in Financial Sentiment [26] Z. Wang, Y. Li, J. Wu, J. Soon, X. Zhang, Finvis-gpt: Analysis, in: 2024 IEEE International Conference A multimodal large language model for financial on Big Data (BigData), IEEE, 2024, pp. 4869–4874. chart analysis, arXiv preprint arXiv:2308.01430 [40] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, (2023). S. Zhong, B. Yin, X. Hu, Harnessing the power of [27] Y. Yang, Y. Tang, K. Tam, Investlm: A large lan- llms in practice: A survey on chatgpt and beyond, guage model for investment using financial domain ACM Transactions on Knowledge Discovery from instruction tuning, arXiv preprint arXiv:2309.13064 Data 18 (2024) 1–32.

(2023). [41] D. Chicco, G. Jurman, The advantages of the [28] P. Agarwal, A. Gupta, Strategic business insights Matthews correlation coeficient (MCC) over F1 through enhanced financial sentiment analysis: A score and accuracy in binary classification evaluaifne-tuned llama 2 approach, in: 2024 International tion, BMC genomics 21 (2020) 1–13.

Conference on Inventive Computation Technolo- [42] M. Rospocher, S. Eksir, Assessing fine-grained exgies (ICICT), IEEE, 2024, pp. 1446–1453. plicitness of song lyrics, Information 14 (2023). [29] W. Kang, X. Yuan, X. Zhang, Y. Chen, J. Li, ChatGPTbased Sentiment Analysis and Risk Prediction in the Bitcoin Market, Procedia Computer Science 242 (2024) 211–218. [30] K. Krippendorf, Content analysis: An introduction

to its methodology, Sage publications, 2018. [31] R. Artstein, M. Poesio, Inter-coder agreement for computational linguistics, Computational linguistics 34 (2008) 555–596. [32] S. Sohangir, N. Petty, D. Wang, Financial sentiment

lexicon analysis, in: 2018 IEEE 12th international Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.