<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ariana Sahitaj</string-name>
          <email>sahitaj@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiaao Li</string-name>
          <email>jiaao.li@campus.tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pia Wenzel Neves</string-name>
          <email>p.wenzel.2@campus.tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fedor Splitt</string-name>
          <email>splitt@campus.tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Premtim Sahitaj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charlott Jakob</string-name>
          <email>c.jakob@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronika Solopova</string-name>
          <email>veronika.solopova@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vera Schmitt</string-name>
          <email>vera.schmitt@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Artificial Intelligence (DFKI)</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Quality and Usability Lab, Technische Universität Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared task [1] on multilingual subjectivity detection. We evaluate two approaches: (1) supervised fine-tuning of transformer encoders, EuroBERT, XLMRoBERTa, and German-BERT, on monolingual and machine-translated training data; and (2) zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and Perspective (comparative reasoning). The Annotation Approach achieves 1st place in the Italian monolingual subtask with an F1 score of 0.8104, outperforming the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned XLM-RoBERTa model obtains an F 1 score of 0.7917, ranking 3rd and exceeding the baseline of 0.6461. The same model also performs reliably in the multilingual task and improves over the baseline in Greek. For German, a German-BERT model fine-tuned on translated training data from typologically related languages yields competitive performance over the baseline. In contrast, performance in the Ukrainian and Polish zero-shot settings falls slightly below the respective baselines, reflecting the challenge of generalization in low-resource cross-lingual scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Subjectivity Detection</kwd>
        <kwd>Multilingual NLP</kwd>
        <kwd>Zero-Shot Learning</kwd>
        <kwd>Prompt-Based Inference</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Understanding whether a sentence expresses a personal opinion or presents information in a neutral
and therefore objective way is important in many natural language processing tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This distinction
is particularly relevant in the context of news reporting, where objectivity is traditionally considered a
core principle. Yet, subjective or evaluative language is often embedded in news texts through stylistic
choices and subtle dialogic elements that influence how readers interpret information. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] This efect
is especially strong when opinionated language is presented in the style of factual reporting, causing
evaluative statements to appear as objective observations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A precise distinction between subjective
and objective language is important for tasks such as sentiment analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], stance detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
automated fact-checking [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], propaganda detection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], argument mining [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and bias identification
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These applications rely on the ability to detect whether a statement reflects personal opinion,
emotional language, or evaluative framing, or whether it is intended to convey factual content.
Subjective sentences commonly include emotional terms, value judgments, or rhetorical elements such as
irony or exaggeration [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, even for human readers, it is not always simple to decide whether
a sentence is subjective or not. Interpretations often depend on context and background knowledge,
making critical thinking essential for distinguishing between evaluative language and factual reporting
[
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. This issue becomes more apparent in multilingual settings, as diferent languages signal
subjectivity in diverse ways, through verb forms, word order, lexical choices, or stylistic conventions
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. At the same time, many languages lack annotated resources for subjectivity detection, which
poses an additional challenge for training reliable models [16].
      </p>
      <p>This notebook describes our submission to the CheckThat! Lab at CLEF 2025 [17], which focused on
sentence-level subjectivity detection across multiple languages. Our approach investigates two
complementary approaches: (1) supervised fine-tuning of multilingual and monolingual transformer-based
encoders on annotated datasets, and (2) zero-shot prompting with LLMs using natural language
inference guided by explicit instructions. While the first approach relies on parameter-eficient adaptation of
pretrained models, the second uses the contextual reasoning capabilities of LLMs to classify subjectivity
without additional training.</p>
      <p>The paper is structured as follows: In Section 2 we review related work on subjectivity detection.
Section 3 introduces the dataset used in the shared task. Section 4 details our modeling approaches,
including fine-tuned transformer models and zero-shot prompting strategies. Section 5 presents our
evaluation results before we finally summarize our findings in Section 6 and outline directions for future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The aim of subjectivity detection is to distinguish language that conveys private states, such as opinions,
from language that presents information in a way that seem factual or in a neutral manner, regardless
of whether the information is actually true [18]. Yu and Hatzivassiloglou proposed one of the earliest
computational models for sentence-level subjectivity detection, using Bayesian classification. Rilof
and Wiebe and Rilof et al. explored how subjectivity detection can improve information extraction
by reducing false positives, especially in metaphorical or emotional contexts. These studies marked
a shift toward integrating subjectivity classification into practical NLP pipelines. In 2006 Esuli et al.
introduced SentiWordNet, a lexical resource assigning polarity and objectivity scores to WordNet
synsets. While primarily intended for polarity detection, it also provides objectivity measures, implicitly
supporting subjectivity detection tasks. Banea et al. addressed the scarcity of resources in non-English
languages by developing a bootstrapping approach to build subjectivity lexicons using only seed lists
and a basic corpus. This method made subjectivity analysis feasible for low-resource settings. Later,
Chaturvedi et al. provided a comprehensive survey of both rule-based and automatic models. They
emphasized that subjectivity detection is a important preprocessing step for sentiment analysis, as
polarity classifiers may otherwise incorrectly label neutral statements as emotional content. Antici
et al. introduced NewsSD-ENG, a sentence-level dataset with human-annotation subjectivity labels.
Their experiments showed that multilingual transformer models (M-BERT and M-SBERT) clearly
outperformed traditional classifiers (SVM and LR), with multilingual training improving performance
and enabling robust cross-lingual subjectivity detection. Savinova and Del Prado reframed subjectivity
detection as a regression task and fine-tuned a RoBERTa model to predict degrees of subjectivity in news
texts. Their model aligned closely with human judgments and significantly outperformed a widely used
rule-based regressor based on lexical patterns. Recent work by Shokri et al. evaluated large language
models like GPT-3.5, GPT-4, and Gemini for sentence-level subjectivity detection in English news. They
showed that while fine-tuned models perform well in-distribution, zero-shot and chain-of-thought
prompting yield more robust generalization across diverse datasets. Also, Suwaileh et al. introduced
ThatiAR, the first large-scale dataset for sentence-level subjectivity detection in Arabic news. They
demonstrated that GPT-4, especially in few-shot settings, outperformed traditional and fine-tuned
Arabic language models, highlighting the potential of LLMs in morphologically-rich and culturally
diverse contexts.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset employed in this study originates from the shared task on subjectivity detection from the
CheckThat! Lab 2025.1 It is designed to evaluate the ability of computational systems to classify whether
a sentence or short paragraph extracted from a news article expresses a subjective (SUBJ) or objective
opinion (OBJ). The dataset comprises textual instances drawn from news sources in five languages:
English, Italian, German, Bulgarian, and Arabic. For each language, the data is divided into four subsets:
training, development, development-test, and test (with test labels withheld for evaluation purposes).</p>
      <p>Token-level statistics for the test set are summarized in Table 2, using the xlm-roberta-base
tokenizer [26]. Arabic and Greek test sets have the longest sequences on average, while English, Italian,
and Polish are more concise. These diferences may impact model robustness across languages. During
preprocessing, we identified several anomalous cases, particularly in the Bulgarian, German, and Italian
splits, where open-ended quotation marks led to excessively long token sequences (often over 500
tokens). These likely stemmed from the tokenizer’s handling of unmatched punctuation and were
manually corrected to avoid distortions in length statistics.</p>
      <sec id="sec-3-1">
        <title>3.1. Ambiguities and Hard-to-Translate Cases</title>
        <p>To better understand common sources of labeling disagreement, we manually examined 20 sentences
with annotation conflicts in the English development set. Five recurring themes emerged, each
highlighting linguistic or contextual features that challenge binary subjectivity classification:
• Immigration Discourse: Statements such as “Mr. Buchanan’s criticism of immigration” may
appear factual but are often ideologically charged, subtly framing the topic in ways that evoke
subjective interpretation.
• Race and Social Commentary: Sentences referencing phrases such as “CRT anti-white curricula”
or “diversity, equity, and inclusion” are lexically neutral but semantically charged. The underlying
ideological associations can trigger difering interpretations, often reflecting the annotator’s
sociopolitical context.
• Media and Political Rhetoric: Labels such as “Lügenpresse” or “Treason Lobby” embed explicit
bias or contempt within declarative syntax, complicating detection by surface-level classifiers.
• Sarcasm and Pragmatic Devices: Utterances like “What could possibly go wrong?” rely on irony
or context-based inference. Lacking overt opinion markers, they remain dificult to detect using
standard lexical cues.
• Framing in Economic and Environmental Topics: Sentences such as “bribed by a globalist
billionaire” combine factual assertions with emotionally charged language, blurring the line
between reporting and commentary.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Approach</title>
      <p>We explore two complementary approaches to multilingual subjectivity detection: supervised
finetuning of transformer-based classifiers and zero-shot prompting with LLMs. The former trains
taskspecific classifiers on available annotated data, while the latter uses instruction-following capabilities
of LLMs to perform inference without parameter updates. This section details the setup, training
procedures, and reasoning strategies employed in both directions.</p>
      <sec id="sec-4-1">
        <title>4.1. Fine-Tuned Transformers</title>
        <sec id="sec-4-1-1">
          <title>Fine-tuning German-BERT with Translated Training Data The German training dataset was</title>
          <p>expanded by translating the other provided datasets into German, followed by fine-tuning a BERT
model [27]. We ordered the languages from most to least similar to German, assuming that using the
translation of training data of more similar languages will yield better results. First English (West
Germanic, Indo-European), then Italian (Romance, Indo-European), Bulgarian (South Slavic,
IndoEuropean) and Arabic (Semitic, Non-Indo-European) [28]. Even though Italian and Bulgarian are both
Indo-European languages, Bulgarian is more distant from German than Italian in terms of Levenshtein
distance [29]. By gradually adding more translated training data we monitored which additions improved
the performance of the model as shown in Table 3.</p>
          <p>In all training settings for the fine-tuning process, training data was shufled, weight decay was set
to 0.01, batch size was 32 and improved truncation and padding was applied. The number of epochs
and the learning rate were adjusted to the size of the dataset, namely the amount of sentences. All
experiments were conducted on a remote server equipped with an NVIDIA Tesla T4 GPU with 66 GB
of RAM. Tokenizer and dataloader used for testing were the same ones that were used during training.
The diferent fine-tuned models were compared regarding their macro F1-scores. The addition of the
English, Italian, and Bulgarian translated training data increased the F1-score, but when including the
Arabic data, it dropped. The F1-scores of the de-en-it-model and de-en-it-bg-model were very close,
so we ran those fine-tuned models also on the dev-test dataset, showing a clear preference to include
Bulgarian.
Monolingual Fine-tuning EuroBERT and XLM-RoBERTa-base To identify an efective
architecture for monolingual subjectivity detection, we fine-tuned two transformer-based models: EuroBERT
[30] and XLM-RoBERTa-base [31]. Both models are well-suited for sentence-level classification tasks,
capable of capturing nuanced semantic and syntactic patterns. EuroBERT, a recently released
multilingual model, is pre-trained primarily on European languages, aligning well with the linguistic coverage
of our datasets [30]. XLM-RoBERTa-base was also selected for comparison due to its consistent
crosslingual performance and demonstrated efectiveness in prior sentence-level classification tasks [ 31, 32].
Experiments in monolingual setting were conducted on a remote server equipped with an NVIDIA
H100 GPU with 80 GB of memory. On the provided datasets (excluding Arabic), we fine-tuned both
models using a batch size of 16, a learning rate of 2e-5, and 15 training epochs. In the mean time, we
employed early stopping with a patience value of 3 based on the macro-F1 score on the development
set and applied temperature scaling post-training to calibrate prediction confidence [ 33]. We used the
AdamW optimizer with a weight decay of 0.01 and employed Focal Loss [34] with class weighting
to address label imbalance. To improve training eficiency and stability, we adopted mixed-precision
training using PyTorch’s AMP framework and applied gradient clipping with a maximum norm of 1.0
to prevent exploding gradients. After training, we evaluated both models on the dev-test datasets using
a structured inference pipeline. Input sentences were tokenized using the same configuration as during
training and passed through the model to obtain raw logits. These logits were optionally calibrated
using temperature scaling and then converted to class probabilities via the softmax function. Due to
the observed class imbalance in the training data across all languages, where the SUBJ class occupies
only around 37% of instances on average and is consistently under-represented compared to OBJ, we
thus applied a reduced classification threshold of 0.45 (instead of the standard 0.5) for predicting the
SUBJ label. Final predictions were mapped to their corresponding labels (OBJ or SUBJ) and compared
to gold-standard labels to compute accuracy and macro-F1, as specified by the shared task organizers.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. LLM-Based Inference Strategies</title>
        <p>In addition to our fine-tuned classifiers, we implement three zero-shot prompting strategies using LLMs.
Each strategy frames subjectivity detection as an inference-only task and generates natural-language
responses from one or more prompts per sentence. The three methods difer in reasoning style but
operate directly on the input text without task-specific training, as seen in the overview in Figure 1.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Annotation Approach</title>
          <p>In this inference strategy, each sentence is classified with a single API call to the OpenAI o3-mini
model. [35] We supply the model with the full set of 14 decision rules [22], which define linguistic
and rhetorical cues, to diferentiate subjective from objective language. Upon receiving the model’s
natural-language response, we extract both, the verdict and the supporting explanation, in two stages:
ifrst, by parsing the response as JSON and reading the verdict and explanation fields; and, if JSON
parsing fails, by performing a simple keyword match for “subjective” or “objective” within the raw text.
In all cases, we archive both the generated explanation and the final label. The following example
illustrates how each of our prompting strategies processes the same input sentence.</p>
          <p>Input Sentence: "Blanco established himself earlier in his career working for Dr. Luke’s Kasz
Money Productions."
Explanation: The sentence provides factual information about Blanco’s career and his afiliation
with a production company. It does not include any indications of personal opinion, sarcastic
remarks, or evaluative language by the author. Instead, it merely states a historical fact, which
aligns with the criteria for an objective sentence. Label: OBJ</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. DoubleDown Approach</title>
          <p>The DoubleDown approach reframes classification as a two-way comparison instead of a direct label.
Rather than asking the model to judge the original sentence outright, it first generates two rewrites,
one that leans into a subjective style (opinion, personal evaluation or emotional language) and a second
that adopts a more objective tone (factual content, neutral phrasing). Both rewrites come from separate
gpt-4.1-mini prompts. [36] In the final step, the original sentence plus both rewrites are submitted
to a third prompt, which determines which version best preserves the original meaning, and on that
basis classifies the input as subjective or objective. Below, you can see how the sentence is reformulated
in both its subjective and objective versions.</p>
          <p>Subjective rewrite: In my view, Blanco really made a name for himself early on thanks to
his work with Dr. Luke’s Kasz Money Productions collaboration that, to me, marked a crucial
turning point in his career.</p>
          <p>Objective rewrite: Blanco worked earlier in his career at Dr. Luke’s Kasz Money Productions.</p>
          <p>Since the objective rewrite more closely reflects the original sentence and presents it as a factual
career statement with only minor evaluative elements, the model classifies it as OBJ.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Perspective Approach</title>
          <p>The Perspective approach frames the classification task as a stepwise evaluation of contrasting
interpretations. Rather than directly predicting a label, the model is prompted to analyze the input from two
distinct angles, subjective and objective, before making a decision. All prompts in this approach are
processed by gpt-4.1-mini. [36] The process is divided into three steps. First, the model receives
a prompt asking it to explain why the sentence might be considered subjective, focusing on aspects
such as evaluative language or personal judgment. In the second step, the same sentence is analyzed
from an objective perspective, highlighting features like factuality, neutrality, or verifiability. These
two prompts are issued separately to ensure that the analyses are independently reasoned. In the third
step, the outputs from the previous prompts are provided as input to a final classification prompt. The
model is asked to compare both analyses and determine which interpretation is more convincing. It
then outputs a classification along with a justification. This approach prioritizes analytical comparison
over generative rewriting like in the DoubleDown approach, as seen below.</p>
          <p>Explanation: The statement contains elements that can be viewed both subjectively and
objectively. The subjective analysis points out that the phrase "established himself" is open to
interpretation, as what qualifies as "established" can vary by individual perception, making it a
somewhat evaluative judgment. The objective analysis highlights that the statement refers to a
verifiable fact: Blanco worked for Dr. Luke’s Kasz Money Productions earlier in his career. This
part can be independently confirmed. However, the key phrase "established himself" goes beyond
merely stating a fact about employment; it implies a level of success, recognition, or impact,
which is inherently subjective because these concepts difer across perspectives. Therefore,
while the statement contains a factual component, the primary assertion involves a subjective
judgment. Given this, the subjective analysis is more convincing because the core claim revolves
around the idea of "establishing oneself", which is not a strictly objective measure. Label: SUBJ</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>Table 4 summarizes the oficial evaluation results of our submitted systems on the final test set, as
provided by the shared task organizers. We submitted one system per subtask, using a combination
of zero-shot prompting and fine-tuned transformer models. The approach and model were chosen
individually for each language, based on preliminary development set results and practical observations.
In the Italian monolingual subtask, our system reached the highest macro-F1 score on the leaderboard
(0.8104) using the Annotation Approach with the o3-mini model. This suggests that prompt-based
classification can work well when the input data closely follows the structure of the rules described
in the prompt. For Romanian, we used a fine-tuned XLM-RoBERTa model and achieved an F1 score
of 0.7917. This result placed us third overall and indicates that the model was able to generalize
reasonably well, even though labeled subjectivity data in Romanian was not part of our fine-tuning. In
the multilingual setting, we again used a fine-tuned XLM-RoBERTa, which reached 0.7186 macro-F1,
clearly outperforming the baseline. The model showed stable results across several languages, including
Greek and Polish. For German, we fine-tuned a German-BERT model using translated training data from
related languages. This approach led to solid performance (F1 = 0.7269) and confirms that adding training
data from similar languages can be helpful when working with limited resources, as also observed by
Solopova et al. for german language. In English, we used the Annotation Approach, which achieved
0.7228 and outperformed the baseline. This result supports the idea that rule-based prompting can be
efective in high-resource settings where the classification cues are well captured by the guidelines.
Performance in the zero-shot subtasks varied. In Ukrainian, our system scored 0.6124, slightly below the
baseline. In Greek, it reached 0.4750, showing a moderate improvement over the baseline. For Polish, the
model scored 0.5665, which was slightly below the baseline (0.5719). These results suggest that zero-shot
performance depends not only on the model itself, but also on the similarity between the training
and test languages, and the phrasing patterns in the input data. Due to the shared task submission
protocol, only one system could be submitted per language. This restriction limited our ability to
systematically compare multiple approaches across all languages. Consequently, the selected system
for each subtask reflects a pragmatic decision based on development performance and informal testing,
rather than a globally optimal configuration. This was especially relevant for zero-shot settings, where
generalization is influenced by a combination of linguistic similarity, domain coverage, and how well
task framing aligns with the model’s training data. Among the prompt-based methods, the Annotation
strategy proved more robust than the more complex comparative prompting variants. XLM-RoBERTa
consistently outperformed EuroBERT across all settings tested during development, particularly in
multilingual and cross-lingual tasks.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>We presented a multilingual system for subjectivity detection using two main approaches: fine-tuned
transformer models and zero-shot prompting with LLMs. Our results in the CheckThat! 2025 shared
task show that both directions can be efective, depending on language and resource availability. For
ifne-tuned models, XLM-RoBERTa delivered the most consistent performance and was used in several
subtasks, including multilingual and zero-shot settings. For German, we observed that fine-tuning a
German-BERT model with translated training data led to competitive results. On the prompting side,
the Annotation Approach with the o3-mini model performed well in high-resource languages, such as
Italian and English, where classification rules were clearly reflected in the data. Due to the submission
constraint of only one system per language, we could not test all combinations of models and approaches
systematically. Our choices were based on limited development set results and informal comparisons.
This afected our ability to fully explore the strengths and weaknesses of each approach across languages,
especially for zero-shot cases. Another important limitation was the lack of broader context for each
sentence. Since the task involved classifying isolated sentences, it was often dificult to judge subjectivity
accurately without context. This made the task especially challenging when subjective language relied
on surrounding sentences. Another issue is the imbalance of label distributions across languages, that is
most notably in the Bulgarian development set, where subjective sentences dominate. These imbalances
can lead models to internalize and amplify misleading associations, potentially reinforcing biases and
over-predicting subjectivity in certain languages or cultural contexts. For future work, it would be
valuable to systematically compare prompting and fine-tuned approaches across languages and subtasks
under controlled conditions. In particular, we aim to better understand which types of tasks or linguistic
features favor instruction-based inference over supervised training. Additionally, exploring more
lfexible combinations of prompting and fine-tuning, e.g., via model ensembling or fallback strategies
such as few-shot prompting or confidence-based model switching, could help improve performance,
especially in low-resource or zero-shot settings.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research is funded by the Federal Ministry of Research, Technology and Space (BMFTR, reference:
03RU2U151C) in the scope of the research project news-polygraph.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[16] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, K. Kanclerz, P. Miłkowski, P. Kazienko, Learning
personal human biases and representations for subjective tasks in natural language processing, in:
2021 IEEE international conference on data mining (ICDM), IEEE, 2021, pp. 1168–1173.
[17] F. Ruggeri, A. Muti, K. Korre, J. M. Struß, M. Siegel, M. Wiegand, F. Alam, R. Biswas, W. Zaghouani,
M. Nawrocka, B. Ivasiuk, G. Razvan, A. Mihail, Overview of the CLEF-2025 CheckThat! lab task 1
on subjectivity in news article, ????
[18] J. Wiebe, T. Wilson, R. Bruce, M. Bell, M. Martin, Learning subjective language, Computational
linguistics 30 (2004) 277–308.
[19] E. Rilof, J. Wiebe, W. Phillips, Exploiting subjectivity classification to improve information
extraction, in: AAAI, 2005, pp. 1106–1111.
[20] A. Esuli, F. Sebastiani, et al., Sentiwordnet: A publicly available lexical resource for opinion
mining., in: LREC, volume 6, 2006, pp. 417–422.
[21] C. Banea, R. Mihalcea, J. Wiebe, A bootstrapping method for building subjectivity lexicons for
languages with scarce resources., in: LREC, volume 8, 2008, pp. 2–764.
[22] F. Antici, F. Ruggeri, A. Galassi, K. Korre, A. Muti, A. Bardi, A. Fedotova, A. Barrón-Cedeño, A
corpus for sentence-level subjectivity detection on English news articles, in: N. Calzolari, M.-Y. Kan,
V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA
and ICCL, Torino, Italia, 2024, pp. 273–285. URL: https://aclanthology.org/2024.lrec-main.25/.
[23] E. Savinova, F. M. Del Prado, Analyzing subjectivity using a transformer-based regressor trained on
naïve speakers’ judgements, in: Proceedings of the 13th Workshop on Computational Approaches
to Subjectivity, Sentiment, &amp; Social Media Analysis, 2023, pp. 305–314.
[24] M. Shokri, V. Sharma, E. Filatova, S. Jain, S. Levitan, Subjectivity detection in english news using
large language models, in: Proceedings of the 14th Workshop on Computational Approaches to
Subjectivity, Sentiment, &amp; Social Media Analysis, 2024, pp. 215–226.
[25] R. Suwaileh, M. Hasanain, F. Hubail, W. Zaghouani, F. Alam, Thatiar: subjectivity detection in
arabic news sentences, arXiv preprint arXiv:2406.05559 (2024).
[26] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR
abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[27] B. Chan, T. Möller, M. Pietsch, T. Soni, Bert base german cased, 2020. URL: https://huggingface.co/
google-bert/bert-base-german-cased.
[28] L. Campbell, Historical linguistics, Edinburgh University Press, 2013.
[29] M. Serva, F. Petroni, Indo-european languages tree by levenshtein distance, EPL
(Europhysics Letters) 81 (2008) 68005. URL: http://dx.doi.org/10.1209/0295-5075/81/68005. doi:10.1209/
0295-5075/81/68005.
[30] N. Boizard, H. Gisserot-Boukhlef, D. M. Alves, A. Martins, A. Hammal, C. Corro, C. Hudelot,
E. Malherbe, E. Malaboeuf, F. Jourdan, G. Hautreux, J. Alves, K. El-Haddad, M. Faysse, M. Peyrard,
N. M. Guerreiro, P. Fernandes, R. Rei, P. Colombo, Eurobert: Scaling multilingual encoders for
european languages, 2025. URL: https://arxiv.org/abs/2503.05500. arXiv:2503.05500.
[31] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR
abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116.
[32] J. M. Struß, F. Ruggeri, A. Barron-Cedeño, F. Alam, D. Dimitrov, A. Galassi, G. Pachov, I. Koychev,
P. Nakov, M. Siegel, M. Wiegand, M. Hasanain, R. Suwaileh, W. Zaghouani, Overview of the
clef2024 checkthat! lab task 2 on subjectivity in news articles, in: G. Faggioli, N. Ferro, P. Galuščakova,
A. G. S. de Herrera (Eds.), CLEF 2024 Working Notes : Working Notes of the Conference and Labs
of the Evaluation Forum (CLEF 2024), Notebook for the CheckThat! Lab at CLEF 2024, 2024, pp.
287 – 298. URL: https://ceur-ws.org/Vol-3740/paper-25.pdf.
[33] A. S. Mozafari, H. S. Gomes, W. Leão, S. Janny, C. Gagné, Attended temperature scaling: A
practical approach for calibrating deep neural networks, 2019. URL: https://arxiv.org/abs/1810.
11586. arXiv:1810.11586.
[34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2018. URL:
https://arxiv.org/abs/1708.02002. arXiv:1708.02002.
[35] OpenAI, Openai o3-mini, 2025. URL: https://openai.com/index/openai-o3-mini/, accessed:
202505-28.
[36] OpenAI, Gpt-4.1, 2024. URL: https://openai.com/index/gpt-4-1/, accessed: 2025-05-28.
[37] V. Solopova, V. Herman, C. Benzmüller, T. Landgraf, Check news in one click: NLP-empowered
pro-kremlin propaganda detection, in: N. Aletras, O. De Clercq (Eds.), Proceedings of the 18th
Conference of the European Chapter of the Association for Computational Linguistics: System
Demonstrations, Association for Computational Linguistics, St. Julians, Malta, 2024, pp. 44–51.
URL: https://aclanthology.org/2024.eacl-demo.6/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          ,
          <article-title>The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hatzivassiloglou</surname>
          </string-name>
          ,
          <article-title>Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences</article-title>
          ,
          <source>in: Proceedings of the 2003 conference on Empirical methods in natural language processing</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruotsalainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hujanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villi</surname>
          </string-name>
          ,
          <article-title>A future of journalism beyond the objectivity-dialogue divide? hybridity in the news of entrepreneurial journalists</article-title>
          ,
          <source>Journalism</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <fpage>2240</fpage>
          -
          <lpage>2258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.-J.</given-names>
            <surname>Rodrigo-Ginés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Plaza</surname>
          </string-name>
          ,
          <article-title>A systematic review on media bias detection: What is media bias, how it is expressed, and how to detect it</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>237</volume>
          (
          <year>2024</year>
          )
          <fpage>121641</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chaturvedi</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Welsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>Distinguishing between facts and opinions for sentiment analysis: Survey and challenges</article-title>
          ,
          <source>Information Fusion</source>
          <volume>44</volume>
          (
          <year>2018</year>
          )
          <fpage>65</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kasnesis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Toumanidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Z.</given-names>
            <surname>Patrikakis</surname>
          </string-name>
          ,
          <article-title>Combating fake news with transformers: a comparative analysis of stance detection and subjectivity analysis</article-title>
          ,
          <source>Information</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>409</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , A survey on
          <source>automated fact-checking, Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>178</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahitaj</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Maab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yamagishi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kolanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <article-title>Towards automated factchecking of real-world claims: Exploring task formulation and assessment with llms</article-title>
          ,
          <source>arXiv preprint arXiv:2502.08909</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schefler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Solopova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popa-Wyatt</surname>
          </string-name>
          ,
          <article-title>The telegram chronicles of online harm</article-title>
          ,
          <source>Journal of Open Humanities Data</source>
          <volume>7</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahitaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ruiz-Dolz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sahitaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nizamoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohtaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <article-title>From construction to application: Advancing argument mining with the large-scale kialoprime dataset</article-title>
          ,
          <source>in: Computational Models of Argument</source>
          , IOS Press,
          <year>2024</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rilof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiebe</surname>
          </string-name>
          ,
          <article-title>Learning extraction patterns for subjective expressions</article-title>
          ,
          <source>in: Proceedings of the 2003 conference on Empirical methods in natural language processing</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Palshikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Apte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pandita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Learning to identify subjective sentences</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Natural Language Processing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Sentiment analysis and subjectivity</article-title>
          .,
          <source>Handbook of natural language processing 2</source>
          (
          <year>2010</year>
          )
          <fpage>627</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sahitaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahitaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohtaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <article-title>Towards a computational framework for distinguishing critical and conspiratorial texts by elaborating on the context and argumentation with llms</article-title>
          ,
          <source>Working Notes of CLEF</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F. M. S.</given-names>
            <surname>Eid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. A.</given-names>
            <surname>Mutahar</surname>
          </string-name>
          ,
          <article-title>Bridging rhetorical diferences: Arabic textual metaphors in academic writing and translation</article-title>
          ,
          <source>European Journal of Arts, Humanities and Social Sciences</source>
          <volume>2</volume>
          (
          <year>2025</year>
          )
          <fpage>183</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>