=Paper=
{{Paper
|id=Vol-3740/paper-55
|storemode=property
|title=JK_PCIC_UNAM at CheckThat! 2024: Analysis of Subjectivity in News Sentences Using
Transformers-Based Models
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-55.pdf
|volume=Vol-3740
|authors=Karla Salas-Jimenez,Iván Díaz,Helena Gómez-Adorno,Gemma Bel-Enguix,Gerardo Sierra
|dblpUrl=https://dblp.org/rec/conf/clef/Salas-JimenezDG24
}}
==JK_PCIC_UNAM at CheckThat! 2024: Analysis of Subjectivity in News Sentences Using
Transformers-Based Models==
JK_PCIC_UNAM at CheckThat! 2024: Analysis of
Subjectivity in News Sentences Using Transformers-Based
Models
Notebook for the CheckThat! Lab at CLEF 2024
Karla Salas-Jimenez1,† , Iván Díaz1,† , Helena Gómez-Adorno2 , Gemma Bel-Enguix3,4 and
Gerardo Sierra3
1
Posgrado en Ciencias e Ingeniería de la Computación, Universidad Nacional Autónoma de México, Ciudad de México 04510,
México.
2
Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad de
México 04510, México.
3
Instituto de Ingeniería, Universidad Nacional Autónoma de México, Ciudad de México 04510, México.
4
Departament de Filologia Catalana i Lingüística General, Universitat de Barcelona, Barcelona, España.
Abstract
Recognizing subjectivity in online content is essential for understanding public opinion, detecting bias, and
managing misinformation. This year’s CheckThat! 2024 Task 2 emphasized the identification of subjective and
objective news sentences. Transformer models, particularly BERT, have demonstrated high efficacy for this task.
In our study, we trained and evaluated our methodologies on the English and Italian sub-tasks of the challenge. A
thorough data analysis was conducted, emphasizing the importance of extracting relevant features for accurate
classification. Although traditional machine learning algorithms were utilized for this task, the BERT models
significantly outperformed them, demonstrating superior performance. Specifically, our BERT-based classifiers
achieved a macro F1 score of 0.82 on the English development dataset and 0.81 on the Italian development dataset.
These results underscore the effectiveness of transformer models in distinguishing subjective content.
Keywords
Subjectivity, News sentences, Transformer Models, BERT,
1. Introduction
A subjective sentence expresses the position, attitude, or feelings of its author [1]. The detection
of subjectivity is a challenging task for computers due to the intricate nature of human language.
Subjective statements rely on personal opinions and emotions, which are difficult to quantify and
interpret accurately. Context and cultural references further complicate the task, as words can have
different meanings in different situations. This complexity requires the application of advanced natural
language processing techniques, which still struggle to reliably distinguish between subjective and
objective content.
Detecting subjectivity in news articles is essential for numerous applications, including sentiment
analysis, opinion mining, fact-checking, understanding public opinion, identifying bias, and combating
misinformation. In the realm of journalism, where articles are widely disseminated and opinions are
often intertwined with facts, differentiating between subjective and objective tones is a critical task.
The 2024 edition of CheckThat! shared task [2] included 6 subtasks. Subtask 2 focused on evaluating
whether a sentence within a news article is presented with an objective or subjective tone. This task seeks
to address the challenge of discerning whether online news articles are composed of subjective opinions
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
†
These authors contributed equally.
$ karla_dsj@ciencias.unam.mx (K. Salas-Jimenez); diazrysivan@gmail.com (I. Díaz); helena.gomez@iimas.unam.mx
(H. Gómez-Adorno); gbele@iingen.unam.mx (G. Bel-Enguix); gsierram@iingen.unam.mx (G. Sierra)
https://github.com/KarlaDSJ (K. Salas-Jimenez); https://github.com/JuanIvanDiazReyes (I. Díaz);
https://helenagomez-adorno.github.io (H. Gómez-Adorno)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
or objective statements. This diversity highlights the importance of language-agnostic approaches to
subjectivity detection, enabling broader applicability across different linguistic contexts. The official
evaluation metric for the shared task is the macro-averaged F1 score between the two classes (subjective
and objective).
Datasets were offered in five languages: Arabic, Bulgarian, English, German, and Italian, in addition
to a Multilingual, mixing all the above languages. We participated in the English and Italian subtasks.
The sentences from each dataset are from news articles dealing with controversial topics such as
political issues, COVID-19, civil rights, and economics. In addition to annotating the data, the organizers
developed a set of guidelines [1] that can be applied to any language to generate corpora in other
languages, trying to assist in any disagreements that may arise between annotators.
Among the guidelines [1], there are the following cases:
• Sentence is subjective if it contains:
– Speculations that draw conclusions that are considered opinions
– Sarcastic or ironic expressions
– Exhortations or personal auspices
– Discriminating or downgrading expressions
– Rhetorical figures explicitly made by its author to convey their opinion
– A conclusion made by its author that is drawn despite insufficient factual information.
– Intensifiers that can be attributed to its author to express their opinion
• Sentence is objective when it:
– Describes the personal feelings, emotions, or moods of its author without conveying opinions
on other matters
– Expresses an opinion, claim, emotion, or a point of view that is explicitly attributable to a
third-party
– Presence of quotation marks, when used to quote a third person
We consider these guidelines to select the characteristics that can help us to determine whether a
sentence is subjective or not, as mentioned in section 3.1.
We first performed handcrafted feature extraction with Machine Learning Models to train our
subjectivity detection models. Additionally, we fine-tuned BERT-based models for comparison purposes.
During the development phase, our best-performing models were those based on BERT.
The remainder of the paper is organized as follows. Section 2 discusses related work, introducing
transformer-based models and recent applications in subjectivity classification. Section 3 describes
the methodology, detailing the dataset provided and used throughout the competition, as well as the
models employed. The results of the experiments performed and an analysis are presented in Section 4.
The paper concludes with a discussion of the findings and potential future work.
2. Related Work
In the last few years, special attention has been paid to detecting subjectivity in texts. Recently,
transformer models have been applied to text classification tasks involving subjectivity detection.
For example Timo Spinde’s work [3], and the Python package DBias [4]. These works use the Bias
Annotations By Experts (BABE) corpus, which consists of 3,700 sentences on news with controversial
topics extracted from 14 US news platforms from January 2017 to June 2020. In both cases, the authors
attacked the problem of classifying whether a sentence is subjective or not using attention-based models
such as RoBERTa (F1 0.804) and DistilBERT (F1 0.75), which obtained the best results.
In previous years, the work of DWReCO at CheckThat! 2023 [5] utilized BERT-based models fine-
tuned on the competition dataset, augmented with the help of ChatGPT. To encode texts and train
subjectivity classifiers, language-specific transformers were employed: ‘Roberta-base’ for English,
‘German BERT’ for German, and ‘BERTurk’ for Turkish, as these models have demonstrated strong
performance on the tasks in their respective languages.
These approaches highlight the flexibility and effectiveness of transformer models in handling various
NLP tasks. By adapting pre-trained models through fine-tuning on task-specific data, they achieve
state-of-the-art results in subjectivity classification.
3. Methodology
3.1. Analysis of the dataset
Since we only participated in the English and Italian language subtasks , we only perform the analysis
for these languages.
The first thing observed is that the training dataset provided is unbalanced. The number of subjective
sentences is very low compared to the objective sentences. This can be seen in Table 1.
Table 1
Data Split and Distribution
English Italian
Total OBJ SUBJ Total OBJ SUBJ
Train 830 532 298 1613 1231 232
Dev 219 106 113 227 167 60
dev-test 243 116 127 440 323 117
We decided not to make any augmentation to the dataset to balance it out due to the fact that the
selected models managed to capture the differences despite the difference in the amount of data in each
class, as we will see later in the section on the analysis of the model results.
When analyzing the data, we noted that the guide mentioned in the previous section helped in the
task. For example, we observe in Table 2 that the presence of quotation marks is certainly higher in
objective sentences.
Table 2
Analysis of the Training Dataset Features Considered for Classic Supervised Classifiers.
Feature English Italian
OBJ SUBJ OBJ SUBJ
# of words 11,745 7,358 24,138 7,746
# of different lemmas 2,508 1,826 4,722 2,372
% of nouns 47.67 45.31 50.98 45.98
% of adjectives 18.28 20.10 15.38 16.35
% of verbs 24.69 23.13 15.38 16.35
% of adverbs 9.34 11.34 10.47 14.64
# of quotation marks 110 28 211 59
We inspected the vocabulary of each class, where we can observe that although they share many
words, they also disagree in many others. We generated word clouds to visualize this in a better way,
and we observed that the objective classes tend to discuss statistics, studies, and reports and frequently
mention terms like "infected," "schools," and "teachers." In contrast, the subjective classes use words
such as "thought," "consider," "indigent," and "perfectly," among others. This indicates a clear difference
in the vocabulary used in each of the two classes. It is also important to note that not only does the
vocabulary differ, but the number of words differs. The objective class has more words.
The difference between the features of each class is not as much as we expected, as can be observed
for English and Italian in Table 2, except for the feature ’quotation marks.’ We expected that in the
subjective sentences, the number of adjectives and adverbs would be higher.
since these usually provide more details on characteristics or attributes about nouns, this could, in
certain contexts, introduce a kind of subjectivity into the sentence.
Finally, we count the number of quotation mark pairs and see that there is a difference. As expected,
objective sentences have a higher number of quotation marks since they indicate the opinion of a third
person, which the annotators considered an objective feature.
3.2. Machine Learning Models
The analysis conducted in Section 3.1 indicates that the main features are:
number of quotation marks, the probability of the sentence being positive and negative, making use
of the python pysentimiento package [6], the number of nouns, adjectives, verbs and adverbs, divided
by the word length of the sentence, plus a multilingual BERT Sentence [7] to generate the sentence
embedding and try to capture the semantics of each sentence, also add a bag of words vector as we see
that words not sharing is an important feature.
We tested these characteristics with Logistic Regression (LR), Support Vector Classification (SVC),
Support Vector Regression (SVR), RandomForest (RF) and Naive Bayes (NB). These methods have been
shown in the literature to work well for learning text features.
3.3. Transformer Training
We employed BERT (Bidirectional Encoder Representations from Transformers) as our primary classi-
fier. BERT models are pre-trained on a vast corpus of text and are specifically tailored for sequence
classification tasks, making them ideal for our needs. We utilize language-specific transformers [8]:
BERT-base-uncased [9] for English and BERT-base-italian-cased-sentiment [10] for Italian. Both models
were fine-tuned on the provided dataset to adapt them for the subjectivity detection task. Our primary
focus is on tuning the parameters of the supervised classifier. We train the models for 4 epochs, a batch
size of 16, and limit the input size to a maximum of 256 tokens. We trained and ran our system on Google
Colab. The experiments used GPUs to leverage faster computation and efficiently handle the large-scale
computations involved in fine-tuning BERT models. The dataset used to tune the hyperparameters was
the training dataset provided by the organizers.
4. Results and Analysis
In order to obtain the results of the classical methods, we apply the 5-fold cross-validation. The results
can be seen in tables 4 and 3, which provide the scores in English and Italian respectively, obtained for
each of the machine learning models on the dev-test set. These tables show the effectiveness of the
machine learning models in capturing relevant text features.
Table 3
Results of the Classical Machine Learning Models on the English Development Dataset.
Model Accuracy Precision Recall Macro F1
LR 0.700 0.760 0.622 0.699
SVC 0.601 0.841 0.291 0.562
SVR 0.682 0.791 0.535 0.659
RF 0.663 0.747 0.536 0.659
NB 0.572 0.604 0.528 0.572
For the transformer case, results are shown in tables 5 and 6, to further evaluate the performance
of our models, we examined the results under different settings for both English and Italian dev-test
datasets provided by the organizers. These analyses compare the performance metrics of various
configurations of batch size and max length parameters.
Table 4
Results of the Classical Machine Learning Models on the Italian Development Dataset.
Model Accuracy Precision Recall Macro F1
LR 0.718 0.465 0.402 0.662
SVC 0.750 0.621 0.154 0.548
SVR 0.743 0.530 0.299 0.610
RF 0.730 0.484 0.265 0.586
NB 0.650 0.301 0.239 0.518
Table 5
Analysis of Results with Different Settings on the English Development Dataset
Settings Precision Recall F1 Macro F1
Batch size = 32, Max len = 128 0.834 0.761 0.796 0.799
Batch size = 16, Max len = 128 0.798 0.769 0.783 0.780
Batch size = 32, Max len = 256 0.834 0.761 0.796 0.799
Batch size = 16, Max len = 256 0.849 0.796 0.821 0.821
Table 6
Analysis of Results with Different Settings on the Italian Development Dataset
Settings Precision Recall F1 Macro F1
Batch size = 32, Max len = 128 0.760 0.633 0.690 0.796
Batch size = 16, Max len = 128 0.677 0.700 0.688 0.787
Batch size = 32, Max len = 256 0.730 0.633 0.678 0.786
Batch size = 16, Max len = 256 0.733 0.733 0.733 0.818
As we can see in both English and Italian, the models based on transformers are ahead by almost
a decimal point, which is still quite a lot. Something that surprised us was the performance of these
models in Italian since they work almost as well as in English. Usually, the transformers have a better
performance in English because this is the language in which they were trained, but we can observe
that the art of transferring these models to other languages, such as Italian, is getting better and better.
This model was fine-tuned specifically for the task of sentiment analysis in Italian texts. This may have
helped to obtain better results in Italian.
Note also that in both English and Italian, increasing the maximum token size and reducing the
number of batches to 16 helped. This may be because fewer tokens cause information to be lost along
the sentences, which could contain bias.
Something similar happens with sentence BERT, which we use for the machine learning models that
here in English, it performs better because, for Italian, we use a multilingual model, not one focused
on this language. Of these, we can also appreciate that logistic regression is the one with the best
performance. It is also important to mention that among the features that helped the most were the
bags of words, since the words in which they differ are very significant, followed by the embeddings
generated by sentences BERT, which helped to increase the results by almost one-tenth.
5. Conclusion and Future Work
In this research, we use transformer-based models. We fine-tuned BERT-based and Italian BERT-
base to analyze the subjectivity of newspaper articles. We also employ classical methods to compare
performance on this task.
This approach achieved 5th. place in English and 1st. place in Italian, with 0.7079 and a 0.7917 macro
F1-score, respectively. In the case of English, the results are above the baseline, with a difference of only
0.04 from the first place. Our findings show that transformer-based models are effective at detecting
subjectivity in sentences.
Future experiments on the classical machine learning models include adding features that consider
elements of subjectivity more related to semantics, for example, to detect sarcastic or ironic expressions,
to detect if in the sentence a conclusion is expressed, to identify intensifiers in a better way, it is not
enough to count adjectives and adverbs. For the part of transformers a domain adaptation can be made
before making the classification, in addition to an assembly of models, in the state of the art there are
already models that detect feelings, sarcasm, hate speech, etc., look for a way to put them together to
enrich the subjectivity detection model.
In future work, several strategies can be employed to enhance the performance of our model. One
potential improvement involves implementing weight loss management to address class imbalance.
This technique adjusts the loss function to assign higher weights to underrepresented classes, thereby
improving the model’s ability to learn from these examples. Additionally, freezing model parameters
can be beneficial. Specifically, freezing the lower layers of the model while training only the upper
layers and the classification layer can lead to more efficient and targeted learning. Combined with
further hyperparameter tuning and advanced regularization techniques, these approaches hold promise
for achieving better performance metrics in subsequent experiments.
Acknowledgments
K. Salas-Jimenez thanks CONAHCYT scholarship program (CVU: 1291359). J. Díaz-Reyes thanks
CONAHCYT scholarship program (CVU: 923309).
This research was funded by CONAHCYT (CF-2023-G-64) and PAPIIT project IT100822, IN104424.
G.B.E. is supported by a grant for the requalification of the Spanish university system from the Ministry
of Universities of the Government of Spain, financed by the European Union, NextGeneration EU (María
Zambrano program, Universitat de Barcelona).
References
[1] F. Ruggeri, F. Antici, A. Galassi, K. Korre, A. Muti, A. Barrón-Cedeño, On the definition of
prescriptive annotation guidelines for language-agnostic subjectivity detection., Text2Story@
ECIR 3370 (2023) 103–111.
[2] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-Worthiness,
Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: N. Goharian, N. Tonel-
lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
[3] T. Spinde, M. Plank, J.-D. Krieger, T. Ruas, B. Gipp, A. Aizawa, Neural media bias detection using
distant supervision with babe–bias annotations by experts, arXiv preprint arXiv:2209.14557 (2022).
[4] S. Raza, D. J. Reji, C. Ding, Dbias: detecting biases and ensuring fairness in news articles, Interna-
tional Journal of Data Science and Analytics (2022) 1–21.
[5] L. K. Ipek Baris Schlicht, D. Altiok, DWReCO at CheckThat! 2023: Enhancing Subjectivity
Detection through Style-based Data Sampling, in: Notebook for the CheckThat! Lab at CLEF 2023,
CLEF 2023: Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 2023.
[6] J. M. Pérez, M. Rajngewerc, J. C. Giudici, D. A. Furman, F. Luque, L. A. Alemany, M. V. Martínez, py-
sentimiento: A python toolkit for opinion mining and social nlp tasks, 2023. arXiv:2106.09462.
[7] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR
abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084.
[8] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
towicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the
2020 conference on empirical methods in natural language processing: system demonstrations,
2020, pp. 38–45.
[9] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805.
arXiv:1810.04805.
[10] NeuralyIA, neuraly/bert-base-italian-cased-sentiment, https://huggingface.co/neuraly/
bert-base-italian-cased-sentiment, 2021. Accessed: 2024-05-24.