JK_PCIC_UNAM at CheckThat! 2024: Analysis of Subjectivity in News Sentences Using Transformers-Based Models Notebook for the CheckThat! Lab at CLEF 2024 Karla Salas-Jimenez1,† , Iván Díaz1,† , Helena Gómez-Adorno2 , Gemma Bel-Enguix3,4 and Gerardo Sierra3 1 Posgrado en Ciencias e Ingeniería de la Computación, Universidad Nacional Autónoma de México, Ciudad de México 04510, México. 2 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad de México 04510, México. 3 Instituto de Ingeniería, Universidad Nacional Autónoma de México, Ciudad de México 04510, México. 4 Departament de Filologia Catalana i Lingüística General, Universitat de Barcelona, Barcelona, España. Abstract Recognizing subjectivity in online content is essential for understanding public opinion, detecting bias, and managing misinformation. This year’s CheckThat! 2024 Task 2 emphasized the identification of subjective and objective news sentences. Transformer models, particularly BERT, have demonstrated high efficacy for this task. In our study, we trained and evaluated our methodologies on the English and Italian sub-tasks of the challenge. A thorough data analysis was conducted, emphasizing the importance of extracting relevant features for accurate classification. Although traditional machine learning algorithms were utilized for this task, the BERT models significantly outperformed them, demonstrating superior performance. Specifically, our BERT-based classifiers achieved a macro F1 score of 0.82 on the English development dataset and 0.81 on the Italian development dataset. These results underscore the effectiveness of transformer models in distinguishing subjective content. Keywords Subjectivity, News sentences, Transformer Models, BERT, 1. Introduction A subjective sentence expresses the position, attitude, or feelings of its author [1]. The detection of subjectivity is a challenging task for computers due to the intricate nature of human language. Subjective statements rely on personal opinions and emotions, which are difficult to quantify and interpret accurately. Context and cultural references further complicate the task, as words can have different meanings in different situations. This complexity requires the application of advanced natural language processing techniques, which still struggle to reliably distinguish between subjective and objective content. Detecting subjectivity in news articles is essential for numerous applications, including sentiment analysis, opinion mining, fact-checking, understanding public opinion, identifying bias, and combating misinformation. In the realm of journalism, where articles are widely disseminated and opinions are often intertwined with facts, differentiating between subjective and objective tones is a critical task. The 2024 edition of CheckThat! shared task [2] included 6 subtasks. Subtask 2 focused on evaluating whether a sentence within a news article is presented with an objective or subjective tone. This task seeks to address the challenge of discerning whether online news articles are composed of subjective opinions CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France † These authors contributed equally. $ karla_dsj@ciencias.unam.mx (K. Salas-Jimenez); diazrysivan@gmail.com (I. Díaz); helena.gomez@iimas.unam.mx (H. Gómez-Adorno); gbele@iingen.unam.mx (G. Bel-Enguix); gsierram@iingen.unam.mx (G. Sierra) € https://github.com/KarlaDSJ (K. Salas-Jimenez); https://github.com/JuanIvanDiazReyes (I. Díaz); https://helenagomez-adorno.github.io (H. Gómez-Adorno) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings or objective statements. This diversity highlights the importance of language-agnostic approaches to subjectivity detection, enabling broader applicability across different linguistic contexts. The official evaluation metric for the shared task is the macro-averaged F1 score between the two classes (subjective and objective). Datasets were offered in five languages: Arabic, Bulgarian, English, German, and Italian, in addition to a Multilingual, mixing all the above languages. We participated in the English and Italian subtasks. The sentences from each dataset are from news articles dealing with controversial topics such as political issues, COVID-19, civil rights, and economics. In addition to annotating the data, the organizers developed a set of guidelines [1] that can be applied to any language to generate corpora in other languages, trying to assist in any disagreements that may arise between annotators. Among the guidelines [1], there are the following cases: • Sentence is subjective if it contains: – Speculations that draw conclusions that are considered opinions – Sarcastic or ironic expressions – Exhortations or personal auspices – Discriminating or downgrading expressions – Rhetorical figures explicitly made by its author to convey their opinion – A conclusion made by its author that is drawn despite insufficient factual information. – Intensifiers that can be attributed to its author to express their opinion • Sentence is objective when it: – Describes the personal feelings, emotions, or moods of its author without conveying opinions on other matters – Expresses an opinion, claim, emotion, or a point of view that is explicitly attributable to a third-party – Presence of quotation marks, when used to quote a third person We consider these guidelines to select the characteristics that can help us to determine whether a sentence is subjective or not, as mentioned in section 3.1. We first performed handcrafted feature extraction with Machine Learning Models to train our subjectivity detection models. Additionally, we fine-tuned BERT-based models for comparison purposes. During the development phase, our best-performing models were those based on BERT. The remainder of the paper is organized as follows. Section 2 discusses related work, introducing transformer-based models and recent applications in subjectivity classification. Section 3 describes the methodology, detailing the dataset provided and used throughout the competition, as well as the models employed. The results of the experiments performed and an analysis are presented in Section 4. The paper concludes with a discussion of the findings and potential future work. 2. Related Work In the last few years, special attention has been paid to detecting subjectivity in texts. Recently, transformer models have been applied to text classification tasks involving subjectivity detection. For example Timo Spinde’s work [3], and the Python package DBias [4]. These works use the Bias Annotations By Experts (BABE) corpus, which consists of 3,700 sentences on news with controversial topics extracted from 14 US news platforms from January 2017 to June 2020. In both cases, the authors attacked the problem of classifying whether a sentence is subjective or not using attention-based models such as RoBERTa (F1 0.804) and DistilBERT (F1 0.75), which obtained the best results. In previous years, the work of DWReCO at CheckThat! 2023 [5] utilized BERT-based models fine- tuned on the competition dataset, augmented with the help of ChatGPT. To encode texts and train subjectivity classifiers, language-specific transformers were employed: ‘Roberta-base’ for English, ‘German BERT’ for German, and ‘BERTurk’ for Turkish, as these models have demonstrated strong performance on the tasks in their respective languages. These approaches highlight the flexibility and effectiveness of transformer models in handling various NLP tasks. By adapting pre-trained models through fine-tuning on task-specific data, they achieve state-of-the-art results in subjectivity classification. 3. Methodology 3.1. Analysis of the dataset Since we only participated in the English and Italian language subtasks , we only perform the analysis for these languages. The first thing observed is that the training dataset provided is unbalanced. The number of subjective sentences is very low compared to the objective sentences. This can be seen in Table 1. Table 1 Data Split and Distribution English Italian Total OBJ SUBJ Total OBJ SUBJ Train 830 532 298 1613 1231 232 Dev 219 106 113 227 167 60 dev-test 243 116 127 440 323 117 We decided not to make any augmentation to the dataset to balance it out due to the fact that the selected models managed to capture the differences despite the difference in the amount of data in each class, as we will see later in the section on the analysis of the model results. When analyzing the data, we noted that the guide mentioned in the previous section helped in the task. For example, we observe in Table 2 that the presence of quotation marks is certainly higher in objective sentences. Table 2 Analysis of the Training Dataset Features Considered for Classic Supervised Classifiers. Feature English Italian OBJ SUBJ OBJ SUBJ # of words 11,745 7,358 24,138 7,746 # of different lemmas 2,508 1,826 4,722 2,372 % of nouns 47.67 45.31 50.98 45.98 % of adjectives 18.28 20.10 15.38 16.35 % of verbs 24.69 23.13 15.38 16.35 % of adverbs 9.34 11.34 10.47 14.64 # of quotation marks 110 28 211 59 We inspected the vocabulary of each class, where we can observe that although they share many words, they also disagree in many others. We generated word clouds to visualize this in a better way, and we observed that the objective classes tend to discuss statistics, studies, and reports and frequently mention terms like "infected," "schools," and "teachers." In contrast, the subjective classes use words such as "thought," "consider," "indigent," and "perfectly," among others. This indicates a clear difference in the vocabulary used in each of the two classes. It is also important to note that not only does the vocabulary differ, but the number of words differs. The objective class has more words. The difference between the features of each class is not as much as we expected, as can be observed for English and Italian in Table 2, except for the feature ’quotation marks.’ We expected that in the subjective sentences, the number of adjectives and adverbs would be higher. since these usually provide more details on characteristics or attributes about nouns, this could, in certain contexts, introduce a kind of subjectivity into the sentence. Finally, we count the number of quotation mark pairs and see that there is a difference. As expected, objective sentences have a higher number of quotation marks since they indicate the opinion of a third person, which the annotators considered an objective feature. 3.2. Machine Learning Models The analysis conducted in Section 3.1 indicates that the main features are: number of quotation marks, the probability of the sentence being positive and negative, making use of the python pysentimiento package [6], the number of nouns, adjectives, verbs and adverbs, divided by the word length of the sentence, plus a multilingual BERT Sentence [7] to generate the sentence embedding and try to capture the semantics of each sentence, also add a bag of words vector as we see that words not sharing is an important feature. We tested these characteristics with Logistic Regression (LR), Support Vector Classification (SVC), Support Vector Regression (SVR), RandomForest (RF) and Naive Bayes (NB). These methods have been shown in the literature to work well for learning text features. 3.3. Transformer Training We employed BERT (Bidirectional Encoder Representations from Transformers) as our primary classi- fier. BERT models are pre-trained on a vast corpus of text and are specifically tailored for sequence classification tasks, making them ideal for our needs. We utilize language-specific transformers [8]: BERT-base-uncased [9] for English and BERT-base-italian-cased-sentiment [10] for Italian. Both models were fine-tuned on the provided dataset to adapt them for the subjectivity detection task. Our primary focus is on tuning the parameters of the supervised classifier. We train the models for 4 epochs, a batch size of 16, and limit the input size to a maximum of 256 tokens. We trained and ran our system on Google Colab. The experiments used GPUs to leverage faster computation and efficiently handle the large-scale computations involved in fine-tuning BERT models. The dataset used to tune the hyperparameters was the training dataset provided by the organizers. 4. Results and Analysis In order to obtain the results of the classical methods, we apply the 5-fold cross-validation. The results can be seen in tables 4 and 3, which provide the scores in English and Italian respectively, obtained for each of the machine learning models on the dev-test set. These tables show the effectiveness of the machine learning models in capturing relevant text features. Table 3 Results of the Classical Machine Learning Models on the English Development Dataset. Model Accuracy Precision Recall Macro F1 LR 0.700 0.760 0.622 0.699 SVC 0.601 0.841 0.291 0.562 SVR 0.682 0.791 0.535 0.659 RF 0.663 0.747 0.536 0.659 NB 0.572 0.604 0.528 0.572 For the transformer case, results are shown in tables 5 and 6, to further evaluate the performance of our models, we examined the results under different settings for both English and Italian dev-test datasets provided by the organizers. These analyses compare the performance metrics of various configurations of batch size and max length parameters. Table 4 Results of the Classical Machine Learning Models on the Italian Development Dataset. Model Accuracy Precision Recall Macro F1 LR 0.718 0.465 0.402 0.662 SVC 0.750 0.621 0.154 0.548 SVR 0.743 0.530 0.299 0.610 RF 0.730 0.484 0.265 0.586 NB 0.650 0.301 0.239 0.518 Table 5 Analysis of Results with Different Settings on the English Development Dataset Settings Precision Recall F1 Macro F1 Batch size = 32, Max len = 128 0.834 0.761 0.796 0.799 Batch size = 16, Max len = 128 0.798 0.769 0.783 0.780 Batch size = 32, Max len = 256 0.834 0.761 0.796 0.799 Batch size = 16, Max len = 256 0.849 0.796 0.821 0.821 Table 6 Analysis of Results with Different Settings on the Italian Development Dataset Settings Precision Recall F1 Macro F1 Batch size = 32, Max len = 128 0.760 0.633 0.690 0.796 Batch size = 16, Max len = 128 0.677 0.700 0.688 0.787 Batch size = 32, Max len = 256 0.730 0.633 0.678 0.786 Batch size = 16, Max len = 256 0.733 0.733 0.733 0.818 As we can see in both English and Italian, the models based on transformers are ahead by almost a decimal point, which is still quite a lot. Something that surprised us was the performance of these models in Italian since they work almost as well as in English. Usually, the transformers have a better performance in English because this is the language in which they were trained, but we can observe that the art of transferring these models to other languages, such as Italian, is getting better and better. This model was fine-tuned specifically for the task of sentiment analysis in Italian texts. This may have helped to obtain better results in Italian. Note also that in both English and Italian, increasing the maximum token size and reducing the number of batches to 16 helped. This may be because fewer tokens cause information to be lost along the sentences, which could contain bias. Something similar happens with sentence BERT, which we use for the machine learning models that here in English, it performs better because, for Italian, we use a multilingual model, not one focused on this language. Of these, we can also appreciate that logistic regression is the one with the best performance. It is also important to mention that among the features that helped the most were the bags of words, since the words in which they differ are very significant, followed by the embeddings generated by sentences BERT, which helped to increase the results by almost one-tenth. 5. Conclusion and Future Work In this research, we use transformer-based models. We fine-tuned BERT-based and Italian BERT- base to analyze the subjectivity of newspaper articles. We also employ classical methods to compare performance on this task. This approach achieved 5th. place in English and 1st. place in Italian, with 0.7079 and a 0.7917 macro F1-score, respectively. In the case of English, the results are above the baseline, with a difference of only 0.04 from the first place. Our findings show that transformer-based models are effective at detecting subjectivity in sentences. Future experiments on the classical machine learning models include adding features that consider elements of subjectivity more related to semantics, for example, to detect sarcastic or ironic expressions, to detect if in the sentence a conclusion is expressed, to identify intensifiers in a better way, it is not enough to count adjectives and adverbs. For the part of transformers a domain adaptation can be made before making the classification, in addition to an assembly of models, in the state of the art there are already models that detect feelings, sarcasm, hate speech, etc., look for a way to put them together to enrich the subjectivity detection model. In future work, several strategies can be employed to enhance the performance of our model. One potential improvement involves implementing weight loss management to address class imbalance. This technique adjusts the loss function to assign higher weights to underrepresented classes, thereby improving the model’s ability to learn from these examples. Additionally, freezing model parameters can be beneficial. Specifically, freezing the lower layers of the model while training only the upper layers and the classification layer can lead to more efficient and targeted learning. Combined with further hyperparameter tuning and advanced regularization techniques, these approaches hold promise for achieving better performance metrics in subsequent experiments. Acknowledgments K. Salas-Jimenez thanks CONAHCYT scholarship program (CVU: 1291359). J. Díaz-Reyes thanks CONAHCYT scholarship program (CVU: 923309). This research was funded by CONAHCYT (CF-2023-G-64) and PAPIIT project IT100822, IN104424. G.B.E. is supported by a grant for the requalification of the Spanish university system from the Ministry of Universities of the Government of Spain, financed by the European Union, NextGeneration EU (María Zambrano program, Universitat de Barcelona). References [1] F. Ruggeri, F. Antici, A. Galassi, K. Korre, A. Muti, A. Barrón-Cedeño, On the definition of prescriptive annotation guidelines for language-agnostic subjectivity detection., Text2Story@ ECIR 3370 (2023) 103–111. [2] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness, in: N. Goharian, N. Tonel- lotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458. [3] T. Spinde, M. Plank, J.-D. Krieger, T. Ruas, B. Gipp, A. Aizawa, Neural media bias detection using distant supervision with babe–bias annotations by experts, arXiv preprint arXiv:2209.14557 (2022). [4] S. Raza, D. J. Reji, C. Ding, Dbias: detecting biases and ensuring fairness in news articles, Interna- tional Journal of Data Science and Analytics (2022) 1–21. [5] L. K. Ipek Baris Schlicht, D. Altiok, DWReCO at CheckThat! 2023: Enhancing Subjectivity Detection through Style-based Data Sampling, in: Notebook for the CheckThat! Lab at CLEF 2023, CLEF 2023: Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 2023. [6] J. M. Pérez, M. Rajngewerc, J. C. Giudici, D. A. Furman, F. Luque, L. A. Alemany, M. V. Martínez, py- sentimiento: A python toolkit for opinion mining and social nlp tasks, 2023. arXiv:2106.09462. [7] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084. [8] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, et al., Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45. [9] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805. arXiv:1810.04805. [10] NeuralyIA, neuraly/bert-base-italian-cased-sentiment, https://huggingface.co/neuraly/ bert-base-italian-cased-sentiment, 2021. Accessed: 2024-05-24.