Nullpointer at CheckThat! 2024: Identifying Subjectivity from Multilingual Text Sequence

Nullpointer at CheckThat! 2024: Identifying Subjectivity from Multilingual Text Sequence MdRafiulBiswas Hamad Bin Khalifa University

Doha Qatar

AbrarTasneem Abir Education Carnegie Mellon University in Qatar

City, Doha Qatar

WajdiZaghouani wajdi.zaghouani@northwestern.edu Education Northwestern University in Qatar

City, Doha Qatar

Nullpointer at CheckThat! 2024: Identifying Subjectivity from Multilingual Text Sequence 1613-0073 ED92825808FFB7408119DA18B6E65E0C GROBID - A machine learning software for extracting information from scholarly documents subjectivity natural language processing sentiment fact checking news articles text sequence

This study addresses a binary classification task to determine whether a text sequence, either a sentence or paragraph, is subjective or objective. The task spans five languages-Arabic, Bulgarian, English, German, and Italian-along with a multilingual category. Our approach involved several key techniques. Initially, we preprocessed the data through parts of speech (POS) tagging, identification of question marks, and application of attention masks. We fine-tuned the sentiment-based Transformer model 'MarieAngeA13/Sentiment-Analysis-BERT' on our dataset. Given the imbalance with more objective data, we implemented a custom classifier that assigned greater weight to objective data. Additionally, we translated non-English data into English to maintain consistency across the dataset. Our model achieved notable results, scoring top marks for the multilingual dataset (Macro F1-0.7121) and German (Macro F1-0.7908). It ranked second for Arabic (Macro F1-0.4908) and Bulgarian (Macro F1-0.7169), third for Italian (Macro F1-0.7430), and ninth for English (Macro F1-0.6893).

Introduction

The concepts of objectivity and subjectivity are crucial in shaping methodologies, interpretations, and the perceived validity of findings in many natural language processing (NLP) applications, such as sentiment analysis and information extraction [1,2]. Objectivity analysis relies on data that can be measured, observed, and verified by others and is achieved through careful experimental designs, standard procedures, and statistical analysis. In an ideal sense, objective analysis is supposed to be free from individual biases, emotions, and personal judgments, thereby ensuring that the results are universally valid and replicable [3].

Subjectivity, on the other hand, refers to perspectives, interpretations, or analyses that are influenced by personal experiences, feelings, beliefs, or biases [4]. Subjective analysis is inherently shaped by the individual's background, cultural context, and personal viewpoints. While often perceived as less reliable or credible in scientific contexts, subjectivity is an unavoidable aspect of human cognition and can provide valuable insights, particularly in fields such as humanities, social sciences, and qualitative research where personal interpretation and contextual understanding are essential [5].

Identifying whether a text sequence expresses personal opinions, emotions, or factual information is essential for enhancing the accuracy and relevance of automated systems in diverse fields such as social media monitoring, customer feedback analysis, and news content categorization. In data analysis, the tension that arises from the interaction of objectivity and subjectivity frequently affects decision-making procedures and the dissemination of findings. The challenge lies in creating systems that can accurately classify text sequences-whether sentences or paragraphs-as either subjective, reflecting personal opinions or sentiments, or objective, presenting factual information devoid of personal bias [6]. In an effort to improve the acceptability and credibility of work, researchers may strive for objectivity, occasionally avoiding or hiding choices that might be viewed as subjective. Subjective opinions can, for example, slightly skew the analysis's ostensibly objective results in the data selection, analytical method selection, and result interpretation processes. Thus, there is a high chance that the dataset contains a relatively higher number of objective values compared to subjective values.

Task 2 in CheckThat Lab at CLEF 2024 [7] classifies text as either subjective or objective. This binary classification task requires systems to accurately identify the nature of a text sequence. The task is offered in multiple languages: Arabic, Bulgarian, English, German, and Italian, providing a comprehensive multilingual evaluation of the systems' capabilities. The challenge of multilingual and cross-linguistic text classification is compounded by the inherent linguistic and cultural differences that influence the expression of subjectivity and objectivity.

This study presents an approach to a binary classification task aimed at discerning subjective from objective text across multiple languages. By leveraging advanced NLP techniques and Transformer models, we aim to enhance the accuracy and robustness of subjective-objective text classification. The implications of this research extend to improving automated news analysis, enhancing content recommendation systems, and promoting a comprehension understanding in various languages.

Related Works

The task of classifying text as subjective or objective has been studied extensively in natural language processing. Early work by Wiebe et al. [8] laid the foundations for subjectivity analysis, proposing a scheme for annotating subjective elements in text. They developed a system called OpinionFinder [1] which performed subjectivity analysis using various lexical and syntactic features. More recently, deep learning approaches have been applied to this task with great success. Nakov et al. [9] provide a thorough overview of modern approaches to sentiment analysis, including detecting subjectivity. They highlight the effectiveness of leveraging pre-trained language models like BERT [10] and fine-tuning them for the target task. Several studies have specifically examined subjectivity classification in a multilingual setting. Balahur et al. [11] constructed a multilingual dataset for subjectivity classification in English, Spanish, French and German. They experimented with various machine translation approaches to make the problem cross-lingual. Similarly, Mihalcea et al. [12] generated subjectivity datasets for English and Romanian, using English tools and manually translating the subjective sentences into Romanian. The CLEF [13](Conference and Labs of the Evaluation Forum) has run workshops on automatic identification and verification of claims in political debates, speeches, and news articles since 2018 [14]. The CheckThat! shared task at CLEF focuses on detecting checkworthy claims across various languages including Arabic [15], which is one of the languages in the current study. In terms of methodology, fine-tuning pre-trained Transformer models has proven very effective for subjectivity and sentiment tasks. Xu et al. [16] fine-tuned BERT for sentiment classification and demonstrated its strong performance on multiple benchmarks. Exploring multi-task learning, Yu and Jiang [17] showed that jointly learning sentiment and subjectivity through a shared BERT encoder led to improvements on both tasks.

System Overview

This works system for subjectivity classification comprises several key components, including data preprocessing, model selection, and training strategies. This section provides an overview of each component and the techniques employed (see Figure 1).

Data Preprocessing

The first step in the pipeline is data preprocessing, which involves cleaning and transforming the raw text data into a suitable format for model. The preprocessing steps include: We also experiment with additional preprocessing techniques such as part-of-speech (POS) tagging and attention masking, but find that they do not significantly improve the performance of the model.

Model Selection

For the subjectivity classification task, we choose to fine-tune pre-trained Transformer models that have been previously trained on sentiment analysis tasks. Specifically, we use the 'MarieAngeA13/Sentiment-Analysis-BERT' model, which is a BERT-based model fine-tuned for sentiment analysis. We find that this approach of using a model already fine-tuned for a related task (i.e., multi-task learning) yields better results compared to fine-tuning a pre-trained model from scratch. The code and data can be found in the GitHub repository https://github.com/Abrar-Abir/CLEF2024task02.

Training Strategies

The training was conducted on a remote Dell server running the latest Ubuntu 22 OS with 512 GB RAM and 24-core CPU. The server was equipped with NVIDIA A100 GPU with 80 GB GPU memory. We employ several training strategies to improve the performance of the model listed below.

• Label mapping: The pre-trained sentiment analysis model is designed for three-class prediction (positive, neutral, negative), while our subjectivity classification task requires only two classes (subjective and objective). We experimented with different label mappings and found that mapping subjective to negative sentiment and objective to positive sentiment yielded the best results. • Confidence weighting: For the English dataset, we incorporate the confidence level information provided in the dataset (in the 'solved_conflict' column where 1[true] means conflict was resolved i.e., higher annotation confidence and vice versa). We assign 20% higher weight for the training losses-coming from the annotations with higher confidence (i.e, 1.2 weight)-before passing the losses to the loss function so that backpropagation prioritizes minimizing loss for higher confidence annotation compared to their counterparts. • Hyperparameter tuning: We experiment with different hyperparameter settings and find that a batch size of 16, learning rate of 2e-5, and training for 20 epochs yields the best performance.

Language Adaptation

To handle the multilingual nature of the task, we employ machine translation to convert non-English data into English. We use the Google Translator API through the deep translator library for this purpose.

While we also experiment with fine-tuning language-specific pre-trained models for non-English languages, we find that translating the training and test datasets to English and using the English model yields better performance. These preprocessing, model selection, and training strategies form the core of the subjectivity classification system. In the following sections, we detail our experimental setup and present the results of the approach.

Results

This section presents the results of subjectivity classification system across various languages and datasets. We first describe the dataset characteristics and then provide a detailed analysis of the model's performance using different evaluation metrics. Finally, we compare our results with those of other participating teams in the CheckThat! Lab at CLEF 2024.

Dataset Description

The dataset for the Subjectivity Subtask consists of sentences from news articles in five languages: Arabic, Bulgarian, English, German, and Italian. Additionally, there is a multilingual dataset that combines all five languages. Table 1 shows the distribution of objective and subjective sentences in the training and test sets for each language. Across all languages, the percentage of objective sentences is higher than that of subjective sentences, with the imbalance being more pronounced in the training sets. This imbalance poses a challenge for subjectivity classification systems, as they need to learn from skewed data distributions. The test set contains 500 sentences, evenly split with 250 objective sentences (50%) and 250 subjective sentences (50%). This comprehensive dataset provides a robust foundation for developing and evaluating systems that distinguish between subjective and objective statements in news articles across multiple languages.

Performance Metrics

We evaluate our subjectivity classification model using various performance metrics, including macroaveraged F1-score, precision, recall, and accuracy. Table 2 presents the results for each language and the multilingual dataset. Our model achieves the best performance on the German dataset, with an F1 Macro score of 0.79 and an accuracy of 0.81, indicating high prediction correctness. The multilingual dataset obtains good performance, with an F1 Macro score of 0.71, an F1 SUBJ of 0.69, and an accuracy of 0.71. The model also shows good performance for the Italian language with an F1 Macro score of 0.74 and strong subjective class metrics, with an F1 SUBJ of 0.64. On the other hand, the model struggles the most with the Arabic dataset, obtaining an F1 Macro score of 0.49 and an accuracy of 0.52. The performance is relatively lower than in other languages, which shows the difficulty in identifying subjective data. The model performs well in Bulgarian, achieving an F1 Macro score of 0.72 and high subjective class performance with an F1 SUBJ of 0.69. For English, the performance is moderate to good, with an F1 Macro score of 0.68. The model handles subjective data in English relatively better, with an F1 SUBJ of 0.54, precision (P SUBJ) of 0.52, and recall (R SUBJ) of 0.64. The overall accuracy for English is 0.64. In summary, the model shows the highest performance in German, followed by Italian and Bulgarian, with Arabic being the most challenging language for the model. The performance in English is moderate, and the overall multilingual performance is strong, suggesting the model's effectiveness across multiple languages but with some variability in specific language performance. • F1 Macro: The macro-averaged F1 score, which is the harmonic mean of precision and recall across all classes. • P Macro: The macro-averaged precision.

• R Macro: The macro-averaged recall.

• F1 SUBJ: The F1 score for subjective classification.

• P SUBJ: The precision for subjective classification.

• R SUBJ: The recall for subjective classification.

• Accuracy: The overall accuracy of the model.

Comparison with Other Teams

We compare the performance of our subjectivity classification model with that of other participating teams in the CheckThat! Lab at CLEF 2024. Table 3 shows the official results for each language and the multilingual dataset. Our team achieves the highest rank in the German and multilingual categories, with Macro F1 scores of 0.7908 and 0.7121, respectively. We also secure the second position in Arabic and Bulgarian. For Arabic, our model achieved second place with a Macro F1 score of 0.4908 and a SUBJ F1 score of 0.37. In Bulgarian, our model also secured second place with a Macro F1 score of 0.7169 and a SUBJ F1 score of 0.69. In Italian, our model ranks third with a Macro F1 score of 0.7430 and a SUBJ F1 score of 0.64. In the English category, our model ranks ninth with a Macro F1 score of 0.6893 and a SUBJ F1 score of 0.54.

These results showcase the competitiveness of our approach in the shared task, especially in the German and multilingual categories. They also indicate areas for improvement, particularly in English, where our model's performance is lower than other teams. Overall, our team's participation in the ArAIEval shared task demonstrated strong performance across multiple languages, securing top ranks in several categories and showcasing our model's capabilities in multilingual data and subjective data evaluation.

Discussion

Our system leveraged state-of-the-art pre-trained language models, specifically BERT, which we finetuned for subjectivity classification task. Through extensive experiments, we demonstrated the ef-fectiveness of our approach, achieving competitive performance in various languages. Our system ranked first in the German and multilingual categories, second in Arabic and Bulgarian, and third in Italian. These results highlight the robustness of our model and its ability to generalize across different languages. We also investigated the impact of various preprocessing techniques, such as part-of-speech tagging and attention masking, on the performance of our system. Furthermore, our analysis of the dataset characteristics revealed the challenges posed by the imbalance between objective and subjective sentences across all languages. This imbalance underscores the need for developing strategies to handle skewed data distributions effectively.

Our work contributes to the growing body of research on subjectivity classification and multilingual natural language processing. The insights gained from our experiments can inform future research directions and help develop more robust and accurate systems for subjectivity analysis across diverse languages.

However, our study also has some limitations. The performance of our system in English was relatively lower compared to other languages, indicating room for improvement. Future work could explore more advanced techniques, such as domain adaptation and transfer learning, to enhance the model's performance in English and other languages. Moreover, the scope of our study was limited to the dataset provided by the CheckThat! Lab. Further research could investigate the generalizability of our approach to other datasets and domains, such as social media and customer reviews.

Conclusion

In conclusion, our subjectivity classification system, Nullpointer, demonstrates the potential of leveraging pre-trained language models and multilingual approaches for identifying subjective and objective statements in news articles. As the volume of online content continues to grow, the ability to automatically distinguish between subjective and objective information becomes increasingly crucial. Our work contributes to this important research area and paves the way for more advanced and reliable subjectivity analysis systems in the future.

Figure 1 :1Figure 1: Diagram for classification of subjectivity in text sequence

Table 11Training and Test Data DistributionLanguageDatasetOBJ (N) (%) SUBJ (N) (%)ArabicTrain (1185) 905 (76.37)280 (23.63)Test (748)425 (56.81)323 (43.18)BulgarianTrain (729)406 (55.69)323 (44.23)Test (250)143 (57.2)107 (42.8)EnglishTrain (830)532 (64.09)298 (35.9)Test (484)362 (74.79)122 (25.2)GermanTrain (800)492 (61.5)308 (38.5)Test (337)226 (67.07)111 (32.93)ItalianTrain (1613) 1231 (76.31)382 (23.68)Test (513)377 (73.4)136 (26.5)Multilingual Train (5159) 3568 (69.16) 1591 (30.83)Test (500)250 (50)250 (50)

Table 22Performance metrics across different languagesLanguageF1 Macro P Macro R Macro F1 SUBJ P SUBJ R SUBJ AccuracyArabic0.490.490.500.370.430.330.52Bulgarian0.720.720.720.690.660.720.72English0.680.430.500.540.520.640.64German0.790.780.810.730.670.800.81Italian0.740.730.770.640.570.730.78Multilingual0.710.720.710.690.760.630.71

Table 33Official results for six test languages in Subtask2 CheckThat! Lab at CLEF 2024LanguageTeamRank Macro F1 SUBJ F1ArabicIAI Group10.49470.46Nullpointer20.49080.37Baseline30.48520.40JUNLP (last)70.36230.00BulgarianBaseline10.75310.73Nullpointer20.71690.69Hybrinfox30.71470.65JUNLP (last)50.36390.00EnglishHybrinfox10.74420.60Nullpointer90.68930.54Baseline110.63460.45IAI Group (last)150.44910.39GermanNullpointer10.79080.73IAI Group20.73020.66Baseline30.69940.63Hybrinfox (last)40.69680.57ItalianJK_PCIC_UNAM10.79170.69Nullpointer30.74300.64Baseline40.65030.52IAI Group (last)50.58620.49MultilingualNullpointer10.71210.69Hybrinfox20.68490.63Baseline30.66970.66IAI Group (last)40.62920.67

https://carpedm20.github.io/emoji/docs/api.html#emoji.demojize

Acknowledgments

We acknowledge Qatar National Research Fund grant NPRP14C0916-210015 from the Qatar Research Development and Innovation Council (QRDI) for funding this research.

Opinionfinder: A system for subjectivity analysis TWilson PHoffmann SSomasundaran JKessler JWiebe YChoi CCardie ERiloff SPatwardhan Proceedings of HLT/EMNLP 2005 Interactive Demonstrations HLT/EMNLP 2005 Interactive Demonstrations 2005 Beyond subjective and objective in statistics AGelman CHennig Journal of the Royal Statistical Society Series A: Statistics in Society 180 2017 Decline of a paradigm? bias and objectivity in news media studies RAHackett Critical Studies in Media Communication 1 1984 Learning personal human biases and representations for subjective tasks in natural language processing JKocoń MGruza JBielaniewicz DGrimling KKanclerz PMiłkowski PKazienko IEEE International Conference on Data Mining (ICDM), IEEE 2021. 2021 Subjectivity, identification and differentiation: Key issues in early social development UMüller JCarpendale MBibok TRacine Monographs of the Society for Research in Child Development 2006 Using nlp approach for opinion types classifier MOthman HHassan RMoawad AMIdrees 2015 JMStruß FRuggeri ABarrón-Cedeño FAlam DDimitrov AGalassi MSiegel MWiegand Overview of the CLEF-2024 CheckThat! lab task 2 on subjectivity in news articles 2024 Development and use of a gold-standard data set for subjectivity classifications JWiebe RBruce TO'hara Proceedings of the 37th annual meeting of the Association for Computational Linguistics the 37th annual meeting of the Association for Computational Linguistics 1999 Semeval-2016 task 4: Sentiment analysis in twitter PNakov ARitter SRosenthal FSebastiani VStoyanov Proceedings of the 10th international workshop on semantic evaluation the 10th international workshop on semantic evaluation

SemEval-

2016. 2016 Bert: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 1 Opinion mining on newspaper quotations ABalahur RSteinberger EVan Der Goot BPouliquen MKabadjov Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 03 the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 03 IEEE Computer Society 2009 Learning multilingual subjective language via cross-lingual projections RMihalcea CBanea JWiebe Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics the 45th Annual Meeting of the Association of Computational Linguistics 2007 The clef-2024 checkthat! lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness ABarrón-Cedeño FAlam TChakraborty TElsayed PNakov PPrzybyła JMStruß FHaouari MHasanain FRuggeri XSong RSuwaileh Advances in Information Retrieval NGoharian NTonellotto YHe ALipani GMcdonald CMacdonald IOunis

Nature Switzerland, Cham

Springer 2024 The clef-2022 checkthat! lab on fighting the covid-19 infodemic and fake news detection PNakov ABarr'on-Cede No GDa San Martino FAlam RM'ıguez TCaselli MKutlu WZaghouani CLi SShaar European Conference on Information Retrieval Springer 2022 Fighting the covid-19 infodemic in social media: a holistic perspective and a call to arms FAlam FDalvi SShaar NDurrani HMubarak ANikolov GDa San Martino AAli FSajjad TCaselli Proceedings of the International AAAI Conference on Web and Social Media the International AAAI Conference on Web and Social Media 2021 15 Bert post-training for review reading comprehension and aspectbased sentiment analysis HXu BLiu LShu PSYu Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 1 Adapting bert for target-oriented multimodal sentiment classification JYu JJiang Proceedings of the 28th International Joint Conference on Artificial Intelligence the 28th International Joint Conference on Artificial Intelligence 2019