Tonirodriguez at CheckThat!2024: Is it Possible to Use Zero-Shot Cross-Lingual Methods for Subjectivity Detection in Low-Resources Languages? Notebook for the CheckThat! Lab Task2 at CLEF 2024

Barcelona Spain

ElisabetGolobardes elisabet.golobardes@salle.url.edu La Salle Engineering Universitat Ramon Llull

Barcelona Spain

JaumeSuau jsuau@blanquerna.url.edu Blanquerna Universitat Ramon Llull

Barcelona Spain

Tonirodriguez at CheckThat!2024: Is it Possible to Use Zero-Shot Cross-Lingual Methods for Subjectivity Detection in Low-Resources Languages? Notebook for the CheckThat! Lab Task2 at CLEF 2024 1613-0073 CBAB874EEC582B7D3A92AD295C2A8710 GROBID - A machine learning software for extracting information from scholarly documents Subjectivity Detection Natural Language Processing Fake News Journalism Misinformation Transformers Cross-lingual Transfer Learning

Subjectivity detection is a key task within natural language processing due to the challenges generated by new forms of journalism, the proliferation of misinformation and fake news, and existing concerns about the quality and integrity of journalism. Although subjectivity detection is an existing challenge in all languages, the amount of resources available to build these types of applications varies greatly among languages. In this paper, we present our participation in the CLEF2024 CheckThat! Lab Task2 [1], where we have attempted to apply Zero-Shot Cross-Lingual transfer techniques using the datasets for the five languages provided in Task2 (English, German, Italian, Bulgarian, and Arabic). For this, we have fine-tuned two multilingual models, mDeBERTa v3 and XLM-RoBERTa, on a subset of the dataset consisting of three of the languages provided in Task2, specifically English, German, and Italian, and we have applied Zero-Shot Cross-Lingual transfer to the other two languages available in Task2, Arabic and Bulgarian.

Introduction

Currently, the proliferation of news sites and the widespread use of social networks have revolutionized the way news is consumed, giving rise to new forms of journalism [2]. However, these changes have introduced several challenges, including the proliferation of misinformation and fake news, the formation of "echo chambers" where news consumers limit their exposure to different points of view, and emerging concerns about the quality and integrity of journalism [3]. A common element in many of the identified challenges is the need to distinguish whether a news author is sharing objective information or expressing their own opinions, desires, or biases [4] [5]. The goal of Subjectivity Detection (SD) is to develop computational systems capable of implementing a binary classifier that can determine whether a text is objective or subjective.

CLEF2024 CheckThat! Lab Task2 [1] provides an opportunity to work on the challenges associated with subjectivity detection. This task aims to construct a binary classifier that can identify whether a text sequence, in the form of a sentence, is subjective or objective [6]. For the execution of Task2, the organizers have published five datasets in different languages (English, German, Italian, Bulgarian, and Arabic), plus an additional dataset that combines the previous five languages for the multilingual version of the task. The evaluation of the results presented will be carried out through the macro-averaged F1 between the two classes. This paper begins with the "Related Work" section, where a comprehensive review of previous research and studies relevant to the topic is conducted. This is followed by the "Data" section, which provides a detailed description of the structure and characteristics of the dataset provided for the Task2. The "Approach" section outlines the phases and techniques employed to conduct the research. In the "Results" section, the findings obtained from the implementation of the models used are presented and analysed using the metric macro F1. Finally, the "Conclusions" section provides a summary of the results, discusses the implications of the research, and suggests possible directions for future research.

Related Work

According to Liu [7], Subjectivity Detection (SD) is a field of study traditionally encompassed within a broader field known as Sentiment Analysis (SA) also referred to as opinion mining. Sentiment analysis is the field of study that analyses people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. Sentiment Analysis is an area of research deeply studied in the last two decades.

Chaturvedi [8] categorizes methods for subjectivity detection into two main types: traditional syntaxcentered NLP methods and semantics-based NLP approaches. Syntax-centered NLP can be broadly divided into three main categories: keyword spotting, lexical affinity, and statistical methods. The major issue with these methods is that they are highly language-specific and require the existence of databases and resources for each language in which they are to be applied. To address this issue, solutions such as translating content between languages lacking these resources and languages like English, which have a wealth of resources, have been adopted. However, the translation of sentences can lead to the loss of lexical information, such as word sense, resulting in low accuracy [8].

On the other hand, semantic methods based on embeddings, RNNs, Convolutional Networks, and Transformers have gained significant relevance recently. They offer more accurate results than methods based on syntactic features, but they present their own challenges, as they require large datasets for each language in which we want to work. The creation of these datasets is complex and can generate problems such as ambiguity when classifying sentences [8] or annotator bias [9]. To avoid these problems, a recent paper published by F. Antici et al. [10] proposes annotation guidelines with the aim of unifying criteria and avoiding previous problems while experimenting with monolingual, multilingual, and cross-lingual Transformers scenarios between English and Italian languages.

Schumacher [11], starting with a multilingual BERT model, achieves good results for cross-language entity linking. From there, he explores Zero-Shot Cross-Lingual transfer between different languages and obtains robust results with a slight degradation when the model is applied to a language for which fine-tuning has not been performed. He concludes that although multilingual Transformer models make a good transfer between languages, issues remain in disambiguating similar entities unseen in training.

The objective of this paper is to address the question of the viability of using Zero-Shot Cross-Lingual transfer for subjectivity detection. To this end, we will fine-tune two multilingual Transformer models and analyze the results obtained within the framework of the CLEF2024 CheckThat! Lab Task2 [1]. To achive this goal, we will employ DeBERTa [12,13] and RoBERTa [14] for the monolingual approach and their multilingual versions, MDeBERTa [12,13] and XLM-RoBERTa [14], respectively for the multilingual approach. These models are evolutions built upon BERT that significantly enhance the results achieved by multilingual BERT, particularly in low-resource languages [15].

Data

The six datasets provided for the execution of Task2 exhibit varying characteristics in terms of size and distribution of objective and subjective sentences. In all datasets, objective sentences are labeled with the tag "OBJ", while subjective sentences are labeled as "SUBJ". As shown in Table 1, the Bulgarian dataset, which is the smallest, comprises a total of 1043 texts, 729 of which are included in the training dataset. In contrast, the Italian dataset contains a total of 2280 sentences, 1613 of which are in the training dataset. Furthermore, an examination of the datasets reveals a distribution bias in favour of the "OBJ" class across all datasets, although the extent of this bias varies depending on the language. For instance, while the bias is only 55.69% in favour of "OBJ" sentences in Bulgarian, this bias increases to 76.32% and 76.37% for Italian and Arabic, respectively. The multilingual dataset, the largest among all, is composed of a subset of sentences provided in each of the other datasets across all subsets (training, validation and test). However, due to its composition, it also exhibits a bias in favour of the "OBJ" class, accounting for 69.16% of the dataset.

Approach

In our research, we adopted a dual approach. Initially, we employed a monolingual approach that leveraged Transformers, placing the focus on the English language. Subsequently, we implemented a second phase, utilizing multilingual Transformers with a dual purpose: to enhance the results obtained in the first phase with the monolingual Transformers by increasing the size of the training set, and to verify the Zero-Shot Cross-Lingual transfer capabilities of the model. This means that a model that is fine-tuned in certain languages can be applied to other languages without any specific training.

Monolingual Models

The primary objective of the monolingual phase was to enhance the results provided by Task2 as a baseline. The baseline is based on a two-step approach. First, Sentence-BERT [16] is used to transform each sentence into a high-dimensional vector representation capable of capturing its semantic meaning. In the second step, a classifier is constructed by training a Logistic Regression model on the vectors generated in the previous step. To improve the results provided by the baseline, we utilized various Transformers such as DeBERTa v3 Large [12,13], RoBERTa Large [14] and BART [17] Large MNLI [17] that uses the entailment approach [18].

BART Large MNLI [17] is a Transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder pretrained on English. BART is pretrained by ( 1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. BART is particularly effective when fine-tuned for text generation tasks (e.g., summarization, translation) but also performs well for comprehension tasks (e.g., text classification, question answering). In this study, we selected the checkpoint for bart-large after it had been trained on the MultiNLI (MNLI) dataset. Yin et al. [18] proposed a method for using pre-trained NLI models as ready-made Zero-Shot sequence classifiers. The method works by posing the sequence to be classified as the NLI premise and constructing a hypothesis from each candidate label.

Multilingual Models

In the second phase of our study, we utilized multilingual Transformers. Although these models have architectures and training procedures similar to their monolingual counterparts, they differ in that the corpus used for their pretraining consists of documents in many languages. The multilingual transformer models used in this study were MDeBERTa Base and XLM-RoBERTa Base. These models use masked language modeling as a pretraining objective and are trained jointly on texts in over one hundred languages. By pretraining on vast corpora across numerous languages, these multilingual Transformers enable Zero-Shot Cross-Lingual transfer. This implies that a model fine-tuned on one language can be applied to others without any additional training. The characteristics of these models MDeBERTa V3 Base: [12,13] mDeBERTa is multilingual version of DeBERTa which use the same structure as DeBERTa and was trained with CC100 multilingual data. The mDeBERTa V3 base model comes with 12 layers and a hidden size of 768. It has 86M backbone parameters with a vocabulary containing 250K tokens which introduces 190M parameters in the Embedding layer. This model was trained using the 2.5T CC100 data as XLM-R.

XLM-RoBERTa: XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. Following the work of XLM and RoBERTa, the XLM-RoBERTa or XLM-R model takes multilingual pretraining one step further by massively upscaling the training data [19]. Using the Common Crawl corpus, its developers created a dataset with 2.5 terabytes of text; they then trained an encoder with MLM on this dataset. Since the dataset only contains data without parallel texts (i.e., translations), the TLM objective of XLM was dropped. This approach beats XLM and multilingual BERT variants by a large margin, especially on low-resource languages [15].

The objective pursued through this cross-lingual approach is to utilize the same model across different languages, as the resulting linguistic representations can be well generalized across languages for various subsequent tasks, such as classification in our case. To this end, we have fine-tuned the multilingual models in English, German, and Italian, and applied them to the rest of the languages available in Task2, Arabic and Bulgarian.

Results

In the initial phase of this research, we focused on the English language, applying fine-tuning to various monolingual models with the aim of achieving optimal results as measured by the macro F1 metric, as outlined in the guidelines for Task2. We selected three distinct Transformer-based models for this purpose: DeBERTa Large, RoBERTa Large, and BART Large MNLI. We used Kaggle as the platform for training these models. The results of this process are presented in Table 3.

The models DeBERTa v3 Large and RoBERTa Large yield very similar results for the English language, with the best result being achieved by RoBERTa Large, scoring 0.74 on the test dataset. A much larger model, BART Large MNLI, which in principle should be capable of a greater understanding of language, performs worse, likely due to the dataset size not allowing it to generalize the characteristics of subjective language. As this model does not have an equivalent multilingual model, we have discarded it for the subsequent phases of the research. In any case, all trained models significantly outperform the baseline result provided for Task2 in English.

In the second phase of the research, we fine-tuned the multilingual models equivalent to the models selected in Phase 1 on a training dataset composed of the union of the data provided in Task2 for English, Italian, and German languages. Given the increased size of the training dataset, we used the base models, which are smaller in size, instead of the large models. Therefore, we replaced DeBERTa v3 Large with MDeBERTa v3 Base, and instead of RoBERTa Large, we used XLM-RoBERTa Base. As we can observe in Table 4, in all cases, the MDeBERTa v3 Base model outperforms the XLM-RoBERTa Base by a wide margin. In the case of the English language, we narrowly missed surpassing the result obtained by RoBERTa Large in the previous phase, but we matched the result obtained by DeBERTa v3 Large with a base model. The results obtained in the German and Italian languages are noteworthy, where we achieved scores of 0.85 and 0.83 respectively, significantly surpassing the baseline provided by Task2 for these languages.

In order to ensure the reproducibility of the results obtained with both the monolingual and multilingual approaches, Table 6 displays the models, training dataset, and hyperparameters used to train the models that achieved the best results when applied to the Final Test Dataset.

Finally, we sought to verify the Zero-Shot Cross-Lingual properties of both models by applying the models trained with the English, Italian, and German language datasets to the test datasets for the Bulgarian and Arabic languages without any specific fine-tuning for them.

We can observe in Table 5 that for both Arabic and Bulgarian languages, the results obtained in each case are worse than the baseline provided for both languages by Task2. Therefore, we must conclude that for subjectivity detection, there is no significant transfer of learning from one language to others without having seen examples in the second language during training. Consequently, we cannot rely on this feature of multilingual models for subjectivity detection in low-resource languages.

We believe that there could be several reasons why cross-lingual transfer has not worked, which should be analyzed in greater depth in subsequent studies. Lauscher [20] highlights the pretraining corpora size of the target language and the structural language similarity between languages as the main factors for the success of cross-lingual transfer.

In the final ranking for Task2, we achieved the second position out of a total of 15 participating teams in English language, with a final result for the Macro F1 score of 0.7372 and a SUBJ F1 score of 0.58. In Arabic, we obtained the fifth position out of a total of 7 participating teams, with a Macro F1 score of 0.4551 and a SUBJ F1 score of 0.25.

Conclusion

Our contribution to Task2 of CheckLab!2024 Subjectivity [1] aimed to determine, based on the provided datasets, whether it is possible to use the Zero-Shot Cross-Lingual feature of multilingual models to detect subjectivity in low-resource languages. The conclusion we reached is that it is not possible. However, given that this is a widespread problem that applies to all languages, we believe it would be interesting to continue investigating other non-multilingual Transformer-based approaches to help detect subjectivity in low-resource languages. Although the answer to our research question was negative, during the research process, we fine-tuned an MDeBERTa v3 Base model that achieved second place for English in Task2, with a score of 0.7372. It also achieved excellent results for German and Italian, with scores of 0.85 and 0.83 respectively, although we did not actively participate in the competition for these languages. As future lines of work, we propose adding Bulgarian and Arabic datasets, which we have not used to train the MDeBERTa v3 Base model, to see if adding more languages improves the model. It would also be relevant to analyze the use of Downsampling and Oversampling techniques to mitigate the bias present in the available datasets between objective and subjective sentences.

Table 2 Figure 1 :21Figure 1: Distribution of the number of words per sentence for each of the languages considered in the Task2.

Table 11Datasets and Distribution of classesEnglish:ObjectiveSubjectiveTotalTrain532 (64.10%)298 (35.90%)830Dev106 (48.40%)113 (51.60%)219Dev Test116 (47.74%)127 (52.26%)243Italian:ObjectiveSubjectiveTotalTrain1231 (76.32%) 382 (23.68%)1613Dev167 (73.57%)60 (26.43%)227Dev Test323 (73.41%)117 (26.59%)440German:ObjectiveSubjectiveTotalTrain492 (61.50%)308 (38.50%)800Dev123 (61.50%)77 (38.50%)200Dev Test194 (66.67%)97 (33.33%)291Bulgarian:ObjectiveSubjectiveTotalTrain406 (55.69%)323 (44.31%)729Dev59 (55.66%)47 (44.34%)106Dev Test116 (55.77%)92 (44.23%)208Arabic:ObjectiveSubjectiveTotalTrain905 (76.37%)280 (23.63%)1185Dev227 (76.43%)70 (23.57%)297Dev Test363 (81.57%)82 (18.43%)445Multilingual: ObjectiveSubjectiveTotalTrain3568 (69.16%) 1591 (30.84%) 5159Dev250 (50.00%)250 (50.00%)500Dev Test250 (50.00%)250 (50.00%)500

Table 33Results of the Monolingual Models trained in EN and applied to the Final Test dataset for EN.F1 Macro SUBJ F1Baseline EN0.630.45Deberta V3 Large0.730.60Roberta Large0.740.59BART Large MNLI 0.690.51are as follows:

Table 44Results of the Multilingual Models trained in EN+IT+DE and applied to the Final Test datasets for EN,IT,DE.BaselineMDeBERTa V3 Base XLM-RoBERTa BaseF1 Macro SUBJ F1 F1 Macro SUBJ F1 F1 Macro SUBJ F1English 0.630.450.730.580.690.50German 0.690.630.850.800.820.75Italian0.630.500.830.740.650.43

Table 55Results of the Multilingual Models trained in EN+IT+DE and applied to the Final Test datasets for AR, BG.BaselineMDeBERTa V3 Base XLM-RoBERTa BaseF1 Macro SUBJ F1 F1 Macro SUBJ F1 F1 Macro SUBJ F1Arabic0.490.400.480.290.450.23Bulgarian 0.750.720.690.610.640.53

Table 66Hyperparameters for the best performing modelsHyperparameter Best Monolingual ModelBest Multilingual ModelModelFacebookAI/roberta-largemicrosoft/mdeberta-v3-baseTraining DatasetTrain_ENTrain_EN+Train_IT+Train_DENum Train Epochs53Train Batch Size816Eval Batch Size88Learning Rate5e-52e-5Weight Decay0.010.01Warmup Steps500200

Acknowledgments

This research was made possible through the funding of Project TED2021-130810B-C22 by the Ministry of Science and Innovation of the Government of Spain.

The authors express their gratitude to the Smart Society Research Group at La Salle Engineering, Universitat Ramon Llull, the Digilab Research Group at Blanquerna, Universitat Ramon Llull and the anonymous reviewers for their insightful comments. They also extend their thanks to the organizers of the CheckThat!2024 Lab Task2 [1] for making this lab possible.

Overview of the CLEF-2024 CheckThat! lab task 2 on subjectivity in news articles JMStruß FRuggeri ABarrón-Cedeño FAlam DDimitrov AGalassi GPachov IKoychev PNakov MSiegel MWiegand MHasanain RSuwaileh WZaghouani Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum, CLEF 2024 GFaggioli NFerro PGaluščáková AGarcía Seco De Herrera

Grenoble, France

2024 Networked Communication. People are the Message GCardoso 2023 Editora Mundos Sociais Lisboa RNielsen SGanter 10.1093/oso/9780190908850.001.0001 The Power of Platforms: Shaping Media and Society 2022 Journalistic power: Constructing the "truth" and the economics of objectivity GCanella 10.1080/17512786.2021.1914708 arXiv: Journalism Practice 17 2023 Evolving journalism norms: objective, interpretive and fact-checking journalism JBirks The Routledge companion to political journalism

London

Routledge 2021 The CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness ABarrón-Cedeño FAlam TChakraborty TElsayed PNakov PPrzybyła JMStruß FHaouari MHasanain FRuggeri XSong RSuwaileh Advances in Information Retrieval NGoharian NTonellotto YHe ALipani GMcdonald CMacdonald IOunis

Nature Switzerland, Cham

Springer 2024 BLiu Sentiment Analysis and Opinion Mining Morgan and Claypool Publishers May 2012 Distinguishing between facts and opinions for sentiment analysis: Survey and challenges IChaturvedi ECambria REWelsch FHerrera Inf. Fusion 44 2018 Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets MGeva YGoldberg JBerant 10.18653/v1/D19-1107 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics KInui JJiang VNg XWan the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 A corpus for sentence-level subjectivity detection on English news articles FAntici FRuggeri AGalassi KKorre AMuti ABardi AFedotova ABarrón-Cedeño Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Cross-lingual transfer in zero-shot cross-language entity linking ESchumacher JMayfield MDredze 10.18653/v1/2021.findings-acl.52 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics CZong FXia WLi RNavigli 2021 Deberta: Decoding-enhanced bert with disentangled attention PHe XLiu JGao WChen International Conference on Learning Representations 2021 PHe JGao WChen arXiv:2111.09543 Debertav3: Improving deberta using electra-style pre-training with gradientdisentangled embedding sharing 2021 Roberta: A robustly optimized bert pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov ArXiv abs/1907.11692 2019 Natural Language Processing with Transformers TWLewis Tunstell LeandroVon Werra 1005 O'Reilly Media, Inc <author> <persName><forename type="first">Gravenstein Highway</forename><surname>North</surname></persName> </author> <imprint> <date type="published" when="2022">2022</date> <biblScope unit="page">95472</biblScope> <pubPlace>Sebastopol, CA</pubPlace> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b16"> <analytic> <title level="a" type="main">Sentence-BERT: Sentence embeddings using Siamese BERT-networks NReimers IGurevych 10.18653/v1/D19-1410 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics KInui JJiang VNg XWan the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension MLewis YLiu NGoyal MGhazvininejad AMohamed OLevy VStoyanov LZettlemoyer 10.18653/v1/2020.acl-main.703 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach WYin JHay DRoth ArXiv abs/1909.00161 2019 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov 10.18653/v1/2020.acl-main.747 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers ALauscher VRavishankar IVulic GGlavas 10.18653/v1/2020.emnlp-main.363 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020