OpenFact at CheckThat! 2024: Cross-Lingual Transfer Learning for Check-Worthiness Detection Notebook for the CheckThat! Lab at CLEF 2024 Marcin Sawiński1,* , Krzysztof Węcel1 and Ewelina Księżniak1 1 Department of Information Systems, Poznań University of Economics and Business, Al. Niepodległości 10, 61-875 Poznań, Poland Abstract This paper presents the results of the OpenFact team’s experiments in the CLEF 2024 CheckThat! Lab Task 1 competition for multilingual, unimodal check-worthiness detection. Several mono- and multilingual pre-trained language models were fine-tuned using different variants of the training datasets. Cross-lingual transfer learning was applied without instance transfer and proved to be effective for Arabic and Dutch. Additionally, we tested the effectiveness of class balancing using several under-sampling methods, which, when combined with appropriate model selection and cross-lingual transfer learning, produced the second-best results for Arabic and English. Keywords check-worthiness, fact-checking, fake news detection, language models, cross-lingual transfer learning, BERT 1. Introduction Check-worthiness detection refers to the process of determining which statements should be fact- checked based on their potential influence and the probability of being incorrect. This paper describes the experiments conducted as part of the preparations for the CheckThat! Lab, Task 1 for Arabic, English, and Dutch at CLEF 2024, where the task was framed as a binary text classification problem. The study predominantly investigated the effectiveness of cross-lingual transfer learning applied through multilingual pre-trained language models and optimal dataset preparation. The comparison of the performance of multilingual pre-trained language models versus monolingual models revealed that multilingual models can perform equally well or even outperform monolingual models when fine- tuned on monolingual training datasets, and additionally improve performance when fine-tuned with multilingual datasets. Our results for check-worthiness detection at CheckThat! Lab in 2023 showed a significant impact of dataset sampling. Previous experiments demonstrated that the under-sampling method, which boosted the performance of a fine-tuned GPT-3 model from the F1 score of 0.826 to 0.898 for the positive class, did not consistently yield the same improvements for BERT models. This study collected more observations with the aim of analyzing this phenomenon. The dataset preparation experiments focused on attempts to improve above random under-sampling by introducing additional methods based on training dynamics [1]. The experiments resulted in creating check-worthiness detection methods that were ranked as the second-best for Arabic and English on the leaderboard for CheckThat! Lab, Task 1 in 2024.1 CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ marcin.sawinski@ue.poznan.pl (M. Sawiński); krzysztof.wecel@ue.poznan.pl (K. Węcel); ewelina.ksiezniak@ue.poznan.pl (E. Księżniak) € https://kie.ue.poznan.pl/en/ (M. Sawiński)  0000-0002-1226-4850 (M. Sawiński); 0000-0001-5641-3160 (K. Węcel); 0000-0003-1953-8014 (E. Księżniak) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://checkthat.gitlab.io/clef2024/task1/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Related Work In previous editions of CheckThat! Lab [2][3], many methods were proposed to solve the check- worthiness detection task on text data. In 2023, the dominant method involved the application of pre-trained language models fine-tuned for the classification task. For English, the best score was achieved by team OpenFact [4], using the GPT-3 curie model fine-tuned on an under-sampled training dataset. However, DeBERTa V3 performed only marginally worse. Under- sampling was performed using an additional annotation quality flag derived from the ClaimBuster dataset2 . Other teams used monoligual models: BERT (Fraunhofer SIT [5], CSECU-DSG [6]), RoBERTa (Ac- centure [7]), GigaBERT (Accenture), MARBERT (ES-VRAI [8]), a feed-forward neural network trained on embeddings (Z-Index [9]), and multilingual models: XLM-RoBERTa (DSHacker [10]), Twitter XLM- RoBERTa (CSECU-DSG). The models were mostly used for sequence classification but other methods were also applied: ensemble learning with model souping (Fraunhofer SIT [5]), BiLSTM module handle long-term contextual dependency and multisample dropout(CSECU-DSG [6]). The dataset curation included back-translation (Accenture [7]), under-sampling (ES-VRAI [8], Open- Fact [4]), instance transfer (DSHacker [10]), paraphrasing with GPT-3.5 (DSHacker [10]). We drew the conclusion that complex model setups were not critical to achieving the best results and that a well-performing BERT-family model could achieve top results provided a sufficient dataset. Another observation was that dataset augmentations, despite showing improvements over the baseline, might be outperformed by under-sampling. The last finding from the analysis of previous submissions was that multilingual models could perform equally well or better than single-language models. A survey on offensive language detection [11], a task that share some similarities to check-worthiness detection, presents many options for leveraging domain knowledge from high-resource languages to low-resource languages by using Cross-Lingual Transfer Learning (CLTL). The first category of CLTL, Instance Transfer, includes the transfer of text or label information between source and target languages. In Text-Based Transfer (applied by DSHacker and Accenture in 2023), machine translation is most often used. For the purposes of this research, neither Label-Based Transfer (annotation projection and pseudo-labeling) nor Text Alignment methods are relevant because the data for all languages included in the competition, although scarce, come with labels. Next category, Feature Transfer methods extract linguistic features from source and target languages (e.g., using Multilingual Word Embeddings) and align them into a shared feature space. Those methods are applicable for the check-worthy detection task, but they were not used for experiments. Parameter Transfer relies on transferring distributions of parameters between languages within one model or across separate models. Multilingual pre-trained language models are fundamental for this method, as they are pre-trained on vast datasets in many languages, sharing semantic representations across languages. We decided to focus our experiments on this CLTL method to analyze the performance of multilingual models fine-tuned on the multilingual datasets provided by the CheckThat! Lab organizers. 3. Methodology The study focused on application of cross-lingual transfer learning for finding the best performing solution for check-worthiness detection in Arabic, Dutch, and English. Specific research questions were formulated: • RQ1. What was the contribution to the final score of specific features of the ClaimBuster 1:2 dataset used to create the best-performing method in the 2023 CheckThat! Lab Task 1b? • RQ2. How effective are multilingual pre-trained language models compared to monolingual models? 2 https://zenodo.org/record/3836810 • RQ3. How can cross-lingual transfer be leveraged to improve check-worthiness detection using training data in multiple languages? • RQ4. Is it possible to outperform random under-sampling with methods informed by annotation quality or training dynamics? The first research question stems from the uncertainty surrounding the root causes of effectiveness of dataset curation applied in the winning method for English in the 2023 CheckThat! Lab Task 1b. The dataset reduced the class imbalance but did not completely eliminate it (with a 1:2 ratio of positive to negative examples), and some lower quality examples were filtered out. We observed inconsistent impact during the training process: some models produced much better results (e.g., the F1 score of winning method - fine-tuned GPT-3 curie increased by 0.072), while others remained unchanged or even worsened. The experiments were planned to isolate the impact of class balancing, removal of low-quality examples, and variability arising from random model parameter initialization. The second research question aims at measuring the gap between monolingual and multilingual models, highlighting any potential performance loss when using the latter. The third research question focuses on utilizing cross-lingual transfer not only for low-resource languages but also for improving high-resource languages by combining data from around the globe. Check-worthiness detection is part of the fact-checking process, which in many cases is global. Fake news and narratives cross geographical and language barriers. Consequently, models trained on multilingual data could potentially outperform monolingual models, even for high-resource languages. The goal of the fourth research question is to design proxy measures that would allow for the creation of a high-quality training dataset even when explicit annotation quality feature is not available. The study contains three parts: 1. Finding the best monolingual model to use as a baseline. 2. Preparing multilingual training dataset variants. 3. Training and evaluating mono- and multilingual models on the prepared datasets. The study required multiple model training runs for various models, dataset preparation variants, and different random seeds to allow for more accurate comparisons of results. Each training was evaluated using the loss metric or the F1 score metric for the positive class, and tested using the F1 score metric for the positive class. Phases of the experiments included: 1. Testing single language models using unaltered datasets. 2. Testing cross-lingual transfer learning using various concatenations of datasets. 3. Testing the impact of various structural changes to the training datasets. Our team achieved the best score in CheckThat! Lab Subtask-1B in English in 2023 [4] using a fine-tuned GPT-3 model; however, results obtained by the DeBERTa V3 model were only marginally worse. Considering that the end goal of check-worthiness detection is large-scale application, resource consumption is a critical factor for the actual method selection. Given the significantly lower resources needed to run BERT models compared to GPT-3, we decided to limit this study to BERT models and to maximize the model performance within this constraint. 4. Models We made an initial selection of BERT models for the experiments to use for the sequence classification task. We were not able to test all available models and we have not been able to establish an objectively verifiable ranking list. Instead we decided to include selected mono- and multilingual models. The subjective selection was based on preference for largest, most recent, or the best performing models according to benchmarks or previous editions of CLEF CheckThat! Lab. For the English subtask, we tested two English models: • DeBERTa V3 base (microsoft/deberta-v3-base),3 • DeBERTa V3 large (microsoft/deberta-v3-large).4 DeBERTa V3 base scored 0.894 in CheckThat! Lab Subtask-1B in English in 2023 [4], only 0.004 less than the winning GPT-3 but still 0.006 better than the second team [12]. Adding a larger version of the same model was expected to yield even better results. For the Arabic subtask, we tested three variants of CAMeLBERT, choosing the best-suited model for the dataset – Modern Standard Arabic (MSA), dialectal Arabic (DA), and classical Arabic (CA): • CAMeLBERT MSA (CAMeL-Lab/bert-base-arabic-camelbert-msa),5 • CAMeLBERT DA (CAMeL-Lab/bert-base-arabic-camelbert-da), • CAMeLBERT CA (CAMeL-Lab/bert-base-arabic-camelbert-ca). For the Dutch subtask, we selected two models: • RobBERT 2023 large (DTAI-KULeuven/robbert-2023-dutch-large),6 • BERTje (GroNLP/bert-base-dutch-cased).7 Results from CheckThat! Lab Subtask-1B [3] indicated that multilingual models also have the potential to achieve top results. We decided to include two multilingual models in our experiments: • mDeBERTa V3 base (microsoft/mdeberta-v3-base),8 • XLM-RoBERTa base (FacebookAI/xlm-roberta-base).9 Due to time and resource constraints, we were not able to extensively search for optimal hyperparam- eter values. We decided to use preselected values and tested multiple variants of the training dataset. We monitored the learning curves to ensure that the models did not under-fit and applied early stopping to avoid overfitting. We used step-wise evaluation strategy instead of epochs with 5000 maximum steps. 5. Datasets 5.1. Datasets Overview CheckThat! Lab in 2024 provided participants with four datasets in Arabic, Dutch, English, and Spanish. Each dataset contained train, dev, and dev_test splits. For Arabic, Dutch, and English, a test split was also provided for use in submission. The count of examples revealed that the Dutch dataset contained significantly fewer examples than the others (see Table 1) and that the positive class is underrepresented in all datasets (see Table 2). Results in previous editions of CheckThat! Lab inspired us to explore various sampling methods informed by data quality and training dynamics measures. English Dataset. Analysis revealed that examples in the train and dev splits originated from the ClaimBuster dataset. A lookup on ClaimBuster files indicated that the train data split was fully annotated by crowd-sourcing, while the dev split was annotated by experts (the so-called ground-truth dataset in ClaimBuster). The dev_test split was equal to the test split delivered in the 2023 edition of CheckThat! Lab, but its origins are unknown. The test split was not matched with any existing dataset. Arabic, Dutch, Spanish Datasets. The data structure revealed that examples were collected from Twitter, but the datasets were not matched with any existing datasets. 3 https://huggingface.co/microsoft/deberta-v3-base 4 https://huggingface.co/microsoft/deberta-v3-large 5 https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa 6 https://huggingface.co/DTAI-KULeuven/robbert-2023-dutch-large 7 https://huggingface.co/GroNLP/bert-base-dutch-cased 8 https://huggingface.co/microsoft/mdeberta-v3-base 9 https://huggingface.co/FacebookAI/xlm-roberta-base Table 1 Number of examples per language and dataset split Language train train dev_test test Total Arabic 7333 1093 500 610 9536 Dutch 995 252 666 1000 2913 English 22501 1032 318 341 24192 Spanish 19948 5000 5000 - 29948 Total 50777 7377 6484 1951 66589 Table 2 Positive class ratios per language and dataset split Language train train dev_test test Total Arabic 0.31 0.38 0.75 0.36 0.34 Dutch 0.41 0.40 0.47 0.40 0.42 English 0.24 0.23 0.34 0.26 0.24 Spanish 0.16 0.14 0.10 - 0.14 Total 0.22 0.20 0.20 0.36 0.22 5.2. Dataset Variants 5.2.1. Monolingual Dataset Variants In the first phase, the original dataset splits were used to train language-specific models. Three main baseline variants of datasets were: • Arabic train, • Dutch train, • English train. For evaluation, the original dev dataset splits were used, and dev_test splits were used to calculate the F1 score (positive class) of each trained model. 5.2.2. Multilingual Dataset Variants In the second phase, we planned experiments with cross-lingual transfer learning using six multilingual train datasets. New dataset variants were created by concatenating the train splits of the single language datasets. Similarly, the dev splits were concatenated to create multilingual evaluation datasets. All dev_test splits were used individually to calculate the F1 score. The concatenation variants included: • Full multilingual – concatenation of Arabic, Dutch, English, and Spanish (later referred to as ar+en+es+nl). • Twitter multilingual – concatenation of Arabic, Dutch, and Spanish (later referred to as ar+es+nl). • Twitter bilingual – concatenation of Arabic and Dutch (later referred to as ar+nl). We noticed a significant disproportion in the size of the train datasets: Dutch (995 examples) vs Arabic (7333), English (22501), and Spanish (19948). To address this issue, we created over-sampled versions of the datasets with Dutch examples sampled three times for ar+nl(x3) and ar+es+nl(x3), and five times for ar+en+es+nl(x5). 5.2.3. Filtering by Annotation Quality, Correctness, and Random Under-Sampling Previously, we observed a significant improvement from balancing class counts, so we added a train dataset variant with random under-sampling applied. Another variant involved reshuffling examples in the train and dev splits before applying random under-sampling. Our previous research [4] showed that for English, the annotation quality differed between train (crowd-sourced labels) and dev (ground-truth annotated by experts), and this difference could impact the training process. For English, the aim of reshuffling was to test if adding some higher quality examples to the train set from dev, combined with adding some lower quality examples to dev from train, would affect the results. The preparation process consisted of three steps: 1. Concatenation of train and dev splits into a single dataset. 2. Random split into new train and dev subsets with an 8:2 ratio. 3. Random under-sampling of the new train and dev sets to achieve equal class counts. As a result, three sets of dataset variants were created: Original (full training datasets), RUS (random under-sampling applied), and RUS & new split (random under-sampling applied after joining train and dev and splitting again). The total number of available training datasets for cross-lingual transfer learning was 30 (4 monolin- gual datasets and 6 multilingual, each in Original, RUS, and RUS & new split versions). Not all variants were used in experiments due to resource considerations and potential improvements in results. In the third phase, seven additional train dataset variants were created for the English dataset. Leveraging additional information about annotation quality derived from the ClaimBuster dataset, individual examples were assigned a High or Low quality flag. The authors of the ClaimBuster dataset, used for creating the English train dataset, introduced screening criteria to exclude low-quality labels and published three filtered datasets with class ratios of 1:2, 1:2.5, and 1:310 . The most balanced, 1:2 dataset was used directly in the experiment (referred to as High Quality 1:2). Additionally, we derived a new High Quality flag that was assigned to all examples included in any of the three mentioned ClaimBuster datasets. Analogously, examples not included in any of the aforementioned datasets were flagged as Low Quality. On top of that, we used a separate flag for Ground Truth indicating examples annotated by experts, while all other examples were annotated using a crowd-sourcing approach. As a result, eight English train datasets based on quality were created and later referred to as: • Original – Unmodified English train dataset from CheckThat! Lab 2024. • Ground Truth – Selected examples annotated by experts. • High Quality – Examples included in ClaimBuster files screened for quality. • Low Quality – Examples excluded from ClaimBuster files screened for quality. • Original and GT (Ground Truth) – Concatenation of 0.8 of Original and Ground Truth examples (0.2 hold-out for evaluation). • High Quality and GT (Ground Truth) – Concatenation of 0.8 of High Quality and Ground Truth examples (0.2 hold-out for evaluation). • Low Quality and GT (Ground Truth) – Concatenation of 0.8 of Low Quality and Ground Truth examples (0.2 hold-out for evaluation). • High Quality 1:2 - 0.8 of examples included in the ClaimBuster 1:2 file (0.2 hold-out for evalua- tion). Additionally, we trained the DeBERTa V3 base model on all examples (concatenated train and dev) for 5 epochs and collected logits after each epoch to calculate training dynamics metrics: variability, confidence, and correctness as described by Swayamdipta et al. [1]. We used the correctness measure to further filter the data: examples were classified as correct (correctness equal to five) or not (correctness 10 https://zenodo.org/record/3836810 less than five). The correctness flag was used to generate an additional set of train datasets by removing examples with correctness less than five. As a final step, we applied random under-sampling (Random US, RUS) to all 16 datasets (eight splits by quality times two by correctness equal to five flag), producing 32 new final datasets (the order of filtering was quality > correctness > random under-sampling). 5.2.4. Additional Under-Sampling Methods We created additional dataset variants using under-sampling methods informed by additional measures. These variants were assigned the following codes: • DUS – Symmetrically removing the most easy-to-learn and hard-to-learn examples. All majority class examples were sorted in descending order by their ℓ2 distance from the reference point (variability, confidence)=(0.5, 0.5) and removed until the desired class count was reached. • HUS – First removing all hard-to-learn examples (defined as examples having an ℓ2 distance from (variability, confidence)=(0.5, 0.5) greater than 0.35 while having a confidence < 0.5), and then removing easy-to-learn examples sorted by descending distance from (variability, confidence)=(0.5, 0.5) until the desired class count was reached. • CUS – First removing all examples from the majority class with correctness less than five, and later, if necessary, randomly choosing examples with correctness equal to five until the desired class count was reached. The calculation formulas (variability, confidence, and correctness) and definitions of regions (easy-to- learn, hard-to-learn, ambiguous) follow [1]. The results were compared to the original dataset (Original) and random under-sampling (RUS). 6. Experimental Results 6.1. Monolingual Model Selection Several training runs for the Arabic dataset revealed that the MSA variant of the CAMeLBERT family of models is best suited for the task. The best F1 score (Positive Class) was 0.832 with a learning rate of 1e-05, and this configuration was used for other experiments (see Figure 1). CAMeL-Lab/bert-base-arabic-camelbert-da Model CAMeL-Lab/bert-base-arabic-camelbert-ca CAMeL-Lab/bert-base-arabic-camelbert-msa 0.55 0.60 0.65 0.70 0.75 0.80 dev_test F1 (Positive Class) Figure 1: Results of hyperparameter sweep for Arabic models - F1 score (Positive Class). Training runs for the Dutch dataset revealed that the RobBERT 2023 large model outperformed the BERTje model for the given task. The best F1 score (Positive Class) was 0.671 with a learning rate of 1e-05; however, we decided to include both models in other experiments (see Figure 2). DTAI-KULeuven/robbert-2023-dutch-large Model GroNLP/bert-base-dutch-cased 0.58 0.60 0.62 0.64 0.66 dev_test F1 (Positive Class) Figure 2: Results of hyperparameter sweep for Dutch models - F1 score (Positive Class). Training runs for the English dataset revealed that the DeBERTa V3 large model outperformed the DeBERTa V3 base model for the task. The best F1 score (Positive Class) was 0.926 with a learning rate of 1e-05. We decided to mainly use the DeBERTa V3 large model for other experiments; however, we made some further comparisons with the base model as well (see Figure 3). microsoft/deberta-v3-base Model microsoft/deberta-v3-large 0.84 0.86 0.88 0.90 0.92 dev_test F1 (Positive Class) Figure 3: Results of hyperparameter sweep for English models - F1 score (Positive Class) 6.2. Cross-Lingual Transfer Learning Experiments for cross-lingual transfer learning followed a similar pattern for all languages. We first tested and compared the performance of monolingual model training on full datasets (Original), Random Under-Sampling (RUS), and Random Under-Sampling with joined and new splits of train and dev (RUS & new split). In most cases, the F1 score was higher for RUS and RUS & new split, so we dropped some of the Original variants for subsequent study to save on compute power. Analysis of the performance of training the monolingual models shows that the results for English (see Table 5) were significantly higher than for Arabic and Dutch (0.932 vs 0.873 and 0.671 respectively — see Tables 3 and 4). For cross-lingual transfer, we decided to test multilingual models trained solely on the English dataset to predict on dev_test datasets in Arabic and Dutch. Bearing in mind resource utilization, we excluded other possibilities (e.g., testing predictions for English using a model trained solely in Arabic or Dutch) as this was not likely to improve the F1 score. The remaining combinations of dataset variants, models, and sampling methods were applied in model training. We planned four runs with different random seeds for each combination but, due to compute constraints, not all seed values were tested. The result tables present the highest F1 score achieved (max) and the mean (mean) calculated from multiple runs of the same configuration with different random seed values. 6.2.1. Arabic For Arabic, the highest F1 score for positive class was achieved by the mDeBERTa V3 base model trained on the largest dataset, which concatenated Arabic, Dutch, English, and Spanish datasets after applying random under-sampling on individual datasets and over-sampling Dutch data five times (ar+en+es+nl(x5)). The maximum F1 score was 0.901, with the mean F1 score from all runs only slightly lower at 0.894. It surpassed the best monolingual Arabic model by 0.028 for the maximum and 0.042 for the mean F1 score (0.873 and 0.852 for CAMeLBERT MSA, see Table 3). It is worth noting that using only Arabic data for training the multilingual model also produced a higher F1 score than the dedicated Arabic model (0.021 for maximum and 0.025 for mean). mDeBERTa V3 base Model XLM-RoBERTa base Model ar en ar+nl ar+nl(x3) Dataset ar+en ar+es+nl ar+es+nl(x3) ar+en+es+nl ar+en+es+nl(x5) 0.70 0.75 0.80 0.85 0.90 CAMeLBERT MSA Model ar en ar+nl ar+nl(x3) Sampling Dataset ar+en Random Under-sampling, new TRAIN / DEV split ar+es+nl Random Under-sampling ar+es+nl(x3) Original ar+en+es+nl ar+en+es+nl(x5) 0.70 0.75 0.80 0.85 0.90 Figure 4: Results of Cross-Lingual Transfer experiments - x-axis presents F1 score (Positive Class) for the Arabic dev_test dataset. 6.2.2. Dutch For Dutch, the highest F1 score (Positive Class) was also achieved by the mDeBERTa V3 base model, but the optimal training dataset was different. The best performance was achieved using a random under- sampled and reshuffled train and dev (RUS & new split) dataset, concatenated with Arabic and Dutch data (ar+nl). This configuration provided the model with the optimal training examples and resulted in both the highest maximum F1 score of 0.714 and the highest mean of all runs at 0.684. Surprisingly, adding more data (English, Spanish) or over-sampling Dutch examples lowered the F1 score. In this case, cross-lingual transfer surpassed the best monolingual model by 0.036 for the maximum and 0.016 for the mean F1 score (0.678 and 0.668 for RobBERT 2023 large, see Table 4). It is worth noting that using only Dutch data for training the multilingual model yielded lower results than dedicated Dutch models (0.017 for maximum and 0.018 for mean). mDeBERTa V3 base Model XLM-RoBERTa base Model en nl ar+nl ar+nl(x3) Dataset en+nl ar+es+nl ar+es+nl(x3) ar+en+es+nl ar+en+es+nl(x5) RobBERT 2023 large Model BERTje Model en nl ar+nl ar+nl(x3) Sampling Dataset en+nl Random Under-sampling ar+es+nl Original ar+es+nl(x3) Random Under-sampling, new TRAIN / DEV split ar+en+es+nl ar+en+es+nl(x5) 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Figure 5: Results of Cross-Lingual Transfer experiments - x-axis presents F1 score (Positive Class) for Dutch dev_test dataset. 6.2.3. English For English, the single highest F1 score (Positive Class) was achieved by the monolingual DeBERTa V3 large on the randomly under-sampled English dataset (0.932), but the highest mean of all runs was equal for both DeBERTa V3 large and multilingual mDeBERTa V3 base (0.899). Both results were achieved using only English examples in training. Cross-lingual transfer was not effective in this case (see Table 5) in the appendix. mDeBERTa V3 base Model XLM-RoBERTa base Model en ar+nl ar+nl(x3) en+nl Dataset ar+en ar+es+nl ar+es+nl(x3) ar+en+es+nl ar+en+es+nl(x5) 0.5 0.6 0.7 0.8 0.9 DeBERTa V3 large Model en ar+nl ar+nl(x3) en+nl Sampling Dataset ar+en Random Under-sampling, new TRAIN / DEV split ar+es+nl Random Under-sampling ar+es+nl(x3) Original ar+en+es+nl ar+en+es+nl(x5) 0.5 0.6 0.7 0.8 0.9 Figure 6: Results of Cross-Lingual Transfer experiments - x-axis presents F1 score (Positive Class) for English dev_test dataset. 6.3. Filtering by Quality and Correctness Experiments performed on English dataset variants, filtered by annotation quality and under-sampled, showed a greater impact of class balancing over structural changes. While filtering by annotation quality was able to improve the F1 score compared to the original dataset, the improvements from class balancing were much more pronounced. The best overall score was achieved by DeBERTa V3 large with random under-sampling on the original dataset, achieving the highest maximum score of 0.95 and a mean score of 0.939. The highest maximum score without random under-sampling was also achieved by DeBERTa V3 large on the original dataset. The highest mean score without random under-sampling was 0.90, achieved by the same model using the High Quality 1:2 dataset. It is important to note that this dataset is more balanced than the original dataset (1:2 vs 1:3.17). Random under-sampling combined with filtering of examples with correctness less than five produced worse results than random under-sampling alone (see Figure 7). Complete results are presented in Table 6 in the appendix. 6.4. Additional Under-Sampling Methods Experiments with under-sampling methods continued after submission to the CheckThat! Lab 2024 competition, and many training runs were performed when the test file with labels was already available. In contrast to previously reported results, this experiment reports the F1 score (Positive Class) on both dev_test and test datasets. The application of filtering by quality and correctness did not yield improvements when applied as the first step of the processing pipeline before random under-sampling. In this phase of the experiment, the processing order was changed: all minority (positive) class examples were included in all training runs, and only the majority (negative) class examples were filtered out based on various conditions (referred to as RUS, QUS, DUS, HUS, and CUS, see Section 5.2.4). DeBERTa V3 base Model DeBERTa V3 large Model Original Ground Truth (GT) High Quality Low Quality Dataset Original and GT High Quality and GT Low Quality and GT High Quality 1:2 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 English F1 (Positive Class) English F1 (Positive Class) Original RUS RUS & Correctness=5 Figure 7: Results of Cross-Lingual Transfer experiments - F1 score (Positive Class) for English train dataset tested on dev_test dataset. The distribution of the results in this experiment varied from the previous one (6.3) due to changes in hyper-parameter values; nevertheless, similar patterns emerged. For Arabic, models trained on datasets with random under-sampling outperformed models trained on the original dataset when the F1 score was measured against dev_test. This was not true when measured against test. Random under-sampling (RUS) performed slightly worse than the Original dataset; however, the use of correctness improved results on average. The differences, however, were insignificant (0.001 to 0.002 difference between Original mean and CUS mean F1 score). Complete results are presented in Table 7 in the appendix. mDeBERTa V3 base Model | Trained on ar+nl mDeBERTa V3 base Model | Trained on ar Original RUS Dataset DUS Test Dataset dev_test HUS test CUS 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Arabic F1 (Positive Class) Arabic F1 (Positive Class) Figure 8: Results of under-sampling experiments - F1 score (Positive Class) for Arabic dev_test dataset. For Dutch, we did not observe a systematic improvement from under-sampling. Similar to the experiment on training the English model on the Ground Truth portion of data (similar in size to the Dutch dataset, approximately 1,000 examples, see Section 5.2.4), any further reduction lowered the F1 score. Complete results are presented in Table 8 in the appendix. Trained on ar+nl | mDeBERTa V3 base Model Original RUS Dataset DUS Test Dataset dev_test HUS test CUS 0.54 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 Dutch F1 (Positive Class) Figure 9: Results of under-sampling experiments - F1 score (Positive Class) for Dutch dev_test dataset. For English, models trained on datasets with random under-sampling outperformed models trained on the original dataset when comparing both dev_test and test F1 scores. An even higher increase in F1 scores was observed when under-sampling was performed based on annotation quality criteria (QUS). The highest maximum F1 score with DeBERTa V3 base was 0.942, with a mean of 0.9 (a 0.03 and 0.035 increase versus the Original baseline). This contrasts with the quality-based filtering experiment results. Unfortunately, the DUS, HUS, and CUS methods generated mostly inferior results (see Figure 10). Complete results are presented in Table 9 in the appendix. Trained on en | DeBERTa V3 base Model Trained on en | DeBERTa V3 large Model Original RUS QUS Dataset Test Dataset DUS dev_test test HUS CUS 0.70 0.75 0.80 0.85 0.90 0.95 0.70 0.75 0.80 0.85 0.90 0.95 English F1 (Positive Class) English F1 (Positive Class) Figure 10: Results of under-sampling experiments - F1 score (Positive Class) for English dev_test dataset. 6.5. Result Submission The following set-ups were used for result submission: • For Arabic, we submitted results generated by the mDeBERTa V3 base model, trained on a randomly under-sampled and concatenated dataset comprising Arabic, Dutch, English, and Spanish training data. • For Dutch, we submitted results generated by the mDeBERTa V3 base model, trained on a randomly under-sampled and concatenated dataset comprising Arabic and Dutch training data. • For English, we submitted results generated by the DeBERTa V3 large model. The preparation of the training dataset included concatenation of the train and dev datasets, followed by a split in an 8:2 ratio and subsequent under-sampling. The annotation quality features derived from the ClaimBuster dataset were not used for training the model chosen for submission. 7. Conclusions and Future Work Application of cross-lingual transfer learning allowed us to achieve a 0.557 F1 score for Arabic, securing second place on the leaderboard. Conversely, for Dutch, the method achieved a 0.590 F1 score, placing only seventh in the competition. For English, we submitted predictions generated with a monolingual model trained on a randomly under-sampled dataset and achieved an F1 score of 0.796, earning second place on the leaderboard. The results of the conducted experiments shed light on the research questions. RQ1. What was the contribution to the final score of specific features of the dataset used to create the best-performing method in the 2023 CheckThat! Lab Task 1b? The best results achieved in the CheckThat! Lab 2023 for English, using a ClaimBuster 1:2 dataset, can be attributed to addressing the class imbalance problem rather than purely the quality of annotation. RQ2. How effective are multilingual pre-trained language models compared to monolingual models? We demonstrated the efficacy of multilingual models in classification tasks. The results were compara- ble to or better than those of dedicated monolingual models, even when fine-tuned on a single-language training dataset. RQ3. How can cross-lingual transfer be leveraged to improve check-worthiness detection using training data in multiple languages? In the case of the Arabic and Dutch subtasks, training on concatenated multilingual datasets led to superior results. The English dataset, on its own, was sufficient to train the best model. RQ4. Is it possible to outperform random under-sampling with methods informed by annotation quality or training dynamics? Although the removal of lower-quality examples did not contribute to improvements in the F1 score, the inclusion of the annotation quality feature in the under-sampling process has the potential to outperform random under-sampling. An important limitation of application of annotation-quality under-sampling comes from availability of quality measure. An alternative was proposed based on model training dynamics. Three methods for enhancing under-sampling with measures calculated from model training dynamics did not outperform random under-sampling. Despite the failure of the training dynamics measures proposed in this paper, we believe that future work should investigate other possibilities for defining measures to support the identification of mislabeled examples to inform dataset balancing methods. Acknowledgments The research is supported by the project “OpenFact – artificial intelligence tools for verification of veracity of information sources and fake news detection” (INFOSTRATEG-I/0035/2021-00), granted within the INFOSTRATEG I program of the National Center for Research and Development, under the topic: Verifying information sources and detecting fake news. References [1] S. Swayamdipta, R. Schwartz, N. Lourie, Y. Wang, H. Hajishirzi, N. A. Smith, Y. Choi, Dataset cartog- raphy: Mapping and diagnosing datasets with training dynamics, arXiv preprint arXiv:2009.10795 (2020). [2] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Overview of the clef-2022 checkthat! lab task 1 on identifying relevant claims in tweets, CLEF ’2022, Bologna, Italy, 2022. [3] A. Barrón-Cedeño, F. Alam, A. Galassi, G. Da San Martino, P. Nakov, T. Elsayed, D. Azizov, T. Caselli, G. S. Cheema, F. Haouari, M. Hasanain, M. Kutlu, C. Li, F. Ruggeri, J. M. Struß, W. Zaghouani, Overview of the clef–2023 checkthat! lab on checkworthiness, subjectivity, political bias, factuality, and authority of news articles and their source, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro (Eds.), Exper- imental IR Meets Multilinguality, Multimodality, and Interaction, Springer Nature Switzerland, Cham, 2023, pp. 251–275. [4] M. Sawiński, K. Węcel, E. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz, Openfact at checkthat! 2023: Head-to-head gpt vs. bert - a comparative study of transformers language models for the detection of check-worthy claims, in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CLEF ’2023, Thessaloniki, Greece, 2023. [5] R. Frick, I. Vogel, I. Nunes Grieser, Fraunhofer sit at checkthat! 2022: semi-supervised ensemble classification for detecting check-worthy tweets, in: Working Notes of CLEF 2022–Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [6] A. Aziz, M. Hossain, A. Chy, Csecu-dsg at checkthat! 2023: transformer-based fusion approach for multimodal and multigenre check-worthiness, Working Notes of CLEF (2023). [7] S. Tran, P. Rodrigues, B. Strauss, E. Williams, Accenture at checkthat! 2023: Identifying claims with societal impact using nlp data augmentation, Working Notes of CLEF (2023). [8] H. T. Sadouk, F. Sebbak, H. E. Zekiri, Es-vrai at checkthat! 2023: Analyzing checkworthiness in multimodal and multigenre (2023). [9] P. Tarannum, M. A. Hasan, F. Alam, S. R. H. Noori, Z-index at checkthat! 2023: Unimodal and multimodal checkworthiness classification, Working Notes of CLEF (2023). [10] A. Modzelewski, W. Sosnowski, A. Wierzbicki, Dshacker at checkthat! 2023: Check-worthiness in multigenre and multilingual content with gpt-3.5 data augmentation, Working Notes of CLEF (2023). [11] A. Jiang, A. Zubiaga, Cross-lingual offensive language detection: A systematic review of datasets, transfer approaches and challenges, 2024. arXiv:2401.09244. [12] R. Frick, I. Vogel, J. Choi, Fraunhofer sit at checkthat! 2023: enhancing the detection of multi- modal and multigenre check-worthiness using optical character recognition and model souping., in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CLEF ’2023, Thessaloniki, Greece, 2023. Appendices A. Cross-Lingual Transfer Learning Table 3 Results for the Arabic Test Dataset measured with the F1 score (Positive Class) for models trained with full datasets (Original), Random Under-Sampling (RUS), and Random Under-Sampling with a combined and new split of train and dev (RUS & new split). Sampling Original RUS RUS & new split Arabic – F1 (Positive Class) max mean max mean max mean Model Dataset ar 0.85 0.833 0.899 0.886 0.894 0.877 en 0.769 0.769 0.881 0.881 0.842 0.842 ar+nl - - 0.885 0.873 0.886 0.876 ar+nl(x3) - - 0.852 0.846 0.871 0.846 mDeBERTa V3 base ar+en 0.839 0.839 - - - - ar+es+nl 0.767 0.767 0.875 0.858 0.823 0.818 ar+es+nl(x3) - - 0.892 0.889 0.846 0.834 ar+en+es+nl 0.783 0.783 0.899 0.885 0.891 0.885 ar+en+es+nl(x5) - - 0.901 0.894 0.89 0.879 ar 0.863 0.785 0.885 0.876 0.885 0.869 en 0.771 0.771 0.775 0.775 0.873 0.873 ar+nl 0.76 0.76 0.885 0.871 0.882 0.862 ar+nl(x3) - - 0.85 0.821 0.872 0.822 XLM-RoBERTa base ar+en 0.843 0.843 - - - - ar+es+nl 0.689 0.689 0.891 0.882 0.847 0.827 ar+es+nl(x3) - - 0.889 0.858 0.808 0.791 ar+en+es+nl 0.766 0.766 0.883 0.86 0.847 0.844 ar+en+es+nl(x5) - - 0.867 0.837 0.857 0.787 CAMeLBERT MSA ar 0.761 0.741 0.864 0.852 0.873 0.841 Table 4 Results for the Dutch Test Dataset measured with the F1 score (Positive Class) for models trained with full datasets (Original), Random Under-Sampling (RUS), and Random Under-Sampling with a combined and new split of train and dev (RUS & new split). Sampling Original RUS RUS & new split Dutch – F1 (Positive Class) max mean max mean max mean Model Dataset en 0.413 0.413 0.51 0.51 0.487 0.487 nl 0.663 0.636 0.656 0.642 0.672 0.666 ar+nl - - 0.706 0.677 0.714 0.684 ar+nl(x3) - - 0.687 0.665 0.688 0.672 mDeBERTa V3 base en+nl 0.573 0.573 - - - - ar+es+nl 0.606 0.606 0.664 0.65 0.578 0.565 ar+es+nl(x3) - - 0.643 0.624 0.609 0.599 ar+en+es+nl 0.621 0.621 0.651 0.639 0.656 0.635 ar+en+es+nl(x5) - - 0.662 0.624 0.629 0.619 en 0.458 0.458 0.487 0.487 0.561 0.561 nl 0.649 0.629 0.661 0.65 0.656 0.631 ar+nl 0.529 0.529 0.664 0.643 0.671 0.64 ar+nl(x3) - - 0.665 0.635 0.65 0.616 XLM-RoBERTa base en+nl 0.507 0.507 - - - - ar+es+nl 0.545 0.545 0.62 0.61 0.624 0.597 ar+es+nl(x3) - - 0.629 0.593 0.632 0.591 ar+en+es+nl 0.561 0.561 0.636 0.601 0.655 0.628 ar+en+es+nl(x5) - - 0.652 0.616 0.598 0.586 RobBERT 2023 large nl 0.671 0.65 0.678 0.668 0.667 0.657 BERTje nl 0.613 0.594 0.641 0.634 0.652 0.639 Table 5 Results for the English Test Dataset measured with the F1 score (Positive Class) for models trained with full datasets (Original), Random Under-Sampling (RUS), and Random Under-Sampling with a combined and new split of train and dev (RUS & new split). Sampling Original RUS RUS & new split English – F1 (Positive Class) max mean max mean max mean Model Dataset en 0.724 0.724 0.825 0.825 0.899 0.899 ar+nl - - 0.726 0.633 0.714 0.648 ar+nl(x3) - - 0.608 0.58 0.627 0.591 en+nl 0.819 0.819 - - - - mDeBERTa V3 base ar+en 0.861 0.861 - - - - ar+es+nl - - 0.726 0.723 0.667 0.655 ar+es+nl(x3) - - 0.754 0.712 0.647 0.621 ar+en+es+nl 0.778 0.778 0.898 0.878 0.892 0.876 ar+en+es+nl(x5) - - 0.878 0.864 0.893 0.878 en 0.811 0.811 0.777 0.777 0.873 0.873 ar+nl - - 0.607 0.567 0.721 0.639 ar+nl(x3) - - 0.663 0.618 0.667 0.656 en+nl 0.743 0.743 - - - - XLM-RoBERTa base ar+en 0.773 0.773 - - - - ar+es+nl - - 0.677 0.654 0.686 0.68 ar+es+nl(x3) - - 0.732 0.641 0.686 0.657 ar+en+es+nl 0.75 0.75 0.874 0.856 0.896 0.883 ar+en+es+nl(x5) - - 0.855 0.849 0.882 0.873 DeBERTa V3 large en 0.888 0.848 0.932 0.899 0.908 0.897 B. Filtering by Quality and Correctness Table 6 Results for the English dataset measured with the F1 score (Positive Class) for models trained with several pre-configured datasets filtered by quality and correctness. Data preparation Original RUS RUS & Correctness=5 max mean max mean max mean Model Dataset Original 0.86 0.828 0.906 0.894 0.899 0.879 Ground Truth (GT) 0.784 0.77 0.76 0.728 - - High Quality 0.851 0.818 0.894 0.883 0.846 0.846 Low Quality 0.9 0.887 0.909 0.906 0.913 0.886 DeBERTa V3 base Original and GT 0.866 0.855 0.925 0.915 0.882 0.872 High Quality and GT 0.838 0.794 0.913 0.896 0.892 0.858 Low Quality and GT - - 0.909 0.894 0.898 0.882 High Quality 1:2 0.879 0.857 0.911 0.893 - - Original 0.912 0.836 0.95 0.939 0.938 0.914 Ground Truth (GT) 0.837 0.811 0.71 0.696 - - High Quality 0.876 0.83 0.913 0.913 0.893 0.878 Low Quality 0.87 0.843 0.927 0.913 0.916 0.916 DeBERTa V3 large Original and GT 0.907 0.871 0.937 0.921 0.917 0.905 High Quality and GT 0.854 0.836 0.94 0.914 0.913 0.896 Low Quality and GT - - 0.935 0.919 0.937 0.915 High Quality 1:2 0.91 0.9 0.922 0.912 - - C. Additional Under-Sampling Methods Table 7 F1 scores (Positive Class) for Arabic using the mDeBERTa V3 base model trained with different under-sampling methods. F1 (Positive Class) max mean Test Dataset dev_test test dev_test test Train Language Dataset Original 0.87 0.553 0.833 0.549 RUS 0.87 0.542 0.859 0.538 ar DUS 0.824 0.533 0.797 0.529 HUS 0.825 0.535 0.794 0.534 CUS 0.842 0.555 0.832 0.549 Original 0.858 0.551 0.834 0.548 RUS 0.854 0.547 0.854 0.547 ar+nl DUS 0.851 0.53 0.811 0.528 HUS 0.854 0.54 0.803 0.531 CUS 0.848 0.555 0.84 0.55 Table 8 F1 scores (Positive Class) for Dutch using the mDeBERTa V3 base model trained with different under-sampling methods on the ar+nl dataset. F1 (Positive Class) max mean Test Dataset dev_test test dev_test test Dataset Original 0.693 0.675 0.573 0.532 RUS 0.65 0.707 0.55 0.663 DUS 0.663 0.608 0.646 0.561 HUS 0.665 0.623 0.624 0.563 CUS 0.687 0.685 0.639 0.612 Table 9 F1 scores (Positive Class) for English using a model trained with different under-sampling methods on the en dataset. F1 (Positive Class) max mean Test Dataset dev_test test dev_test test Model Dataset Original 0.912 0.795 0.865 0.749 RUS 0.934 0.798 0.884 0.749 QUS 0.942 0.8 0.9 0.773 DeBERTa V3 base DUS 0.905 0.785 0.865 0.731 HUS 0.913 0.797 0.882 0.756 CUS 0.924 0.781 0.892 0.764 Original 0.926 0.807 0.903 0.79 RUS 0.942 0.814 0.905 0.771 QUS 0.937 0.814 0.915 0.795 DeBERTa V3 large DUS 0.918 0.792 0.883 0.746 HUS 0.928 0.831 0.899 0.779 CUS 0.919 0.807 0.908 0.785