<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OpenFact at CheckThat! 2024: Cross-Lingual Transfer Learning for Check-Worthiness Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcin Sawiński</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krzysztof Węcel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ewelina Księżniak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems, Poznań University of Economics and Business</institution>
          ,
          <addr-line>Al. Niepodległości 10, 61-875 Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the results of the OpenFact team's experiments in the CLEF 2024 CheckThat! Lab Task 1 competition for multilingual, unimodal check-worthiness detection. Several mono- and multilingual pre-trained language models were fine-tuned using diferent variants of the training datasets. Cross-lingual transfer learning was applied without instance transfer and proved to be efective for Arabic and Dutch. Additionally, we tested the efectiveness of class balancing using several under-sampling methods, which, when combined with appropriate model selection and cross-lingual transfer learning, produced the second-best results for Arabic and English.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;check-worthiness</kwd>
        <kwd>fact-checking</kwd>
        <kwd>fake news detection</kwd>
        <kwd>language models</kwd>
        <kwd>cross-lingual transfer learning</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In previous editions of CheckThat! Lab [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], many methods were proposed to solve the
checkworthiness detection task on text data. In 2023, the dominant method involved the application of
pre-trained language models fine-tuned for the classification task.
      </p>
      <p>
        For English, the best score was achieved by team OpenFact [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], using the GPT-3 curie model fine-tuned
on an under-sampled training dataset. However, DeBERTa V3 performed only marginally worse.
Undersampling was performed using an additional annotation quality flag derived from the ClaimBuster
dataset2.
      </p>
      <p>
        Other teams used monoligual models: BERT (Fraunhofer SIT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], CSECU-DSG [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), RoBERTa
(Accenture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), GigaBERT (Accenture), MARBERT (ES-VRAI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]), a feed-forward neural network trained
on embeddings (Z-Index [9]), and multilingual models: XLM-RoBERTa (DSHacker [10]), Twitter
XLMRoBERTa (CSECU-DSG). The models were mostly used for sequence classification but other methods
were also applied: ensemble learning with model souping (Fraunhofer SIT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), BiLSTM module handle
long-term contextual dependency and multisample dropout(CSECU-DSG [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>
        The dataset curation included back-translation (Accenture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), under-sampling (ES-VRAI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
OpenFact [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), instance transfer (DSHacker [10]), paraphrasing with GPT-3.5 (DSHacker [10]).
      </p>
      <p>We drew the conclusion that complex model setups were not critical to achieving the best results
and that a well-performing BERT-family model could achieve top results provided a suficient dataset.
Another observation was that dataset augmentations, despite showing improvements over the baseline,
might be outperformed by under-sampling. The last finding from the analysis of previous submissions
was that multilingual models could perform equally well or better than single-language models.</p>
      <p>A survey on ofensive language detection [ 11], a task that share some similarities to check-worthiness
detection, presents many options for leveraging domain knowledge from high-resource languages to
low-resource languages by using Cross-Lingual Transfer Learning (CLTL). The first category of CLTL,
Instance Transfer, includes the transfer of text or label information between source and target languages.
In Text-Based Transfer (applied by DSHacker and Accenture in 2023), machine translation is most
often used. For the purposes of this research, neither Label-Based Transfer (annotation projection and
pseudo-labeling) nor Text Alignment methods are relevant because the data for all languages included
in the competition, although scarce, come with labels. Next category, Feature Transfer methods extract
linguistic features from source and target languages (e.g., using Multilingual Word Embeddings) and
align them into a shared feature space. Those methods are applicable for the check-worthy detection
task, but they were not used for experiments. Parameter Transfer relies on transferring distributions of
parameters between languages within one model or across separate models. Multilingual pre-trained
language models are fundamental for this method, as they are pre-trained on vast datasets in many
languages, sharing semantic representations across languages.</p>
      <p>We decided to focus our experiments on this CLTL method to analyze the performance of multilingual
models fine-tuned on the multilingual datasets provided by the CheckThat! Lab organizers.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The study focused on application of cross-lingual transfer learning for finding the best performing
solution for check-worthiness detection in Arabic, Dutch, and English. Specific research questions were
formulated:
• RQ1. What was the contribution to the final score of specific features of the ClaimBuster 1:2
dataset used to create the best-performing method in the 2023 CheckThat! Lab Task 1b?
• RQ2. How efective are multilingual pre-trained language models compared to monolingual
models?
• RQ3. How can cross-lingual transfer be leveraged to improve check-worthiness detection using
training data in multiple languages?
• RQ4. Is it possible to outperform random under-sampling with methods informed by annotation
quality or training dynamics?</p>
      <p>The first research question stems from the uncertainty surrounding the root causes of efectiveness
of dataset curation applied in the winning method for English in the 2023 CheckThat! Lab Task 1b.
The dataset reduced the class imbalance but did not completely eliminate it (with a 1:2 ratio of positive
to negative examples), and some lower quality examples were filtered out. We observed inconsistent
impact during the training process: some models produced much better results (e.g., the F1 score of
winning method - fine-tuned GPT-3 curie increased by 0.072), while others remained unchanged or even
worsened. The experiments were planned to isolate the impact of class balancing, removal of low-quality
examples, and variability arising from random model parameter initialization. The second research
question aims at measuring the gap between monolingual and multilingual models, highlighting any
potential performance loss when using the latter.</p>
      <p>The third research question focuses on utilizing cross-lingual transfer not only for low-resource
languages but also for improving high-resource languages by combining data from around the globe.
Check-worthiness detection is part of the fact-checking process, which in many cases is global. Fake
news and narratives cross geographical and language barriers. Consequently, models trained on
multilingual data could potentially outperform monolingual models, even for high-resource languages.</p>
      <p>The goal of the fourth research question is to design proxy measures that would allow for the creation
of a high-quality training dataset even when explicit annotation quality feature is not available.</p>
      <p>The study contains three parts:
1. Finding the best monolingual model to use as a baseline.
2. Preparing multilingual training dataset variants.
3. Training and evaluating mono- and multilingual models on the prepared datasets.</p>
      <p>The study required multiple model training runs for various models, dataset preparation variants, and
diferent random seeds to allow for more accurate comparisons of results. Each training was evaluated
using the loss metric or the F1 score metric for the positive class, and tested using the F1 score metric
for the positive class.</p>
      <p>Phases of the experiments included:
1. Testing single language models using unaltered datasets.
2. Testing cross-lingual transfer learning using various concatenations of datasets.
3. Testing the impact of various structural changes to the training datasets.</p>
      <p>
        Our team achieved the best score in CheckThat! Lab Subtask-1B in English in 2023 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using a
ifne-tuned GPT-3 model; however, results obtained by the DeBERTa V3 model were only marginally
worse. Considering that the end goal of check-worthiness detection is large-scale application, resource
consumption is a critical factor for the actual method selection. Given the significantly lower resources
needed to run BERT models compared to GPT-3, we decided to limit this study to BERT models and to
maximize the model performance within this constraint.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Models</title>
      <p>We made an initial selection of BERT models for the experiments to use for the sequence classification
task. We were not able to test all available models and we have not been able to establish an objectively
verifiable ranking list. Instead we decided to include selected mono- and multilingual models. The
subjective selection was based on preference for largest, most recent, or the best performing models
according to benchmarks or previous editions of CLEF CheckThat! Lab.</p>
      <p>For the English subtask, we tested two English models:
• DeBERTa V3 base (microsoft/deberta-v3-base),3
• DeBERTa V3 large (microsoft/deberta-v3-large).4</p>
      <p>
        DeBERTa V3 base scored 0.894 in CheckThat! Lab Subtask-1B in English in 2023 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], only 0.004 less
than the winning GPT-3 but still 0.006 better than the second team [12]. Adding a larger version of the
same model was expected to yield even better results.
      </p>
      <p>For the Arabic subtask, we tested three variants of CAMeLBERT, choosing the best-suited model for
the dataset – Modern Standard Arabic (MSA), dialectal Arabic (DA), and classical Arabic (CA):
• CAMeLBERT MSA (CAMeL-Lab/bert-base-arabic-camelbert-msa),5
• CAMeLBERT DA (CAMeL-Lab/bert-base-arabic-camelbert-da),
• CAMeLBERT CA (CAMeL-Lab/bert-base-arabic-camelbert-ca).</p>
      <p>For the Dutch subtask, we selected two models:
• RobBERT 2023 large (DTAI-KULeuven/robbert-2023-dutch-large),6
• BERTje (GroNLP/bert-base-dutch-cased).7</p>
      <p>
        Results from CheckThat! Lab Subtask-1B [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] indicated that multilingual models also have the potential
to achieve top results. We decided to include two multilingual models in our experiments:
• mDeBERTa V3 base (microsoft/mdeberta-v3-base),8
• XLM-RoBERTa base (FacebookAI/xlm-roberta-base).9
      </p>
      <p>Due to time and resource constraints, we were not able to extensively search for optimal
hyperparameter values. We decided to use preselected values and tested multiple variants of the training dataset.
We monitored the learning curves to ensure that the models did not under-fit and applied early stopping
to avoid overfitting. We used step-wise evaluation strategy instead of epochs with 5000 maximum steps.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Datasets</title>
      <sec id="sec-5-1">
        <title>5.1. Datasets Overview</title>
        <p>CheckThat! Lab in 2024 provided participants with four datasets in Arabic, Dutch, English, and Spanish.
Each dataset contained train, dev, and dev_test splits. For Arabic, Dutch, and English, a test split was
also provided for use in submission.</p>
        <p>The count of examples revealed that the Dutch dataset contained significantly fewer examples than
the others (see Table 1) and that the positive class is underrepresented in all datasets (see Table 2). Results
in previous editions of CheckThat! Lab inspired us to explore various sampling methods informed by
data quality and training dynamics measures.</p>
        <p>English Dataset. Analysis revealed that examples in the train and dev splits originated from the
ClaimBuster dataset. A lookup on ClaimBuster files indicated that the train data split was fully annotated
by crowd-sourcing, while the dev split was annotated by experts (the so-called ground-truth dataset in
ClaimBuster). The dev_test split was equal to the test split delivered in the 2023 edition of CheckThat!
Lab, but its origins are unknown. The test split was not matched with any existing dataset.</p>
        <p>Arabic, Dutch, Spanish Datasets. The data structure revealed that examples were collected from
Twitter, but the datasets were not matched with any existing datasets.
3https://huggingface.co/microsoft/deberta-v3-base
4https://huggingface.co/microsoft/deberta-v3-large
5https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa
6https://huggingface.co/DTAI-KULeuven/robbert-2023-dutch-large
7https://huggingface.co/GroNLP/bert-base-dutch-cased
8https://huggingface.co/microsoft/mdeberta-v3-base
9https://huggingface.co/FacebookAI/xlm-roberta-base</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Dataset Variants</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Monolingual Dataset Variants</title>
          <p>In the first phase, the original dataset splits were used to train language-specific models. Three main
baseline variants of datasets were:
• Arabic train,
• Dutch train,
• English train.</p>
          <p>For evaluation, the original dev dataset splits were used, and dev_test splits were used to calculate the
F1 score (positive class) of each trained model.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Multilingual Dataset Variants</title>
          <p>In the second phase, we planned experiments with cross-lingual transfer learning using six multilingual
train datasets. New dataset variants were created by concatenating the train splits of the single language
datasets. Similarly, the dev splits were concatenated to create multilingual evaluation datasets. All
dev_test splits were used individually to calculate the F1 score. The concatenation variants included:
• Full multilingual – concatenation of Arabic, Dutch, English, and Spanish (later referred to as
ar+en+es+nl).
• Twitter multilingual – concatenation of Arabic, Dutch, and Spanish (later referred to as
ar+es+nl).</p>
          <p>• Twitter bilingual – concatenation of Arabic and Dutch (later referred to as ar+nl).</p>
          <p>We noticed a significant disproportion in the size of the train datasets: Dutch (995 examples) vs
Arabic (7333), English (22501), and Spanish (19948). To address this issue, we created over-sampled
versions of the datasets with Dutch examples sampled three times for ar+nl(x3) and ar+es+nl(x3), and
ifve times for ar+en+es+nl(x5).
Previously, we observed a significant improvement from balancing class counts, so we added a train
dataset variant with random under-sampling applied. Another variant involved reshufling examples in
the train and dev splits before applying random under-sampling.</p>
          <p>
            Our previous research [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] showed that for English, the annotation quality difered between train
(crowd-sourced labels) and dev (ground-truth annotated by experts), and this diference could impact the
training process. For English, the aim of reshufling was to test if adding some higher quality examples
to the train set from dev, combined with adding some lower quality examples to dev from train, would
afect the results. The preparation process consisted of three steps:
1. Concatenation of train and dev splits into a single dataset.
2. Random split into new train and dev subsets with an 8:2 ratio.
3. Random under-sampling of the new train and dev sets to achieve equal class counts.
          </p>
          <p>As a result, three sets of dataset variants were created: Original (full training datasets), RUS (random
under-sampling applied), and RUS &amp; new split (random under-sampling applied after joining train and
dev and splitting again).</p>
          <p>The total number of available training datasets for cross-lingual transfer learning was 30 (4
monolingual datasets and 6 multilingual, each in Original, RUS, and RUS &amp; new split versions). Not all variants
were used in experiments due to resource considerations and potential improvements in results.</p>
          <p>In the third phase, seven additional train dataset variants were created for the English dataset.
Leveraging additional information about annotation quality derived from the ClaimBuster dataset,
individual examples were assigned a High or Low quality flag.</p>
          <p>The authors of the ClaimBuster dataset, used for creating the English train dataset, introduced
screening criteria to exclude low-quality labels and published three filtered datasets with class ratios of
1:2, 1:2.5, and 1:310. The most balanced, 1:2 dataset was used directly in the experiment (referred to as
High Quality 1:2). Additionally, we derived a new High Quality flag that was assigned to all examples
included in any of the three mentioned ClaimBuster datasets. Analogously, examples not included in
any of the aforementioned datasets were flagged as Low Quality. On top of that, we used a separate flag
for Ground Truth indicating examples annotated by experts, while all other examples were annotated
using a crowd-sourcing approach.</p>
          <p>As a result, eight English train datasets based on quality were created and later referred to as:
• Original – Unmodified English train dataset from CheckThat! Lab 2024.
• Ground Truth – Selected examples annotated by experts.
• High Quality – Examples included in ClaimBuster files screened for quality.
• Low Quality – Examples excluded from ClaimBuster files screened for quality.
• Original and GT (Ground Truth) – Concatenation of 0.8 of Original and Ground Truth examples
(0.2 hold-out for evaluation).
• High Quality and GT (Ground Truth) – Concatenation of 0.8 of High Quality and Ground</p>
          <p>Truth examples (0.2 hold-out for evaluation).
• Low Quality and GT (Ground Truth) – Concatenation of 0.8 of Low Quality and Ground Truth
examples (0.2 hold-out for evaluation).
• High Quality 1:2 - 0.8 of examples included in the ClaimBuster 1:2 file (0.2 hold-out for
evaluation).</p>
          <p>
            Additionally, we trained the DeBERTa V3 base model on all examples (concatenated train and dev)
for 5 epochs and collected logits after each epoch to calculate training dynamics metrics: variability,
confidence, and correctness as described by Swayamdipta et al. [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. We used the correctness measure to
further filter the data: examples were classified as correct (correctness equal to five) or not (correctness
less than five). The correctness flag was used to generate an additional set of train datasets by removing
examples with correctness less than five.
          </p>
          <p>As a final step, we applied random under-sampling (Random US, RUS) to all 16 datasets (eight splits
by quality times two by correctness equal to five flag), producing 32 new final datasets (the order of
ifltering was quality &gt; correctness &gt; random under-sampling).</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.4. Additional Under-Sampling Methods</title>
          <p>We created additional dataset variants using under-sampling methods informed by additional measures.
These variants were assigned the following codes:
• DUS – Symmetrically removing the most easy-to-learn and hard-to-learn examples. All majority
class examples were sorted in descending order by their ℓ2 distance from the reference point
(variability, confidence )=(0.5, 0.5) and removed until the desired class count was reached.
• HUS – First removing all hard-to-learn examples (defined as examples having an ℓ2 distance from
(variability, confidence )=(0.5, 0.5) greater than 0.35 while having a confidence &lt; 0.5), and then
removing easy-to-learn examples sorted by descending distance from (variability, confidence )=(0.5,
0.5) until the desired class count was reached.
• CUS – First removing all examples from the majority class with correctness less than five, and
later, if necessary, randomly choosing examples with correctness equal to five until the desired
class count was reached.</p>
          <p>
            The calculation formulas (variability, confidence, and correctness) and definitions of regions (
easy-tolearn, hard-to-learn, ambiguous) follow [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. The results were compared to the original dataset (Original)
and random under-sampling (RUS).
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental Results</title>
      <sec id="sec-6-1">
        <title>6.1. Monolingual Model Selection</title>
        <p>Several training runs for the Arabic dataset revealed that the MSA variant of the CAMeLBERT family
of models is best suited for the task. The best F1 score (Positive Class) was 0.832 with a learning rate of
1e-05, and this configuration was used for other experiments (see Figure 1).</p>
        <p>CAMeL-Lab/bert-base-arabic-camelbert-da
l
oed CAMeL-Lab/bert-base-arabic-camelbert-ca
MCAMeL-Lab/bert-base-arabic-camelbert-msa
0.55
0.60</p>
        <p>0.65 0.70 0.75
dev_test F1 (Positive Class)
0.80</p>
        <p>Training runs for the Dutch dataset revealed that the RobBERT 2023 large model outperformed the
BERTje model for the given task. The best F1 score (Positive Class) was 0.671 with a learning rate of
1e-05; however, we decided to include both models in other experiments (see Figure 2).
lDTAI-KULeuven/robbert-2023-dutch-large
e
d
o
M</p>
        <p>GroNLP/bert-base-dutch-cased</p>
        <p>Training runs for the English dataset revealed that the DeBERTa V3 large model outperformed the
DeBERTa V3 base model for the task. The best F1 score (Positive Class) was 0.926 with a learning rate
of 1e-05. We decided to mainly use the DeBERTa V3 large model for other experiments; however, we
made some further comparisons with the base model as well (see Figure 3).</p>
        <p>l microsoft/deberta-v3-base
e
d
o
Mmicrosoft/deberta-v3-large
0.84
0.86 0.88
dev_test F1 (Positive Class)
0.90
0.92</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Cross-Lingual Transfer Learning</title>
        <p>Experiments for cross-lingual transfer learning followed a similar pattern for all languages. We first
tested and compared the performance of monolingual model training on full datasets (Original), Random
Under-Sampling (RUS), and Random Under-Sampling with joined and new splits of train and dev (RUS
&amp; new split).</p>
        <p>In most cases, the F1 score was higher for RUS and RUS &amp; new split, so we dropped some of the
Original variants for subsequent study to save on compute power. Analysis of the performance of
training the monolingual models shows that the results for English (see Table 5) were significantly
higher than for Arabic and Dutch (0.932 vs 0.873 and 0.671 respectively — see Tables 3 and 4).</p>
        <p>For cross-lingual transfer, we decided to test multilingual models trained solely on the English dataset
to predict on dev_test datasets in Arabic and Dutch. Bearing in mind resource utilization, we excluded
other possibilities (e.g., testing predictions for English using a model trained solely in Arabic or Dutch)
as this was not likely to improve the F1 score.</p>
        <p>The remaining combinations of dataset variants, models, and sampling methods were applied in
model training. We planned four runs with diferent random seeds for each combination but, due to
compute constraints, not all seed values were tested. The result tables present the highest F1 score
achieved (max) and the mean (mean) calculated from multiple runs of the same configuration with
diferent random seed values.
6.2.1. Arabic
For Arabic, the highest F1 score for positive class was achieved by the mDeBERTa V3 base model
trained on the largest dataset, which concatenated Arabic, Dutch, English, and Spanish datasets after
applying random under-sampling on individual datasets and over-sampling Dutch data five times
(ar+en+es+nl(x5)). The maximum F1 score was 0.901, with the mean F1 score from all runs only slightly
lower at 0.894. It surpassed the best monolingual Arabic model by 0.028 for the maximum and 0.042 for
the mean F1 score (0.873 and 0.852 for CAMeLBERT MSA, see Table 3). It is worth noting that using
ar
en
ttsaae aarr++anearlsr+(++xe3nnn)ll
D ar+es+nl(x3)</p>
        <p>ar+en+es+nl
ar+en+es+nl(x5)
ar
en
ttsaae aarr++anearlsr+(++xe3nnn)ll
D ar+es+nl(x3)</p>
        <p>ar+en+es+nl
ar+en+es+nl(x5)
en
nl
ar+nl
ttsaaeD ar+eaasrr+++ennenlls((++xx33nn))ll</p>
        <p>ar+en+es+nl
ar+en+es+nl(x5)
en
nl
ar+nl
ttsaaeD ar+eaasrr+++ennenlls((++xx33nn))ll</p>
        <p>ar+en+es+nl
ar+en+es+nl(x5)
only Arabic data for training the multilingual model also produced a higher F1 score than the dedicated
Arabic model (0.021 for maximum and 0.025 for mean).
6.2.2. Dutch
For Dutch, the highest F1 score (Positive Class) was also achieved by the mDeBERTa V3 base model, but
the optimal training dataset was diferent. The best performance was achieved using a random
undersampled and reshufled train and dev (RUS &amp; new split) dataset, concatenated with Arabic and Dutch
data (ar+nl). This configuration provided the model with the optimal training examples and resulted in
both the highest maximum F1 score of 0.714 and the highest mean of all runs at 0.684. Surprisingly,
adding more data (English, Spanish) or over-sampling Dutch examples lowered the F1 score. In this
case, cross-lingual transfer surpassed the best monolingual model by 0.036 for the maximum and 0.016
for the mean F1 score (0.678 and 0.668 for RobBERT 2023 large, see Table 4). It is worth noting that
using only Dutch data for training the multilingual model yielded lower results than dedicated Dutch
models (0.017 for maximum and 0.018 for mean).</p>
        <p>mDeBERTa V3 base Model</p>
        <p>XLM-RoBERTa base Model
RobBERT 2023 large Model</p>
        <p>BERTje Model</p>
        <p>Sampling
Random Under-sampling
Original
Random Under-sampling, new TRAIN / DEV split
0.40
6.2.3. English
For English, the single highest F1 score (Positive Class) was achieved by the monolingual DeBERTa
V3 large on the randomly under-sampled English dataset (0.932), but the highest mean of all runs
was equal for both DeBERTa V3 large and multilingual mDeBERTa V3 base (0.899). Both results were
achieved using only English examples in training. Cross-lingual transfer was not efective in this case
(see Table 5) in the appendix.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Filtering by Quality and Correctness</title>
        <p>Experiments performed on English dataset variants, filtered by annotation quality and under-sampled,
showed a greater impact of class balancing over structural changes. While filtering by annotation
quality was able to improve the F1 score compared to the original dataset, the improvements from class
balancing were much more pronounced.</p>
        <p>The best overall score was achieved by DeBERTa V3 large with random under-sampling on the
original dataset, achieving the highest maximum score of 0.95 and a mean score of 0.939. The highest
maximum score without random under-sampling was also achieved by DeBERTa V3 large on the
original dataset. The highest mean score without random under-sampling was 0.90, achieved by the
same model using the High Quality 1:2 dataset. It is important to note that this dataset is more balanced
than the original dataset (1:2 vs 1:3.17).</p>
        <p>Random under-sampling combined with filtering of examples with correctness less than five produced
worse results than random under-sampling alone (see Figure 7). Complete results are presented in
Table 6 in the appendix.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Additional Under-Sampling Methods</title>
        <p>Experiments with under-sampling methods continued after submission to the CheckThat! Lab 2024
competition, and many training runs were performed when the test file with labels was already available.
In contrast to previously reported results, this experiment reports the F1 score (Positive Class) on both
dev_test and test datasets.</p>
        <p>The application of filtering by quality and correctness did not yield improvements when applied as
the first step of the processing pipeline before random under-sampling. In this phase of the experiment,
the processing order was changed: all minority (positive) class examples were included in all training
runs, and only the majority (negative) class examples were filtered out based on various conditions
(referred to as RUS, QUS, DUS, HUS, and CUS, see Section 5.2.4).</p>
        <p>DeBERTa V3 base Model</p>
        <p>Original
Ground Truth (GT)</p>
        <p>High Quality
ttsaaeD OrigLinoawl QanudalGitTy
High Quality and GT
Low Quality and GT</p>
        <p>High Quality 1:2
0.70</p>
        <p>The distribution of the results in this experiment varied from the previous one (6.3) due to changes
in hyper-parameter values; nevertheless, similar patterns emerged.</p>
        <p>For Arabic, models trained on datasets with random under-sampling outperformed models trained on
the original dataset when the F1 score was measured against dev_test. This was not true when measured
against test. Random under-sampling (RUS) performed slightly worse than the Original dataset; however,
the use of correctness improved results on average. The diferences, however, were insignificant (0.001
to 0.002 diference between Original mean and CUS mean F1 score). Complete results are presented in
Table 7 in the appendix.</p>
        <p>mDeBERTa V3 base Model | Trained on ar+nl</p>
        <p>mDeBERTa V3 base Model | Trained on ar</p>
        <p>For Dutch, we did not observe a systematic improvement from under-sampling. Similar to the
experiment on training the English model on the Ground Truth portion of data (similar in size to the
Dutch dataset, approximately 1,000 examples, see Section 5.2.4), any further reduction lowered the F1
score. Complete results are presented in Table 8 in the appendix.</p>
        <p>Trained on ar+nl | mDeBERTa V3 base Model
Original</p>
        <p>RUS
t
e
sa DUS
t
a
D</p>
        <p>HUS</p>
        <p>CUS
Original</p>
        <p>RUS
t
e
sa DUS
t
a
D</p>
        <p>HUS
CUS
0.54</p>
        <p>For English, models trained on datasets with random under-sampling outperformed models trained
on the original dataset when comparing both dev_test and test F1 scores. An even higher increase
in F1 scores was observed when under-sampling was performed based on annotation quality criteria
(QUS). The highest maximum F1 score with DeBERTa V3 base was 0.942, with a mean of 0.9 (a 0.03 and
0.035 increase versus the Original baseline). This contrasts with the quality-based filtering experiment
results. Unfortunately, the DUS, HUS, and CUS methods generated mostly inferior results (see Figure
10). Complete results are presented in Table 9 in the appendix.</p>
        <p>Trained on en | DeBERTa V3 base Model</p>
        <p>Trained on en | DeBERTa V3 large Model
0.70
0.75 0.80 0.85
English F1 (Positive Class)
0.90
0.95
0.70
0.75 0.80 0.85
English F1 (Positive Class)
0.90
0.95</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Result Submission</title>
        <p>The following set-ups were used for result submission:
• For Arabic, we submitted results generated by the mDeBERTa V3 base model, trained on a
randomly under-sampled and concatenated dataset comprising Arabic, Dutch, English, and
Spanish training data.
• For Dutch, we submitted results generated by the mDeBERTa V3 base model, trained on a
randomly under-sampled and concatenated dataset comprising Arabic and Dutch training data.
• For English, we submitted results generated by the DeBERTa V3 large model. The preparation of
the training dataset included concatenation of the train and dev datasets, followed by a split in
an 8:2 ratio and subsequent under-sampling. The annotation quality features derived from the
ClaimBuster dataset were not used for training the model chosen for submission.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>Application of cross-lingual transfer learning allowed us to achieve a 0.557 F1 score for Arabic, securing
second place on the leaderboard. Conversely, for Dutch, the method achieved a 0.590 F1 score, placing
only seventh in the competition. For English, we submitted predictions generated with a monolingual
model trained on a randomly under-sampled dataset and achieved an F1 score of 0.796, earning second
place on the leaderboard.</p>
      <p>The results of the conducted experiments shed light on the research questions.</p>
      <p>RQ1. What was the contribution to the final score of specific features of the dataset used to
create the best-performing method in the 2023 CheckThat! Lab Task 1b?</p>
      <p>The best results achieved in the CheckThat! Lab 2023 for English, using a ClaimBuster 1:2 dataset, can
be attributed to addressing the class imbalance problem rather than purely the quality of annotation.</p>
      <p>RQ2. How efective are multilingual pre-trained language models compared to monolingual
models?</p>
      <p>We demonstrated the eficacy of multilingual models in classification tasks. The results were
comparable to or better than those of dedicated monolingual models, even when fine-tuned on a single-language
training dataset.</p>
      <p>RQ3. How can cross-lingual transfer be leveraged to improve check-worthiness detection
using training data in multiple languages?</p>
      <p>In the case of the Arabic and Dutch subtasks, training on concatenated multilingual datasets led to
superior results. The English dataset, on its own, was suficient to train the best model.</p>
      <p>RQ4. Is it possible to outperform random under-sampling with methods informed by
annotation quality or training dynamics?</p>
      <p>Although the removal of lower-quality examples did not contribute to improvements in the F1 score,
the inclusion of the annotation quality feature in the under-sampling process has the potential to
outperform random under-sampling. An important limitation of application of annotation-quality
under-sampling comes from availability of quality measure. An alternative was proposed based on
model training dynamics. Three methods for enhancing under-sampling with measures calculated from
model training dynamics did not outperform random under-sampling.</p>
      <p>Despite the failure of the training dynamics measures proposed in this paper, we believe that future
work should investigate other possibilities for defining measures to support the identification of
mislabeled examples to inform dataset balancing methods.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research is supported by the project “OpenFact – artificial intelligence tools for verification of
veracity of information sources and fake news detection” (INFOSTRATEG-I/0035/2021-00), granted
within the INFOSTRATEG I program of the National Center for Research and Development, under the
topic: Verifying information sources and detecting fake news.
[9] P. Tarannum, M. A. Hasan, F. Alam, S. R. H. Noori, Z-index at checkthat! 2023: Unimodal and
multimodal checkworthiness classification, Working Notes of CLEF (2023).
[10] A. Modzelewski, W. Sosnowski, A. Wierzbicki, Dshacker at checkthat! 2023: Check-worthiness
in multigenre and multilingual content with gpt-3.5 data augmentation, Working Notes of CLEF
(2023).
[11] A. Jiang, A. Zubiaga, Cross-lingual ofensive language detection: A systematic review of datasets,
transfer approaches and challenges, 2024. arXiv:2401.09244.
[12] R. Frick, I. Vogel, J. Choi, Fraunhofer sit at checkthat! 2023: enhancing the detection of
multimodal and multigenre check-worthiness using optical character recognition and model souping.,
in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum, CLEF ’2023,
Thessaloniki, Greece, 2023.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Cross-Lingual Transfer Learning</title>
      <p>mean
mDeBERTa V3 base
XLM-RoBERTa base
RobBERT 2023 large
BERTje
nl
nl
en
nl
ar+nl
ar+nl(x3)
en+nl
ar+es+nl
ar+es+nl(x3)
ar+en+es+nl
ar+en+es+nl(x5)
en
nl
ar+nl
ar+nl(x3)
en+nl
ar+es+nl
ar+es+nl(x3)
ar+en+es+nl
ar+en+es+nl(x5)
mean
mDeBERTa V3 base
en
ar+nl
ar+nl(x3)
en+nl
ar+en
ar+es+nl
ar+es+nl(x3)
ar+en+es+nl
ar+en+es+nl(x5)
en</p>
      <p>Original
max mean</p>
      <p>RUS
max
mean</p>
      <p>RUS &amp; new split
max mean</p>
    </sec>
    <sec id="sec-10">
      <title>B. Filtering by Quality and Correctness</title>
      <p>mean</p>
      <p>RUS &amp; Correctness=5
max mean</p>
    </sec>
    <sec id="sec-11">
      <title>C. Additional Under-Sampling Methods</title>
      <p>test</p>
      <p>test
test
test
max
dev_test
DeBERTa V3 base
DeBERTa V3 large
test
0.795
0.798</p>
      <p>0.8
0.785
0.797
0.781
mean
dev_test
0.865
0.884</p>
      <p>0.9
0.865
0.882
0.892
test
0.749
0.749
0.773
0.731
0.756
0.764</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lourie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Dataset cartography: Mapping and diagnosing datasets with training dynamics</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>10795</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2022 checkthat! lab task 1 on identifying relevant claims in tweets</article-title>
          ,
          <source>CLEF '</source>
          <year>2022</year>
          , Bologna, Italy,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          , G. Da San Martino, P. Nakov,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Azizov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Cheema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          , W. Zaghouani,
          <article-title>Overview of the clef-2023 checkthat! lab on checkworthiness, subjectivity, political bias, factuality, and authority of news articles and their source</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Aliannejadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>251</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sawiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Węcel</surname>
          </string-name>
          , E. Księżniak,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stróżyna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lewoniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stolarski</surname>
          </string-name>
          , W. Abramowicz, Openfact at checkthat! 2023:
          <article-title>Head-to-head gpt vs. bert - a comparative study of transformers language models for the detection of check-worthy claims</article-title>
          , in: Working Notes of CLEF 2023 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CLEF '
          <year>2023</year>
          , Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Frick</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vogel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Nunes</given-names>
            <surname>Grieser</surname>
          </string-name>
          , Fraunhofer sit at checkthat!
          <year>2022</year>
          <article-title>: semi-supervised ensemble classification for detecting check-worthy tweets</article-title>
          , in: Working Notes of CLEF 2022-
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , CLEF '
          <year>2022</year>
          , Bologna, Italy,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aziz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chy</surname>
          </string-name>
          , Csecu-dsg at checkthat!
          <year>2023</year>
          <article-title>: transformer-based fusion approach for multimodal and multigenre check-worthiness</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Strauss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Williams</surname>
          </string-name>
          , Accenture at checkthat! 2023:
          <article-title>Identifying claims with societal impact using nlp data augmentation</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Sadouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebbak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Zekiri</surname>
          </string-name>
          , Es-vrai at checkthat! 2023:
          <article-title>Analyzing checkworthiness in multimodal and multigenre (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>