<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mirela at CheckThat! 2024: Check-Worthiness of Tweets with Multilingual Embeddings and Adversarial Training</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mirela Dryankova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitar Dimitrov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Koychev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mohamed bin Zayed University of Artificial Intelligence</institution>
          ,
          <addr-line>UAE</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sofia University “St. Kliment Ohridski”</institution>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Accurately assessing the credibility and significance of texts is crucial in today's digital age where misinformation and disinformation abound, especially in social media. In this paper, we propose an approach for check-worthiness of tweets that integrates adversarial learning techniques to optimize classification accuracy and language identification simultaneously. We conduct fine-tuning of DistilBERT-multilingual and XLM-RoBERTa-base for English, Dutch, Spanish, and Arabic to allow the models to adapt to the intricacies of diferent languages. Furthermore, we introduce an adversarial training approach to enhance the performance of multilingual sentence transformers, ensuring their efectiveness across linguistic contexts. The proposed approach ranks 4th in Dutch, 11th in Arabic, and 16th in English with an F1-score (positive class) of 0.65, 0.48, and 0.66, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Check-worthiness</kwd>
        <kwd>Misinformation</kwd>
        <kwd>Disinformation</kwd>
        <kwd>Social Media</kwd>
        <kwd>Multilingual Classification</kwd>
        <kwd>Sentence Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The fast development of social media platforms in recent years has greatly urged information
dissemination. This advancement allows society to stay up-to-date with emerging news and follow the latest
trends, fostering a more informed and connected global community. For instance, people can access
real-time events, participate in online discussions and seminars, and engage with content from all over
the world, encouraging public sharing of opinions. Thus, social media have become one of the main
communication channels for information dissemination and consumption, and nowadays many people
rely on them as their primary source of news [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the ease with which information can be
shared and the often unchecked nature of user-generated content has led to the wide and rapid spread
of false or misleading information, which can have negative societal consequences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        This dual-edged situation highlights the need for efective fact-checking mechanisms to distinguish
between reliable and dubious sources. The CheckThat! Lab Task 1 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] at the Conference and Labs of the
Evaluation Forum (CLEF) 2024 focuses on developing models that automatically determine the tweets’
worthiness. This task is designed to assist fact-checkers by identifying tweets that contain potentially
false claims and have a significant impact if left unchecked, thus streamlining the fact-checking process
and helping to mitigate the spread of misinformation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The task is focused on four languages - English, Spanish, Arabic, and Dutch, where data was collected
from Twitter. We participated in the check-worthiness sub-task with a focus on Dutch, English, and
Arabic. For the submission phase, we proposed a multi-language text classification strategy emphasizing
the incorporation of language adversarial learning into the training process of sentence transformers
for Arabic and Dutch, and run BERT-base-uncased for English.</p>
      <p>
        Our approach mainly focuses on using a pre-trained DistilBERT-multilingual model, which is lighter
and faster at inference time, while also requiring a smaller computational training budget [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For
further experiments, we fine-tuned XLM-RoBERTa-base - a transformer-based multilingual masked
language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on
cross-lingual classification, sequence labeling and question answering [6].
      </p>
      <p>The mentioned methodologies above are essential to automate the fact-checking process and address
misinformation in diverse linguistic contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Traditionally, fact-checking has been a manual process, relying heavily on human efort and some of
the leaders in the field are FactCheck.org 1, Snopes2, PolitiFact3, and FullFact4. This meticulous work is
very time-consuming and labor-intensive so the need to automate it emerged. Automated fact-checking
appeared as an approach where methods of Natural Language Processing (NLP) and Machine Learning
(ML) are used to assist experts in making these decisions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. One of the earlier eforts in this direction
is the ClaimBuster [7], an end-to-end system that uses machine learning, natural language processing,
and database query techniques to aid in the process of fact-checking [8].
      </p>
      <p>In their study [7], the authors address the check-worthiness task by comparing traditional models
(Random Forest and SVM) with transformer-based models BERT and XLM-RoBERTa. The evaluation
shows that transformer models (BERT-multilingual and XLM-RoBERTa-base) outperform the SVM and
Random Forest in Dutch and English languages [7], but the results for Spanish are better using Random
Forest.</p>
      <p>In another study [9], Fraunhofer SIT take first place for CLEF-2023 CheckThat! Task 1A and second
place for CLEF-2023 CheckThat! Task 1B. To determine whether a claim in a tweet that contains both a
snippet of text and an image is worth fact-checking [9], they combine BERT with an OCR analysis. To
determine whether a text snippet from a political debate should be assessed for check-worthiness, the
team run multiple experiments. The best approach for this task is an ensemble classification scheme
centered on Model Souping [9].</p>
      <p>Another interesting approach [10], compares GPT models with BERT models and uses zero-shot,
few-shot, and fine-tuning techniques in the context of check-worthiness problem. As a result, the
participants managed to outperform CheckThat! Lab 2022 Task 1 winning model, as fine-tuning
DeBERTa v3 base.</p>
      <p>Additionally, other methods have been explored, using Word2Vec [11], as well as many participants
apply diferent machine learning methods such as k-nearest neighbors [ 12] and Gradient boosting [13].</p>
      <p>All of the above methodologies are specifically focused on addressing the check-worthiness task for
English.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>3.1. Data
The dataset used for the check-worthiness task is given by the organizers of the CheckThat! Lab. The
train data is provided in four languages - English, Spanish, Arabic, and Dutch while the test datasets
are in English, Arabic, and Dutch. The number of rows in English and Spanish given for training is
relatively higher than in Arabic and Dutch. The datasets contain text, id and link of the tweet as well as
class label ("Yes" / "No") whether or not the text can be fact-checked.</p>
      <p>Table 1 displays the distributions of the given datasets. As can be seen, the dataset sufers from class
imbalance [9]. It can be concluded that within each split the total number of "No" labels is relatively
higher than "Yes" labels.
1http://www.factcheck.org/
2http://www.snopes.com/fact-check/
3http://www.politifact.com/
4http://fullfact.org/</p>
      <sec id="sec-3-1">
        <title>3.2. Models</title>
        <p>We implemented a multilingual sentence classification system, designed to classify sentences across
multiple languages and ensure that the learned representations are independent of the language of
input sentences.</p>
        <p>We introduced two classes of models Sentence Transformer, representing a basic sentence embedding
model, and Sentence Transformer Adversarial, extending the original model with an adversarial training.
The key diference lies in incorporating an additional language classification in Sentence Transformer
Adversarial architecture, enabling language prediction of the input sentence in an adversarial manner.
Both models follow pre-trained transformer-based architectures, specifically DistilBERT-multilingual
and XLM-RoBERTa-base.</p>
        <p>The model configuration is set to output both attention and hidden states. The architecture (Figure 1)
includes a fully connected neural network to refine the representations for classification tasks. During
training, we employ cross-entropy loss for classification tasks and incorporate linear scheduling of
learning rates to stabilize training and improve convergence.</p>
        <p>For evaluation, the key metric used is F1-score with respect to the positive class as proposed by
the organizers of the CheckThat! Lab. Furthermore, accuracy, precision, and recall with respect to
the positive and negative classes are also shown in this paper for a better understanding of model
performance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Experiments</title>
        <p>All models are trained on Google Colab Pro’s T4 GPU. The T4 GPU ofers significant computational
capabilities, 16 GB of memory, and CUDA cores, which are crucial for our experiments.</p>
        <p>DistilBERT-multilingual is trained on 5 epochs, while XLM-RoBERTa-base on 3 epochs due to higher
computational demands. The training and validation data are processed in batches of 32 and 16,
respectively. Additional hyperparameters include learning rate 2e-5, Adam optimizer 1e-8, and dropout
level at 0.1.</p>
        <p>For the oficial competition submission, we provided a Multilingual DistilBERT model, which
demonstrated promising results in the fact-checking task, particularly for Dutch. Initially, our approach
focused exclusively on English using BERT-base-uncased. Later, we upgraded our methodology with
multilingual model. Due to time constraints, we ran only the Multilingual DistilBERT model for Arabic
and Dutch, and BERT-base-uncased model for English and submitted the results. Following this initial
phase, we expanded our approach by incorporating the XLM-RoBERTa-base model. After releasing gold
labels once the submission period ended, we evaluated all the experiments and reported the statistics.
The output results on the test set from all models are presented in Table 2.</p>
        <p>We can conclude that our original submission model achieved the highest F1-score over the positive
class for Dutch. For all languages, XLM-RoBERTa base outperforms DistilBERT-multilingual and has the
greatest increase in F1-score for correctly identifying the positive class in the Arabic dataset. Moreover,
XLM-RoBERTa-base model generally provides a better balance between precision and recall, making it
slightly more reliable for identifying check-worthy tweets across multiple languages.</p>
        <p>The oficial results and ranking from the competition submission are presented in Table 3:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In the following paper, we presented our experiments and insights gained from the check-worthiness
task at CheckThat! 2024. Our methodology employs state-of-the-art transformer models, enhanced
with adversarial training techniques, to improve fact-checking accuracy across multiple languages. By
incorporating language classification into the training process, we ensure the models are capable of
handling diverse linguistic inputs efectively. The proposed approach achieves 4 th place in Dutch with
an F1-score (positive class) of 0.65 and only 11th in Arabic, and 16th in English with an F1-score (positive
class) of 0.48, and 0.66, respectively. Overall, our study reveals that XLM-RoBERTa-base definitely
outperforms DistilBERT-multilingual for all languages, regarding F1-score over the positive class. The
superior performance of XLM-RoBERTa-base can be attributed to its larger size and more intricate
model architecture which is capable of capturing complex linguistic patterns more efectively. Moreover,
XLM-RoBERTa-base benefits from extensive multilingual pretraining on a diverse corpus, enhancing its
ability to generalize across diferent languages and understand diverse linguistic patterns. We can also
conclude that our approach achieves the best results in Dutch due to the equal distribution of "Yes"
and "No" class labels in train dataset. In all other languages the negative labels outnumber the positive
labels by more than two times.</p>
      <p>Further experiments can be conducted using a larger version of the discussed models or exploring
bigger hyperparameter space, which can potentially lead to better results. Due to resource constraints,
large transformer model architectures were not used in this research. Moreover, the current model
can be expanded by incorporating additional contextual features, enhancing its capability to capture
additional information from the input text and improve check-worthiness detection performance.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The work is partially financed by the European Union-NextGenerationEU, through the National
Recovery and Resilience Plan of the Republic of Bulgaria, project SUMMIT, No BG-RRP-2.004-0008.
[6] A. Conneau, K. Khandelwal, V. C. Naman Goyal, F. G. Guillaume Wenzek, E. Grave, M. Ott,</p>
      <p>L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale v1 (2019).
[7] P. Tarannum, M. A. Hasan, F. Alam, S. R. H. Noori, Z-index at CheckThat! lab 2022:
Checkworthiness identification on tweet text (2022).
[8] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A.
Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, ClaimBuster: the first-ever end-to-end fact-checking
system (2017).
[9] R. A. Frick, I. Vogel, J.-E. Choi, Fraunhofer SIT at CheckThat! 2023: Enhancing the detection
of multimodal and multigenre check-worthiness using optical character recognition and model
souping (2023).
[10] M. Sawiński1, K. Węcel1, E. Księżniak, M. Stróżyna1, W. Lewoniewski, P. Stolarski, W. Abramowicz,
OpenFact at CheckThat! 2023: Head-to-head GPT vs. BERT - a comparative study of transformers
language models for the detection of check-worthy claims (2023).
[11] M. Z. Ullah, An ML model for predicting information check-worthiness using a variety of features
(2018).
[12] B. Ghanem, M. Montes-y-G´omez, F. Rangel, P. Rosso, UPV-INAOE - check that: preliminary
approach for checking worthiness of claims. in: Working notes of CLEF 2018 - conference and
labs of the Evaluation Forum, Avignon, France (2018).
[13] K. Yasser, M. Kutlu, T. Elsayed, bigIR at CLEF 2018: Detection and verification of check-worthy
political claims (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Perrin</surname>
          </string-name>
          , Social media usage, Pew research center (
          <year>2015</year>
          )
          <fpage>52</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vladika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Matthes</surname>
          </string-name>
          , Scientific Fact-Checking:
          <article-title>A survey of resources and approaches (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hamdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          , G. Da San Martino, A.
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Míguez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Beltrán</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Elsayed</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in tweets and political debates, PNotes of CLEF 2021-Conference and Labs of the Evaluation Forum</article-title>
          , CLEF '
          <year>2021</year>
          , Bucharest, Romania (online) (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          , Hugging
          <string-name>
            <surname>Face</surname>
          </string-name>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>