SSN-NLP at CheckThat! 2024: Assessing the
                         Check-Worthiness of Tweets and Debate Excerpts Using
                         Traditional Machine Learning and Transformer Models
                         Sanjai Balajee Kannan Giridharan† , Sanjjit Sounderrajan† , B Bharathi† and Nilu R. Salim†
                         Sri Sivasubramaniya Nadar College of Engineering, Chennai, Tamil Nadu, India


                                     Abstract
                                     The rapid spread of misinformation on social media necessitates efficient methods for determining whether claims
                                     in tweets or transcriptions warrant fact-checking. Traditional approaches rely on professional fact-checkers or
                                     human annotators, which are labor-intensive and time-consuming. This paper presents automated methods using
                                     machine learning and natural language processing to streamline check-worthiness estimation. We leveraged
                                     various techniques, including transformer models, to capture contextual nuances and improve prediction accuracy.
                                     Our work focused solely on the English language dataset, and our methods ranked 13th on the leaderboard. Our
                                     findings demonstrate the effectiveness of these automated methods, highlighting their potential to significantly
                                     enhance the efficiency of fact-checking systems and promote information integrity in digital communication.

                                     Keywords: Check-Worthiness Estimation, Fact-Checking Automation, Natural Language Processing (NLP), PoS
                                     Tagging, Machine Learning Classifiers, Transformer Models


                         1. Introduction
                         In today’s digital age, the rapid dissemination of information through social media platforms and online
                         forums has led to an increase in the spread of misinformation and fake news. Addressing this challenge
                         requires effective methods for identifying claims that warrant further investigation and fact-checking.
                         Traditionally, the task of check-worthiness estimation has relied on professional fact-checkers or human
                         annotators who assess the verifiability and potential harm of claims [1]. However, these manual
                         processes are labor-intensive and time-consuming, highlighting the need for automated solutions.
                            Our research aims to automate the task of check-worthiness estimation by leveraging machine
                         learning and natural language processing techniques. We utilize a multi-genre dataset comprising
                         tweets and transcriptions to evaluate the effectiveness of different models across various linguistic and
                         cultural contexts [2]. By employing advanced algorithms and transformer models, we aim to enhance
                         the accuracy and efficiency of check-worthiness estimation.
                            In this paper, we provide a comprehensive overview of the check-worthiness estimation task, empha-
                         sizing its significance in combating misinformation and promoting information integrity. We discuss
                         the methodologies employed in existing approaches, including traditional machine learning algorithms
                         and transformer-based models, and propose avenues for future research and model development. Our
                         contributions seek to advance automated fact-checking systems that can effectively identify and flag
                         potentially misleading or false claims in textual content, fostering a more informed and trustworthy
                         information ecosystem [3].


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ sanjai2110173@ssn.edu.in (S. B. K. Giridharan); sanjjit2110378@ssn.edu.in (S. Sounderrajan); bharathib@ssn.edu.in
                         (B. Bharathi); nilurs@ssn.edu.in (N. R. Salim)
                          https://www.ssn.edu.in/staff-members/dr-b-bharathi/ (B. Bharathi); https://www.ssn.edu.in/staff-members/nilu-r-salim/
                         (N. R. Salim)
                          0000-0003-3078-5470 (S. B. K. Giridharan); 0009-0008-0247-0475 (S. Sounderrajan); 0000-0001-7279-5357 (B. Bharathi);
                         0000-0001-6619-7027 (N. R. Salim)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work
The task of check-worthiness estimation has gained considerable attention, particularly through the
CLEF CheckThat! lab series. These efforts address the challenges of identifying claims in social media
and other textual sources that warrant fact-checking. Nakov et al. [1] investigate the identification of
check-worthy claims amidst the COVID-19 infodemic and the detection of fake news. Their research
provides an in-depth examination of techniques for spotting misleading information on social media
platforms.
   The CLEF-2022 CheckThat! lab introduced additional tasks and datasets aimed at enhancing the
automatic identification and verification of claims [4]. This included Task 1, which focused on identifying
relevant claims in tweets [2].
   Advancements have been made by incorporating machine learning and deep learning models. For
example, Support Vector Machines (SVM) [5] and Random Forest [6] have been utilized effectively in
classification tasks. Passive-Aggressive Classifiers have shown promise in fake news detection [7].
   Transformer-based models have significantly advanced check-worthiness estimation. BERT [8],
RoBERTa [9], XLM [10], and DeBERTa [11] have demonstrated high effectiveness in understanding com-
plex linguistic patterns. Additionally, ensemble learning techniques have been identified as promising
for improving model performance by combining the strengths of various algorithms [12].
   These works collectively highlight the importance of leveraging machine learning and natural lan-
guage processing techniques to automate the detection of claims that warrant fact-checking, thus
contributing to the broader effort of combating misinformation and enhancing the reliability of infor-
mation disseminated through social media.


3. Experiment Setup
3.1. Dataset Description
In this study, we utilized a dataset encompassing four languages: English, Spanish, Arabic, and Dutch,
as released by the CLEF CheckThat! organizers. The dataset comprises sentence IDs, text snippets
extracted from tweets, debates, or speech transcriptions, and a class label indicating whether a claim
is check-worthy (Yes) or not (No). Table 1 presents the distribution of the dataset across the four
languages.

Table 1
Dataset Distribution for English, Spanish, Dutch, and Arabic
                              Language      Label    Train + Dev   Dev-Test
                              English        YES         5651         108
                                             NO         17,882        210
                              Spanish        YES         3826         509
                                             NO         21,124       4491
                              Arabic         YES         2656         377
                                             NO          5910         128
                              Dutch          YES         509          318
                                             NO          1202         577

  However, our research focuses exclusively on the English subset of the dataset.

3.2. Dataset Preprocessing
In this section, we describe the data preprocessing steps undertaken to prepare the dataset for training
and evaluation. The preprocessing pipeline consists of several key stages, including text cleaning,
tokenization, stopword removal, punctuation removal, URL removal, spelling correction, part-of-speech
(POS) tagging, dependency parsing, and feature extraction.
3.3. Feature Extraction Methods
The feature extraction process begins with the linguistic analysis of text data using natural language
processing (NLP) techniques. Initially, the input sentences are subjected to part-of-speech (POS) tagging
and dependency parsing. POS tagging assigns grammatical categories to each word, distinguishing
between parts of speech such as nouns, verbs, adjectives, etc., while dependency parsing reveals the
syntactic relationships between words, delineating the structure of the sentence through dependencies
like subject-verb or modifier-modified relationships.
   Figure 1 shows the dependency relations


Figure 1: Distribution of the Top 10 Dependency Relations


   Subsequently, these syntactic analyses are leveraged to extract relevant features capturing the
linguistic structure of the text. Feature extraction involves converting the sequences of POS tags and
dependency labels into meaningful representations. This process encompasses aggregating POS tags
into vectors, capturing the distribution of different grammatical categories in the text, and encoding
dependency relationships into feature vectors, emphasizing crucial syntactic dependencies.
   Finally, the feature vectors are combined with sentence embeddings generated using a pre-trained
transformer model such as Sentence-BERT. These embeddings capture the semantic content of the
text at the sentence level, enabling the extraction of high-level semantic features. This combination
allows for a richer and more nuanced understanding of the textual data. The combined feature repre-
sentation undergoes data scaling to normalize the feature values and dimensionality reduction using
principal component analysis (PCA) to reduce computational complexity and potentially enhance model
performance.The resulting reduced-dimensional feature vectors serve as input to machine learning
models.
   Figure 2 illustrates the feature extraction pipeline.

3.4. Basic ML Models
In our experiment, we utilized various machine learning models with hyperparameters optimized using
GridSearchCV. The models and their respective hyperparameters are as follows:
Figure 2: Feature Extraction Pipeline


    • Support Vector Machine (SVM): 𝐶 = 100, 𝛾 = 0.02, kernel: rbf.
    • Random Forest Classifier: n_estimators=300.
    • Logistic Regression: 𝐶 = 0.1, solver=liblinear.
    • XGBoost Classifier: learning_rate=0.1, max_depth=6, n_estimators=1000.
    • CatBoost Classifier: depth=5, learning_rate=0.05, iterations=1000.
    • K-Nearest Neighbors (KNN): n_neighbors=11, metric=euclidean.
    • Passive Aggressive Classifier: 𝐶 = 0.01.

  The hyperparameters for each model were tuned using GridSearchCV to ensure optimal performance.

3.5. Transformer Models
In our experiment, we utilized several transformer models to evaluate their performance on our task.
BERT-base-uncased is the original BERT model developed by Google in 2018. It is a pre-trained
language model that uses a multi-layer bidirectional transformer encoder to generate contextualized
representations of words in a sentence. RoBERTa-base is another variant of BERT developed by
Facebook AI in 2019. It improves upon the original BERT model by using a different pre-training
objective and a larger dataset. XLM-RoBERTa-base is a multilingual version of RoBERTa developed
by researchers from the University of Montreal and Facebook AI in 2020. It is trained on a large corpus
of text in multiple languages and can be fine-tuned for specific NLP tasks in any of these languages.
DeBERTa-v3-base is a variant of BERT developed by researchers from Microsoft in 2021. It enhances
BERT by using disentangled attention and ELECTRA-style pre-training, improving efficiency and
performance on downstream tasks. Among these models, DeBERTa-v3-base demonstrated slightly
better performance compared to the others.
   The table shows the maximum sequence length, batch size, and learning rate used for each transformer
model in our experiment (Table 2).
  Model                                    Maximum Sequence Length              Batch Size   Learning Rate
  BERT (bert-base-uncased)                          128                             32            2e-5
  RoBERTa (roberta-base)                            128                             32            1e-5
  XLM-RoBERTa (xlm-roberta-base)                    128                             32            3e-5
  DeBERTa (deberta-v3-base)                         128                             32            2e-5
Table 2
Transformer Models and Hyperparameters


4. Results
For the challenge, the macro-average F1-score was employed as the official evaluation metric. This
metric calculates the F1-score for each class individually and then computes the average of these scores.
  The organizers provided three datasets for both languages: training, dev-test, and test. All models
were trained using only the training dataset. Refer to the previous section for the hyperparameters
used during training.
  Table 3 summarizes the test F1 scores achieved by the models on the provided test dataset.

Table 3
Test F1 Scores of Models on dev-test-set
                                Model                           Test F1-score
                                Majority Baseline                  0.000
                                Random Baseline                    0.462
                                N-gram Baseline                    0.599
                                SVM                                0.597
                                Random Forest Classifier           0.530
                                Logistic Regression                0.610
                                XGBoost Classifier                 0.606
                                CatBoost Classifier                0.614
                                KNN                                0.540
                                Passive Aggressive Classifier      0.641
                                MLP Classifier                     0.410
                                BERT-base-uncased                  0.851
                                RoBERTa-base                       0.843
                                XLM-RoBERTa-base                   0.847
                                DeBERTa-v3-base                    0.876

  Overall, the transformer models outperformed the traditional machine learning algorithms. DeBERTa-
v3-base achieved the highest F1-score on the test dataset.


5. Conclusion
In this work, we present our participation in CLEF 2024 CheckThat-Lab Task 1: Check-worthiness Esti-
mation in Text. Our evaluation showed the superiority of transformer models over traditional machine
learning algorithms, as measured by the macro-average F1-score. This highlights the importance of
using advanced transformer-based approaches for natural language processing tasks. Future research
could explore fine-tuning strategies and alternative architectures to improve performance.
   Our team, SSN-NLP, ranked 13th out of 27 teams on the leaderboard with a macro F1-score of 0.706,
using the BERT model. The leaderboard results are summarized in Table 4, showing that BERT-base-
uncased performed better than other models.
Table 4
Leaderboard Score
                                 Team Name             Model            F1-score
                                 SSN-NLP        BERT-base-uncased         0.706


6. Perspectives for Future Work
    • Ensemble Methods: Combining multiple models to leverage their individual strengths can
      potentially lead to higher accuracy and robustness. Future work could explore various ensemble
      techniques, such as stacking, boosting, or voting, to improve overall performance.
    • Larger and More Diverse Datasets: The availability of larger and more diverse datasets can
      significantly impact the generalizability of the models. Future studies should aim to collect and
      utilize datasets that encompass a wider range of topics, languages, and cultural contexts to train
      more versatile and robust models.
    • Cross-Lingual and Cross-Domain Transfer Learning: Exploring transfer learning techniques
      to adapt models trained on one language or domain to other languages or domains can broaden
      the applicability of check-worthiness estimation models.


References
 [1] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
     M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal,
     J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19 infodemic and fake news detection, in:
     M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information
     Retrieval, Springer International Publishing, Cham, 2022, pp. 416–428.
 [2] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani,
     C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview of the CLEF-2022 CheckThat! lab
     task 1 on identifying relevant claims in tweets, in: Working Notes of CLEF 2022—Conference and Labs of
     the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022.
 [3] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, T. Mihaylova, A context-aware approach for detecting
     worth-checking claims in political debates, in: Proceedings of the International Conference Recent Advances
     in Natural Language Processing, RANLP 2017, 2017, pp. 267–276.
 [4] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
     M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal,
     J. Beltrán, M. Wiegand, M. Siegel, J. Köhler, Overview of the CLEF-2022 CheckThat! lab on fighting the
     COVID-19 infodemic and fake news detection, in: Proceedings of the 13th International Conference of the
     CLEF Association: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization,
     CLEF ’2022, Bologna, Italy, 2022.
 [5] F. Rossi, N. Villa, Support vector machine for functional data classification, Neurocomputing 69 (2006)
     730–742.
 [6] A. Liaw, M. Wiener, Classification and regression by randomforest, R news 2 (2002) 18–22.
 [7] S. Gupta, P. Meel, Fake news detection using passive-aggressive classifier, in: Inventive Communication
     and Computational Technologies: Proceedings of ICICCT 2020, Springer Singapore, 2021, pp. 155–164.
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for
     language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A
     robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[10] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint arXiv:1911.02116 (2019).
[11] P. He, X. Liu, J. Gao, Deberta: Decoding-enhanced bert with disentangled attention, arXiv preprint
     arXiv:2111.09543 (2021).
[12] S. Salvador, R. Francisco, J. Carmen, H. Olga, Ensemble learning: Insights for machine learning ensemble
     methods, in: Proceedings of CEUR Workshop, 2023, pp. 251–264.