SemanticCuetSync at CheckThat! 2024: Pre-trained Transformer-based Approach to Detect Check-Worthy Tweets Notebook for the CheckThat! Lab at CLEF 2024

SemanticCuetSync at CheckThat! 2024: Pre-trained Transformer-based Approach to Detect Check-Worthy Tweets Notebook for the CheckThat! Lab at CLEF 2024 SymomHossainShohan Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

MdSajjadHossain Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

AshrafulIslamParan Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

JawadHossain Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

ShawlyAhsan Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

MohammedMoshiulHoque Chittagong University of Engineering and Technology

Chattogram -4349 Bangladesh

SemanticCuetSync at CheckThat! 2024: Pre-trained Transformer-based Approach to Detect Check-Worthy Tweets Notebook for the CheckThat! Lab at CLEF 2024 1613-0073 D4C06F18C59DF4DD07B577AE51579273 GROBID - A machine learning software for extracting information from scholarly documents Natural Language Processing Check-Worthiness Fact-checking Tweet-Verification Transformers

This paper presents an intelligent technique for classifying English, Arabic, and Dutch texts as checkworthy, harnessing the power of the BERT-based model. The study explores ten baseline models, including LR, MNB, SVM, CNN+LSTM, CNN+BiLSTM, BERT-Base-Uncased, RoBERTa, AraBERTv2, Dutch-RoBERTa, and Dutch-BERT, to address the shared task. The study also investigates an LLM using few-shots, such as SetFit, to identify checkworthy tweets or texts. Evaluation results unequivocally demonstrate the superiority of transformer-based models, with RoBERTa achieving the highest F1 scores of 75.82% for English tweets, Dehate-BERT scoring 52.55% for Arabic texts, and Dutch-BERT obtaining a maximum score of 58.42% for Dutch texts. Our team ranked 6 th overall for English, 5 th for Arabic, and 16 th for Dutch in the shared task challenge.

Introduction

Checkworthy content refers to information that must be confirmed for accuracy, as it may have the potential to shape the opinions and decisions of others. The rise of social networks has led to an exponential growth in textual data on the internet, sometimes resulting in the spread of false claims that can be detrimental to society if left unaddressed. These claims can include political, religious, and health-related misinformation, which can cause discord in society. Fact-checking is a time-consuming task that requires extensive research, identification, verification, and expert analysis. Automating this entire process is a significant challenge, and the first step towards this goal is to determine whether the information is worth checking in the first place.

With the proliferation of communication and social media platforms, such as Facebook, Twitter, and Reddit, the dissemination of false information has become increasingly prevalent. A recent study has suggested that people struggle to differentiate facts from false news [1]. Intelligent technologies can be used to support human fact-checkers to identify claims worth fact-checking [2]. Many studies have been devoted to developing a fully automated system for fact-checking [3], [4], [5], [6], [7]. As social media data continues to expand daily, it is impractical to monitor everything efficiently by human experts. Therefore, developing an automatic system has emerged as the ultimate solution to this problem. This work proposes a solution to classify English, Arabic, and Dutch texts or tweets as checkworthy, harnessing the power of BERT-based approaches. The critical contributions of this study are:

• Introducing a fine-tuned transformer-based model to classify checkworthy texts for three languages (English, Arabic, and Dutch). • Exploring various machine learning (LR, SVM, and MNB), deep learning (CNN, CNN + LSTM, and CNN + BiLSTM), and transformer-based models for finding a suitable method for detecting checkworthy texts in multiple languages.

Related Work

Inaccurate news is quickly spreading throughout social media. Checking the authenticity of any post that surfaces on social media becomes crucial. Intelligent fact-checking systems have emerged as a significant area of research to tackle this problem. Several domains allow for the detection of trustworthiness, such as digital scam [8], the healthcare sector [9], politics [10], and many more fields. An overview of Task 1 in the fourth edition of the CheckThat! The lab was provided by Shaar et al. [11]. Their job was anticipating which tweets involving politics and COVID-19 needed to be verified. Williams et al. [12] presented a transformer-based solution with data augmentation for this problem, and it received an mAP (mean average precision) of 0.66 in the Arabic language. Checkworthiness in multimodal [13] is another popular research area these days; in addition to unimodal, Sadouk et al. [14] proposed a multimodal transformer-based model (BERT+ResNet50) to identify checkworthiness in English which recorded F1 score of 0.71 and transformer based model (MarBERT) with downsampling recorded F1 score of 0.61 in Arabic for image dataset. Meanwhile, Ivanov et al. [15] proposed audio datasets from past political debates and ensemble techniques for detecting checkworthiness. Their audio model (wav2vec2.0) received a mAP of 0.34 when extra noise was eliminated. Ensembles using BERT and an audio model outperformed BERT alone, with a mAP of 0.38. This work addresses a significant gap in the existing literature by comprehensively comparing machine learning (ML), deep learning (DL), and transformer-based solutions. In addition, it investigates the use of few-shot models like SetFit for determining check worthiness in Dutch, Arabic, and English. This study improves the understanding of the various models' performance in these distinct languages.

Dataset and Task Description

The dataset consists of tweets or texts in English, Arabic, and Dutch languages, along with their corresponding labels ('Yes' for texts worth checking, 'No' otherwise). Table 1 shows the distribution of train, dev, dev-test, and test sets. We trained all models using the training set and evaluated the model's performance based on the test set. CLEF 2024 -CheckThat! Lab [16,17,18] consists of six tasks [19,20,21,22,23]. We participated in task-1 of this shared task. Task-1 [19] focuses on assessing whether a claim in a tweet or transcription requires further investigation for fact-checking. The traditional approach for such decisions involves human experts, either professional fact-checkers or annotators, who evaluate the claim based on various criteria. Table 2 illustrates an example of training data for the different languages.

System Overview

This task exploited various ML, DL, and transformer-based approaches across all three languages. MLbased techniques used include linear regression (LR), support vector machine (SVM), and multinomial naive Bayes (MNB). DL-based techniques involve CNN, CNN+long short-term memory (LSTM), and CNN+bidirectional LSTM (BiLSTM). Lastly, various BERT-based transformers are fine-tuned for each language for the given task. Figure 1 illustrates the schematic process of checkworthy text detection. Textual Feature Extraction: Textual feature extraction is one of the essential steps in natural language processing, which involves transforming raw textual data into numerical representations. This numerical representation aids the models in understanding and processing textual data. A Count Vectorizer is used in the ML models examined in this work. It is a widely used technique for textual feature extraction that transforms text data into a matrix of token counts. In DL models, tokenization and padding combine to convert raw texts into structured numerical data. These numerical representations are then passed through an embedding layer, which captures more advanced features such as semantic relationships. This study uses the embedding layer instead of Word2Vec [24] or GloVe [25] to allow the model to learn task-specific embedding during training. Finally, BERT-based tokenizers are employed for transformer-based models to exploit the BERT architecture. ML Models: Various ML models are examined in this work, such as LR, SVM, MNB, KNN, and RF. All the hyperparameter settings for these models are illustrated in Table 3.

CNN:

This work employed a CNN model comprising an embedding layer with an output dimension of 200. The model features two Conv1D layers with 64 and 128 filters, respectively. Both layers used a kernel size of 2 and ReLU activation. For downsampling, the model incorporates a GlobalMaxPooling1D layer. Subsequently, a dense layer with 128 units and ReLU activation is followed by a dropout layer with a rate of 0.5 to prevent overfitting. The output layer has a single unit with sigmoid activation. The model utilizes the 'binary_crossentropy' loss function and 'Nadam' optimizer and has trained with a batch size of 32 for three epochs.

CNN+LSTM: The CNN+LSTM model used in this work has almost the same architecture as the CNN model, incorporating a single LSTM layer comprising 64 units and a dropout rate of 0.2 for sequence modeling. Furthermore, the dense layer included in this design features 64 units and utilizes the ReLU activation function. The remaining hyperparameter configurations are consistent with those employed in the CNN model.

CNN+BiLSTM: This model has an architecture similar to the CNN+LSTM model but replaces LSTM with a Bidirectional LSTM.

Transformer models for English: This study fine-tuned three transformer-based models for a specified task in the English dataset. The models employed were BERT-Base-Uncased [26], SetFit [27], and RoBERTa [28]. The necessary text preprocessing steps were followed before feeding the data into the transformers. These text preprocessing steps include lowercasing, emoji removal, stop word removal, stemming, contraction expansion, simple Unicode spelling correction, and HTML tag removal. For stop word removal, the NLTK stopwords list is used. The main agenda of the text preprocessing steps was to reduce the noise in the dataset and focus on meaningful words. The BERT-Base-Uncased used in this task is a pre-trained transformer model with exceptional performance across various natural language processing (NLP) tasks. This model demonstrated satisfactory performance on the specified task. On the other hand, SetFit leverages pre-trained transformers with limited labeled data. We explored the potential of this few-shot learning framework for the given task. SetFit does not require manual prompts for classification, in contrast to LLMs. Finally, RoBERTa, another optimized version of BERT, is used here for the specified task and outperforms other models.

Transformer models for Arabic: This study also exploited three transformer-based models and fine-tuned them in the Arabic dataset. Models used for the Arabic dataset were AraBERTV2 [29], SetFit (Few-shot) [27], and Dehate-BERT [30]. Similar to the English dataset, some text-preprocessing steps were also performed. Again, the preprocessing steps are lowercasing, emoji removal, stop word removal, stemming, contraction expansion, simple spelling correction using Unicode, HTML tag removal, punctuation removal, URL removal, whitespace removal, and number removal. Stemming was performed here using ArabicLightStemmer. Finally, normalization was used to convert similar characters to a standard form.

AraBERTV2 is the improved version of AraBERT, which leverages the BERT architecture. This model was trained on a sizeable Arabic dataset and has demonstrated effectiveness in various downstream NLP tasks, including sentiment analysis, NER, and Arabic question answering. Dehate-BERT is a pre-trained transformer model primarily designed for hate speech detection, and it outperformed all other models in Arabic in this specific task.

Transformer models for Dutch: This study investigated Dutch RoBERTa [31], SetFit, and Dutch-BERT in the Dutch dataset. Rather than undertaking extensive text preprocessing, this work limited its processing to removing non-Dutch characters from the texts. Table 4 illustrates the hyperparameters of transformer-based models.

Results and Analysis

Table 5 illustrates an in-depth analysis of the performance of ML, DL, and transformer-based models in English, Arabic, and Dutch on the test set.

In evaluating English language data, logistic regression (LR) achieved a precision of 75.00%, recall of 37.50%, and F1-score of 50.00%. However, support vector machine (SVM) emerged as the top-performing ML model, with a precision of 68.97%, recall of 45.45%, and the highest F1-score of 54.79%. Multinomial Naive Bayes (MNB) showed a precision of 64.81%, recall of 39.77%, and F1-score of 49.29%. Among DL models, CNN+BiLSTM demonstrated the best performance with a precision of 66.10%, recall of 44.32%, and F1-score of 53.06%. Among Transformers, RoBERTa showcased remarkable performance, achieving a precision of 89.23%, recall of 65.91%, and the highest F1-score of 75.82%. The SetFit model achieved a precision of 52.34%, recall of 63.64%, and an F1-score of 57.44%. This model shows comparable performance to well-known models such as SVM and LR.

Arabic language evaluations revealed LR achieves a precision of 38.52%, recall of 21.55%, and F1-score of 27.65%. However, SVM followed closely with a precision of 40.57%, recall of 32.57%, and F1-score of 36.13%. MNB showed a precision of 36.00%, recall of 24.77%, and the highest F1-score of 29.35%. Among DL models, CNN+BiLSTM showed the best performance with a precision of 34.27%, recall of 27.98%, and F1-score of 30.81%. Among Transformers, Dehate-BERT emerged as the top-performing model with a precision of 40.24%, recall of 75.69%, and the highest F1-score of 52.55%. The SetFit model attained a precision of 37.75%, recall of 69.26%, and an F1-score of 48.86%. Although it does not match the performance of the leading transformer-based model, Dehate-BERT, it still demonstrates potential in tackling Arabic language classification issues.

For Dutch language evaluations, LR achieves a precision of 50.98%, recall of 32.75%, and F1-score of 39.88%. SVM showed a precision of 43.86%, recall of 37.78%, and F1-score of 40.60%. MNB attained a precision of 46.91%, recall of 9.57%, and F1-score of 15.89%. Among DL models, CNN+BiLSTM In general, transformer-based models surpass both ML and DL models across various languages, demonstrating the effectiveness of pre-trained language models in numerous natural language processing applications. Additionally, within each category, specific models show superior performance, emphasizing the necessity of choosing the appropriate model based on the specific task and language.

Error Analysis

A comprehensive quantitative and qualitative error analysis is conducted to provide detailed insights into the proposed model's performance.

Quantitative Analysis

Figure 2 illustrates the confusion matrix of the best-performing models across English, Arabic, and Dutch.

From a total of 341 test cases in English, RoBERTa demonstrates strong performance in identifying the positive class, with 247 True Positives and only 6 False Positives. This indicates a high precision, meaning the model is highly accurate when it predicts "Yes." Additionally, with 57 True Negatives, it correctly identifies many negative instances. However, there are 31 False Negatives, which indicates that some positive instances are being missed. RoBERTa shows a balanced approach with notable proficiency in minimizing incorrect optimistic predictions, resulting in a high F1 score of 75.82% over

Qualitative Analysis

Table 6 presents some actual labels (AL) and predicted labels (PL) of the developed models.

Table 6

Few predictions with actual and predicted label.

It is clear that the models accurately predicted the labels for examples 2, 3, and 5 but made errors with examples 1 and 4. For the first example, the sentence's intent is ambiguous, leading to an incorrect label prediction by the model. In the case of example 4, although the sentence is checkworthy, the model mislabeled it due to inadequate training data in the Dutch language, which hindered proper learning.

Conclusion

This work investigated the various ML, DL, and transformer-based models for identifying checkworthy tweets or texts in English, Arabic, and Dutch. The results indicate that transformer-based models shine in this task and exhibit exceptional capability in detecting checkworthy text. Specifically, RoBERTa excels in English, Dehate-BERT for Arabic, and Dutch-BERT for Dutch, achieving the highest F1 scores of 75.82%, 52.55%, and 58.42%, respectively. The study recommends that further advancements be made by increasing the training data and incorporating advanced LLMs and GPT models.

Figure 1 :1Figure 1: Schematic process for check-worthy text detection.

Figure 2 :2Figure 2: Confusion matrix of RoBERTa, Dehate-BERT and Dutch-BERT model.

Table 11Dataset statistics for Task-1, where TW stands for Total words and UW stands for Unique words.Language TrainDev Dev-Test TestTotal TWUWEnglish22501 103231834124192 43290311605Arabic73331093500610953625161950001Dutch99525266610002913360628240Total30829 237714841951 36641 720584 69846

Table 22Task-1 sample with text and label.

Table 33Parameters of the employed ML models.

Classifier Parameters ValueLRsolver max_iterlbfgs 20000MNBalpha fit-prior1.0 FalseSVMkernel gammalinear auto

Table 44Hyperparameters for transformer-based models, where LR, WD, WS, and EP stand for learning rate, weight decay, warmup steps, and number of epochs, respectively.ModelsLRWD WS EPAraBERTv23𝑒 −5 0.01 5003Dehate-BERT4𝑒 −5 0.01 5002BERT-Base-Uncased 3𝑒 −5 0.01 5002RoBERTa3𝑒 −5 0.01 5002Dutch-RoBERTa3𝑒 −5 0.01 5003Dutch-BERT5𝑒 −5 0.01 5005

Table 55Performance of the employed models on the test set.Language MethodClassifierPr(%) Re(%) Ac(%) F1(%)LR75.0037.5080.6550.00ML ModelsSVM68.97 45.45 80.65 54.79MNB64.8139.7778.8849.29EnglishDL ModelsCNN+LSTM CNN+BiLSTM74.47 66.10 44.32 79.77 53.06 39.77 80.94 51.85BERT-Base-Uncased 84.8563.6487.6872.73TransformersSetFit52.3463.6475.6657.44RoBERTa89.23 65.91 89.15 75.82LR38.52 21.55 59.67 27.65ML ModelsSVM40.5732.5758.8536.13MNB36.0024.7757.3829.35ArabicDL ModelsCNN+LSTM CNN+BiLSTM33.16 34.27 27.98 55.08 30.81 28.44 53.93 30.62AraBERTV240.4055.0554.9246.60TransformersSetFit37.7569.2648.2048.86Dehate-BERT40.24 75.69 51.15 52.55LR50.98 32.7560.8039.88ML ModelsSVM43.8637.7856.1040.60MNB46.919.5759.8015.89DutchDL ModelsCNN CNN+BiLSTM33.33 52.06 44.58 0.2560.20 61.70 48.03 0.50Dutch-RoBERTa6.2160.330.3211.26TransformersSetFit45.3158.4455.5051.05Dutch-BERT48.40 73.80 52.40 58.42demonstrated the best performance with a precision of 52.06%, recall of 44.58%, and F1-score of 48.03%.In the transformer category, Dutch-BERT outperformed others with a precision of 48.40%, recall of73.80%, and the highest F1-score of 58.42%.The SetFit model exhibited a precision of 45.31%, recall of 58.44%, and an F1-score of 51.05%. It showscompetitive results compared to transformer-based models such as Dutch-BERT, indicating its possibleusefulness in Dutch language classification tasks.

Fake news on social media: the impact on society FOlan UJayawickrama EOArakpogun JSuklan SLiu Information Systems Frontiers 26 2024 PNakov DCorney MHasanain FAlam TElsayed ABarrón-Cedeño PPapotti SShaar GD SMartino arXiv:2103.07769 Automated fact-checking for assisting human fact-checkers 2021 arXiv preprint A survey on truth discovery YLi JGao CMeng QLi LSu BZhao WFan JHan ACM Sigkdd Explorations Newsletter 17 2016 Fake news detection on social media: A data mining perspective KShu ASliva SWang JTang HLiu ACM SIGKDD explorations newsletter 19 2017 The science of fake news DMLazer MABaum YBenkler AJBerinsky KMGreenhill FMenczer MJMetzger BNyhan GPennycook DRothschild Science 359 2018 The spread of true and false news online SVosoughi DRoy S science 359 2018 A unified perspective for disinformation detection and truth discovery in social sensing: a survey FXu VSSheng MWang ACM Computing Surveys (CSUR) 55 2021 Scam detection in twitter XChen RChandramouli KPSubbalakshmi Data Mining for Service Springer 2014 Identifying checkworthy cure claims on twitter SDGollapalli MDu S.-KNg Proceedings of the ACM Web Conference 2023 the ACM Web Conference 2023 2023 Tathya: A multi-classifier system for detecting check-worthy statements in political debates APatwari DGoldwasser SBagchi Proceedings of the 2017 ACM on Conference on Information and Knowledge Management the 2017 ACM on Conference on Information and Knowledge Management 2017 Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates SShaar MHasanain BHamdan ZSAli FHaouari ANikolov MKutlu YSKartal FAlam GDa San Martino CLEF (working notes 2021 EWilliams PRodrigues STran arXiv:2107.05684 Accenture at checkthat! 2021: interesting claim identification and ranking with contextually sensitive lexical training data augmentation 2021 arXiv preprint FAlam ABarrón-Cedeño GSCheema SHakimov MHasanain CLi RMíguez HMubarak GKShahi WZaghouani Overview of the clef-2023 checkthat! lab task 1 on checkworthiness in multimodal and multigenre content 2023 Working Notes of CLEF HTSadouk FSebbak HEZekiri Es-vrai at checkthat! 2023: Analyzing checkworthiness in multimodal and multigenre 2023 Detecting check-worthy claims in political debates, speeches, and interviews using audio data PIvanov IKoychev MHardalov PNakov ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2024 Overview of the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities and adversarial robustness ABarrón-Cedeño FAlam JMStruß PNakov TChakraborty TElsayed PPrzybyła TCaselli GDa San Martino FHaouari CLi JPiskorski FRuggeri XSong RSuwaileh Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association LGoeuriot PMulhem GQuénot DSchwab LSoulier GMDi Nunzio PGaluščáková AGarcía Seco De Herrera GFaggioli NFerro

CLEF

2024. 2024 The CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness ABarrón-Cedeño FAlam TChakraborty TElsayed PNakov PPrzybyła JMStruß FHaouari MHasanain FRuggeri XSong RSuwaileh Advances in Information Retrieval NGoharian NTonellotto YHe ALipani GMcdonald CMacdonald IOunis

Nature Switzerland, Cham

Springer 2024 Working Notes of CLEF 2024 -Conference and Labs of the Evaluation Forum, CLEF 2024 GFaggioli NFerro PGaluščáková AGarcía Seco De Herrera

Grenoble, France

2024 Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content MHasanain RSuwaileh SWeering CLi TCaselli WZaghouani ABarrón-Cedeño PNakov FAlam 2024 JMStruß FRuggeri ABarrón-Cedeño FAlam DDimitrov AGalassi GPachov IKoychev PNakov MSiegel MWiegand MHasanain RSuwaileh WZaghouani Overview of the CLEF-2024 CheckThat! lab task 2 on subjectivity in news articles 2024 Overview of the CLEF-2024 CheckThat! lab task 3 on persuasion techniques JPiskorski NStefanovitch FAlam RCampos DDimitrov AJorge SPollak NRibin ZFijavž MHasanain NGuimarães AFPacheco ESartori PSilvano AVZwitter IKoychev NYu PNakov GDa San Martino 2024 Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor Verification using Evidence from Authorities FHaouari TElsayed RSuwaileh 2024 PPrzybyła BWu AShvets YMu KCSheang XSong HSaggion Overview of the CLEF-2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples 2024 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GSCorrado JDean Advances in neural information processing systems 26 2013 Glove: Global vectors for word representation JPennington RSocher CDManning Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) the 2014 conference on empirical methods in natural language processing (EMNLP) 2014 BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova CoRR abs/1810.04805 2018 Efficient few-shot learning without prompts LTunstall NReimers UE SJo LBates DKorat MWasserblat OPereg arXiv:2209.11055 2022 arXiv preprint Roberta: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 WAntoun FBaly HHajj arXiv:2003.00104 Arabert: Transformer-based model for arabic language understanding 2020 arXiv preprint SSAluru BMathew PSaha AMukherjee arXiv:2004.06465 Deep learning models for multilingual hate speech detection 2020 arXiv preprint RobBERT: a Dutch RoBERTa-based Language Model PDelobelle TWinters BBerendt 10.18653/v1/2020.findings-emnlp.292 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics 2020