1. Introduction

SemanticCuetSync at CheckThat! 2024: Pre-trained Transformer-based Approach to Detect Check-Worthy Tweets

Symom Hossain Shohan

Md. Sajjad Hossain

Ashraful Islam Paran

Jawad Hossain

Shawly Ahsan

Mohammed Moshiul Hoque

moshiul_240@cuet.ac.bd 0 0 Chittagong University of Engineering and Technology , Chattogram - 4349 , Bangladesh

2024

This paper presents an intelligent technique for classifying English, Arabic, and Dutch texts as checkworthy, harnessing the power of the BERT-based model. The study explores ten baseline models, including LR, MNB, SVM, CNN+LSTM, CNN+BiLSTM, BERT-Base-Uncased, RoBERTa, AraBERTv2, Dutch-RoBERTa, and Dutch-BERT, to address the shared task. The study also investigates an LLM using few-shots, such as SetFit, to identify checkworthy tweets or texts. Evaluation results unequivocally demonstrate the superiority of transformer-based models, with RoBERTa achieving the highest F1 scores of 75.82% for English tweets, Dehate-BERT scoring 52.55% for Arabic texts, and Dutch-BERT obtaining a maximum score of 58.42% for Dutch texts. Our team ranked 6th overall for English, 5th for Arabic, and 16th for Dutch in the shared task challenge.

eol>Natural Language Processing Check-Worthiness Fact-checking Tweet-Verification Transformers

1. Introduction

Checkworthy content refers to information that must be confirmed for accuracy, as it may have the potential to shape the opinions and decisions of others. The rise of social networks has led to an exponential growth in textual data on the internet, sometimes resulting in the spread of false claims that can be detrimental to society if left unaddressed. These claims can include political, religious, and health-related misinformation, which can cause discord in society. Fact-checking is a time-consuming task that requires extensive research, identification, verification, and expert analysis. Automating this entire process is a significant challenge, and the first step towards this goal is to determine whether the information is worth checking in the first place.

With the proliferation of communication and social media platforms, such as Facebook, Twitter, and Reddit, the dissemination of false information has become increasingly prevalent. A recent study has suggested that people struggle to diferentiate facts from false news [ 1 ]. Intelligent technologies can be used to support human fact-checkers to identify claims worth fact-checking [ 2 ]. Many studies have been devoted to developing a fully automated system for fact-checking [ 3 ], [ 4 ], [ 5 ], [ 6 ], [ 7 ]. As social media data continues to expand daily, it is impractical to monitor everything eficiently by human experts. Therefore, developing an automatic system has emerged as the ultimate solution to this problem. This work proposes a solution to classify English, Arabic, and Dutch texts or tweets as checkworthy, harnessing the power of BERT-based approaches. The critical contributions of this study are: • Introducing a fine-tuned transformer-based model to classify checkworthy texts for three languages (English, Arabic, and Dutch). • Exploring various machine learning (LR, SVM, and MNB), deep learning (CNN, CNN + LSTM, and CNN + BiLSTM), and transformer-based models for nfiding a suitable method for detecting checkworthy texts in multiple languages.

2. Related Work

Inaccurate news is quickly spreading throughout social media. Checking the authenticity of any post that surfaces on social media becomes crucial. Intelligent fact-checking systems have emerged as a significant area of research to tackle this problem. Several domains allow for the detection of trustworthiness, such as digital scam [ 8 ], the healthcare sector [ 9 ], politics [ 10 ], and many more fields. An overview of Task 1 in the fourth edition of the CheckThat! The lab was provided by Shaar et al. [ 11 ]. Their job was anticipating which tweets involving politics and COVID-19 needed to be verified. Williams et al. [ 12 ] presented a transformer-based solution with data augmentation for this problem, and it received an mAP (mean average precision) of 0.66 in the Arabic language. Checkworthiness in multimodal [ 13 ] is another popular research area these days; in addition to unimodal, Sadouk et al. [ 14 ] proposed a multimodal transformer-based model (BERT+ResNet50) to identify checkworthiness in English which recorded F1 score of 0.71 and transformer based model (MarBERT) with downsampling recorded F1 score of 0.61 in Arabic for image dataset. Meanwhile, Ivanov et al. [ 15 ] proposed audio datasets from past political debates and ensemble techniques for detecting checkworthiness. Their audio model (wav2vec2.0) received a mAP of 0.34 when extra noise was eliminated. Ensembles using BERT and an audio model outperformed BERT alone, with a mAP of 0.38.

This work addresses a significant gap in the existing literature by comprehensively comparing machine learning (ML), deep learning (DL), and transformer-based solutions. In addition, it investigates the use of few-shot models like SetFit for determining check worthiness in Dutch, Arabic, and English. This study improves the understanding of the various models’ performance in these distinct languages.

3. Dataset and Task Description

The dataset consists of tweets or texts in English, Arabic, and Dutch languages, along with their corresponding labels (‘Yes’ for texts worth checking, ‘No’ otherwise). Table 1 shows the distribution of train, dev, dev-test, and test sets. We trained all models using the training set and evaluated the model’s performance based on the test set. CLEF 2024 - CheckThat! Lab [16, 17, 18] consists of six tasks [19, 20, 21, 22, 23]. We participated in task-1 of this shared task. Task-1 [19] focuses on assessing whether a claim in a tweet or transcription requires further investigation for fact-checking. The traditional approach for such decisions involves human experts, either professional fact-checkers or annotators, who evaluate the claim based on various criteria. Table 2 illustrates an example of training data for the diferent languages.

4. System Overview

This task exploited various ML, DL, and transformer-based approaches across all three languages. MLbased techniques used include linear regression (LR), support vector machine (SVM), and multinomial naive Bayes (MNB). DL-based techniques involve CNN, CNN+long short-term memory (LSTM), and CNN+bidirectional LSTM (BiLSTM). Lastly, various BERT-based transformers are fine-tuned for each language for the given task. Figure 1 illustrates the schematic process of checkworthy text detection.

Textual Feature Extraction: Textual feature extraction is one of the essential steps in natural language processing, which involves transforming raw textual data into numerical representations. This numerical representation aids the models in understanding and processing textual data. A Count Vectorizer is used in the ML models examined in this work. It is a widely used technique for textual feature extraction that transforms text data into a matrix of token counts. In DL models, tokenization and padding combine to convert raw texts into structured numerical data. These numerical representations are then passed through an embedding layer, which captures more advanced features such as semantic relationships. This study uses the embedding layer instead of Word2Vec [24] or GloVe [25] to allow the model to learn task-specific embedding during training. Finally, BERT-based tokenizers are employed for transformer-based models to exploit the BERT architecture.

ML Models: Various ML models are examined in this work, such as LR, SVM, MNB, KNN, and RF. All the hyperparameter settings for these models are illustrated in Table 3.

CNN: This work employed a CNN model comprising an embedding layer with an output dimension of 200. The model features two Conv1D layers with 64 and 128 filters, respectively. Both layers used a kernel size of 2 and ReLU activation. For downsampling, the model incorporates a GlobalMaxPooling1D layer. Subsequently, a dense layer with 128 units and ReLU activation is followed by a dropout layer with a rate of 0.5 to prevent overfitting. The output layer has a single unit with sigmoid activation. The model utilizes the ‘binary_crossentropy’ loss function and ‘Nadam’ optimizer and has trained with a batch size of 32 for three epochs.

CNN+LSTM: The CNN+LSTM model used in this work has almost the same architecture as the CNN model, incorporating a single LSTM layer comprising 64 units and a dropout rate of 0.2 for sequence modeling. Furthermore, the dense layer included in this design features 64 units and utilizes the ReLU activation function. The remaining hyperparameter configurations are consistent with those employed in the CNN model.

CNN+BiLSTM: This model has an architecture similar to the CNN+LSTM model but replaces LSTM with a Bidirectional LSTM.

Transformer models for English: This study fine-tuned three transformer-based models for a specified task in the English dataset. The models employed were BERT-Base-Uncased [ 26], SetFit [27], and RoBERTa [28]. The necessary text preprocessing steps were followed before feeding the data into the transformers. These text preprocessing steps include lowercasing, emoji removal, stop word removal, stemming, contraction expansion, simple Unicode spelling correction, and HTML tag removal. For stop word removal, the NLTK stopwords list is used. The main agenda of the text preprocessing steps was to reduce the noise in the dataset and focus on meaningful words. The BERT-Base-Uncased used in this task is a pre-trained transformer model with exceptional performance across various natural language processing (NLP) tasks. This model demonstrated satisfactory performance on the specified task. On the other hand, SetFit leverages pre-trained transformers with limited labeled data. We explored the potential of this few-shot learning framework for the given task. SetFit does not require manual prompts for classification, in contrast to LLMs. Finally, RoBERTa, another optimized version of BERT, is used here for the specified task and outperforms other models.

Transformer models for Arabic: This study also exploited three transformer-based models and ifne-tuned them in the Arabic dataset. Models used for the Arabic dataset were AraBERTV2 [ 29], SetFit (Few-shot) [27], and Dehate-BERT [30]. Similar to the English dataset, some text-preprocessing steps were also performed. Again, the preprocessing steps are lowercasing, emoji removal, stop word removal, stemming, contraction expansion, simple spelling correction using Unicode, HTML tag removal, punctuation removal, URL removal, whitespace removal, and number removal. Stemming was performed here using ArabicLightStemmer. Finally, normalization was used to convert similar characters to a standard form.

AraBERTV2 is the improved version of AraBERT, which leverages the BERT architecture. This model was trained on a sizeable Arabic dataset and has demonstrated efectiveness in various downstream NLP tasks, including sentiment analysis, NER, and Arabic question answering. Dehate-BERT is a pre-trained transformer model primarily designed for hate speech detection, and it outperformed all other models in Arabic in this specific task.

Transformer models for Dutch: This study investigated Dutch RoBERTa [31], SetFit, and DutchBERT in the Dutch dataset. Rather than undertaking extensive text preprocessing, this work limited its processing to removing non-Dutch characters from the texts. Table 4 illustrates the hyperparameters of transformer-based models.

5. Results and Analysis

demonstrated the best performance with a precision of 52.06%, recall of 44.58%, and F1-score of 48.03%. In the transformer category, Dutch-BERT outperformed others with a precision of 48.40%, recall of 73.80%, and the highest F1-score of 58.42%.

The SetFit model exhibited a precision of 45.31%, recall of 58.44%, and an F1-score of 51.05%. It shows competitive results compared to transformer-based models such as Dutch-BERT, indicating its possible usefulness in Dutch language classification tasks.

In general, transformer-based models surpass both ML and DL models across various languages, demonstrating the efectiveness of pre-trained language models in numerous natural language processing applications. Additionally, within each category, specific models show superior performance, emphasizing the necessity of choosing the appropriate model based on the specific task and language. 5.1. Error Analysis

Quantitative Analysis

A comprehensive quantitative and qualitative error analysis is conducted to provide detailed insights into the proposed model’s performance.

Figure 2 illustrates the confusion matrix of the best-performing models across English, Arabic, and Dutch.

From a total of 341 test cases in English, RoBERTa demonstrates strong performance in identifying the positive class, with 247 True Positives and only 6 False Positives. This indicates a high precision, meaning the model is highly accurate when it predicts “Yes.” Additionally, with 57 True Negatives, it correctly identifies many negative instances. However, there are 31 False Negatives, which indicates that some positive instances are being missed. RoBERTa shows a balanced approach with notable proficiency in minimizing incorrect optimistic predictions, resulting in a high F1 score of 75.82% over positive samples.

From a total of 610 test cases in Arabic, Dehate-BERT shows a diferent pattern in its confusion matrix. With 215 True Positives and 120 True Negatives, the model accurately identifies many instances from both classes. However, the model has many False Positives (177) and False Negatives (98). This suggests that while Dehate-BERT can identify positive instances, it incorrectly classifies many negative instances as positive, leading to a lower precision. Additionally, the relatively high count of False Negatives indicates room for improvement in recall, highlighting the need for better distinction between the two classes.

Dutch-BERT performs moderately from 1000 test cases in Dutch, with 446 True Positives and 214 True Negatives. The model, however, has 157 False Positives and 183 False Negatives. This indicates that while Dutch-BERT can identify positive instances reasonably well, it struggles with precision and recall. The high number of False Positives suggests a tendency to overpredict the positive class, and the significant count of False Negatives shows it also misses many positive instances. Consequently, DutchBERT’s overall performance is balanced but shows substantial room for improvement in minimizing misclassifications to enhance its F1 score.

Qualitative Analysis

It is clear that the models accurately predicted the labels for examples 2, 3, and 5 but made errors with examples 1 and 4. For the first example, the sentence’s intent is ambiguous, leading to an incorrect label prediction by the model. In the case of example 4, although the sentence is checkworthy, the model mislabeled it due to inadequate training data in the Dutch language, which hindered proper learning.

6. Conclusion

This work investigated the various ML, DL, and transformer-based models for identifying checkworthy tweets or texts in English, Arabic, and Dutch. The results indicate that transformer-based models shine in this task and exhibit exceptional capability in detecting checkworthy text. Specifically, RoBERTa excels in English, Dehate-BERT for Arabic, and Dutch-BERT for Dutch, achieving the highest F1 scores of 75.82%, 52.55%, and 58.42%, respectively. The study recommends that further advancements be made by increasing the training data and incorporating advanced LLMs and GPT models. speeches, and interviews using audio data, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 12011–12015. [16] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli, G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [17] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458. [18] G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble, France, 2024. [19] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov, F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of multigenre content, in: [18], 2024. [20] J. M. Struß, F. Ruggeri, A. Barrón-Cedeño, F. Alam, D. Dimitrov, A. Galassi, G. Pachov, I. Koychev, P. Nakov, M. Siegel, M. Wiegand, M. Hasanain, R. Suwaileh, W. Zaghouani, Overview of the CLEF-2024 CheckThat! lab task 2 on subjectivity in news articles, in: [18], 2024. [21] J. Piskorski, N. Stefanovitch, F. Alam, R. Campos, D. Dimitrov, A. Jorge, S. Pollak, N. Ribin, Z. Fijavž, M. Hasanain, N. Guimarães, A. F. Pacheco, E. Sartori, P. Silvano, A. V. Zwitter, I. Koychev, N. Yu, P. Nakov, G. Da San Martino, Overview of the CLEF-2024 CheckThat! lab task 3 on persuasion techniques, in: [18], 2024. [22] F. Haouari, T. Elsayed, R. Suwaileh, Overview of the CLEF-2024 CheckThat! Lab Task 5 on Rumor

Verification using Evidence from Authorities, in: [18], 2024. [23] P. Przybyła, B. Wu, A. Shvets, Y. Mu, K. C. Sheang, X. Song, H. Saggion, Overview of the CLEF2024 CheckThat! lab task 6 on robustness of credibility assessment with adversarial examples (incrediblae), in: [18], 2024. [24] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems 26 (2013). [25] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [26] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/1810.04805. arXiv:1810.04805. [27] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat, O. Pereg, Eficient few-shot learning without prompts, arXiv preprint arXiv:2209.11055 (2022). [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [29] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language understanding, arXiv preprint arXiv:2003.00104 (2020). [30] S. S. Aluru, B. Mathew, P. Saha, A. Mukherjee, Deep learning models for multilingual hate speech detection, arXiv preprint arXiv:2004.06465 (2020). [31] P. Delobelle, T. Winters, B. Berendt, RobBERT: a Dutch RoBERTa-based Language Model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 3255–3265. URL: https://www.aclweb.org/anthology/2020. ifndings-emnlp.292. doi: 10.18653/v1/2020.findings-emnlp.292.

[1]

Olan ,

Jayawickrama ,

E. O.

Arakpogun ,

Suklan , S. Liu, Fake news on social media: the impact on society , Information Systems Frontiers 26 ( 2024 ) 443 - 458 .

[2]

Nakov ,

Corney ,

Hasanain ,

Alam ,

Elsayed ,

Barrón-Cedeño ,

Papotti ,

Shaar ,

G. D. S.

Martino , Automated fact-checking for assisting human fact-checkers , arXiv preprint arXiv:2103.07769 ( 2021 ).

[3]

Li ,

Gao ,

Meng ,

Li ,

Su ,

Zhao ,

Fan , J. Han, A survey on truth discovery , ACM Sigkdd Explorations Newsletter 17 ( 2016 ) 1 - 16 .

[4]

Shu ,

Sliva ,

Wang ,

Tang , H. Liu, Fake news detection on social media: A data mining perspective , ACM SIGKDD explorations newsletter 19 ( 2017 ) 22 - 36 .

[5]

D. M.

Lazer ,

M. A.

Baum ,

Benkler ,

A. J.

Berinsky , K. M. Greenhill , F.

Menczer , M. J.

Metzger , B.

Nyhan , G.

Pennycook , D.

Rothschild , et al., The science of fake news, Science 359 ( 2018 ) 1094 - 1096 .

[6]

Vosoughi ,

Roy , S. Aral, The spread of true and false news online , science 359 ( 2018 ) 1146 - 1151 .

[7]

Xu ,

V. S.

Sheng ,

Wang , A unified perspective for disinformation detection and truth discovery in social sensing: a survey, ACM Computing Surveys (CSUR) 55 ( 2021 ) 1 - 33 .

[8]

Chen ,

Chandramouli ,

K. P.

Subbalakshmi , Scam detection in twitter , in: Data Mining for Service , Springer, 2014 , pp. 133 - 150 .

[9]

S. D.

Gollapalli ,

Du , S. - K. Ng, Identifying checkworthy cure claims on twitter , in: Proceedings of the ACM Web Conference 2023 , 2023 , pp. 4015 - 4019 .

[10]

Patwari ,

Goldwasser ,

Bagchi , Tathya: A multi-classifier system for detecting check-worthy statements in political debates , in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management , 2017 , pp. 2259 - 2262 .

[11]

Shaar ,

Hasanain ,

Hamdan ,

Z. S.

Ali ,

Haouari ,

Nikolov ,

Kutlu ,

Y. S.

Kartal ,

Alam , G. Da San Martino, et al., Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates ., in: CLEF (working notes) , 2021 , pp. 369 - 392 .

[12]

Williams ,

Rodrigues ,

Tran , Accenture at checkthat! 2021 : interesting claim identification and ranking with contextually sensitive lexical training data augmentation , arXiv preprint arXiv:2107.05684 ( 2021 ).

[13]

Alam ,

Barrón-Cedeño ,

G. S.

Cheema ,

Hakimov ,

Hasanain ,

Li ,

Míguez ,

Mubarak ,

G. K.

Shahi ,

Zaghouani , et al., Overview of the clef-2023 checkthat! lab task 1 on checkworthiness in multimodal and multigenre content , Working Notes of CLEF ( 2023 ).

[14]

H. T.

Sadouk ,

Sebbak ,

H. E.

Zekiri , Es-vrai at checkthat! 2023: Analyzing checkworthiness in multimodal and multigenre ( 2023 ).

[15]

Ivanov , I. Koychev,

Hardalov ,

Nakov , Detecting check-worthy claims in political debates,