1. Introduction

Evaluating TF-IDF and Transformers-based Models for Detecting COVID-19 related Conspiracies

Rohullah Akbari

0 0 Simula Research Laboratory , Norway

The proliferation of misinformation and conspiracy theories on online social media platforms has become a significant concern for public health and safety. To efectively combat this issue, a new generation of data mining and analysis algorithms is essential for early detection and tracking of these information cascades. In this paper, we employed a multifaceted approach for detecting and identifying conspiracy theories and misinformation spreaders related to the Coronavirus pandemic. Specifically, we utilized Text-Based Detection (Task 1) through a combination of TF-IDF-based and Transformers-based methods, Graph-Based Detection (Task 2) through a graph convolutional network, and alternative Transformersbased methods to improve the results of Task 1. Our eforts have yielded promising results, with our best models achieving an impressive MCC score of 0.705 for Task 1, 0.041 for Task 2, and 0.698 for Task 3.

1. Introduction 2. Text-Based Misinformation and Conspiracies Detection 2.1. The TF-IDF approach

In this section, we will create nine distinct TF-IDF models for each of the nine categories. We are interested to see if the TF-IDF technique can outperform the CT-BERT model, and if not, how close it can come. This approach is based on using Tfidf Vectorizer and Stochastic Gradient Descent classifier (SGD) from the scikit-learn framework [8]. SGD is a simple but very eficient approach to fit linear classifiers such as linear Support Vector Machines (SVM). SGD does not belong to any particular family of machine learning models; it is only an optimization technique. Often, an instance of SGD Classifier has an equivalent estimator in the Scikit-learn API, potentially using a diferent optimization technique. For example, logistic regression is produced when SGDClassifier(loss=’log loss’) is used. The TF-IDF approaches in previous works have been only executed with unigrams [7]. This leads to mislaid learning since there could be important information in the bigrams and trigrams. We can see in Table 2 that N-grams such as "bill gate" and "new world order" could be very important for the classification of the conspiracies. Based on this, we have chosen to implement the TF-IDF with various N-grams including unigrams, bigrams, trigrams, and other ranges. In addition to that, we have also chosen to implement the SGD with diferent loss functions and penalties (see Table 1 for the parameters).

2.2. Transformers-based approaches

The first Transformers approach ( One-for-All) is based on training one CT-BERT model for classifying all of the conspiracy categories at once (see Figure 1). The CT-BERT is fine-tuned with nine diferent weighted Cross Entropy loss functions. The weights are computed by taking into account the number of samples in a specific category and dividing it by the numbers of each of the subcategories in that category. The optimizer used in this approach is AdamW [9]. Before feeding the text data into the model, we preprocessed it by converting the emojis into their textual meaning. Furthermore, the training of the model was done with 5-fold Cross validation and the model with the best test MCC score was chosen. The One-for-One approach is based on training nine separate CT-BERT models for the nine categories (the approach is shown in Figure 2). In this approach, we are not using any weighted loss function. Other than that, we are applying the same loss function, optimizer, and preprocessing method. The training of the model was done with stratified 5-fold cross-validation and the model with the best MCC score was chosen.

3. Graph-Based Conspiracy Source Detection

For this task, we applied a simple node classification where the nodes are representing the user’s label for whether they are a misinformation spreader or not. We created a network for each of the users that had a label. The network consisted of all of the other users that had an edge directed to the main user and the users with low-weight values were removed. We chose to work with graph convolutional network (GCN) [10]. The implementation was done by using the GCNConv class from the torch_geometric library with PyTorch.

4. Graph and Text-Based Conspiracy Detection

In this section, we will examine whether we can improve the results from Section 2 by combining the data from Section 2 and Section 3. The output of the classifiers will be enriched by combining text with numerical features. We are proposing an approach that consists of training the CT-BERT with the text data and concatenating the last layer of the CT-BERT with the user information such as verified_account , description_length, num_favourites, num_followers, num_statuses, num_friends and location_country. The concatenating layer is then driven through a multilayer perceptron (MLP) and then processed into an output layer (see Figure 3). Our second approach is based on extending the text data with tweeters’ statistics and then feeding it into the One-for-All approach 2.2. The numerical features that have been inserted in the text are separated with [SEP] token, e.g.

Tweet_text [SEP] 0 [SEP] 159 [SEP] 2812 [SEP] 566 [SEP] 1426 [SEP] 1041 [SEP] 3

5. Results

As expected, the TF-IDF approach obtained a lower MCC score than the Transformers-based approaches (see Table 3). The One-for-One approach achieved the best score from all submitted runs. The TF-IDF approach does quite well for some of the categories, especially for the Population reduction and the New World Order. Bigrams such as "population control" and "bill gate" are very important for Population reduction, and "world order" and "new world" are obviously talking about the New World Order category (Table 2). Furthermore, we can see that the N-range such as ( 2,3 ), ( 2,4 ), and ( 2,4 ) did not do well and the dominating range is ( 1,4 ) (Figure 4). As a result, unigrams are crucial for the classification of conspiracies since the N-gram ranges without it performed poorly. We submitted only one run for Task 2 which resulted in an MCC score of 0.041 and clearly states that our implementation was not successful. The main reason for the poor performance could be the fact that we removed all the neighbors of the main user node that had low edge values. The combination of CT-BERT with numerical

6. Discussion and Outlook

We successfully implemented three approaches for Task 1; one TF-IDF approach and two Transformers-based approaches. We experimented with diferent N-gram ranges and found out that the N-gram range ( 1,4 ) was best suited for most of the categories. The best MCC score (0.705) was found with the One-for-One approach. We presented two approaches for improving the Task 1 results but none of them improved the results from Task 1.

[1]

Pogorelov ,

D. T.

Schroeder ,

Brenner ,

Moe ,

Maulana1 , J. Langguth, Combining tweets and connections graph for fakenews detection at mediaeval 2022 , in: roceedings of MediaEval 2022 CEUR Workshop , 2022 .

[2]

M. S.

Al-Rakhami ,

A. M.

Al-Amri , Lies kill, facts save: Detecting covid-19 misinformation in twitter , IEEE Access 8 ( 2020 ) 155961 - 155970 . doi: 10 .1109/ACCESS. 2020 . 3019600 .

[3]

Wani , I. Joshi,

Khandve ,

Wagh ,

Joshi , Evaluating deep learning approaches for covid19 fake news detection , in: Combating Online Hostile Posts in Regional Languages during Emergency Situation , Springer International Publishing, 2021 , pp. 153 - 163 . URL: https://doi.org/10.1007% 2F978 - 3 - 030 -73696-5_ 15 . doi: 10 .1007/978-3- 030 -73696-5_ 15 .

[4]

Glazkova ,

Glazkov , T. Trifonov, g2tmn at constraint@AAAI2021: Exploiting CT-BERT and ensembling learning for COVID-19 fake news detection , in: Combating Online Hostile Posts in Regional Languages during Emergency Situation , Springer International Publishing, 2021 , pp. 116 - 127 . URL: https://doi.org/10.1007% 2F978 - 3 - 030 -73696-5_ 12 . doi: 10 .1007/978-3- 030 -73696-5_ 12 .

[5]

Patwa ,

Sharma ,

Pykl ,

Guptha , G. Kumari,

M. S.

Akhtar ,

Ekbal , A. Das , T. Chakraborty , Fighting an infodemic: COVID-19 fake news dataset , in: Combating Online Hostile Posts in Regional Languages during Emergency Situation , Springer International Publishing, 2021 , pp. 21 - 29 . URL: https://doi.org/10.1007% 2F978 - 3 - 030 -73696- 5 _3. doi: 10 .1007/978-3- 030 -73696- 5 _ 3 .

[6]

G. K.

Shahi , D. Nandini, FakeCovid- A Multilingual Cross -domain Fact Check News Dataset for COVID-19 , ICWSM, 2020 . URL: https://doi.org/10.36190/ 2020 .14. doi: 10 .36190/ 2020 .14.

[7]

Peskine , G. Alfarano,

Ismail ,

Papotti ,

Troncy , Detecting covid-19 -related conspiracy theories in tweets ( 2021 ). URL: https://2021.multimediaeval.com/paper65.pdf.

[8]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg , et al., Scikit-learn: Machine learning in python , the Journal of machine Learning research 12 ( 2011 ) 2825 - 2830 .

[9]

Loshchilov ,

Hutter , Decoupled weight decay regularization, 2017 . URL: https://arxiv.org/abs/ 1711.05101. doi: 10 .48550/ARXIV.1711.05101.

[10]

T. N.

Kipf ,

Welling , Semi-supervised classification with graph convolutional networks , 2016 . URL: https://arxiv.org/abs/1609.02907. doi: 10 .48550/ARXIV.1609.02907.