Detecting fake news using Twitter social information Jesús M. Fraile-Hernández* , Álvaro Rodrigo and Roberto Centeno NLP & IR Group at UNED (Spain) Abstract In this paper, the aim is to study whether social information can provide useful information when classifying news. For this purpose, a set of news items in Spanish has been extended with social information. Subsequently, a classifier model has been proposed to carry out this task, mixing the social information previously extracted with the textual information of the news item. Finally, we have studied which social features are the most relevant in this task. Keywords Social information, Classifying news, Classifier model, Social features, Fake news detection 1. Introduction Due to the increase in communication channels in recent decades, users have access to an immense amount of information almost instantaneously. However, it is relatively easy to fall for hoaxes or misinformation on social media. Traditional models of fake news detection focus on detecting the linguistic characteristics of the news. Subsequently, in [1], pre-trained embeddings were used along with LSTM. Finally, with the emergence of contextual models, [2] leveraged the pre-trained BERT model, to perform transferred learning and identify the veracity of news. However, due to the difficulty even for a human to discern between true and false news, sometimes the textual information in the news is not enough. In [3] it is proposed at a theoretical level the possibility of creating a hybrid approach that incorporates the linguistic characteristics of the news and an analysis of the networks that are formed around that news. In [4] the author uses different features to identify fake news in popular Twitter threads. In [5] fake news is detected using only the extracted textual information. Regarding hybrid models, the CSI model proposed in [6] performs a characterisation in three modules: capturing, scoring and integrating. In [7], a news detection model is proposed that considers the association of user interactions, the editor’s bias and the users’ stance towards the news. The aim of this work is to study whether social information can provide useful information for the detection of fake news. To this end, social information has been collected from Twitter to extend FakeDeS, a relevant corpus of news in Spanish, and a model has been designed to include textual and social information. Furthermore, we intend to study which social features are the most relevant for news classification. The rest of this paper is structured as follows: Section 2 describes the datasets to be used along with the task to be solved. Section 3 describes the methodology followed including the extraction of social information from Twitter along with the models proposed based on the data they use. Section 4 includes the evaluation metrics used. Section 5 then presents the results, which will be discussed in Section 6. Finally, conclusions and future work are given in Section 7. Proceedings of the 1st Workshop on COuntering Disinformation with Artificial Intelligence (CODAI), co-located with the 27th European Conference on Artificial Intelligence (ECAI), pages 19–28, October 20, 2024, Santiago de Compostela, Spain * Corresponding author. $ jfraile@lsi.uned.es (J. M. Fraile-Hernández)  0009-0001-5474-4844 (J. M. Fraile-Hernández) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 19 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Figure 1: Results IberLEF 2021 on the test set. 2. Dataset and task The dataset we will work with is the Spanish Fake News Corpus (FakeDeS) [8], which contains publica- tions in Spanish about different events that were collected from November 2020 to March 2021. Each of these publications is labelled as true or false. Newspaper websites and fact-checking websites were mainly used to collect the information. The dataset is divided into 3 files with a total of 1543 news items. Because of the methodology used, it has been decided to merge the training and development files to obtain what we will call the training set. Each of the news items contains information such as the topic, the name of the source, the headline, the text and the link to the news item. The training set has a total of 971 news items, of which 480 are false and 491 are true. On the other hand, the test set consists of 572 news items, half of which are true and half of which are false. Therefore, we are dealing with balanced data sets. The topics covered in the training corpus are: politics, entertainment, sport, society, science, health, economy, security and education. It should be noted that the test set has news related to Covid-19, while the training set does not present any news related to this topic (the most similar are the health news, but in no case do they mention Covid-19). Therefore, the models that are proposed will have to correctly classify this topic without having seen it in the training. In IberLEF 2021, a shared task was proposed whose objective was to classify a series of news items as true or false. To do so, the FakeDeS corpus described above was used. A report was published in [9], which collected the most important characteristics of the best-performing models. The results of this task by the different participants can be seen in Figure 1. Among the approaches used to solve it, the participants of the GDUFS team, the team that achieved the best accuracy, used a BERT model and sample memory with an attention mechanism. The method consisted of taking the first and last segments of the texts and feeding them into a BERT system, obtaining two embeddings (head and tail). In addition, there is a matrix called ‘sample memory’, which is obtained by taking a random sample of the head and tail embeddings; this matrix is used in an attention mechanism with the rest of the 20 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Figure 2: Violin diagram of the number of tweets collected. texts. In contrast to the GDUFS_DM approach, the participants of team Haha, the second-placed team, employed feature selection with a weighted tf-idf and a multilayer perceptron. This model not only analysed the content of the news item, but also combined information such as the publisher of the news item or the topic of the news item. 3. Methodology This section describes the methodology used to extract social information from Twitter users. In addition, the models trained according to the type of data they use are presented. 3.1. Social information extraction The main objective of this work is to study the information provided by social information when detecting fake news, and as mentioned in Chapter 1, there is no corpus in Spanish that contains this information. This is why we decided to extract this information from the social network Twitter, using the API provided by the platform. For each news item, we searched for those tweets that contained the headline of the news item or the link to it. To solve the problem of the maximum length of the queries, special characters have been eliminated from the news headlines. According to [4] and [5] there is a series of metadata of the tweets that allow extracting information about whether the user may be prone to the propagation of fake news or the tweet may contain untruthful information. Therefore, it has been decided to extract the following metadata from each of the tweets. • Tweet. Text of the tweet, id of the author, id of the tweet, number of retweets, number of replies to the tweet, number of likes, number of citations of the tweet. • User. username (str), user creation date (date) ISO 8601, verified user (bool), number of followers (int), number of followed (int), number of tweets (int), number of times listed (int). We have managed to extract posts from 41.67% of the total number of news items. Of these, the distribution of the number of tweets collected per news item shows a high concentration in the (0, 200) interval, representing 86% of the news items. Within this interval, it is observed that true news tends to receive more interaction. However, as the number of tweets about a news item increases, it is evident that fake news receives a greater number of interactions. This trend can be seen in the violin diagram presented in Figure 2. It is worth noting that, although the news is written in Spanish, there are tweets in English or French that talk about the news. This is especially true for news related to Covid-19. 21 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 3.2. Textual models In this section, the textual methods used for the binary classification of the news items will be presented. The full text of the news item has been used, so it has had to be preprocessed. For the non-contextual models, urls, emoticons or non-textual expressions, stopwords, the text has been converted to lowercase and the processes of lemmatisation and stemming have been applied. However, for the contextual models, only the urls have been eliminated. Subsequently, 5 different approaches have been used. 1. Vector space model based on bags of words (BoW). 2. Vector space model using a weighted tf-idf. 3. Bigram counting. 4. Neural Networks and deep learning. 5. Contextual models. For approaches 1, 2 and 3, Naive Bayes, SVM, Logistic Regression, Decision Trees and Random Forest models have been trained. For approach 4, multilayer perceptrons with input the tf-idf weight vector, multilayer perceptrons and convolutional networks with an embedding layer and multilayer perceptrons, convolutional networks, LSTM, GRU and bidirectional networks with a pre-trained embedding layer. Finally, for approach 5, the BETO model has been selected: Spanish BERT [10] with a final classification layer with two neurons. This model is a BERT model trained with the whole-word masking technique on a large corpus of more than three billion Spanish words. 3.3. Models with social information The methods that use only the social information of the news collected use the following metadata for each published tweet: number of retweets, number of replies, number of likes of the tweet, number of quotes of the tweet, verified user, number of followers, number of followed, number of tweets of the author, number of times the author has been listed. Then, in order to record the impact of the news item on social networks, the number of tweets collected for this news item is added. To represent all the tweets that talk about a certain news item, an average of the previous character- istics of each tweet has been calculated. Finally, the standard deviation of each characteristic was added. In this way, a data matrix with 20 columns is obtained (where the column relating to the deviation of the number of tweets of the news item is always 0). Once the feature matrix has been obtained, different learning models have been used with different hyperparameter explorations such as Decision Trees, Random Forest, SVM, Gradient Boosting, Adaptive Boosting, MLP,... 3.4. Hybrid model A hybrid model has been developed that seeks to take advantage of both the textual information provided by the text of the news item and the social information extracted from the Twitter data (both the non-textual information of the section and the text of the tweets collected). In this model, for each news item, a specialised model is used to classify the news using social information. For this purpose, the best model from the previous subsection (Random Forest) is selected. With this model, for each news item, the probabilities of being true or false are extracted using as input the corresponding row of the matrix of social characteristics with standard deviation described in that section. In the event that no tweets could be extracted from a news item, the output would be a vector of two zeros. In parallel, the text of the news item is processed using the BETO: Spanish BERT model [11]. The output is a vector of dimension 768. In parallel to these two processes, for each news item with tweets collected, the text of each tweet is pre-processed (eliminating URLs and tokenising) and subsequently processed using the pre-trained XLM-roBERTa-base model [12]. This transformer model has been trained on a corpus of about 198 22 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Figure 3: Workflow of the hybrid model. million tweets in 8 different languages (Spanish, Arabic, English, French, German, Hindi, Portuguese and Italian) and is specialised in sentiment classification (positive, negative or neutral). In our case, the last layer of the model will be removed, obtaining as output a vector of length 768 that will represent the most relevant features of the text of the tweet. For each available tweet, the previous process has been carried out, obtaining a vector of length 768. Finally, an average of all the vectors of the tweets of the news item has been made to obtain a vector that represents the tweets of that news item. If the news item had no social information, a vector of zeros is returned. Then, the three vectors are joined to obtain a vector of dimensionality 1538. This flowchart can be seen in Figure 3 Once all the news has been processed following the previous diagram, several models have been trained such as Decision Trees, Random Forest, SVM, Gradient Boosting, Adaptive Boosting, MLP, ... 4. Evaluation Two different methodologies have been used to evaluate the models, a cross-validation and an evaluation on the test set. 4.1. 𝑘-fold cross-validation Cross-validation is one of the most widely used methods to estimate the prediction error of a model with a given set of hyperparameters. A 𝑘-fold (or 𝑘-fold cross-validation) has been used. This method divides the data set, in our case the train set together with the development set, into 𝑘 equal parts 𝑃1 , . . . , 𝑃𝑘 . For each 𝑃𝑛 the model is trained using the other 𝑘 − 1 parts and the error in predicting the 𝑃𝑛 data (data never seen by this model) is calculated. By doing this for the 𝑘 parts we obtain a set of errors. With these 𝑘 errors we calculate their mean and variance to obtain a measure of the average error of that model with those hyperparameters. It should be noted that this method requires a fairly large computational cost, since for a cross- validation of 𝑘-folds it would be necessary to train 𝑘 models. As a general rule, a value of 5 or 10 is 23 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Textual models 𝐹1 TF-IDF (RF) 0.849 BoW (RF) 0.825 Bigramas (RF) 0.822 MLP (Embedding) 0.786 MLP (TF-IDF) 0.751 CNN 0.740 BETO 0.727 GRU 0.678 Table 1 Cross-validation results of textual model training. usually chosen as a good compromise between bias and variance. In our case a 5-fold cross-validation has been used. 4.2. Test set evaluation Finally, for the model that has performed best in the previous cross-validations, the test set will be evaluated. This set will never be seen by the model and will provide a representation of the generalisability of the model. 4.3. Evaluation metrics To evaluate the performance of our classification model, we use the F1 metric. The F1 value will be calculated for both true and false classified news. With this, the value 𝑀 𝑎𝑐𝑟𝑜 - 𝐹 1, or simply 𝐹 1, will be calculated as the average between the two previous values. 5. Results In this section the results of the various trained models will be presented. For each approach in the section 3 the following results will be shown: • Within the training of a particular approach, the 𝑀 𝑎𝑐𝑟𝑜 - 𝐹1 value of the best algorithms used will be shown. The average of the 𝑀 𝑎𝑐𝑟𝑜 - 𝐹1 values will be reflected using 5-fold cross-validation. • For each approach, the model with the best 𝑀 𝑎𝑐𝑟𝑜 - 𝐹1 will be selected during training. Sub- sequently, it will be retrained with all data and evaluated on the test set. The 𝐹 1𝐹 𝑎𝑘𝑒 , 𝐹 1𝑇 𝑟𝑢𝑒 , 𝑀 𝑎𝑐𝑟𝑜 - 𝐹1 and the Accuracy of the model will be exposed. 5.1. Textual models The training results of the methods described in section 3.2 are listed in Table 1. It can be seen that the non-neural models stand out from those using neural networks. This could be due to the fact that the models being used have a large number of parameters to optimise and we have a rather limited data set. It is worth noting that the use of pre-trained embeddings has resulted in lower performance than training the embeddings from scratch. Also noteworthy is the poor performance obtained with recurrent networks, models that have required a large amount of training time and are commonly used for language processing problems. The best performing approach has been to use a weighted tf-idf together with a Random Forest model. The results of the evaluation of this model on the test set and the results of the teams participating in IberLEF 2021 are shown in Table 4. 24 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Social information models 𝐹1 Random Forest 0.845 Gradient Boosting 0.834 Adaptive Boosting 0.826 Extremely Randomized Trees 0.817 Decision Trees 0.797 K-Nearest 0.788 Multilayer Perceptron (MLP) 0.787 SVM 0.785 Passive-Aggressive Classifier 0.785 Perceptron with two hidden layers 0.783 Linear Discriminant Analysis (LDA) 0.781 Multinomial Naive Bayes 0.781 Perceptron with one hidden layer 0.781 Bernouilli Naive Bayes 0.779 Quadratic Discriminant Analysis 0.776 Logistic Regression 0.703 Table 2 Cross-validation results of social models. 5.2. Social information models The training results of the methods described in section 3.3 are collected in Table 2. We can see that the 𝐹 1 of the models is quite high. Tree-based models occupy the top 5 positions in the list. In addition, those based on clusters of trees stand out from individual decision trees. The best performing approach was a Random Forest model. It should be remembered that this model has only been trained and evaluated with those news items from which it has been possible to extract social information, so the training and test set is smaller than in the rest of the cases. Due to these results, it has been decided to choose the Random Forest classifier for the social information for the hybrid model, as indicated in section 3.4. 5.3. Hybrid model The training results of the methods described in section 3.4 are listed in Table 3. In view of the training results, any of the first 2 models would be valid for your choice. The rest of the models have a very similar accuracy to the first three. It has been decided to select logistic regression over decision trees since it is a simpler algorithm, with a smaller number of hyperparameters and with a lower computational cost. The results of the evaluation of this model on the test set and the results of the teams participating in IberLEF 2021 are shown in Table 4. 6. Discussion This section presents a discussion of the results obtained. In view of the results shown in Tables 1 and 3, it can be seen that the approach that obtains the best 𝐹 1 is a model that uses only textual information, more specifically a Random Forest with a weighted tf-idf. This approach obtains a higher 𝐹 1 compared to other types of models that include social information, so that a priori it could be thought that social information does not provide relevant information. However, in Table 4, we can see how on the test set the model that uses only textual information obtains worse results compared to the hybrid model. This is due to the fact that when using a tf-idf weight it is possible that there are words in the corpus on which the weight is applied (training news corpus) that do not exist in the test set. This is why models such as transformer networks pre-trained 25 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 Hybrid Model 𝐹1 Decision Trees 0.818 Logistic Regression 0.818 SVM 0.809 Linear Discriminant Analysis 0.809 Random Forest 0.809 Gradient Boosting 0.809 Passive-Aggressive Classifier 0.809 Adaptive Boosting (AdaBoost) 0.809 Extremely Randomized Trees 0.809 Quadratic Discriminant Analysis 0.809 Multilayer Perceptron (MLP) 0.809 K-Nearest 0.809 Perceptron with three hidden layers 0.809 Perceptron with two hidden layers 0.808 Perceptron with one hidden layer 0.808 Multinomial Naive Bayes 0.631 Bernoulli Naive Bayes 0.607 Table 3 Cross-validation results of hybrid model. Fake True 𝐹𝑚𝑎𝑐𝑟𝑜 Accuracy Textual Models 0.7140 0.7488 0.7314 0.7325 Hybrid Model 0.7900 0.7352 0.7626 0.7657 GDUFS_DM 0.7666 0.7649 0.7666 0.7657 Haha 0.7548 0.7522 0.7548 0.7535 Chats_ 0.7514 0.7690 0.7514 0.7605 SINAI 0.7385 0.7821 0.7385 0.7622 baseline-BERT 0.7321 0.7432 0.7321 0.7378 baseline-BOW-SVM 0.7217 0.7359 0.7217 0.7290 Table 4 Results on the test set. Including the best participants of IberLEF 2021. on large corpora will have more generalisation capacity and, therefore, will be able to obtain better results. Once social information is introduced into the model, a significant increase in results can be seen. This is due to the fact that on the one hand the text is being processed using transformer models with a very high generalisation capacity and that the non-textual social information extracted from Twitter is the same regardless of the subject matter. Comparing the models with respect to the best classified in IberLEF 2021, Figure 1, it can be seen that the hybrid model is the one that best classifies Fake news. This hybrid model obtains the same Accuracy as the first ranked team. In addition, a study has been carried out on which social information features are the most relevant for the model. For this purpose, the importance of the permutation set out in [13] has been used. It can be seen that 8 of the 9 most relevant features only depend on the author’s information and not on the content or information of the tweet. These 9 features are, in order of importance: listed_count, following_count_std, followers_count, tweet_count_std, followers_count_std, quote_count_std, verified, verified_std, tweet_count. In addition, within these characteristics, the information provided by those obtained from the standard deviation of the set of tweets collected for each news item stands out. The percentage of importance of the most relevant features used in the logistic regression of the hybrid model has also been calculated. To calculate the importance of each feature, 𝑓𝑖 , the coefficients of the regression, 𝑤𝑖 , have been extracted and the following operation has been carried out 𝑓𝑖 = 𝑒𝑤𝑖 . Finally, the percentage of each of them has been calculated. With this, the most relevant characteristic 26 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 for the model, with 10 times more importance over the rest, was the variable that corresponds to the probability returned by the Random Forest that a news item is true using the social information of the news item. 7. Conclusions and Future Work Throughout the development of this work, it has been observed how the introduction of social informa- tion, combined with textual information, has enabled the classification of news, helping to improve the performance of the models. This suggests that, when solving a problem, it would be useful to add social information to the dataset. However, obtaining this information is quite costly both economically and in terms of time. Additionally, the importance of social features in classifier models has been studied, concluding that author-related features are more important than tweet-related features. The development of a model that combines all textual and social features achieves similar or better results than models that use only textual information. However, it is crucial to acknowledge several important limitations: • Impractical Approach: Many of the social signals being harvested are post-facto. While disin- formation might actually be spreading, many features (such as the number of reposts) would not have stabilized. Thus, while the current approach of augmenting these signals might work post-facto, it is unlikely to work with live data. Even post-facto, it is unclear whether the approach will scale. • Flawed Methodology: The use of balanced training data, and a small set of data at that, is not meaningful. Particularly, it is unclear how learning from such a small corpus would generalize when new kinds of disinformation arise. In practice, the distribution of disinformation-carrying articles compared to genuine ones is far from balanced. Therefore, any realistic methodology needs to incorporate the ability to handle imbalance and transferability from the learning phase. Moreover, adversary behavior might change to emulate the features of good articles or at least stray away from its current behavior, rendering the specific features used for classification obsolete. • Too Static and Small Dataset: The dataset used is too static and small, and lacks adequate diversity to consider any results conclusive. A variety of distinct datasets ought to be used to determine if the ideas actually work in a more general setting. As a line of future work, it would be a good approach not only to study the individual social metadata of each user, but also to study a social graph of the followers or followers to see the social relationships that exist between them. Additionally, the dataset should be expanded and diversified, and methods should be developed to handle imbalanced data and adapt to changing adversary behavior. We acknowledge that this work, while preliminary, can trigger useful discussions and provides a foundation upon which more robust and scalable approaches can be built in the future. Acknowledgments This work was supported by the HAMiSoN project grant CHIST-ERA-21-OSNEM-002, AEI PCI2022- 135026-2 (MCIN/AEI/10.13039/501100011033 and EU “NextGenerationEU”/PRTR). References [1] P. Bharadwaj, Z. Shao, Fake news detection with semantic features and text mining, International Journal on Natural Language Computing (IJNLC) Vol 8 (2019). [2] R. K. Kaliyar, A. Goswami, P. Narang, Fakebert: Fake news detection in social media with a bert-based deep learning approach, Multimedia tools and applications 80 (2021) 11765–11788. 27 Jesús M. Fraile-Hernández et al. CODAI Workshop Proceedings 19–28 [3] N. K. Conroy, V. L. Rubin, Y. Chen, Automatic deception detection: Methods for finding fake news, Proceedings of the association for information science and technology 52 (2015) 1–4. [4] C. Buntain, J. Golbeck, Automatically identifying fake news in popular twitter threads, in: 2017 IEEE International Conference on Smart Cloud (SmartCloud), IEEE, 2017, pp. 208–215. [5] M. Albahar, A hybrid model for fake news detection: Leveraging news content and user comments in fake news, IET Information Security 15 (2021) 169–177. [6] N. Ruchansky, S. Seo, Y. Liu, Csi: A hybrid deep model for fake news detection, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 797–806. [7] K. Shu, S. Wang, H. Liu, Exploiting tri-relationship for fake news detection, arXiv preprint arXiv:1712.07709 8 (2017). [8] J.-P. Posadas-Durán, H. Gómez-Adorno, G. Sidorov, J. J. M. Escobar, Detection of fake news in a new corpus for the spanish language, Journal of Intelligent & Fuzzy Systems 36 (2019) 4869–4876. [9] H. Gómez-Adorno, J. P. Posadas-Durán, G. B. Enguix, C. P. Capetillo, Overview of fakedes at iberlef 2021: Fake news detection in spanish shared task, Procesamiento del Lenguaje Natural 67 (2021) 223–231. [10] J. Canete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, Pml4dc at iclr 2020 (2020) 1–10. [11] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [12] F. Barbieri, L. Espinosa-Anke, J. Camacho-Collados, Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond, Proceedings of the LREC, Marseille, France (2022) 20–25. [13] A. Altmann, L. Toloşi, O. Sander, T. Lengauer, Permutation importance: a corrected feature importance measure, Bioinformatics 26 (2010) 1340–1347. 28