TSIA team at FakeDeS 2021: Fake News Detection in Spanish Using Multi-Model Ensemble Learning Zhengyi Guan1[0000−0002−6265−7193] School of Information Science and Engineering Yunnan University, Yunnan, P.R. China 1941028528@qq.com Abstract. Fake news has become a hotly debated topic in journalism. This paper describes our contribution of the TSIA team in the Fake News Detection in Spanish Shared Task of IberLEF 2021. We regard this task as a binary classification task. We mainly propose three model architec- tures based on the pre-trained model BETO and XLM-RoBERTa-Large. We first fine-tuned the Spanish pre-trained model BETO and then we chose the multi-language pre-trained model XLM-RoBERTa-Large to re- place BETO and fine-tune it, including the addition of CNN for feature extraction. Finally, our system achieves best F1-score of 0.6860 by hard voting, which ranks 10th out of 21 teams on the final leaderboard. Our score is only 0.0806 worse than the best score on the leaderboard. Keywords: Fake News Classification · Natural Language Processing · XLM-RoBERTa-Large· Ensemble. 1 Introduction This goal of Fake News Detection in Spanish Shared Task at IberLEF 2021 [4] [7] aims to help users detect and filter out potentially deceptive news in social net- works. As we all know, social networks offer platforms in which information and articles may be shared without fact-checking or moderation. Moderating user- generated content on social media presents a challenge due to both volume and variety of information posted. In particular, highly partisan fabricated materi- als on social media, fake news, is believed to be an influencing factor in recent elections [1]. Misinformation spread through fake news has attracted significant media attention recently and current approaches rely on manual annotation by third parties [5] to notify users that shared content may be untrue. Social me- dia information may not only represent a lot of negative emotions(terrorism, political elections, advertisement, satire, among others), but also show the par- ticularity that the people can decide to show or hide their identity. The task of IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). detecting fake news is defined as the prediction of the chances of a particular news article being deceptive [12]. The conventional solution to this task is to ask professionals such as journalists to check claims against evidence based on previously spoken or written facts. However, it is time-consuming and expensive. For example, it is hard for editors to judge whether a piece of news is real or not. As the Internet community and the speed of the spread of information are growing rapidly, automated fake news detection on Internet content has gained interest in the Artificial Intelligence research community. The goal of automatic fake news detection is to reduce the human time and effort to detect fake news and help us stop spreading it. The task of fake news detection has been studied from various perspectives with the development in subareas of Computer Sci- ence, such as Machine Learning (ML), Data Mining (DM), and NLP [8]. Besides the fact that most of the previous works done in these two tasks, namely ag- gressiveness detection and fake-news detection, are for English, little research has been done for Spanish using the most recent NLP techniques such as deep learning approaches [16]. In this paper, We use popular techniques in natural language processing to solve the problem of identifying fake news in Spanish. The remainder of the paper is structured as follows: a brief analysis on related work is performed in section 2, followed by a description of the datasets and details on the methods employed for detection of fake news in Section 3. Section 4 outlines the evaluation process and results, while conclusions and future work are drawn in section 5. 2 Related work For datasets in different languages, it brings challenges to fake news detection. In recent years, researchers have done a lot of research on fake news detection on English datasets. And due to the impact of Covid-19, many competitions have issued tasks on fake news detection. Such as SemEval 2021 Task 1 released the de- tection of toxic text span, HASOC 2020 2 issued the challenge of hate speech and offensive content identification in Indo-European languages and CONSTRAINT 2021’ task 3 , about hostility detection in Hindi. All these show that the detec- tion of fake news has always been a fiery challenge. Hence, the researches on the detection of fake news in Spanish in social media is also valuable. This is also helpful for the detection of Covid-19 information in Spanish social media. The detection of fake news is the same as other text classification problems in natural language processing. The most important thing is to find suitable features to represent sentences. The task is to assign predefined categories to a given text sequence. Many work has shown that pre-trained models on large cor- pora are beneficial for text classification and other NLP tasks, which can avoid training new models from scratch. Since 2013, people have proposed some word embedding approaches such as word2vec [6] and glove [9]. However, because their 1 https://sites.google.com/view/toxicspans 2 https://hasocfire.github.io/hasoc/2020/ 3 http://lcs2.iiitd.edu.in/CONSTRAINT-2021/ word embeddings are all in the same space, they can not express the role of poly- semy. In other words, they are non-contextual embedding, they can not capture the high-level concepts of sentences, such as semantics and context [13]. Later, someone proposed the ELMo [10] model to solve this problem. Compared with word2vec and glove, ELMo captures contextual information and not just individ- ual information of words. In word2vec, the vector representations of words are completely consistent in different contexts, but ELMo is optimized for this [17]. More recently, pre-trained language models have shown to be useful in learning common language representations by utilizing a large amount of unlabeled data: such as OpenAI GPT [2] and BERT [3]. BERT is based on a multi-layer bidirec- tional Transformer [15] and is trained on plain text for masked word prediction and next sentence prediction tasks. Since BERT is suitable for English and the dataset of this competition is Spanish, which also added Covid-19 related data for English. We finally choose BETO4 and a multi-language pre-trained model— XLM-RoBERTa-Large 5 as our pre-trained model. And we fine-tuned this two pre-trained models, submited three Runs and made a hard voting on the three Runs finally. 3 Data and Methods 3.1 Dataset The dataset used in the model are all provided by the organizer. There are 676 training set and 295 development set. The corpus consists of news compiled mainly from Mexican web sources: established newspaper websites, media com- panies websites, special websites dedicated to validating fake news, and websites designated by different journalists as sites that regularly publish fake news. The corpus contains the following information [11]: – Category: Fake / True. – Topic: Science / Sport / Economy / Education / Entertainment / Politics, Health / Security / Society. – Source: The name of the source media. – Headline: The title of the news. – Text: The complete text of the news. – Link: The URL where the news was published. Since the corpus contain different labels, in order to increase the learning ability of the model. We added ”Category” and ”Topic” column to the ”Text” column. We did not use the label—”Link”. This does improve the learning ability of the model, but it also leads to the poor generalization ability of the model. In addition, we did simple data preprocessing, such as: we strip emojis from the training set, and we deleted the link of website, etc. 4 https://github.com/dccuchile/beto 5 https://huggingface.co/xlm-roberta-large 3.2 Fine-tuned of BETO and XLM-RoBERTa-Large Pre-trained and fine-tuning architecture is already a popular method for text classification. Our system used BETO and XLM-RoBERTa-Large as the pre- trained model, and we provided three runs with ensemble. They are: – Run 1: Fine-tuned of BETO – Run 2: XLM-RoBERTa-Large – Run 3: XLM-RoBERTa-Large + CNN BETO is similar to BERT. They all have 12 hidden layers. BETO is a BERT model trained on a big Spanish corpus. BETO is of size similar to a BERT- Base and was trained with the whole word masking technique. Representing each word in the sentence as a vector, which includes word embedding and char- acter embedding. The character embedding is initialized randomly. The word embedding is usually imported from a pre-trained word embedding file. All em- beddings will be fine-tuned during training. For the Run 1, as is shown in Fig. 1, P O is the pooler output of BETO, HO is hidden-state of the first token of the sequence(CLS token) at the output of the hidden layer of the model. Then, we concatenate P O and HO of the last three hidden layers into the classifier after obtaining P O. The Facebook AI team released XLM-RoBERTa in November 2019 as an update of its original XLM-100 model. They are all transformer-based language models, all rely on the mask language model target, and they can handle texts in 100 different languages. Compared to the original version, the biggest update of XLM-RoBERTa is a significant increase in the amount of training data. The commonly used crawler datasets that have been cleaned and trained occupy up to 2.5tb of storage space. It is several orders of magnitude larger than the Wiki- 100 corpus used to train its previous version, and this expansion is especially noticeable in languages with fewer resources. XLM-RoBERTa-Large adds 12 hidden layers on the basis of XLM-RoBERTa. Therefore, the network structure of XLM-RoBERTa-Large is much more complicated, and the number of pre- trained layers is deeper. For the Run 2, wo just add a classifier after the XLM- RoBERTa-Large(Note: we did not give the architecture of Run2). For the Run 3, as is shown in Fig. 2, we add CNN before P O is sent to the classifier. Firstly, we got pooler output (P O), P O is the pooler output of XLM-Roberta-Large. It is obtained by its last layer hidden state of the first token of the sequence (CLS token) further processed by a linear layer and a tanh activation function. Then, we let P O go through a three-layers CNN (including convolution and pooling). Finally, input this two-dimensional vector into a linear classifier to do a binary classification. 3.3 Ensemble learning We use the multi-model ensemble learning approach to get a stable system that performs well in all aspects. We further use hard voting to determine the final category, whose main idea is to vote for a speech by the classification results Fig. 1. Model for Run 1 Fig. 2. Model for Run 3 of each model and the minority obeys the majority. Thus, our final predition result integrates the models of Run 1, Run 2 and Run 3 by ensemble learning. The experimental results in the next chapter verify the effectiveness of ensemble learning. 4 Experiments and Results 4.1 Hyper-parameters settings In this work, our models were implemented based on Pytorch 6 . Our experiments were run on Google Colab 7 . The GPU is Tesla P4. The batch size is 32. Our hidden layer state of BETO and XLM-RoBERTa-Large by setting the output hidden states was True. We used the adam optimizer and the learning rate of three Runs was 5e-5. The three models were trained in 30 epochs. For the Run 3, we used three convolutional layers. The number of convolution kernels is 256. The activation function is Relu. The pooling layer uses maximum pooling. 4.2 Criteria evaluation and results We mainly used F1-score to evaluate our model. The criteria evaluation of F1- score is as follows: TP TP P recision ∗ Recall ∗ 2 P recision = , Recall = , F1 = TP + FP TP + FN P recision + Recall The result is shown in Table 1. 6 https://pytorch.org/ 7 https://drive.google.com/drive/my-drive Table 1. Result of three Runs on development set and test set Development set Test set Run Accuracy F1 Recall Accuracy F1 Recall Run 1 0.9392 0.9389 0.9390 - - - Run 2 0.9593 0.9593 0.9595 - - - Run 3 0.9695 0.9695 0.9698 - 0.6252 - Ensemble - - - - 0.6860 - 4.3 Result analysis From the data in Table 1, it can be seen that the three Runs on the development set all obtain good results, which the F1-score of Run 3 is the best. This shows that CNN is helpful in this task. Therefore, we choose the XLM-RoBERTa-Large + CNN architecture to predict the final test set. The result using this model on the test set is 0.6252. Finally, we submitted the result of ensembling the three Runs by hard voting. The final best result on the test set is 0.6860, which shows that ensemble learning strengthens the learning ability of multiple classifiers. But the results of our model on the test set are not the most competitive. This may be because we did not do a better job of data augmentation(DA), which leads to the poor model generalization. We need to allow limited data to produce value equivalent to more data without substantial increase in data. Therefore, we need to put more effort in data processing and augmentation. 5 Conclusions and future work In this paper, we describe our strategy to classify fake and real text in Spanish document. In our three systems, we used transformers based pre-trained mod- els, BETO, XLM-RoBERTa-Large and XLM-RoBERTa-Large adding CNN. Our proposals show to be competitive for this specific task. However, we must also further test and improve our model, because our results are 0.0806 worse than the best F1-score. So we still have a lot of work to do in the future. In the future, We should first try to fine-tune the appropriate parameters of the model, because we have not done too many attempts to fine-tune the parameters. Then, future development directions include exploring other related datasets for fake news fields. Also, We just did ensemble learning for the pre- diction results of the three models. We need to try more integrated learning methods. And we have too few ensemble models, We need to explore more mod- els that are as competitive as others. In addition, advanced error analysis tech- niques, such as feature importance or model explainability, could also be used to improve the model’s performance [14]. References 1. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. Journal of Economic Perspectives 31(2), 211–236 (2017) 2. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Amodei, D.: Language models are few-shot learners (2020) 3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805 4. Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G., Porto, C.: Overview of fakedes task at iberlef 2020: Fake news detection in spanish. Procesamiento del Lenguaje Natural 67(0) (2021) 5. Heath, A.: Facebook is going to use snopes and other fact-checkers to combat and bury ’fake news’ (2016) 6. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. Computer Science (2013) 7. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona, M.Á., Mellado, E.Á., de Albornoz, J.C., Chiruzzo, L., Freitas, L., Adorno, H.G., Gutiérrez, Y., Zafra, S.M.J., Lima, S., de Arco, F.M.P., Taulé, M. (eds.): Proceed- ings of the Iberian Languages Evaluation Forum (IberLEF 2021). CEUR Workshop Proceedings, 2021 8. Oshikawa, R., Qian, J., Wang, W.Y.: A survey on natural language processing for fake news detection (2018) 9. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre- sentation. In: Conference on Empirical Methods in Natural Language Processing (2014) 10. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Zettlemoyer, L.: Deep contex- tualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers) (2018) 11. Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Escobar, J.J.M.: Detection of fake news in a new corpus for the spanish language. Journal of Intelligent & Fuzzy Systems 36(5), 4869–4876 (2019) 12. Rubin, V.L., Conroy, N.J., Chen, Y.: Towards news verification: Deception detec- tion methods for news discourse. In: Hawaii International Conference on System Sciences (2015) 13. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification? (2020) 14. Tanase, M.A., Zaharia, G.E., Cercel, D.C., Dascalu, M.: Detecting aggressiveness in mexican spanish social media content by fine-tuning transformer-based models. In: MEX-A3T at IberLEF 2020: Authorship and aggressiveness analysis in Twitter: case study in Mexican Spanish (2020) 15. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017), http: //arxiv.org/abs/1706.03762 16. Villatoro-Tello, E., Ramı́rez-De-La-Rosa, G., Kumar, S., Parida, S., Motlicek, P.: Idiap and uam participation at mex-a3t evaluation campaign. In: IberLEF2020 (2021) 17. Zhang, Y., Shen, D., Wang, G., Gan, Z., Carin, L.: Deconvolutional paragraph representation learning. In: NIPS (2017) (2017)