YETI at FakeDeS 2021: Fake News Detection in Spanish with ALBERT Hongxin Luo School of Information Science and Engineering Yunnan University, Yunnan, P.R. China 1104792873@qq.com Abstract. This paper explains our participation in the IberLEF2021 shared task 7: Fake News Detection Task. The goal of this task is to analyze a corpus of Spanish News and determine the authenticity of its content. The threat of false information is designed to negatively affect people, by disseminating information that does not match the facts, so that users can accept biased or erroneous information. Therefore, fake news becomes particularly important. For this task, this paper mainly discusses different methods of fake news detection. We chose the AL- BERT Model. On this basis we made a simple modification to the upper structure of the ALBERT model. In the end, our system got 63.16 % F1 score in the task. Although our proposal did not reach the best, it provides a new idea for fake news detection. Keywords: ALBERT · Fake News Classification · Natural Language Processing · Deep-learning. 1 Introduction Many years ago, the main channels for us to obtain news and information were television and newspapers. In recent years, with the rise of the mobile Internet, more and more people are choosing to obtain news information from social media. But the quality of news on social media is far lower than traditional media. Since anyone can easily publish a news article on social media, the quality of articles on social media is uneven, and there are even a lot of fake news. The threat of false information is designed to negatively influence people and deliberately persuade users to accept biased or erroneous information. Therefore, detecting fake news on social media becomes particularly important. This is the mission of IberLEF2021 [3], the forum aims to encourage research on social media content analysis in Spanish [1][10]. In this work, we explored the task of fake news detection in IberLEF2021 from the perspective of deep learning [3]. This task can be regarded as a binary classification in Spanish. The corpus consists of news IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). compiled mainly from Mexican web sources: established newspaper websites, media company websites, special websites dedicated to validating fake news, websites designated by different journalists as sites that regularly publish fake news. The news was collected from January to July of 2018 and all of them were written in Mexican Spanish [10]. There are a total of 971 news items in the corpus. We used several different neural network models for comparison, such as convolutional neural network (TextCnn) [12], fast text classifier (fastText) [4] and a little BERT for self-supervise learning of language representations model (ALBERT) [7]. For the given data set in the task, we found that ALBERT performed best on our validation set. Therefore, we accomplish this task by using the ALBERT model. The rest of this paper is organized as follows. Chapter 2 briefly introduces related work. Chapter 3 introduces our method in detail, including the descrip- tion of the data set, data preprocessing and architecture. Chapter 4 outlines the evaluation process. Finally, Chapter 5 summarizes our work. 2 Related Work IberLEF is an Iberian language evaluation forum for NLP tasks. In the 2020 version of fake news detection, participants have proposed a variety of methods, from traditional machine learning to deep learning, such as BoW, n-grams, Neu- ral Networks, Transformers, etc. [1][8][13]. According to the author’s analysis, the best results are obtained using the Supervised Autoencoder (SAE) method, which is a neural network that learns the representation (encoding) of the in- put data and then learns to reconstruct the original input [14]. They use three different types of features as input representation: word n-grams, char n-grams and BETO encodings [14]. In the previous version, they used the supervised automatic coding method to get good results [1]. Detecting fake news on social media is becoming more and more impor- tant. To build an effective classifier, one of the most important problems is to find suitable input features. Generally, there are two types of features that are widely used: one is a surface feature, such as n-grams [8], and the other is a word representation trained by a neural network, such as skip-grams. General classifiers use traditional machine learning methods, such as support vector ma- chines, random forests, logistic regression, etc., to train for different types of tasks. In many NLP tasks, it is effective to use pre-trained word embeddings to extract features [16][15]. The word embedding model is extracted from a shallow neural network, which requires the neural network to be obtained by training a large amount of text data, it can learn the contextual representation of words, such as skip-grams and GloVe [9]. But these word embeddings are learned from all possible words, which makes the word embedding may cover up the nuances of semantics. However, transformer-based language models, such as OpenAI Generative Pre-trained Transformer (GPT) [11] and BERT [2] have been extended to a depth of as much as 12 layers. ALBERT uses techniques such as parameter sharing and matrix decomposition to greatly reduce model param- eters [7]. ALBERT can greatly improve the level of language models. It can learn a good feature representation for words by running an unsupervised language representation learning algorithm based on a massive corpus. The so-called self- supervised learning means that there is no human Supervised learning running on labeled data. Compared with ELMo and GPT, the pre-trained ALBERT model has achieved good results in a series of NLP tasks [7]. 3 Methodology 3.1 Datasets The Spanish news corpus was collected from January to July 2018, all written in Mexican Spanish. They are news aggregated from several online sources: existing newspaper websites, media company websites, websites that specialize in veri- fying fake news, and websites designated by different reporters to publish fake news on a regular basis. There are 971 news items in the aggregated corpus. The news includes 9 different types of news topics, making the corpus as balanced as possible. The number of fake news and real news is also roughly balanced. In the data set, 676 pieces of data are used as the training data set and 295 pieces are used as the validation data set. The ratio of the training set to the validation set is about 7:3 [1]. 3.2 Pre-processing Although deep learning methods can learn the main features from the data, the output performance of the model also depends on the expected quality of the input training data set and 295 pieces are used as the validation data set. The ratio of the training set to the validation set is about 7:3 [6]. Data preprocessing can remove the noise data in the input data to improve the performance of the model. For the model we used, we have performed the basic preprocessing of the data as follows: 1. Convert the input text to lowercase. 2. Remove punctuation marks. 3. Delete numeric characters. 4. Delete the stop-words. We removed information that was not useful for model extracting features. We used the Natural Language Toolkit (NLTK) to complete the stop-word re- moval step. In the experiment, we use 5-fold cross-validation to control the order of data in each batch. For each fold of the data set, the input data format is [CLS]+sentence+[SEP]([CLS], which are used to separate each sample, [SEP], which are used to separate different sentences in the sample). The pre-trained model is loaded from the ALBERT-base-V2 model [7]. In V2 version, ALBERT apply ’no dropout’, ’additional training data’ and ’long training time’ strategies to all models. ALBERT-base is trained for 10M steps and other models for 3M steps. 3.3 ALBERT The research trend in the NLP field is to use larger and larger models to ob- tain better performance, the depth of the network can improve the results of the model [11][2]. Research on ALBERT shows that t blindly stacking model parameters may reduce the effect., and memory and training speed will also be hindered. ALBERT solves this problem by designing a Lite BERT architec- ture, which has fewer parameters than the traditional BERT architecture [7]. ALBERT is “A Lite” version of BERT, a popular unsupervised language rep- resentation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation. ALBERT uses parameter sharing, matrix decomposition and other technologies to greatly re- duce model parameters, and at the same time replaces NSP (Next Sentence Prediction) Loss with SOP (Sentence Order Prediction) Loss to improve the performance of downstream tasks. The reduction of parameters can make train- ing faster [7]. The structure of ALBERT is basically the same as BERT, and there are three specific improvements. Embedding layer parameter factoriza- tion, cross-layer parameter sharing, NSP task is changed to SOP task. We use albert to fine-tune on the training data set. 3.4 Method In classification task, the output of ALBERT-Base(pooler output) is obtained by its last layer hidden state of the first token of the sequence (CLS token) further processed by a linear layer and a Tanh activation function. Because the pooler output cannot summarize the input semantic content well, and studies have shown that the top layer of Bert can learn richer semantic information features [11]. We try to modify the ALBERT model to obtain rich semantic information features. We pass Spanish News directly to the ALBERT model, and we concatenate H0 (H0 is hidden-state of the first token of the sequence (CLS token) at the output of the hidden layer of the model.) of the last three hidden layers into the classifier.We call this method ALBERT Classifier. 4 Results Task 7 is to detect fake news, and fake news detection solution will be ranked by the F1 measure on the “Fake” class. In our work, the implementation of all models is based on TensorFlow, and the pre-trained models are cased. Due to the limitation of personal GPU memory, the batch size and max-seq-length in the fine-tuning stage were adjusted accord- ing to the memory capacity in order to achieve the best results. The optimizer used in the model in this experiment is Adam [5]. Table 1 shows the hyperpa- rameters of each model on the validation data set of the fake news detection task and the results of the model on the validation data set. Table 1. The hyper-parameters of our model and the results on the validation set. Model Architecture Hyperparameters Stopwods ACC F1 max-seq-length=512 Yes 0.7627 0.7602 ALBERT train-batch-size=16 No 0.7762 0.7758 warmup-step=100 Yes 0.8169 0.8154 ALBERT Classifier learning-rate=1e-5 No 0.8203 0.8199 Table 1 shows the results of the model used in this paper on the validation set. We can see from the results in the above table that the result of our modified ALBERT model is better than the ALBERT model. Compare to know that our modification to the ALBERT model has made it richer in semantic content and has a certain improvement. In the experiment, we also found that deleting stop-words can improve the classification effect of fake news to a certain extent. We input the processed data into the model, use the training data set data for training, and use the test data set to predict the results of the model. We combine the four results through hard voting to get the result. We can output the final prediction result according to the proportion of the prediction results of all the predictors. We use the absolute majority voting method to predict the result, that is, if a prediction result has more than half of the votes, the prediction is the label; otherwise, the category involved is randomly predict. The final F1 result obtained in our model for this task is 63.16%. 5 Conclusions In this paper, we introduce the method of participating in the sharing task of Spanish fake news detection organized by IberLEF2021. We propose to mod- ify the upper structure of the ALBERT model and use the ALBERT-Base-V2 pre-trained model for training. The experiment uses 5-fold cross-validation. Fi- nally, we get the final result through hard voting. In future work, we hope to explore more effective data preprocessing methods and use data augmentation to make the model perform better, and improve our results in the next IberLEF competition. References 1. Aragón, M., Jarquı́n, H., Gómez, M.M.y., Escalante, H., Villaseñor-Pineda, L., Gómez-Adorno, H., Bel-Enguix, G., Posadas-Durán, J.: Overview of mex-a3t at iberlef 2020: Fake news and aggressiveness analysis in mexican spanish. In: Note- book Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain (2020) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 3. Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G., Porto, C.: Overview of fakedes task at iberlef 2020: Fake news detection in spanish. Procesamiento del Lenguaje Natural 67(0) (2021) 4. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) 5. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014) 6. Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for super- vised leaning. International Journal of Computer Science 1(2), 111–117 (2006) 7. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 8. Liu, S., Demirel, M.F., Liang, Y.: N-gram graph: Simple unsupervised representa- tion for graphs, with applications to molecules. arXiv preprint arXiv:1806.09206 (2018) 9. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014) 10. Posadas-Durán, J.P., Gomez-Adorno, H., Sidorov, G., Escobar, J.J.M.: Detection of fake news in a new corpus for the spanish language. Journal of Intelligent & Fuzzy Systems 36(5), 4869–4876 (2019) 11. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) 12. Rakhlin, A.: Convolutional neural networks for sentence classification. GitHub (2016) 13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017) 14. Villatoro-Tello, E., Ramı́rez-de-la Rosa, G., Kumar, S., Parida, S., Motlicek, P.: Idiap and UAM participation at mex-a3t evaluation campaign. In: Notebook Pa- pers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain (2020) 15. Wang, J., Peng, B., Zhang, X.: Using a stacked residual lstm model for sentiment intensity prediction. Neurocomputing 322, 93–101 (2018) 16. Wang, J., Yu, L.C., Lai, K.R., Zhang, X.: Community-based weighted graph model for valence-arousal prediction of affective words. IEEE/ACM Transactions on Au- dio, Speech, and Language Processing 24(11), 1957–1968 (2016)