GDUF_DM at FakeDeS 2021: Spanish Fake News
        Detection with BERT and Sample Memory

                Xixuan Huang1, Jieying Xiong1, and Shengyi Jiang 1,2()
 1 School of Information Science and Technology, Guangdong University of Foreign Studies,

                                             China
2 Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of

                                Foreign Studies, Guangzhou
                                jiangshengyi@163.com


       Abstract. Fake news widely spread on the Internet has had a negative impact on
       society. This article reports the solution of Spanish fake news detection purposed
       by our team GDUFS_DM in IberLEF 2021 shared task. Our purpose is to use
       BERT and Sample Memory with an attention mechanism to detect Spanish fake
       news. To capture richer semantic information in long news texts, we used BERT
       to encode the news headline and the news beginning and end part to keep more
       information instead of using a simple truncation strategy. In addition, we also use
       a matrix parameter initialized by sample representation (we call it Sample
       Memory), combine with the attention mechanism, our model can capture the re-
       lationship information between samples which strengthens the model's robust-
       ness in the inference stage. Our submission result achieved the first place on the
       leaderboard, which fully reflects the advantages of our model.

       Keywords: Fake News Detection, Spanish, BERT, Sample Memory.


1      Introduction

Fake news refers to the news articles that are intentionally and verifiably false [1]. The
rapid development of online news media platforms not only provided convenience for
readers to obtain news information, but also provided soil for the breeding and dissem-
ination of fake news. The publication of fake news is often intentional, and some indi-
viduals or organizations may publish different types of fake news for different pur-
poses. Fake news can not only be used to insult and slander individuals, but it can also
disrupt social order, instigate political unrest, or even undermine the peace and stability
of the international community. What's worse, researches on the dissemination of fake
news shows that fake news is significantly faster, deeper, and wider distributed than
true news [2]. Therefore, it’s has important practical significance to know how to use
the machine to automatically and accurately identify false news.

IberLEF 2021, September 2021, Málaga, Spain.
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
   The FakeDeS@IberLEF 2021 [3, 4] provided us with a fake news detection corpus
in Mexican Spanish [5]. The corpus mainly collects Mexican Spanish fake news from
websites and contains quite balanced data of real and fake news on 9 different topics,
which is intended to encourage people to actively research the identification of fake
news in Mexican Spanish to solve the problem of detecting fake news articles in Mex-
ican Spanish spread through digital media. The distribution of the dataset is shown in
Table 1. Our team GDUFS_DM also participated in this evaluation and achieved first
place on the leaderboard. In this report, we will review our solution for this task,
namely, Mexican Spanish Fake News Detection with BERT and Sample Memory (see
Section 3.3 for details).

            Table 1. The statistics of the Mexican Spanish fake news corpus

                                      Training Set                Validation Set
Topic                          True          Fake         True            Fake
Economy                        18            12           6               7
Education                      6             9            4               3
Entertainment                  48            55           22              23
Health                         16            16           7               7
Politics                       121           105          54              43
Science                        32            30           14              13
Security                       11            18           6               7
Society                        41            52           19              22
Sport                          45            41           21              17


2        Related work

Numerous scholars had conducted extensive research in text features and emotional
features to improve the effect of fake news detection. Ajao et al. pointed out that there
is a relationship between the news veracity and the sentiment of the published text and
attached a sentiment feature (ratio of the number of negative and positive words) to
help the plain text fake news detectors [6]. Instead of attaching a unique feature, Zhang
et al. verified the difference between dual emotions in fake news and real news, and
proposed a dual emotion feature to represent dual emotions and the relationship be-
tween them for fake news detection [7]. Przybyap concluded that the writing style of
fake news has certain characteristics, so they designed two new classifiers: a neural
network classifier and a model classifier based on stylometric feature [8]. Wang et al.
proposed an enhanced weakly supervised fake news detection framework, WeFEND,
which can utilize user reports as weak supervision to expand the amount of training
data for fake news detection, given the dynamic nature of news and the reality that
labeled samples may become outdated quickly [9]. Yi Xie et al. proposed a fake news
detection framework to make full use of characterize users by utilizing social user
graphs [10].
   However, the majority of studies on automated fake news detection have been lim-
ited to English documents, and few have evaluated works in other languages. Moreover,
the spread of deceptive news tends to be a worldwide problem, so we need to study
fake news not only in English, but also look at the world and detect fake news in other
languages. Some scholars had also studied fake news detection of some low-resource
languages. Nankai Lin et al. proposed the CharCNN-RoBERTa model to detect fake
news in the Urdu Language [11]. Hugo Queiroz Abonizio et al. evaluated textual fea-
tures not linked to a specific language when describing textual data for detecting news
[12]. News corpora written in American English, Brazilian Portuguese and Spanish
were explored to investigate complexity, stylometric and psychological textual fea-
tures. As regards the Mexican Spanish, the MEX-A3T@IberLEF2020 [13] has called
methods for aggressiveness and fake news detection in Spanish in Mexico. Samuel
Arce-Cardenas et al. evaluated the combination of basic text classification techniques,
including six machine learning algorithms, two methods for extracting keywords, and
two preprocessing techniques [14]. The best results they ran showed an F1-macro score
of 0.815 for fake news. Esaú Villatoro-Tello el at. [15] evaluated Supervised Autoen-
coder (SAE) learning algorithms in a text classification task. They used three different
sets of features as input, namely classical word n-grams, char n-grams and Spanish
BERT encodings, and obtained the best performance (𝐹𝐹 = 85.66%) in the fake news
classification task.


3      Method

3.1    Overview

The model we finally proposed in this task is shown in Fig. 1. The beginning and end
part of the news text is feed into BERT for obtaining two text embeddings (which were
called Head Embedding and Tail Embedding). Then after an element-wise addition was
applied in those two embeddings, we calculated the dot-product attention between the
result and Sample Memory utils to obtain Memory Embedding. Finally, the Beginning
Embedding, End Embedding, and Memory Embedding are stitched together to calcu-
late the output result. The specific components will be explained in detail below. In
addition, we also use tricks such as gradient accumulating, early stop, and hierarchical
learning rates.
                      Fig. 1. Fake news detection model we proposed


3.2    BERT

It has been shown that the use of pre-trained language models (PTMs) significantly
improves the performance of text classification, and also reduces the amount of labeled
sample data required in supervised learning [16]. In this evaluation task, we also used
one of the representatives of the pre-trained model, BERT [17] (Bidirectional Encoder
Representations from Transformers). In the pre-training section, the model needs to
learn the general semantic information of the language from a large-scale unlabeled
corpus, according to the pre-training task we set. In fine-tuning section, we can use it
as a feature capture model in downstream tasks to obtain the embedding of text or to-
kens and use different finetune frameworks according to the specific task with labeled
data. One disadvantage of the pre-trained model is that pre-training often takes a lot of
computing power and time which may be a difficult thing for us. Fortunately, DCC
Canete J et al. released the Spanish BERT model on an open-source platform called
Transformers [18,19]. This model has two versions of cased and uncased for us to use
which were also the main BERT models used in this evaluation.
   Most pre-trained models set the maximum sequence length to 512, which is not very
friendly to long texts such as news texts. A common solution is truncation, that is, only
a sequence of tokens of a limited length is retained. Previous research has found that
for text classification, keeping the head and tail tokens at the same time can achieve
better results than keeping only the head tokens or only the tail tokens, which means
that the head and tail parts of the text may contain more information than the middle
part [20]. We concatenated news headlines and news texts in the training set and vali-
dation set. After tokenization, we calculated the length distribution of the token se-
quence (see Fig. 2). The average sequence length reached 546, and nearly 41% of the
sequence length is greater than 512. This means that if only a simple truncation method
is used, a lot of useful information may be lost. To solve this problem, we adopted a
simple method, respectively taking the first 512 tokens and the last 512 tokens for en-
coding, and concatenating them to get the embedding of the news text to retain as much
information as possible.

                                       450                                                         406
                                       400
                  Length of Sequence


                                       350
                                       300
                                                                        246
                                       250
                                                                                    195
                                       200
                                       150                   107
                                       100
                                        50        17
                                         0
                                                <=128      129-256    257-384    385-512          >512
                                                                Number of News
                                             Fig. 2. The length distribution of token sequence


3.3      Sample Memory

This evaluation task also brings two challenges, thematic and language variation. It
means that the news in the test corpus may contain topics which are not part of the
training corpus, and some test data may be inconsistent with the training corpus in terms
of language style. In order to improve the robustness and generalization ability of the
model, we also designed a network structure which called "Sample Memory". Sample
Memory contains m vectors with the same dimensions as Head Embedding and Tail
Embedding (we call it Memory Utils). Before network training, the Head Embedding
and Tail Embedding of m samples need to be used for the initialization of Sample
Memory. The model accepts a news and applies Dot-product Attention between sample
embedding and Sample Memory utils to obtain Memory Embedding. The calculation
formula of Memory Embedding is shown in formula 1.

      𝐸𝐸𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠�𝐸𝐸𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 ∗ [𝑈𝑈1 𝑇𝑇 , 𝑈𝑈2 𝑇𝑇 , … , 𝑈𝑈𝑚𝑚 𝑇𝑇 ]� ∗ [𝑈𝑈1 , 𝑈𝑈2 , … , 𝑈𝑈𝑚𝑚 ]𝑇𝑇   (1)

   Where 𝐸𝐸𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 is Memory Embedding, 𝐸𝐸𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 refers to the element-wise addition
result of Head Embedding and Tail Embedding, and 𝑈𝑈1 , 𝑈𝑈2 , … , 𝑈𝑈𝑚𝑚 respectively refers
to m Utils in Sample Memory. Therefore, even if the news text received by the model
is very different from the training set during inferencing, Memory Embedding at this
time is still not going to change much, which means that the experience learned by the
model in the past is still going to be useful. In addition, we concatenate Memory Em-
bedding with Head Embedding and Tail Embedding in the next network layer as usual,
the model therefore can better learn the semantic information of the news text itself.


3.4    Tricks

Gradient Accumulation. Limited by the memory size of the training GPU device, the
batch size of our model during training can only reach 4. Researchers have found that
increasing the batch size appropriately will help the model loss decrease more stably
[21]. Therefore, the effect of batch size of 32 is approached by accumulating the gra-
dient of every 8 training steps during model training.
Early stop. Early stop is a Widely used trick in deep learning to avoid overfitting. After
each training epoch is finished, we run an evaluation on the validation dataset to obtain
validation loss. If the loss value does not continue to decrease within 3 epochs, then the
model training will be stopped and the model with the best performance on the valida-
tion set will be taken as the model to be submitted.
Differential learning rates. It has been shown that the features captured by different
layers in the neural network may be different [22]. Howard and Ruder pointed out that
different layers in the neural network should be set to different learning rates according
to specific tasks, and the experiment shows that setting different learning rates accord-
ing to different layers is beneficial to the model to achieve better results [23]. This
inspired us to set different learning rates for the parameters of the BERT model, the
parameters of the Sample Memory and the parameters of the output layer during the
model training process.


4      Experiment

In our experiment, we only used the string concatenated by the news title and body as
input. The final result we submitted used the bert-base-spanish-wwm-cased model [15]
published by DCC UChile and the spanberta-base model published by Skim AI Tech-
nologies [24]. We used 1e-5, 1e-4, and 1e-3 initial learning rates for BERT parameters,
Sample Memory parameters, and output layer parameters. Before model training, we
selected all training samples, validation samples, and test samples to initialize Sample
Memory. In fact, we also tried some machine learning models (using TFIDF for feature
extraction) for comparison, such as Support Vector Machine (SVM), Naive Bayes
(NB), Logistic Regression (LR), Decision Tree (DT), Gradient Boosting Decision Tree
(GBDT) and Random Forest (RF). Table 2 shows the results in validation and test set
that was reported in accuracy score, precision score for fake (Fake-P), recall score for
fake(Fake-R) and f1-score for fake(Fake-F1).
                         Table 2. Results in validation and test set

                                                      Validation Set                  Test Set
Model                                   Accuracy     Fake-P Fake-R      Fake-F1       Fake-F1
SVM                                     74.58        74.45      71.83   73.12         -
NB                                      54.24        54.02      33.10   41.05         -
LR                                      73.56        73.53      70.42   71.94         -
DT                                      65.42        65.38      59.86   62.50         -
GBDT                                    76.95        78.46      71.83   75.00         -
RF                                      78.31        74.38      83.80   78.81         -
Ours(spanberta-base-cased)              86.10        89.76      80.28   84.76         69.07
Ours(bert-base-spanish-wwm-cased)       86.44        90.48      80.28   85.07         76.66
Second best system (in leaderboard)     -            -          -       -             75.48

   In order to explore why Sample Memory works in our model, we also designed the
following experiment on the validation set: initialize the Sample Memory randomly
instead of using news corpus for initialization, and simply remove Sample Memory (see
Table 3). It can be seen that even randomly initialized Sample Memory can also im-
prove the performance while using real news samples to initialize Sample Memory can
achieve better results. We also found something interesting that the classification pre-
cision of fake news has improved after removing Sample Memory, although the recall
score has decreased significantly. This may indicate that Sample Memory can guide the
model to seek "reference objects" from past samples to guide decision-making. Alt-
hough errors may sometimes occur, fake news can be more easily detected thanks to
Sample Memory.

                      Table 3. Experiment results for Sample Memory

      Model                                 Fake-P           Fake-R         Fake-F1
      Ours(bert-base-spanish-wwm-cased)     90.48            80.28          85.07
      Randomly initialized sample memory 89.52(-0.96) 78.17(-2.11)          83.46(-1.61)
      Without sample memory                 94.23(+3.75) 69.01(-11.27) 79.67(-5.4)


5       Conclusion

In this report, we presented the solution of team GDUFS_DM in the IberLEF 2021
shared task to detect fake news in Spanish. We proposed to use BERT to encode the
beginning and end of the news text, and then applied the Sample Memory module let
the model to learn the relationship between samples. We also applied some training
tricks gradient accumulation, early stop and differential learning rates. The results show
that our model got the first place in the ranking. In our future work, we will consider
using information such as URLs and topics added to the news to enhance the perfor-
mance of fake news detection.
6      Acknowledgements

This work was supported by the Key Field Project for Universities of Guangdong Prov-
ince (No. 2019KZDZX1016), the National Natural Science Foundation of China (No.
61572145) and the National Social Science Foundation of China (No. 17CTQ045). The
authors would like to thank the anonymous reviewers for their valuable comments and
suggestions.


References
 1. Shu K, Sliva A, Wang S, et al.: Fake news detection on social media: A data mining per-
    spective. ACM SIGKDD explorations newsletter 19(1), 22-36 (2017).
 2. Vosoughi S, Roy D, Aral S.: The spread of true and false news online. Science 359(6380),
    1146-1151 (2018).
 3. Gómez-Adorno H, Posadas-Durán J P, Bel-Enguix G, and Clau-dia P.: Overview of fakedes
    task at iberlef 2021: Fake news detection in spanish. Procesamiento del Lenguaje Natural,
    67(0), (2021).
 4. Manuel Montes, Paolo Rosso, Julio Gonzalo, Ezra Aragón, Rodrigo Agerri, Miguel Ángel
    Álvarez-Carmona, Elena Álvarez Mellado, Jorge Carrillo-de-Albornoz, Luis Chiruzzo, La-
    rissa Freitas, Helena Gómez Adorno, Yoan Gutiérrez, Salud María Jiménez Zafra, Salvador
    Lima, Flor Miriam Plaza-de-Arco and Mariona Taulé (eds.): Proceedings of the Iberian Lan-
    guages Evaluation Forum (IberLEF 2021), CEUR Workshop Proceedings, (2021).
 5. Posadas-Durán J P, Gomez-Adorno H, Sidorov G, et al.: Detection of fake news in a new
    corpus for the Spanish language. Journal of Intelligent & Fuzzy Systems, 2019, 36(5): 4869-
    4876.
 6. Ajao O, Bhowmik D, Zargari S.: Sentiment aware fake news detection on online social net-
    works. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP). IEEE, pp. 2507-2511 (2019).
 7. Zhang X, Cao J, Li X, et al.: Mining Dual Emotion for Fake News Detection. arXiv e-prints
    arXiv: 1903.01728 (2019).
 8. Przybyla P.: Capturing the Style of Fake News. In: Proceedings of the AAAI Conference on
    Artificial Intelligence. vol. 34(01), pp. 490-497 (2020).
 9. Wang Y, Yang W, Ma F, et al.: Weak supervision for fake news detection via reinforcement
    learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34(01), pp.
    516-523 (2020).
10. Xie Y, Huang X, Xie X, et al.: A Fake News Detection Framework Using Social User Graph.
    In: Proceedings of the 2020 2nd International Conference on Big Data Engineering. pp. 55-
    61 (2020).
11. Lina N, Fua S, Jianga S.: Fake News Detection in the Urdu Language using CharCNN-
    RoBERTa. In: The 12th edition of the Forum for Information Retrieval Evaluation (FIRE
    2020). pp. 447-451. (2020).
12. Abonizio H Q, de Morais J I, Tavares G M, et al.: Language-independent fake news detec-
    tion: English, Portuguese, and Spanish mutual features. Future Internet 12(5), 87 (2020).
13. Aragón M E, Jarquín H, Gómez M M, et al.: Overview of mex-a3t at iberlef 2020: Fake
    news and aggressiveness analysis in mexican Spanish. In: Notebook Papers of 2nd SEPLN
    Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain. (2020).
14. Arce-Cardenasa S, Fajardo-Delgadoa D, Álvarez-Carmonab M Á.: TecNM at MEX-A3T
    2020: Fake News and Aggressiveness Analysis in Mexican Spanish. In: Proceedings of the
    Iberian Languages Evaluation Forum (IberLEF 2020) co-located with 36th Conference of
    the Spanish Society for Natural Language Processing (SEPLN 2020). pp. 265-272. CEUR
    Workshop Proceedings (2020).
15. Ramírez-de-la-Rosa G, Parida S, Kumar S, et al.: Idiap and UAM Participation at MEX-
    A3T Evaluation Campaign. In: Proceedings of the Iberian Languages Evaluation Forum
    (IberLEF 2020) co-located with 36th Conference of the Spanish Society for Natural Lan-
    guage Processing (SEPLN 2020). pp. 252-257. CEUR Workshop Proceedings (2020).
16. Peters M E, Neumann M, Iyyer M, et al.: Deep contextualized word representations. arXiv
    preprint arXiv:1802.05365 (2018).
17. Devlin J, Chang M W, Lee K, et al.: Bert: Pre-training of deep bidirectional transformers for
    language understanding. arXiv preprint arXiv:1810.04805 (2018).
18. Canete J, Chaperon G, Fuentes R, et al.: Spanish pre-trained bert model and evaluation data.
    PML4DC at ICLR (2020).
19. BETO: Spanish BERT, https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased,
    last accessed 2021/06/01
20. Sun C, Qiu X, Xu Y, et al.: How to Fine-Tune BERT for Text Classification?. arXiv e-prints
    arXiv: 1905.05583 (2019).
21. You Y, Gitman I, Ginsburg B.: Scaling sgd batch size to 32k for imagenet training. arXiv
    preprint arXiv:1708.03888 (2017).
22. Yosinski J, Clune J, Bengio Y, et al.: How transferable are features in deep neural networks?.
    arXiv preprint arXiv:1411.1792 (2014).
23. Howard J, Ruder S.: Universal language model fine-tuning for text classification. arXiv pre-
    print arXiv:1801.06146 (2018).
24. Spanish Bert pretrained model released by Skim AI Technologies, https://hugging-
    face.co/skimai/spanberta-base-cased, last accessed 2021/06/01