zk15120170770 at FakeDeS 2021: Fake news
       detection based on Pre-training Model

 Kun Zhao[0000−0002−4615−3657] , Siyao Zhou[0000−0002−5583−098X] , and Weihua
                                      Li∗

                         Yunnan University, Yunnan, P.R. China
                    ∗
                        Corresponding author: 1129515873@qq.comm


        Abstract. With the rapid development of Internet technology, the net-
        work makes the transmission of information no longer limited due to the
        long distance. All kinds of information can be transmitted conveniently
        and quickly on the Internet. People can view and send all kinds of in-
        formation around the world with only a mobile network device. A lot of
        things happen in the world every day, so this information contains a lot
        of news information, and people can understand what is happening in
        the world through news information. However, there is much fake news
        information mixed in this news information. This fake news will interfere
        with our cognition and judgment. Therefore, we need more attention to
        distinguish the authenticity of this news item. This also poses certain
        challenges to our work tasks. In this paper, we describe the method used
        for the Fake News Detection in Spanish in IberLEF 2021. We fine-tuned
        the XLM-Roberta pre-training model based on the data sets provided by
        the host, Spanish, and obtained good results. The F1 score of our model
        in Spanish tasks reached 0.7053 and ranking seventh.

        Keywords: Fake News Detection · IberLEF 2021 · Pre-training Model
        · XLM-Roberta.


1     Introduction
The rapid development of the Internet has promoted the application of social
media, and social media has gained a lot of popularity. The miniaturization and
convenience of mobile terminals have led to rapid growth in the use of social me-
dia in the past few years. Many people around the world communicate through
social media. The information flowing on the Internet at all times is huge and in-
calculable. In addition to the bridge of social media, language is an indispensable
part of our mutual communication. Therefore, language is an important part of
communication, and equality, diversity, and inclusion (EDI) are very important
to people. Social media such as Twitter, YouTube, and Facebook are some of
the important media for information dissemination and communication in the
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
world today. These platforms have a large number of users who conduct various
exchanges and release various information on these platforms. Many news media
also use these platforms or self-built platforms to release various news informa-
tion. This makes the Internet flooded with all kinds of news information, this
news information is very complex, there is much fake news information, these
fake news have some unnecessary interference to our lives, which requires us to
be able to identify news the authenticity of the information. The variety and
complexity of news information pose a major challenge for us to identify the
authenticity of a news item.
     To help complete the identification of the authenticity of a certain news item,
it is necessary to establish an efficient and accurate system. IberLEF 2021[13]
is committed to promoting the equality, diversity, and inclusiveness of language
technology, and provides a shared mission for the authenticity of news item.
The task established a Spanish data set based on news item collected from
the Mexican network. This is a classification task that classifies news item into
”True” and ”Fake”.
     To solve this task[11], we use XLM-Roberta, a pre-trained language model
based on Transformer. Compared with other methods, this model has some
unique advantages. We only need to perform less pre-processing on the pre-
training model, and then we can achieve better results for downstream clas-
sification tasks, which cannot be achieved by other methods. In addition, the
pre-trained model supports fine-tuning for specific tasks.
     The rest of the paper is organized as follows: In Section 2 we describe the
datasets in detail. Section 3 and describes the model approach we used. We
describe our experiments and results in Section 4. Finally, Section 5 gives the
conclusion.


2   Data description

The data set[12] used for this task was provided by the organizers of IberLEF
2021, who presented the news corpus in Spanish. The news corpus were collected
from January to July of 2018 and all of them were written in Mexican Spanish.
The corpus has 971 news collected from different sources.


               LANGUAGE DATASET TOTAL TRUE FAKE
                             Train       676       338 338
                 Spanish      Dev        295       153 142
                              Test       572
                         Table 1. Dataset statistics


   We present the statistics for the dataset in Table 1. For a given comment,
we need to divide it into the following two categories:
 – True: A news article is true if there is evidence that it has been published
   on reliable sites.
 – Fake: A news article is fake if there is news from reliable sites or specialized
   websites in the detection of deceptive content that contradicts it or no other
   evidence was found about the news besides the source.

    The statistical data show that the proportion of training set and development
set in the number of fake and true news is quite balanced.


3   Methodology

This section describes the deep learning model and architecture[1] that we use
to identify the authenticity of news text in this task.
    Text categorization[2] has been a focus of research for years as social media
has become more popular. In the past, people used SVM[3] and LR classifiers[4]
for sentiment analysis. In recent years, text classification technology is mainly
implemented by bag-of-words (BOW[5]), recursive neural network (RNN[6]), and
Word embedding[7]. In these years of research, the RNN model has achieved good
results in emotion analysis tasks. As the research went on, we found that using
the pre-training model worked much better than the previous methods.
    We chose the XLM-Roberta[8] model based on Transformer[9] and trained the
task on the corpus in Spanish languages. XLM-Roberta is mainly composed of
bidirectional Transformers, using a dynamic tuning Masking mechanism different
from Bert[10]. It is adjusted based on Bert, using the larger batch size and
longer training sequence, and performing the pre-training of the next sentence of
Bert. XLM-Roberta delivers better downstream mission performance compared
to Bert.
    The model structure is shown in figure 1 below. The hidden layer of the
XLM-Roberta model is 768 dimensions and has 12 Transformer encoder layers.
Since a bi-directional Transformer cannot remember time-series information, we
add the [CLS] token to the beginning of the input text to indicate whether it is
used for a classification task, and the [SEP] token is used as a separator between
sentences or a marker at the end of a sentence. Then, after the computation
of the neural network, the [CLS] token we get from XLM-Roberta’s output is
treated as an aggregated representation of the entire text. It is passed as input
to a full connection layer, and the Softmax activation function is used by the
deep neural network for classification. Thus, the XLM-Roberta Society predicts
whether a news item can be classified as ”True” or ”Fake”.


4   Experiments and results

In this section, we describe the methods used to preprocess the data and to train
our model on processed text datasets.
                  Fig. 1. The structure of XLM-Roberta model


4.1   Data preprocessing
Since the data set was captured directly from Mexican web sources, the original
data contained a variety of unnecessary features that would affect the perfor-
mance of model training, and we processed the text before entering it into the
deep learning model. As a pre-processing step, to clean up the data by retaining
the important information in the data and deleting the non-important informa-
tion, we performed the following steps:

 – The following data preprocessing is performed on the news item so that the
   data set achieves good results in the downstream classification and detection
   tasks on the XLM-Roberta model. There is some information that we don’t
   need in the news item. This redundant information will interfere with our
   detection. Removing them will improve the performance of the classifier.
 – Translate the emoji into a textual description of the corresponding emotion.
 – Convert the text to lowercase.
 – Remove words that have no emotional meaning.
 – Remove all URLs.
 – Remove excess Spaces.

4.2   Experimental settings
For the implementation of the model, we used the Transformer library provided
by HuggingFace. The Huggingface Transformers package is a Python library
that not only contains pre-trained XLM-Roberta but also provides pre-trained
models for various NLP missions. As the implementation environment, we use
the PyTorch library, which supports GPU processing. The XLM-Roberta model
runs on an NVIDIA RTX 3080 graphics card with 24GB of video storage. Based
on our experiment, We use stratified 5-fold cross-validation with 42 random
seeds for the training set, and stratified sampling ensures that the proportion of
samples in each category of each fold data set remains unchanged. For the XLM-
Roberta, we use the pre-trained model, which contains 12 layers. We trained our
classifier using Adam optimizer with a learning rate of 2e-5 and CrossEntropy
loss. The dropout set to 0.1, the epoch and maximum sentence length is 10
and 512, respectively. If the maximum sentence length exceeds 512, it will cause
overflow, stop reading subsequent text data, the model will truncate the data set,
and then proceed to the next step. In order to save GPU memory, the batch size
was set to 8, and the gradient steps were set to 4, so that each time a sample
is input, the gradient is accumulated 4 times, and then the back-propagation
update parameters are performed. We extract the hidden layer state of XLM-
Roberta by setting the output hidden States as true. In the process of fine-
tuning and sequence classification, we use the HuggingFace libraries to provide
the RobertaForSequenceClassification module.


4.3   Result


In this work, we will present the results of our submitted evaluations. In this
experiment, we participated in the tasks of Spanish languages provided by Iber-
LEF 2021, and the results were evaluated by the task organizer of IberLEF 2021,
using a weighted average F1-score as the evaluation standard. The results are
shown in Table 2 below. In this task, we got a score of macro F1-0.7053, ranking
7th on the leaderboard. The results are shown in Table 2. The results reported
by the organizers showed that the competition among participating teams was
very intense, and our best performance in the Spanish language task was an F1
score of 0.7053, which gave us 7th place.


                       User        Language Rank F1-score
                 zk15120170770 Spanish 7th 0.7053
             Table 2. The results of our model on the official test set.


    As can be seen from the results, our method works quite well in the Spanish
news item dataset. This may be because the XLM-Roberta model is pre-trained
in the multi-language dataset, so the work effect is so good in the fake news
detection in Spanish.
5    Conclusion

This paper introduces the general idea and specific plan of Team zk15120170770
in IberLEF 2021. As users grow in diversity and number, online platforms must
support multiple languages. In the competition, we used the Transfomer pre-
training model XLM-Roberta to complete Fake News Detection in Spanish. The
performance of our system was very competitive and we achieved good results.
In the future, we hope to extend our system to more languages and increase the
number of tasks it can perform. We will also explore the use of other pre-training
models and make comparative analyses.


References
1. Goodfellow I, Bengio Y, Courville A, et al. Deep learning[M]. Cambridge: MIT
   press, 2016.
2. Sebastiani F. Machine learning in automated text categorization[J]. ACM comput-
   ing surveys (CSUR), 2002, 34(1): 1-47.
3. Platt J. Sequential minimal optimization: A fast algorithm for training support
   vector machines[J]. 1998.
4. Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on
   twitter sentiment analysis[J]. IEEE Access, 2017, 5: 2870-2879.
5. Wallach H M. Topic modeling: beyond bag-of-words[C]//Proceedings of the 23rd
   international conference on Machine learning. 2006: 977-984.
6. Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization[J].
   arXiv preprint arXiv:1409.2329, 2014.
7. Levy O, Goldberg Y. Dependency-based word embeddings[C]//Proceedings of the
   52nd Annual Meeting of the Association for Computational Linguistics (Volume 2:
   Short Papers). 2014: 302-308.
8. Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representa-
   tion learning at scale[J]. arXiv preprint arXiv:1911.02116, 2019.
9. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. arXiv preprint
   arXiv:1706.03762, 2017.
10. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional trans-
   formers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
11. H. Gómez-Adorno, J.-P. Posadas-Durán, G. Bel-Enguix, C. Porto, Overview of
   fakedestask at iberlef 2020: Fake news detection in spanish., Procesamiento del
   Lenguaje Natural67 (2021)
12. Posadas-Durán J P, Gomez-Adorno H, Sidorov G, et al. Detection of fake news in
   a new corpus for the Spanish language[J]. Journal of Intelligent & Fuzzy Systems,
   2019, 36(5): 4869-4876.
13. Manuel Montes, Paolo Rosso, Julio Gonzalo, Ezra Aragón, Rodrigo Agerri, Miguel
   Ángel Álvarez-Carmona, Elena Álvarez Mellado, Jorge Carrillo-de-Albornoz, Luis
   Chiruzzo, Larissa Freitas, Helena Gómez Adorno, Yoan Gutiérrez, Salud Marı́a
   Jiménez Zafra, Salvador Lima, Flor Miriam Plaza-de-Arco and Mariona Taulé (eds.):
   Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), CEUR
   Workshop Proceedings, 2021.