<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>zk15120170770 at FakeDeS 2021: Fake news detection based on Pre-training Model</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Yunnan University</institution>
          ,
          <addr-line>Yunnan</addr-line>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapid development of Internet technology, the network makes the transmission of information no longer limited due to the long distance. All kinds of information can be transmitted conveniently and quickly on the Internet. People can view and send all kinds of information around the world with only a mobile network device. A lot of things happen in the world every day, so this information contains a lot of news information, and people can understand what is happening in the world through news information. However, there is much fake news information mixed in this news information. This fake news will interfere with our cognition and judgment. Therefore, we need more attention to distinguish the authenticity of this news item. This also poses certain challenges to our work tasks. In this paper, we describe the method used for the Fake News Detection in Spanish in IberLEF 2021. We ne-tuned the XLM-Roberta pre-training model based on the data sets provided by the host, Spanish, and obtained good results. The F1 score of our model in Spanish tasks reached 0.7053 and ranking seventh.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake News Detection</kwd>
        <kwd>IberLEF 2021</kwd>
        <kwd>Pre-training Model</kwd>
        <kwd>XLM-Roberta</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The rapid development of the Internet has promoted the application of social
media, and social media has gained a lot of popularity. The miniaturization and
convenience of mobile terminals have led to rapid growth in the use of social
media in the past few years. Many people around the world communicate through
social media. The information owing on the Internet at all times is huge and
incalculable. In addition to the bridge of social media, language is an indispensable
part of our mutual communication. Therefore, language is an important part of
communication, and equality, diversity, and inclusion (EDI) are very important
to people. Social media such as Twitter, YouTube, and Facebook are some of
the important media for information dissemination and communication in the
world today. These platforms have a large number of users who conduct various
exchanges and release various information on these platforms. Many news media
also use these platforms or self-built platforms to release various news
information. This makes the Internet ooded with all kinds of news information, this
news information is very complex, there is much fake news information, these
fake news have some unnecessary interference to our lives, which requires us to
be able to identify news the authenticity of the information. The variety and
complexity of news information pose a major challenge for us to identify the
authenticity of a news item.</p>
      <p>
        To help complete the identi cation of the authenticity of a certain news item,
it is necessary to establish an e cient and accurate system. IberLEF 2021[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
is committed to promoting the equality, diversity, and inclusiveness of language
technology, and provides a shared mission for the authenticity of news item.
The task established a Spanish data set based on news item collected from
the Mexican network. This is a classi cation task that classi es news item into
"True" and "Fake".
      </p>
      <p>
        To solve this task[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we use XLM-Roberta, a pre-trained language model
based on Transformer. Compared with other methods, this model has some
unique advantages. We only need to perform less pre-processing on the
pretraining model, and then we can achieve better results for downstream
classi cation tasks, which cannot be achieved by other methods. In addition, the
pre-trained model supports ne-tuning for speci c tasks.
      </p>
      <p>The rest of the paper is organized as follows: In Section 2 we describe the
datasets in detail. Section 3 and describes the model approach we used. We
describe our experiments and results in Section 4. Finally, Section 5 gives the
conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data description</title>
      <p>
        The data set[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used for this task was provided by the organizers of IberLEF
2021, who presented the news corpus in Spanish. The news corpus were collected
from January to July of 2018 and all of them were written in Mexican Spanish.
The corpus has 971 news collected from di erent sources.
      </p>
      <p>We present the statistics for the dataset in Table 1. For a given comment,
we need to divide it into the following two categories:
{ True: A news article is true if there is evidence that it has been published
on reliable sites.
{ Fake: A news article is fake if there is news from reliable sites or specialized
websites in the detection of deceptive content that contradicts it or no other
evidence was found about the news besides the source.</p>
      <p>The statistical data show that the proportion of training set and development
set in the number of fake and true news is quite balanced.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        This section describes the deep learning model and architecture[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that we use
to identify the authenticity of news text in this task.
      </p>
      <p>
        Text categorization[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been a focus of research for years as social media
has become more popular. In the past, people used SVM[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and LR classi ers[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
for sentiment analysis. In recent years, text classi cation technology is mainly
implemented by bag-of-words (BOW[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), recursive neural network (RNN[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), and
Word embedding[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In these years of research, the RNN model has achieved good
results in emotion analysis tasks. As the research went on, we found that using
the pre-training model worked much better than the previous methods.
      </p>
      <p>
        We chose the XLM-Roberta[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] model based on Transformer[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and trained the
task on the corpus in Spanish languages. XLM-Roberta is mainly composed of
bidirectional Transformers, using a dynamic tuning Masking mechanism di erent
from Bert[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It is adjusted based on Bert, using the larger batch size and
longer training sequence, and performing the pre-training of the next sentence of
Bert. XLM-Roberta delivers better downstream mission performance compared
to Bert.
      </p>
      <p>The model structure is shown in gure 1 below. The hidden layer of the
XLM-Roberta model is 768 dimensions and has 12 Transformer encoder layers.
Since a bi-directional Transformer cannot remember time-series information, we
add the [CLS] token to the beginning of the input text to indicate whether it is
used for a classi cation task, and the [SEP] token is used as a separator between
sentences or a marker at the end of a sentence. Then, after the computation
of the neural network, the [CLS] token we get from XLM-Roberta's output is
treated as an aggregated representation of the entire text. It is passed as input
to a full connection layer, and the Softmax activation function is used by the
deep neural network for classi cation. Thus, the XLM-Roberta Society predicts
whether a news item can be classi ed as "True" or "Fake".
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments and results</title>
      <p>In this section, we describe the methods used to preprocess the data and to train
our model on processed text datasets.
Since the data set was captured directly from Mexican web sources, the original
data contained a variety of unnecessary features that would a ect the
performance of model training, and we processed the text before entering it into the
deep learning model. As a pre-processing step, to clean up the data by retaining
the important information in the data and deleting the non-important
information, we performed the following steps:
{ The following data preprocessing is performed on the news item so that the
data set achieves good results in the downstream classi cation and detection
tasks on the XLM-Roberta model. There is some information that we don't
need in the news item. This redundant information will interfere with our
detection. Removing them will improve the performance of the classi er.
{ Translate the emoji into a textual description of the corresponding emotion.
{ Convert the text to lowercase.
{ Remove words that have no emotional meaning.
{ Remove all URLs.</p>
      <p>{ Remove excess Spaces.
4.2</p>
      <sec id="sec-4-1">
        <title>Experimental settings</title>
        <p>For the implementation of the model, we used the Transformer library provided
by HuggingFace. The Huggingface Transformers package is a Python library
that not only contains pre-trained XLM-Roberta but also provides pre-trained
models for various NLP missions. As the implementation environment, we use
the PyTorch library, which supports GPU processing. The XLM-Roberta model
runs on an NVIDIA RTX 3080 graphics card with 24GB of video storage. Based
on our experiment, We use strati ed 5-fold cross-validation with 42 random
seeds for the training set, and strati ed sampling ensures that the proportion of
samples in each category of each fold data set remains unchanged. For the
XLMRoberta, we use the pre-trained model, which contains 12 layers. We trained our
classi er using Adam optimizer with a learning rate of 2e-5 and CrossEntropy
loss. The dropout set to 0.1, the epoch and maximum sentence length is 10
and 512, respectively. If the maximum sentence length exceeds 512, it will cause
over ow, stop reading subsequent text data, the model will truncate the data set,
and then proceed to the next step. In order to save GPU memory, the batch size
was set to 8, and the gradient steps were set to 4, so that each time a sample
is input, the gradient is accumulated 4 times, and then the back-propagation
update parameters are performed. We extract the hidden layer state of
XLMRoberta by setting the output hidden States as true. In the process of
netuning and sequence classi cation, we use the HuggingFace libraries to provide
the RobertaForSequenceClassi cation module.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Result</title>
        <p>In this work, we will present the results of our submitted evaluations. In this
experiment, we participated in the tasks of Spanish languages provided by
IberLEF 2021, and the results were evaluated by the task organizer of IberLEF 2021,
using a weighted average F1-score as the evaluation standard. The results are
shown in Table 2 below. In this task, we got a score of macro F1-0.7053, ranking
7th on the leaderboard. The results are shown in Table 2. The results reported
by the organizers showed that the competition among participating teams was
very intense, and our best performance in the Spanish language task was an F1
score of 0.7053, which gave us 7th place.</p>
        <p>User Language Rank F1-score
zk15120170770 Spanish 7th 0.7053</p>
        <p>Table 2. The results of our model on the o cial test set.</p>
        <p>As can be seen from the results, our method works quite well in the Spanish
news item dataset. This may be because the XLM-Roberta model is pre-trained
in the multi-language dataset, so the work e ect is so good in the fake news
detection in Spanish.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper introduces the general idea and speci c plan of Team zk15120170770
in IberLEF 2021. As users grow in diversity and number, online platforms must
support multiple languages. In the competition, we used the Transfomer
pretraining model XLM-Roberta to complete Fake News Detection in Spanish. The
performance of our system was very competitive and we achieved good results.
In the future, we hope to extend our system to more languages and increase the
number of tasks it can perform. We will also explore the use of other pre-training
models and make comparative analyses.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Goodfellow</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            <given-names>A</given-names>
          </string-name>
          , et al.
          <source>Deep learning[M]</source>
          . Cambridge: MIT press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Sebastiani F.
          <source>Machine learning in automated text categorization[J]. ACM computing surveys (CSUR)</source>
          ,
          <year>2002</year>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Platt</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Sequential minimal optimization: A fast algorithm for training support vector machines</article-title>
          [J].
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jianqiang</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiaolin</surname>
            <given-names>G</given-names>
          </string-name>
          .
          <article-title>Comparison research on text pre-processing methods on twitter sentiment analysis[J]</article-title>
          .
          <source>IEEE Access</source>
          ,
          <year>2017</year>
          ,
          <volume>5</volume>
          :
          <fpage>2870</fpage>
          -
          <lpage>2879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wallach H M.</surname>
          </string-name>
          <article-title>Topic modeling: beyond bag-of-words[C]//</article-title>
          <source>Proceedings of the 23rd international conference on Machine learning</source>
          .
          <year>2006</year>
          :
          <fpage>977</fpage>
          -
          <lpage>984</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zaremba</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            <given-names>O</given-names>
          </string-name>
          .
          <article-title>Recurrent neural network regularization[J]</article-title>
          .
          <source>arXiv preprint arXiv:1409.2329</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Levy</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            <given-names>Y</given-names>
          </string-name>
          .
          <article-title>Dependency-based word embeddings[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers).</given-names>
          </string-name>
          <year>2014</year>
          :
          <fpage>302</fpage>
          -
          <lpage>308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Conneau</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandelwal</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            <given-names>N</given-names>
          </string-name>
          , et al.
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          [J].
          <source>arXiv preprint arXiv:1911.02116</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Vaswani</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            <given-names>N</given-names>
          </string-name>
          , et al.
          <article-title>Attention is all you need[J]</article-title>
          .
          <source>arXiv preprint arXiv:1706.03762</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Devlin</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>M W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>K</given-names>
          </string-name>
          , et al.
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          [J].
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. H.
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>J.-P.</given-names>
          </string-name>
          <string-name>
            <surname>Posadas-Duran</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Bel-Enguix</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Porto</surname>
          </string-name>
          , Overview of fakedestask at iberlef 2020:
          <article-title>Fake news detection in spanish</article-title>
          .,
          <source>Procesamiento del Lenguaje Natural67</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Posadas-Duran</surname>
            <given-names>J P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            <given-names>G</given-names>
          </string-name>
          , et al.
          <article-title>Detection of fake news in a new corpus for the Spanish language[J]</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          ,
          <year>2019</year>
          ,
          <volume>36</volume>
          (
          <issue>5</issue>
          ):
          <fpage>4869</fpage>
          -
          <lpage>4876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Manuel</surname>
            <given-names>Montes</given-names>
          </string-name>
          , Paolo Rosso, Julio Gonzalo, Ezra Aragon, Rodrigo Agerri,
          <string-name>
            <surname>Miguel Angel</surname>
          </string-name>
          Alvarez-Carmona, Elena Alvarez Mellado, Jorge
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          , Luis Chiruzzo, Larissa Freitas, Helena Gomez Adorno, Yoan Gutierrez,
          <source>Salud Mar a Jimenez Zafra</source>
          , Salvador Lima,
          <string-name>
            <surname>Flor Miriam</surname>
          </string-name>
          Plaza-de-Arco and Mariona Taule (eds.):
          <source>Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2021</year>
          ),
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>