<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CiTIUS at FakeDeS 2021: A Hybrid Strategy for Fake News Detection</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigacion en Tecnolox as da Informacion (CiTIUS) Universidade de Santiago de Compostela</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>This article describes several BERT-based supervised classication strategies submitted to Fake News Detection in Spanish Shared Task, where the sources of data are news annotated as fake or real. In our experiments, the systems were trained exclusively with the o cial datasets provided by the organizers of the shared task, without making use of any other source of information. The best system turned out to be a hybrid strategy that combines sentence similarity with some linguistic heuristics.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake News</kwd>
        <kwd>Transformers</kwd>
        <kwd>BERT Sentence Similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The widespread use of social media and e-communication platforms pushed
people to rely on them as the main source for information. Unfortunately, an
abnormal amount of fake news, rumours and disinformation has over owed social
media, with the aim of drawing the attention of their users to shape their
opinions and judgments [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Fake news and all kind of disinformation can have
dramatic e ects on countries, businesses, and people on various levels, whether
political or economically [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Many approaches have been proposed to identify the authenticity of
published news on social media and e-communication platforms. Some of these
approaches rely on the users of the platforms. For example, Facebook urges their
users to report suspicious news or comments [14], and even makes uses of
professionals to manually checks the reported comments and news published on their
platform. The manual fact checking process also has been used by many other
fact-checkers, journals and organizations to discover questionable news, however,
this manual method is a waste of human e orts because of the huge amount of
news published every second on social media [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Accordingly, automating the
detection of fake news has caught the attention of researchers in academia and
industry particularly after the incident of the American elections in 2016.
      </p>
      <p>
        The use of Natural Language Processing (NLP) techniques with machine
learning and deep learning methods for the detection of fake news can help to
stop or at least reducing the misinformation. The use of both traditional machine
learning techniques and deep learning approaches for the detection of hostile
communication (a term that embraces both disinformation and hate speech)
has received much attention recently [
        <xref ref-type="bibr" rid="ref1 ref10 ref8">10, 1, 8</xref>
        ], even though the most successful
systems for such a task are those using domain-speci c ne-tuning of pre-trained
masked language models (i.e., Transformers architectures). For example, in the
shared subtask focused on COVID-19 Fake News Detection in English, the best
performances were achieved by Transformer-based systems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        In this paper, we describe the experiments performed to participate at Fake
News Detection in Spanish Shared Task (FakeDeS 2021) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] In order to train and
develop the systems, the organizers of this task provide the Spanish Fake News
Corpus, which consists of news compiled from di erent web sources and covering
several topics, e.g. Economy, Science, Politics, Sport, Health, etc. [13]. The nal
test corpus provided by the organizers for evaluation contains news on
COVID19, a speci c topic which was not including in the train and development corpora.
The task consists of deciding if a piece of news is real or fake by considering both
the title and body text of the news.
      </p>
      <p>
        In this paper, in order to accomplish the Fake News Detection in Spanish
Shared Task, we compare several strategies that make use of BERT-based models
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. One of the proposed strategies is hybrid as they compute semantic similarity
by combining BERT-based models with linguistic heuristics. This turned out to
be our best model in the test dataset, and the sixth best model in the shared
task, out of 21 participants.
      </p>
      <p>The remaining of the paper is organized as follows. In section 2, we describe
the proposed methods. Experimental results and discussion are addressed in
section 3, while section 4 reports the conclusion of the current study as well as
future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Strategies</title>
      <p>
        As it has been said, our aim is to compare several BERT-based strategies by
making use of the training data provided by the organizers of Fake News
Detection in Spanish Shared Task, which is part of Iberian Languages Evaluation
Forum (IberLEF 2021) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Bidirectional Encoder Representations from Transformers, known as BERT
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is a bi-directional transformer-based language model learning information
from left to right and from right to left. As any language model, it can be
used to extract high quality language features from input text, but it can also
be ne-tuned on speci c NLP tasks such as entity recognition, classi cation,
question answering, sentiment analysis, or claim veri cation in fact checking.
In the experiments described in the next section, we will use BERT with three
di erent strategies:
      </p>
      <p>Fine-tune model: In the ne tuning strategy, we add a dense layer on
top of the last layer of the pre-trained BERT model and then train the whole
model by making use of the task speci c dataset. Pre-trained BERT model has
been ne-tuned in order to conduct the binary classi cation task to fake news
detection.</p>
      <p>Sentence similarity: BERT is also able to extract both contextualized word
embeddings and sentence embeddings from text in order to compute semantic
similarity between complex expressions or sentences. For the particular task at
stake, we compare the target news of the test set with news labeled as fake in
the training dataset and those labeled as real. Comparison consists in compute
sentence similarity between the target news and all labeled news of the training
data. The target news is classi ed as fake if the average of sentence similarities
with fake news is exceeding the average of sentence similarities with real news.
It is classi ed as real if the opposite situation arises.</p>
      <p>Hybrid strategy: This uses the output of the previous method in order to
identify the borderline cases, that is, those news that were classi ed as either
fake or real with a low value (this value was set empirically). These borderline
cases were then reclassi ed by using linguistic cues, such as size of the news, the
presence of sentences written in capital letters, or speci c fake statements (e.g.,
5G...) in the body text of the news.</p>
      <p>
        To compare these systems with traditional machine learning methods, we
also performed tests with a Naive Bayes classi er. For this purpose, we used
the system we implemented in a previous work for the task of bot detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which relies on Linguakit [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for tokenization and extraction of n-grams.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>The datasets provided by the organizers consists of three partitions: train,
development, and test. The train dataset consists of 676 labeled news containing
264K words. The development dataset consists of 295 news with 126K words,
while the test dataset consists of 572 news with 190K words, the rst being used
for preliminary experiments and con guration of the systems, and the second
one for nal evaluation. The systems submitted to the shared task for nal
evaluation were trained by merging both the train and development datasets: 971
news with 391K words.</p>
        <p>It should be noticed that we did not made use of any external source of
knowledge or other annotated datasets to ne tune the models
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Systems Con guration</title>
        <p>
          Both ne tuning and sentence similarity strategies were performed with BETO
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], a pre-trained model for Spanish with 12 layers, by making use of
Huggingface Transformers library [15]. Concerning ne tuning, the training dataset was
divided in both train (80%) and validation (20%) le partitions, while the
development dataset was used as test le. For the o cial results, train and
development were merged to build a larger training le.Concerning the use of BERT
for sentence similarity, we built sentence embeddings with a pooling method: it
adds a pooling operation to the output of the Transformer to derive xed sized
sentence embeddings. The speci c pooling strategy we used is the mean of all
output vectors.
In this paper, several classi cation strategies have been compared for the fake
news detection task on Spanish news. More precisely, we compared recent
strategies relying on the use of Transformers, which provide deep semantic models
with contextualized word embeddings. The best result combines BERT-based
sentence similarity with linguistic heuristics. The results were submitted to Fake
News Detection in Spanish Shared Task. Experiments were performed without
considering external sources of knowledge or other annotated datasets.
        </p>
        <p>In future work, we will try to carry out an in-depth analysis of the results
obtained to establish what factors determine the signi cant di erence between
the di erent strategies with the datasets evaluated. We will also look for other
sources of information and explore more linguistic information so as to improve
the hybrid strategy.
13. Posadas-Duran, J.P., Gomez-Adorno, H., Sidorov, G., Escobar, J.J.M.: Detection
of fake news in a new corpus for the spanish language. Journal of Intelligent &amp;
Fuzzy Systems 36(5), 4869{4876 (2019)
14. Umer, M., Imtiaz, Z., Ullah, S., Mehmood, A., Choi, G.S., On, B.W.: Fake news
stance detection using deep learning architecture (cnn-lstm). IEEE Access 8,
156695{156706 (2020)
15. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac,
P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen,
P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M.,
Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing.
In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations. pp. 38{45. Association for Computational
Linguistics, Online (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6,
https://www.aclweb.org/anthology/2020.emnlp-demos.6</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Almatarneh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pena</surname>
            ,
            <given-names>F.J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexeev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Supervised classi ers to identify hate speech on english and spanish tweets</article-title>
          .
          <source>In: International Conference on Asian Digital Libraries</source>
          . pp.
          <volume>23</volume>
          {
          <fpage>30</fpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Apuke</surname>
            ,
            <given-names>O.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Fake news and covid-19: modelling the predictors of fake news sharing among social media users</article-title>
          .
          <source>Telematics and Informatics</source>
          p.
          <volume>101475</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Caete</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaperon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prez</surname>
          </string-name>
          , J.:
          <article-title>Spanish pretrained bert model and evaluation data</article-title>
          .
          <source>In: PML4DC at ICLR</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          . pp.
          <volume>4171</volume>
          {
          <fpage>4186</fpage>
          . Association for Computational Linguistics, Minneapolis, Minnesota (
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423, https://www.aclweb.org/ anthology/N19-1423
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pieiro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez-Castao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pichel</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>LinguaKit: A Big Data-Based Multilingual Tool for Linguistic Analysis and Information Extraction</article-title>
          .
          <source>In: 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS)</source>
          . pp.
          <volume>239</volume>
          {
          <issue>244</issue>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/SNAMS.
          <year>2018</year>
          .8554689
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gamallo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almatarneh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Naive-bayesian classi cation for bot detection in twitter</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.E.</surname>
          </string-name>
          , Muller, H. (eds.) Working Notes of CLEF 2019 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Lugano, Switzerland, September 9-
          <issue>12</issue>
          ,
          <year>2019</year>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2380</volume>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2019</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /paper\_194.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Giachanou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The battle against online harmful information: The cases of fake news and hate speech</article-title>
          .
          <source>In: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</source>
          . pp.
          <volume>3503</volume>
          {
          <issue>3504</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gilda</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Evaluating machine learning algorithms for fake news detection</article-title>
          .
          <source>In: 2017 IEEE 15th Student Conference on Research and Development (SCOReD)</source>
          . pp.
          <volume>110</volume>
          {
          <fpage>115</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Duran</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bel-Enguix</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Overview of fakedes task at iberlef 2020: Fake news detection in spanish</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <issue>0</issue>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asthana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Upadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Upreti</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akbar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Fake news detection using deep learning models: A novel approach</article-title>
          .
          <source>Transactions on Emerging Telecommunications Technologies</source>
          <volume>31</volume>
          (
          <issue>2</issue>
          ),
          <year>e3767</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aragn</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agerri</surname>
          </string-name>
          , R., ngel lvarez Carmona, M.,
          <string-name>
            <surname>lvarez Mellado</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>de Albornoz</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adorno</surname>
            ,
            <given-names>H.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutirrez</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zafra</surname>
            ,
            <given-names>S.M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lima</surname>
          </string-name>
          , S.,
          <string-name>
            <surname>de Arco</surname>
            ,
            <given-names>F.M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taul</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Proceedings of the iberian languages evaluation forum (iberlef 2021)</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Patwa</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhardwaj</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guptha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , PYKL,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Ekbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Akhtar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.S.</given-names>
            ,
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts</article-title>
          . In: Chakraborty,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.R.</surname>
          </string-name>
          , Liu,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Akhtar</surname>
          </string-name>
          , M.S. (eds.)
          <article-title>Combating Online Hostile Posts in Regional Languages during Emergency Situation</article-title>
          . pp.
          <volume>42</volume>
          {
          <fpage>53</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>