<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ForceNLP at FakeDeS 2021: Analysis of Text Features Applied to Fake News Detection in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Universidad Nacional Autonoma de Mexico</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mexico luiso</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>@comunidad.unam.mx</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Autonoma de Yucatan</institution>
          ,
          <addr-line>Merida, Yucatan</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our approach to the Task \Fake News Detection", which aims to decide if a news item is fake or real by analyzing its textual representation. The corpus consists of news compiled mainly from Mexican web sources: established newspaper websites, media companies websites, special websites dedicated to validating fake news, and websites designated by di erent journalists as sites that regularly publish fake news. Our approach is based mostly on di erent types of n-grams. For the task we use the classi ers: Logistic Regression, Support Vector Machines and Multinomial Naive-Bayes. Our approach achieved an average F1-score with respect to the other teams in the competition.</p>
      </abstract>
      <kwd-group>
        <kwd>Fake news</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Text features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The new era of spreading information is here, the transmission speed of all
kinds of news is vertiginous. The use of social networks has encouraged and
provoke not only to be more informed, yet misinforming about the reality of the
world. Facebook presents the 50% of the total tra c to fake news sites and 20%
total tra c to reputable websites (9). The impact of this kind of fake news is
di cult to measure, some of the possible a ected areas are economics, politics,
security, health, among others. For this reason the detection of this kind of untrue
statements turns to be essential in most of the automatics systems, in order to
keep the facts veracity that will lead people to make decisions according to true
facts.</p>
      <p>Besides, in most cases, the fake information turns to be more striking, and
when the users see this kind of news, they feel the duty of sharing because the
information seems to be very important and should be passed on, provoking the
fast-spreading and making in some cases, viral information. If we can contribute
in some way to stop this kind of behavior from the beginning, the bene t will be
for all. An example of the kind of damage that false information could cause is
about the supposed e ects of vaccines in general, which could in uence people
for instance not to take the COVID-19 vaccine, as we all know, the pandemic
has paralyzed the world and even now is causing so much pain, a ecting the
daily life all over the world.</p>
      <p>The system developed for ltering fake news is based on annotated corpora,
the organizers provided us a set of truthful and fraudulent previously reviewed
news (8), the testing corpus contained information associated with COVID-19,
although the corpus used in the 2019 edition was given as a training set with
other information topics. The 2021 task edition (4) has as purpose to measure
the quality of the methods when the corpora have di erent topics during all
the competition phases. Posing in this way, a more challenging competition.
The FakeDeS is a task to be presented during the IBERLEF 2021 (6) (Iberian
Languages Evaluation Forum) .</p>
      <p>The rest of the paper is organized as follows: Section 2 presents some related
work regarding fake news detection, the description of the corpora used during
all the competition is described in Section 3, our methodology is presented in
Section 4, containing some preliminary results using the corpus available in the
development phase, guiding us to the improvement of each approach. The nal
results of all systems with the evaluation corpus is reported in Section 5. The
paper ends with some conclusions in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Shu et al.(11) give us a formal de nition of fake news, as follows: Fake news is
a news article that is intentionally and veri ably false. During the study, they
focused on two principal branches referent to the features that characterize better
fake information, the rst is based on traditional news media and they claimed
that this approach mainly relies on news content, while in social media, extra
social context auxiliary information can be used to as additional information to
help detect fake news.</p>
      <p>Perez-Rosas et al. (9) present two fake news datasets, the former is based
on information of di erent domains via crowdsourcing, and the latter was
gathered through the Web. Authors developed classi cation models with a linear
SVM classi er and ve-fold cross-validation. They combined a series of features,
like lexical, syntactic, and semantic information, including some properties that
represent text readability.</p>
      <p>Additionally, Reis et al. (10) present a large study of the most important
features to consider in fake news classi cation, they grouped into di erent elements
that include, a) Textual features (syntax, lexical, psycholinguistic, semantic and
subjectivity), b) News Source Features (bias, credibility and trustworthiness and
domain location), and c) Environment Features (engagement and temporal
patterns). They found that the prediction performance of the features combined
with existing classi ers like k-Nearest Neighbors, Naive Bayes, Random Forests,
Support Vector Machine with RBF kernel, and XGBoost, have a useful degree
of discriminative power for detecting fake news.</p>
      <p>Besides, the research done by Karimi et al. (5) studied the degree of false
news. They proposed a coherent and interpretable framework, that involves
automated feature extraction, multi-source fusion and fakeness discrimination,
showing that that their model can e ectively distinguish di erent degrees of the
fakeness of news.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus</title>
      <p>The training corpus consists of news compiled mainly from a diversity of Mexican
web sources and covers the following 9 topics: Science, Sport, Economy,
Education, Entertainment, Politics, Health, Security, and Society. The data were
gathered from January to July of 2018. The principal sources used to collect the
information were established newspaper websites, media companies websites,
special websites dedicated to validating fake news, websites designated by
different journalists as sites that regularly publish fake news. The corpus has 971
news, 480 were labeled as Fake and the remaining as True, all the news followed
a manual labeling process:
{ A news article is true if there is evidence that it has been published on
reliable sites.
{ A news article is fake if there is news from reliable sites or specialized
websites in the detection of deceptive content that contradicts it or no other
evidence was found about the news besides the source.</p>
      <p>Organizers collected the true-fake news pair of an event so there is a correlation
of news in the corpus.</p>
      <p>The distributed corpus during the development phase contained the following
information:
{ Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/</p>
      <p>Security/ Society
{ Category: Fake/ True
{ Source: The name of the source media.
{ Headline: The title of the news.
{ Text: The complete text of the news.
{ Link: The URL where the news was published.</p>
      <p>For the systems evaluation, they provided a new testing corpus containing
572 elements, that were news related to COVID-19 and news from other
IberoAmerican countries. This variation in the testing corpus produces that the
system should be prepared to dodge thematic and language variation. Besides, the
test data only includes Id, Headline and Text columns.</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>This section presents the process we employed to prepare texts for further
classi cation. When we deal with text information having the idea of discovering
knowledge, we face the problem about lack of structure. This absence is just
apparent, the text itself presents a kind of structure but so much complex and
hard to work computationally. Depending on the operations used in this stage
of pre-processing, these will be the kind of patterns to discover in the collection.
Before the feature extraction, we performed the pre-processing steps, described
in 4.1, to improve the n-grams representation.</p>
      <p>Additionally, there are several methods for increasing the characteristics of
the system, in order to feed the classi er and have more elements to discriminate
the data.
4.1</p>
      <p>Pre-processing
{ All texts were standardized to lowercase, avoiding the repetition of the same
words.
{ Stopwords were removed.
{ We deleted the numbers that appear in text.
{ We deleted punctuation, since it does not add any additional information
when processing text data.
{ The sequences of several blank spaces, tabs and line breaks were standardized
to a single blank space.</p>
      <p>Due to the di erences in both corpora, development and testing, we decide
to apply the pre-processing only in the main text of the news.
4.2</p>
      <p>Features
We took into account several n-grams features for the representation of texts:
{ Character.
{ Word.
{ POS tags. Are sequences of continuous part-of-speech (POS) tags. They
capture syntactic information and are useful. For this feature we used the
Spacy tagger.
{ Skipgrams. We capture groups of 2 words with skips of 1 to 3 words.
{ Function words. The frequency of this words is one of the best
characteristics to detect hate speech and aggressiveness (1), so in this case we want
to see if this can help us to discriminate fake news. We built function words
n-grams from 2 to 4 tokens using the spanish stopwords list from NLTK (3).
{ Punctuation symbols. With this feature we want to tackle the coherence
and cohesion to the written text. Prior to the corpus pre-processing, we built
n-grams of 2 to 4 punctuation symbols.</p>
      <p>We use two variations of features as seen in Table 1. The columns associated
with the approaches represent the lengths of n-grams that were applied using all
the features when the tested classi ers were executed; meaning that approach-1
contained 17 features and approach-2, 15. We select the feature combinations
due to the performance showed during the phases.</p>
      <p>
        Features (n-grams) Approach 1 Approach 2
Characters [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
        ] [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
        ]
Words [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]
Skipgrams [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]
PosTags [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]
stop words [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ]
      </p>
      <p>
        Punctuation symbols [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ] [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2,3,4</xref>
        ]
We used three di erent well -known classi ers, all of them described in (2; 7).
The selected models are: Multinomial Naive Bayes (MNB), Logistic Regression
(LR), and Support Vector Machines (SVM). We also select CountVectorizer
with a threshold of 3. We tested all the models during the competition phases
in conjunction with the approaches described in Table 1.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>During the development phase the best result we got is using the Multinomial
Naive Bayes classi er, having an F1-score of 0.7576, this model let us rank in
position 7 of this phase. The feature approach used in this case was number
1. The models using Logistic Regression and Support Vector Machines applied
on both approaches didn't overcome the results obtained with the Multinomial
classi er.</p>
      <p>On the contrary, during the evaluation phase, we had di erent results, our
best model turns to be the Logistic Regression applied with the feature approach
1 and the worst was Multinomial Naive Bayes, as seen in Table 2. The F1-score
of our best model ranked us in number 8 of the competition. We weren't able to
try more model combinations with the approaches, due to the rules of maximum
submissions during this phase of the competition.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Fake news detection is still an ongoing challenge to be resolved, the wide range
of information that can be changed to produce false statements increase the
difculty to have a forthright solution. Hence, the importance of this type of task
that will let us see, understand, and improve the methodologies developed all
over the world. According to the o cial results, we can see that our approach
stated below the baseline of a SVM featured with character trigrams, having
an F1-score of 0.7062. We consider, that the complexity added to our model as
having di erent features didn't worth it, the gain, in this case, is very low. We
also believe that the unbalanced corpora regarding di erent columns in the
development phase, provided good means of discriminant information during the
classi er training, as well as containing only news in Mexican Spanish.
Unfortunately, we didn't have the same elements in the evaluation. We believe this
causes an impact on the general results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Argota</given-names>
            <surname>Vega</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.E.</surname>
          </string-name>
          , Reyes-Magan~a,
          <string-name>
            <given-names>J.C.</given-names>
            ,
            <surname>Gomez-Adorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Bel-Enguix</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          :
          <article-title>Mineriaunam at semeval-2019 task 5: Detecting hate speech in twitter using multiple features in a combinatorial framework</article-title>
          .
          <source>In: Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          . pp.
          <volume>447</volume>
          {
          <issue>452</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Aurelien</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems</article-title>
          .
          <source>OReilly</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions</source>
          . pp.
          <volume>69</volume>
          {
          <issue>72</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Posadas-Duran</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bel-Enguix</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Porto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Overview of fakedes task at iberlef 2021: Fake news detection in spanish</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <issue>0</issue>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Karimi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saba-Sadiya</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Multi-source multi-class fake news detection</article-title>
          .
          <source>In: Proceedings of the 27th international conference on computational linguistics</source>
          . pp.
          <volume>1546</volume>
          {
          <issue>1557</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aragon</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agerri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Alvarez</given-names>
            <surname>Carmona</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alvarez</given-names>
            <surname>Mellado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Carrillo-de Albornoz</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez Zafra</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lima</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plaza-de Arco</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taule</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Ceur workshop proceedings,
          <source>2021. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2021</year>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Muller,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Guido</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Introduction to machine learning with Python: a guide for data scientists</article-title>
          .
          <source>OReilly Media</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Posadas-Duran</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escobar</surname>
            ,
            <given-names>J.J.M.:</given-names>
          </string-name>
          <article-title>Detection of fake news in a new corpus for the spanish language</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems</source>
          <volume>36</volume>
          (
          <issue>5</issue>
          ),
          <volume>4869</volume>
          {
          <fpage>4876</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Perez-Rosas</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefevre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Automatic detection of fake news (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Reis</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Correia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murai</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veloso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benevenuto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Supervised learning for fake news detection</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          <volume>34</volume>
          (
          <issue>2</issue>
          ),
          <volume>76</volume>
          {
          <fpage>81</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Shu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sliva</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Liu, H.:
          <article-title>Fake news detection on social media: A data mining perspective</article-title>
          .
          <source>ACM SIGKDD explorations newsletter 19(1)</source>
          ,
          <volume>22</volume>
          {
          <fpage>36</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>