ForceNLP at FakeDeS 2021: Analysis of Text
    Features Applied to Fake News Detection in
                      Spanish

         Jorge Reyes-Magaña1,2[0000−0002−8296−1344] and Luis Enrique Argota
                             Vega1[0000−0003−2988−9563]
     1
         Posgrado en Ciencia e Ingenierı́a de la Computación, Universidad Nacional
                             Autónoma de México, México
                              luiso91@comunidad.unam.mx
             2
               Universidad Autónoma de Yucatán, Mérida, Yucatán, México
                             jorge.reyes@correo.uady.mx


          Abstract. This paper presents our approach to the Task “Fake News
          Detection”, which aims to decide if a news item is fake or real by ana-
          lyzing its textual representation. The corpus consists of news compiled
          mainly from Mexican web sources: established newspaper websites, me-
          dia companies websites, special websites dedicated to validating fake
          news, and websites designated by different journalists as sites that reg-
          ularly publish fake news. Our approach is based mostly on different
          types of n-grams. For the task we use the classifiers: Logistic Regres-
          sion, Support Vector Machines and Multinomial Naive-Bayes. Our ap-
          proach achieved an average F1-score with respect to the other teams in
          the competition.

          Keywords: Fake news · Machine learning · Text features


1     Introduction
The new era of spreading information is here, the transmission speed of all
kinds of news is vertiginous. The use of social networks has encouraged and
provoke not only to be more informed, yet misinforming about the reality of the
world. Facebook presents the 50% of the total traffic to fake news sites and 20%
total traffic to reputable websites (9). The impact of this kind of fake news is
difficult to measure, some of the possible affected areas are economics, politics,
security, health, among others. For this reason the detection of this kind of untrue
statements turns to be essential in most of the automatics systems, in order to
keep the facts veracity that will lead people to make decisions according to true
facts.
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
    Besides, in most cases, the fake information turns to be more striking, and
when the users see this kind of news, they feel the duty of sharing because the
information seems to be very important and should be passed on, provoking the
fast-spreading and making in some cases, viral information. If we can contribute
in some way to stop this kind of behavior from the beginning, the benefit will be
for all. An example of the kind of damage that false information could cause is
about the supposed effects of vaccines in general, which could influence people
for instance not to take the COVID-19 vaccine, as we all know, the pandemic
has paralyzed the world and even now is causing so much pain, affecting the
daily life all over the world.
    The system developed for filtering fake news is based on annotated corpora,
the organizers provided us a set of truthful and fraudulent previously reviewed
news (8), the testing corpus contained information associated with COVID-19,
although the corpus used in the 2019 edition was given as a training set with
other information topics. The 2021 task edition (4) has as purpose to measure
the quality of the methods when the corpora have different topics during all
the competition phases. Posing in this way, a more challenging competition.
The FakeDeS is a task to be presented during the IBERLEF 2021 (6) (Iberian
Languages Evaluation Forum) .
    The rest of the paper is organized as follows: Section 2 presents some related
work regarding fake news detection, the description of the corpora used during
all the competition is described in Section 3, our methodology is presented in
Section 4, containing some preliminary results using the corpus available in the
development phase, guiding us to the improvement of each approach. The final
results of all systems with the evaluation corpus is reported in Section 5. The
paper ends with some conclusions in Section 6.


2   Related work

Shu et al.(11) give us a formal definition of fake news, as follows: Fake news is
a news article that is intentionally and verifiably false. During the study, they
focused on two principal branches referent to the features that characterize better
fake information, the first is based on traditional news media and they claimed
that this approach mainly relies on news content, while in social media, extra
social context auxiliary information can be used to as additional information to
help detect fake news.
    Pérez-Rosas et al. (9) present two fake news datasets, the former is based
on information of different domains via crowdsourcing, and the latter was gath-
ered through the Web. Authors developed classification models with a linear
SVM classifier and five-fold cross-validation. They combined a series of features,
like lexical, syntactic, and semantic information, including some properties that
represent text readability.
    Additionally, Reis et al. (10) present a large study of the most important fea-
tures to consider in fake news classification, they grouped into different elements
that include, a) Textual features (syntax, lexical, psycholinguistic, semantic and
subjectivity), b) News Source Features (bias, credibility and trustworthiness and
domain location), and c) Environment Features (engagement and temporal pat-
terns). They found that the prediction performance of the features combined
with existing classifiers like k-Nearest Neighbors, Naive Bayes, Random Forests,
Support Vector Machine with RBF kernel, and XGBoost, have a useful degree
of discriminative power for detecting fake news.
    Besides, the research done by Karimi et al. (5) studied the degree of false
news. They proposed a coherent and interpretable framework, that involves
automated feature extraction, multi-source fusion and fakeness discrimination,
showing that that their model can effectively distinguish different degrees of the
fakeness of news.

3   Corpus
The training corpus consists of news compiled mainly from a diversity of Mexican
web sources and covers the following 9 topics: Science, Sport, Economy, Edu-
cation, Entertainment, Politics, Health, Security, and Society. The data were
gathered from January to July of 2018. The principal sources used to collect the
information were established newspaper websites, media companies websites,
special websites dedicated to validating fake news, websites designated by dif-
ferent journalists as sites that regularly publish fake news. The corpus has 971
news, 480 were labeled as Fake and the remaining as True, all the news followed
a manual labeling process:
 – A news article is true if there is evidence that it has been published on
   reliable sites.
 – A news article is fake if there is news from reliable sites or specialized
   websites in the detection of deceptive content that contradicts it or no other
   evidence was found about the news besides the source.
Organizers collected the true-fake news pair of an event so there is a correlation
of news in the corpus.
    The distributed corpus during the development phase contained the following
information:
 – Topic: Science/ Sport/ Economy/ Education/ Entertainment/ Politics, Health/
   Security/ Society
 – Category: Fake/ True
 – Source: The name of the source media.
 – Headline: The title of the news.
 – Text: The complete text of the news.
 – Link: The URL where the news was published.
    For the systems evaluation, they provided a new testing corpus containing
572 elements, that were news related to COVID-19 and news from other Ibero-
American countries. This variation in the testing corpus produces that the sys-
tem should be prepared to dodge thematic and language variation. Besides, the
test data only includes Id, Headline and Text columns.
4     Methodology

This section presents the process we employed to prepare texts for further clas-
sification. When we deal with text information having the idea of discovering
knowledge, we face the problem about lack of structure. This absence is just
apparent, the text itself presents a kind of structure but so much complex and
hard to work computationally. Depending on the operations used in this stage
of pre-processing, these will be the kind of patterns to discover in the collection.
Before the feature extraction, we performed the pre-processing steps, described
in 4.1, to improve the n-grams representation.
     Additionally, there are several methods for increasing the characteristics of
the system, in order to feed the classifier and have more elements to discriminate
the data.


4.1   Pre-processing

 – All texts were standardized to lowercase, avoiding the repetition of the same
   words.
 – Stopwords were removed.
 – We deleted the numbers that appear in text.
 – We deleted punctuation, since it does not add any additional information
   when processing text data.
 – The sequences of several blank spaces, tabs and line breaks were standardized
   to a single blank space.

    Due to the differences in both corpora, development and testing, we decide
to apply the pre-processing only in the main text of the news.


4.2   Features

We took into account several n-grams features for the representation of texts:

 – Character.
 – Word.
 – POS tags. Are sequences of continuous part-of-speech (POS) tags. They
   capture syntactic information and are useful. For this feature we used the
   Spacy tagger.
 – Skipgrams. We capture groups of 2 words with skips of 1 to 3 words.
 – Function words. The frequency of this words is one of the best character-
   istics to detect hate speech and aggressiveness (1), so in this case we want
   to see if this can help us to discriminate fake news. We built function words
   n-grams from 2 to 4 tokens using the spanish stopwords list from NLTK (3).
 – Punctuation symbols. With this feature we want to tackle the coherence
   and cohesion to the written text. Prior to the corpus pre-processing, we built
   n-grams of 2 to 4 punctuation symbols.
   We use two variations of features as seen in Table 1. The columns associated
with the approaches represent the lengths of n-grams that were applied using all
the features when the tested classifiers were executed; meaning that approach-1
contained 17 features and approach-2, 15. We select the feature combinations
due to the performance showed during the phases.

                    Table 1. Features applied to the models.

                   Features (n-grams) Approach 1 Approach 2
                   Characters          [3,4,5]   [3,4,5]
                   Words               [2,3,4]   [2,3]
                   Skipgrams           [2,3]     [1,2]
                   PosTags             [2,3,4]   [2,3]
                   stop words          [2,3,4]   [2,3,4]
                   Punctuation symbols [2,3,4]   [2,3,4]


4.3   Classifier
We used three different well -known classifiers, all of them described in (2; 7).
The selected models are: Multinomial Naive Bayes (MNB), Logistic Regression
(LR), and Support Vector Machines (SVM). We also select CountVectorizer
with a threshold of 3. We tested all the models during the competition phases
in conjunction with the approaches described in Table 1.

5     Results
During the development phase the best result we got is using the Multinomial
Naive Bayes classifier, having an F1-score of 0.7576, this model let us rank in
position 7 of this phase. The feature approach used in this case was number
1. The models using Logistic Regression and Support Vector Machines applied
on both approaches didn’t overcome the results obtained with the Multinomial
classifier.
    On the contrary, during the evaluation phase, we had different results, our
best model turns to be the Logistic Regression applied with the feature approach
1 and the worst was Multinomial Naive Bayes, as seen in Table 2. The F1-score
of our best model ranked us in number 8 of the competition. We weren’t able to
try more model combinations with the approaches, due to the rules of maximum
submissions during this phase of the competition.

6     Conclusions
Fake news detection is still an ongoing challenge to be resolved, the wide range
of information that can be changed to produce false statements increase the dif-
ficulty to have a forthright solution. Hence, the importance of this type of task
                     Table 2. F1-score in evaluation phase.

                            Model Approach F1-score
                            LR       1     0.6925
                            SVM      1     0.6722
                            LR       2     0.6921
                            MNB      1     0.4928


that will let us see, understand, and improve the methodologies developed all
over the world. According to the official results, we can see that our approach
stated below the baseline of a SVM featured with character trigrams, having
an F1-score of 0.7062. We consider, that the complexity added to our model as
having different features didn’t worth it, the gain, in this case, is very low. We
also believe that the unbalanced corpora regarding different columns in the de-
velopment phase, provided good means of discriminant information during the
classifier training, as well as containing only news in Mexican Spanish. Unfor-
tunately, we didn’t have the same elements in the evaluation. We believe this
causes an impact on the general results.
                              Bibliography


 [1] Argota Vega, L.E., Reyes-Magaña, J.C., Gómez-Adorno, H., Bel-Enguix,
     G.: Mineriaunam at semeval-2019 task 5: Detecting hate speech in twitter
     using multiple features in a combinatorial framework. In: Proceedings of the
     13th International Workshop on Semantic Evaluation. pp. 447–452 (2019)
 [2] Aurelien, G.: Hands-on machine learning with Scikit-Learn and TensorFlow:
     concepts, tools, and techniques to build intelligent systems. OReilly (2019)
 [3] Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COL-
     ING/ACL 2006 Interactive Presentation Sessions. pp. 69–72 (2006)
 [4] Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G., Porto, C.:
     Overview of fakedes task at iberlef 2021: Fake news detection in spanish.
     Procesamiento del Lenguaje Natural 67(0) (2021)
 [5] Karimi, H., Roy, P., Saba-Sadiya, S., Tang, J.: Multi-source multi-class fake
     news detection. In: Proceedings of the 27th international conference on com-
     putational linguistics. pp. 1546–1557 (2018)
 [6] Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez Car-
     mona, M., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Fre-
     itas, L., Gómez-Adorno, H., Gutiérrez, Y., Jiménez Zafra, S.M., Lima, S.,
     Plaza-de Arco, F.M., Taulé, M.: Ceur workshop proceedings, 2021. In: Pro-
     ceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021)
 [7] Müller, A.C., Guido, S.: Introduction to machine learning with Python: a
     guide for data scientists. OReilly Media (2018)
 [8] Posadas-Durán, J.P., Gómez-Adorno, H., Sidorov, G., Escobar, J.J.M.: De-
     tection of fake news in a new corpus for the spanish language. Journal of
     Intelligent & Fuzzy Systems 36(5), 4869–4876 (2019)
 [9] Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detec-
     tion of fake news (2017)
[10] Reis, J.C., Correia, A., Murai, F., Veloso, A., Benevenuto, F.: Supervised
     learning for fake news detection. IEEE Intelligent Systems 34(2), 76–81
     (2019)
[11] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social
     media: A data mining perspective. ACM SIGKDD explorations newsletter
     19(1), 22–36 (2017)