Semi-automatic Annotation Proposal for Increasing a
Fake News Dataset in Spanish
Alba Bonet-Jover
Department of Software and Computing Systems, University of Alicante, Spain


                                      Abstract
                                      The digital era has become an ally of fake news, since it has increased the spread and amount of false
                                      information. Fake news is a global problem that causes disorder and generates fear. This phenomenon
                                      must be attacked in the same environment in which it is generated: in the digital environment. This
                                      paper presents the current state of my doctoral thesis which focuses on the linguistic modelling applied
                                      to the automatic detection of fake news through Natural Language Processing (NLP). In order to study
                                      the linguistic characteristics of fake news and to create computational models that automate its detection,
                                      labelled datasets are needed, but this is a costly task that requires time and expertise. A fake news dataset
                                      and an annotation guide were created ad hoc in a previous work to analyse all the parts and elements of
                                      a news item. However, after creating and training our system, we realised that the time spent was not
                                      proportional to the low annotated data obtained. The need of creating a larger corpus to train and test
                                      our hypothesis has led us to think about a way of increasing our corpus without spending so much time.
                                      For that purpose, a semi-automatic annotation is proposed for reducing time while increasing speed and
                                      quantity of the examples annotated. This proposal, besides allowing us to make progress in our research,
                                      may facilitate the creation of datasets, which are essential in NLP research.

                                      Keywords
                                      Natural Language Processing, Human Language Technologies, Fake news detection, Semi-automatic
                                      annotation, Corpus annotation, Corpus creation


1. Justification of the research
We are living in the global disinformation era, an era in which the excess of information is
causing an infodemic, “a situation in which a lot of false information is being spread in a way
that is harmful”, as defined in the Collins Dictionary. Fake news has always existed and has
been used for different purposes throughout history. However, the difference lies in the fact
that in the current era there are more powerful dissemination tools than paper, radio or oral
speeches: social media and the Internet.
  The information manipulation is commonplace in the digital era and to that fact must be
added the development of new technologies and the arrival of the Internet. All those factors has
increased the fast spreading of fake news via social media and online digital newspapers, thereby
increasing confusion and social damage. Words have a considerable power in shaping people’s
beliefs and opinions [1] and with the current COVID-19 pandemic, the excess of information and

Doctoral Symposium on Natural Language Processing from the PLN.net network 2021 (RED2018-102418-T), 19-20
October 2021, Baeza (Jaén), Spain.
$ alba.bonet@dlsi.ua.es (A. Bonet-Jover)
 0000-0002-7172-0094 (A. Bonet-Jover)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                       14
the amount of disinformation digitally disseminated are raising unfounded fears and confusing
population with hoaxes and fake news.
   It is necessary to tackle this problem in the same environment in which they are created
and spread: in the digital media. Due to the fast dissemination, it is impossible for humans to
assume and analyse such a large amount of information in such a short period of time. For that
reason, Human Language Technologies are needed to automate tasks and develop computational
models in this field. The modelling of a deception language, as well as the manual annotation
of news are key steps for automating the detection of fake news. To that end, labelled datasets
are needed to train, but annotation is one of the most time-consuming and financially costly
tasks in Natural Language Processing [2] and it requires human expertise, time and consistency.
The amount of labelled corpora for research in NLP is low, even more in languages other than
English, such as Spanish, a language in which there are few resources for this task.
   The main objective of the thesis is to shape a deceptive language model of fake news in order
not only to detect them automatically, but also to justify the decision of that detection. To that
end, we manually created and annotated a fake news dataset, but to keep training our model,
we need to obtain a larger corpus but also to reduce the time spent in those tasks. We propose a
new Spanish dataset that is being collected and annotated combining manual and automatic
processes. This semi-automatic proposal may facilitate the creation of our dataset by reducing
cost and time.
   This paper is structured as follows: Section 2 presents an overview of the most relevant
scientific literature; Section 3 describes the new corpus, the updated version of the annotation
guide and the new proposal of semi-automatic annotation; Section 4 introduces the changes
that have affected the methodology in comparison with our first work and new experiments we
are conducting for this research; Section 5 presents some problems encountered for discussion;
Section 6 presents conclusions and future work and the paper ends with the references used for
the writing of this article.


2. Background and related projects
In order to substantiate the approach of the thesis and the progress made, it is important to
analyse the state of the art related to existing datasets dealing with fake news and annotation
proposals available so far.

2.1. English datasets
Several datasets have been published in English to develop computational systems for the fake
news detection.

    • The first public fake news detection and fact-checking dataset was released by Vlachos
      and Riedel [3]. It was composed of 221 statements collected from PolitiFact1 and Channel
      42 . This dataset presented a five-label-tag classification: True, MostlyTrue, HalfTrue,
      MostlyFalse and False.
   1
       http://www.politifact.com/
   2
       http://blogs.channel4.com/factcheck/


                                               15
    • The dataset EMERGENT was presented by Ferreira and Vlachos [4]. It was created from
      rumour sites and Twitter accounts and labelled by journalists. It contained 300 claims
      and 2595 associated news articles. A stance label (for, against, observing) was assigned to
      the news article headline with respect to the claim and, in parallel, the veracity of the
      claim was established following three values (true, false, unverified).
    • Wang [5] introduced the LIAR dataset, a broad dataset consisting of 12836 real-world
      short statements collected from PolitiFact3 and covering several topics. It was manually
      labelled following a scale of six fine-grained labels: pants-fire, false, barely-true, half-true,
      mostly-true and true.
    • Patwa et al. [6] released a fake news dataset of 10700 fake and real news focused on the
      COVID-19. This dataset was manually annotated according to two labels (real, fake).
    • The first COVID-19 Twitter fake News dataset (CTF) was recently introduced by Paka
      et al. [7] consisting of a mixture of labelled and unlabelled tweets. The novelty lies in
      the semi-supervised attention neural model that works with unlabelled data to learn the
      writing style of tweets. This corpus was also manually annotated using a two-scale labels:
      Fake and Genuine.

2.2. Spanish datasets
In NLP, and particularly in the field of the fake news detection, corpora built in the language of
Cervantes are scarce in comparison with those in the language of Shakespeare. Some interesting
datasets for our research are:

    • An opinion Spanish dataset consisting of statements covering 3 topics. It contained 100
      true and 100 false statements for each topic, labelled and manually verified. With this
      corpus, Almela et al. [8] sought to find deceptive cues in written language in Spanish.
    • A Spanish Fake News Corpus introduced by Posadas-Durán et al. [9] and composed of
      491 true news and 480 fake news collected from online resources. For labelling the news,
      two values of veracity were considered: true and fake. To compile the corpus, keywords
      were identified by answering the questions What, Who, How, When and Where.
    • A corpus built in Spanish and Italian including fake news spread in Facebook and Twitter
      in both countries. Two official Italian and Spanish fact-checking agencies, Maldita.es4 and
      Bufale Un Tanto Al Chilo5 , analysed and classified the news as “fake”. With this dataset,
      Mottola [10] proposes an analysis of structural and linguistic features of fake news.


3. Description and objectives
To the authors’ knowledge and according to the literature consulted, available datasets in this
domain are focused on labelling the news as a whole. Even if they usually present a scale of
different degrees of truthfulness, they consider the whole text as a sole unit. In addition, some
of them focus on news published on social media and cover a variety of topics.
   3
     http://www.politifact.com/
   4
     https://maldita.es/
   5
     https://www.butac.it/


                                                 16
   How does our proposal differ from those datasets? Our contribution focus on building a
corpus in Spanish (both from Spain and Latin America), due to the lack of labelled resources in
this language; focused on the health domain and COVID-19, because of the current pandemic
situation; and composed of news collected from digital newspapers, in order to study the
traditional news structure and content. Our dataset is manually collected and labelled according
to the Inverted Pyramid and the 5W1H journalistic techniques. Our hypothesis lies in the
fact that fake news mixes true and false information, so we propose a fine-grained annotation
allowing to determine not only the full document veracity but also the veracity of each essential
content element and structure parts of a news item. Notwithstanding these metrics, our proposal
is adaptable to any domain and language.
   In our previous work, a Spanish fake news dataset, called FNDeep Dataset, focused on the
health domain and composed of 200 news (95 fake news and 105 true news) was built to support
our hypothesis that knowing the veracity value separately can help to detect the overall veracity
of a news item [11]. News was manually search and collected from several online newspapers
or blogs and the information was checked in official fact-checking agencies belonging to the
International Fact Checking Network6 . Furthermore, we introduced an ad hoc annotation guide
(FNDeep Annotation Scheme) and conducted several experiments to train our architecture,
consisting of two main layers (Structure and Veracity). In our published paper it is shown that
determining the veracity of each structural element and each 5W1H component separately
influences the global veracity of a news item. However, the results, although they are good,
show the need to train with a larger number of examples.
   As stated in Section 1, annotating a corpus manually requires a large investment of time
and effort, which makes the process slow and the number of labelled data small. To solve this
problem, we are working on a combined proposal that allows to obtain and annotate news
automatically, but at the same time to be checked by a human expert. For testing our proposal,
we need to build another dataset to compare the corpus created by a semi-automatic process
with the corpus created entirely manually (our previous corpus). As our first dataset and the
annotation guide are described in detail in [11] and in [12], this work presents the new dataset,
which is being created from scratch to compare it with our first dataset, and the new proposal
of semi-automatic annotation.

3.1. Creation of a new fake news dataset
The aim is to check whether the change of approach and the assistance of a semi-automatic
annotation allows to increase the speed and the number of annotated examples. To this end,
we need to build a new corpus combining manual and semi-automatic approaches. The new
corpus we are building keeps the topic (health and Covid-19), the language (Spanish) and the
structure (Inverted Pyramid). However, to make this new dataset more accurate, length and
format have been better defined. Regarding length, this dataset will contain news presenting
a similar number of paragraphs, so that the time of annotation can be calculated on the basis
of texts with similar length. With respect to the format, posts, guides, FAQs or social media
posts are omitted and only news items presenting the traditional journalistic format are being

   6
       https://www.poynter.org/ifcn/


                                               17
collected. News following a specific format to refute a claim (that is, news with fact-check
format) is also being discarded for this corpus.

3.2. FNDeep Annotation Scheme V.2.: reorientation and improvement of the
     guide
The complexity of our annotation lies in annotating all the content of a news item related to the
Inverted Pyramid and the 5W1H journalistic techniques. The first one consists in splitting the
structure of a news item into five common parts: Headline, Subtitle, Lead, Body and Conclusion.
Those parts meet the Inverted Pyramid technique by placing the most relevant information at
the beginning of the news item and the least relevant at the end. One of the improvements
made in this regard is the annotation of the Body part. The other concept used, called 5W1H,
lies in obtaining the answers to six key questions (Who, What, When, Where, Why and How)
that allow to communicate a story in a complete, accurate way. The annotation of these content
elements remains the same, but it has been redirected and well-defined by adding some semantic
relations.
   However, the most important change made with respect to our first corpus is the classification
system that indicates the veracity value of each part/element. We have adopted a reliability
rating instead of using a truthfulness rating, that is we have replaced the veracity attributes
of True, Fake and Unknown by the attributes Reliable and Unreliable. This new classification
is more accurate, since the classification of true and false depends on extra-textual factors
(reader, context, external knowledge), whereas a classification of reliable or unreliable is based
on a purely textual and linguistic analysis, which allows to obtain an analysis prior to the
fact-checking task. Besides the reliability rating, other attributes have been added to some tags
to mark semantic relations and add additional information to our annotation.

    • TITLE-STANCE: this attribute, only used in the Headline, indicates whether or not the
      information presented in the Body is consistent with the information of the headline.
      This consistency is represented by the following values: Agree, Disagree or Unrelated.
    • MAIN-EVENT: this attribute, only used with the What tag, allows to mark the main
      event of the story and it helps to differentiate it with other secondary events. A news
      item could contain more than one “main event”.
    • RELATED: content elements (5W1H) corresponding to the same event are linked with
      this semantic relation to differentiate multiple events appearing in the same passage. A
      sentence could contain more than one event, and each event may include its own 5W1H.
      This attribute is used by connecting all the content tags to the What tag.
    • ROLE: this attribute is only used in the Who tag to indicate the role played by the
      subject/entity of the event. This function can be indicated with one of these three values:
      Subject (the Who causes the event), Target (the Who receives the effects of the event), or
      Both (when the Who performs both functions).
    • ELEMENTS-OF-INTEREST: other tags allowing to create a more accurate report and to
      train more features are those related to style, ambiguity, lack of data, exaggeration, key
      terminology and phraseology, and orthotypography.


                                               18
3.3. Semi-automatic annotation proposal
As stated at the beginning of this paper, the annotation task is essential in NLP since it allows
to train computational models and to automate many human tasks. However, “data collection is
one of the challenges of conducting deception research due to the scarce availability of such
datasets” [13]. There is a lack of resources to study fake news detection in languages other than
English and that scarcity is due to the fact that annotating data is a costly task: it requires time,
human expertise and consistency. Posadas-Durán et al. [9] states that "annotated corpora can
help to increase the performance of automatic methods aiming at detecting this kind of news”
and human intervention is necessary for ensuring consistency in the annotation of texts and
checking the decisions made by the machine about the features learned.
   After assessing the time and level of difficulty required during the creation of our first
corpus, we realised that the effort and time spent in searching and annotating news were not
proportional to the number of labelled texts obtained. To progress in our research, a larger
dataset is needed to train as many examples as possible but only with manual tasks it is difficult
to quickly increase a corpus. Our aim lies in increasing speed and data while reducing time
and effort. A semi-automatic annotation may automatically select the news and propose an
automatic annotation based on our scheme.
   Human intervention remains important, as the expert may check that the news set selected
meets the needs of the corpus and may confirm or deny the annotation proposal. In this way,
the expert ensures consistency and accuracy of the dataset while saving time in collecting
and annotating news. The expert may not spend so much time searching for news items or
annotating news from scratch. The assisted system may allow to create a larger, updated dataset
more quickly. For carrying out this proposal, active learning techniques will be used to assist
the annotation process, as this technique increases the performance of the learning model while
reducing the amount of annotated data required [14].


4. Methodology and experiments
Our aim is to compare efficacy, speed and accuracy between manual annotation and assisted
annotation. For that, some relevant changes were made concerning the methodology of the
creation and annotation of the dataset. It is important to highlight that the annotator and the
general methodology applied for this work are the same as for our first corpus. However, the
semi-automatic proposal has led to change the way of collecting and classifying news, as well
as the annotation tool.

4.1. Methodology
With regard to the compilation, instead of manually collect the news, the system trained with
the annotated data proposes a selection of interesting news to be trained.
  Regarding the annotation tool, we have chosen the Brat tool for this new corpus because it is
an intuitive annotation tool allowing to annotate quickly, accurately and easily. In addition, it
presents a visual and comfortable interface that facilitates the annotation task.


                                                 19
  Last but not least, the verification process is being modified. For our first corpus, all news
was manually chosen and verified in an official fact-checking agency. However, with the
implementation of the first version of the assisted annotation recommendation system, the
system randomly selects news, regardless of whether they have been verified by an agency or
not. At that point we realised that a change of approach was needed. The analysis should focus
on the textual and linguistic level to know whether a piece of a news item is reliable or not, not
whether it is true or fake, since for that classification we need external knowledge.

4.2. Experiments
In order to test this combined proposal of manual annotation supported by an automatic system,
three experiments will be carried out. The objective is to compare the assisted annotation with
the manual annotation. To this end, each experiment will train the same number of news, which
will be annotated by the same expert annotator. However, each experiment will use different
approaches, from fully manual creation and annotation tasks to progressive automation of both
tasks. This proposal is still being refined and tested.
   EXPERIMENT 1: 30 news items will be manually searched by the expert annotator, chosen
according to certain metrics (format, language, topic, length, etc.) and manually annotated
without assistance, following the annotation scheme created ad hoc. With this first experiment,
we want to calculate the time it takes the expert to perform both tasks manually, without the
help of the system.
   EXPERIMENT 2: 30 news items will be automatically selected by the system after having
trained with the news of the first experiment, checked by the expert annotator and manually
annotated. The aim here is to compare if the automatic selection of news helps to save time, as
the expert only has to check whether the selection meets the requirements of the corpus and to
manually annotate them.
   EXPERIMENT 3: 30 news items will be pre-annotated by the semi-automatic system, and the
annotator will correct and verify the annotation assisted by the system. This experiment may
help to know if the semi-automation of both tasks allows to faster increase the corpus.


5. Specific elements of the research for discussion
We are still working on developing our semi-automatic annotation proposal. However, there
is an issue that needs to be further analysed: the difficulty to automatically find fake news.
Fact-checking agencies are constantly fighting against fake news and that continuous pressure
makes fake news hard to find. Sometimes, news with a clearly format of fake news (containing
spelling mistakes, capital letters, full stops, alarmist and offensive messages, etc.) disappears or
is removed. This difficulty is increased when sources publish both fake and true content, since
a source cannot be exactly classified and the system can propose news "considered unreliable"
that are actually reliable and that can lead to an unbalanced corpus.
   Another difficulty encountered is that there are cases where disinformation is spread in social
media (Facebook, Twitter, WhatsApp), in the form of subjective or alarmists posts or through
chain messages. We focus on disinformation in form of traditional digital news, but many times
this disinformation is published only in social media.


                                                20
    The change of perspective has helped us to refine the classification of news. Our project
focus on being a support to the fact-checking task, a previous stage. Our objective is to offer a
preliminary report that allows to get a preconceived idea of a news item, to justify whether
it is reliable or not, so that at a later stage it can be verified by fact-checking techniques. The
veracity of a news item (true/fake) cannot be determined by language alone; external knowledge
is needed to verify the information. The new classification into reliable or unreliable is more
accurate and novel.


6. Conclusions and future work
This paper presents the current state of my doctoral thesis which focus on modelling a deceptive
language applied to the automatic detection of fake news. Fake news has become a global
problem that is damaging society in several ways: it causes fears, prejudices, hate and insecurity.
Fake news makes us vulnerable. For fighting against this problem, we seek to model a deceptive
language of fake news by studying the structure and content of news separately to predict its
global veracity value. For that purpose, labelled data is needed. We have already created and
trained a fake news dataset for our research, but a larger dataset is needed to keep training our
system. As our first corpus was entirely collected and annotated manually, we need to create
a new corpus combining both manual and automatic approaches to test the semi-automatic
annotation.
   Considering that annotated corpora are scarce in this field, specially in Spanish, and that the
manual annotation of a corpus is a slow and difficult process, we propose to boost our dataset
by implementing a semi-automatic annotation which may assist the expert annotator. This
assisted annotation may allow to reduce time and effort while increasing speed, accuracy and
labelled data. This proposal is currently being studied and tested to improve our corpus in
future works and to continue combating digital disinformation.


Acknowledgments
This research work has been partially funded by Generalitat Valenciana through project “SIIA:
Tecnologías del lenguaje humano para una sociedad inclusiva, igualitaria, y accesible” with grant
reference PROMETEU/2018/089, by the Spanish Government through the projects RTI2018-
094653-BC22: “Modelang: Modeling the behavior of digital entities by Human Language Tech-
nologies” and RTI2018-094653-B-C21: “LIVING-LANG: Living Digital Entities by Human Lan-
guage Technologies”, as well as being partially supported by a grant from the Fondo Europeo
de Desarrollo Regional (FEDER). Furthermore, I would like to thank my research team for all
the work done so far: Estela Saquete, Patricio Martínez, Alejandro Piad, Suilan Estévez, Mario
Nieto, Victor Belén. I also thank Miguel Ángel García Cumbreras for his participation and work
in this research.


                                                21
References
 [1] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing
     language in fake news and political fact-checking, in: Proceedings of the 2017 conference
     on empirical methods in natural language processing, 2017, pp. 2931–2937.
 [2] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, Brat: a web-based
     tool for nlp-assisted text annotation, in: Proceedings of the Demonstrations at the 13th
     Conference of the European Chapter of the Association for Computational Linguistics,
     2012, pp. 102–107.
 [3] A. Vlachos, S. Riedel, Fact checking: Task definition and dataset construction, in: Pro-
     ceedings of the ACL 2014 Workshop on Language Technologies and Computational Social
     Science, Association for Computational Linguistics, Baltimore, MD, USA, 2014, pp. 18–22.
     doi:10.3115/v1/W14-2508.
 [4] W. Ferreira, A. Vlachos, Emergent: a novel data-set for stance classification, in: Pro-
     ceedings of the Conference of the North American Chapter of the Association for Com-
     putational Linguistics, Association for Computational Linguistics, 2016, pp. 1163–1168.
     doi:10.18653/v1/N16-1138.
 [5] W. Y. Wang, "liar, liar pants on fire": A new benchmark dataset for fake news detection,
     CoRR abs/1705.00648 (2017).
 [6] P. Patwa, S. Sharma, S. PYKL, V. Guptha, G. Kumari, M. S. Akhtar, A. Ekbal, A. Das,
     T. Chakraborty, Fighting an infodemic: Covid-19 fake news dataset, arXiv preprint
     arXiv:2011.03327 (2020).
 [7] W. S. Paka, R. Bansal, A. Kaushik, S. Sengupta, T. Chakraborty, Cross-sean: A cross-stitch
     semi-supervised neural attention model for covid-19 fake news detection, Applied Soft
     Computing (2021) 107393.
 [8] A. Almela, R. Valencia-García, P. Cantos, Seeing through deception: A computational
     approach to deceit detection in spanish written communication, Linguistic Evidence in
     Security, Law and Intelligence 1 (2013) 3–12.
 [9] J. Posadas-Durán, H. Gomez-Adorno, G. Sidorov, J. Escobar, Detection of fake news in a
     new corpus for the spanish language, Journal of Intelligent and Fuzzy Systems 36 (2019)
     4868–4876. doi:10.3233/JIFS-179034.
[10] S. Mottola, Las fake news como fenómeno social. análisis lingüístico y poder persuasivo
     de bulos en italiano y español, Discurso & Sociedad (2020) 683–706.
[11] A. Bonet-Jover, A. Piad-Morffis, E. Saquete, P. Martínez-Barco, M. Á. García-Cumbreras,
     Exploiting discourse structure of traditional digital media to enhance automatic fake news
     detection, Expert Systems with Applications 169 (2021) 114340.
[12] A. Bonet-Jover, The disinformation battle: Linguistics and artificial intelligence join to
     beat it (2020).
[13] E. Saquete, D. Tomás, P. Moreda, P. Martínez-Barco, M. Palomar, Fighting post-truth
     using natural language processing: A review and open challenges, Expert systems with
     applications 141 (2020) 112943.
[14] M. Kholghi, L. Sitbon, G. Zuccon, A. Nguyen, Active learning: a step towards automating
     medical concept extraction, Journal of the American Medical Informatics Association 23
     (2016) 289–296.


                                              22