Towards Crowdsourcing Tasks for Accurate
              Misinformation Detection

    Ronald Denaux1[0000−0001−5672−9915] , Flavio Merenda1 , and Jose Manuel
                     Gomez-Perez1[0000−0002−5491−6431]?

        Expert System, Madrid, Spain {rdenaux,jmgomez}@expertsystem.com


        Abstract. For all the recent advancements in Natural Language Pro-
        cessing and deep learning, current systems for misinformation detection
        are still woefully inaccurate in real-world data. Automated misinforma-
        tion detection systems —available to the general public and producing
        explainable ratings— are therefore still an open problem and involvement
        of domain experts, journalists or fact-checkers is necessary to correct the
        mistakes such systems currently make. Reliance on such expert feedback
        imposes a bottleneck and prevents scalability of current approaches. In
        this paper, we propose a method —based on a recent semantic-based ap-
        proach for misinformation detection, Credibility Reviews (CR)—, to (i)
        identify real-world errors of the automatic analysis; (ii) use the semantic
        links in the CR graphs to identify steps in the misinformation analysis
        which may have caused the errors and (iii) derive crowdsourcing tasks
        to pinpoint the source of errors. As a bonus, our approach generates
        real-world training samples which can improve existing datasets and the
        accuracy of the overall system.

        Keywords: Disinformation Detection · Crowdsourcing · Credibility Sig-
        nals · Explainability


1     Introduction

One of the reasons that makes misinformation a hard problem is that verifying
a claim requires skills that only a fraction of the population have; typically well-
educated domain experts, fact-checkers or journalists who know where to find
verifying information for a particular domain. As a consequence fact-checking is a
task that cannot easily be performed by crowdsource workers, who have different
levels of education and which may lack specific domain knowledge. This bottle-
neck means in turn that it is difficult to train accurate, domain independent,
automated systems to help in the fact-checking process as there is a relatively
limited amount of fact-checks available. Furthermore, available fact-checks are
highly biased towards claims of specific domains considered more important at
the time, i.e. political claims during elections or health claims during pandemics.
?
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         R. Denaux and F. Merenda and JM Gomez-Perez

    Several automated systems have been proposed [2,5,4,11] to help in misinfor-
mation detection tasks. However, their accuracy is still quite poor at the overall
task of detecting misinforming claims, articles or social media posts in the wild.
Ideally, these systems would catch misinformation before it is spread on social
media, which means they should be accurate based on the content of the reviewed
item. Current content-based systems only achieve about 72% accuracy [4] on
datasets like FakeNewsNet, which are relatively easy as they (i) provide plenty
of content (news articles), (ii) are simplified into a binary classification (fake or
real), and (iii) which have been already reviewed by fact-checkers1 .
    In our previous work on Linked Credibility Reviews (LCRs) [4], we showed
that our implementation, called acred, obtained state of the art results based on
the following steps:
    – simple content decomposition: basing the credibility of more complex doc-
      uments like articles and tweets on its parts like sentences or linked articles
      and metadata like its publisher website. In our current implementation of
      acred, we have introduced a checkworthiness filter to only take into account
      sentences which are factual statements.2 .
    – linking those sentences to a database of claims already reviewed. This linking
      was achieved using simple, domain-independent linguistic tasks such as se-
      mantic similarity and stance detection for which high accuracy deep learning
      models can be trained (92% accuracy on stance detection and 83 pearson
      correlation on semantic similarity, using RoBERTa)
    – normalising existing evidence for:
        • claims from ClaimReviews provided by reputable fact-checkers and
        • websites from reputation scores by WebOfTrust, NewsGuard, and others.
    Surprisingly, initial error analysis showed that most of the errors could be
traced back to the sentence linking steps. One of the advantages of the LCR
approach is that it generates a graph of sub-reviews, rather than just producing
a single credibility label. In this paper we propose a method for exploiting the
traceability of LCRs in order to (i) be able to crowdsource the error analysis
process and (ii) derive new training samples for credibility review subtasks like
semantic similarity and stance detection.


2      Problem and Intuition
Consider the tweet shown in Figure 1a. Using acred, we can generate a credibility
review for that tweet, which we can show to the users in a couple of ways. The
1
  Social signals (replies, likes, etc.) provide further evidence which can improve
  accuracy[10,11], but can only be used after the content has spread.
2
  This is implemented as a RoBERTa model [6] finetuned on a combina-
  tion of datasets: CBD [7], Clef’20 Task 1 (see https://github.com/sshaar/
  clef2020-factchecking-task1) and claims extracted from ClaimReview metadata. We
  obtain f1 weighted scores of 0.85 on Clef’19 Task 1 and 0.95 on 2020 debate (see
  https://github.com/idirlab/claimspotter/tree/master/data/two class)
        Towards Crowdsourcing Tasks for Accurate Misinformation Detection            3

most concise way is shown as a bar on top of the tweet in Fig. 1a; the bar displays
the acred credibility label for the tweet. To the right of the label, we see a couple
of buttons that allow users to provide feedback about whether they agree (happy
face) or disagree (sad face) with the label assigned by the system. In this case,
the numbers indicate there’s a clear majority of users who disagree with the
label, which tells us that something has gone wrong in acred’s analysis. The
challenge is figuring out which step(s) in the acred analysis introduced errors.
Fig. 2a shows the graph of all the evidence gathered and considered by acred
in order to produce the “credible” label shown to the user. Each of the “meter”
icons is a sub-review —e.g. a credibility review of one sentence in the tweet, or
a similarity review between that sentence and some other sentence for which a
credibility value is known— which contributed to the final rating, therefore any
of those steps could have introduced an error, but which ones? Obviously we
do not want to generate tasks for all 36 sub-reviews. Instead, we want to select
the sub-reviews most likely to have produced the error. The rest of the paper
discusses how to do that and what kind of crowdsourcing task could be used to
find errors in the graph.

Intuition for our approach LCR bots, responsible for contributing the sub-
reviews, will tend to apply heuristics to select certain sub-reviews (and discard
others). In Figure 1b we see an interface showing a card for the final credibility
review for the tweet. In essence, it is summarising the graph shown in Fig. 2a.
The generated explanation clearly only uses some of the evidence in the graph.
In particular, we see that the explanation hinges on just one of the sentences in
the tweet and it agreeing with a similar sentence found on a website deemed to
be credible. This chain of evidence is shown in Fig 2b, which is a subset of 7
(out of the initial 36) sub-reviews from 2a. In this sub-graph, all the sub-reviews
directly contribute to the final label. Since the final label is erroneous, one or
more of these evidence nodes must have introduced some error.3


3     Crowdacred
In this section we formalise the problem and our approach, called Crowdacred.

3.1    Preliminaries
Schema.org Reviews and Credibility Reviews Linked Credibility Reviews
(LCR) [4], is a linked data model for composable and explainable misinformation
detection. A Credibility Review (CR) is an extension of the generic Review data
model defined in Schema.org. A Review R can be conceptualised as a tuple
(d, r, p) where R:
 – reviews a data item d, via property itemReviewed, this can be any linked-
   data node (e.g. an article, claim or social media post).
3
    Note that some of the discarded sub-reviews may also be erroneous, but those errors
    did not contribute to the final label, hence we ignore them.
4       R. Denaux and F. Merenda and JM Gomez-Perez


(a) Tweet with label and feedback but-
tons                                     (b) Credibility Review with explanation

Fig. 1: Example UIs for a (dis)agreement task for a tweet. The user can provide
feedback about correct or incorrect labels predicted by acred.


                                                    (b) Kept evidence graph
           (a) Full evidence graph

Fig. 2: Evidence graph for the credibility review and tweet shown in Fig. 1. The
big “meter” icon represents the main credibility review, next to the icon for the
tweet. All the other nodes form the evidence gathered by acred and used to
determine the credibility of the tweet.
      Towards Crowdsourcing Tasks for Accurate Misinformation Detection             5

 – assigns a numeric or textual rating r to (some, often implicit, reviewAspect
   of) d, via property reviewRating
 – optionally provides provenance information p, e.g. via properties author and
   isBasedOn.
  A Credibility Review (CR) is a subtype of Review, defined as a tuple hd, r, c, pi,
where the CR:
 – r must have reviewAspect credibility and is recommended to be ex-
   pressed as a numeric value in range [−1, 1] and is qualified with a rating
   confidence c (in range [0, 1]).
 – the provenance p is mandatory and must include information about:
     • credibility signals (CS) used to derive the credibility rating, which can
       be either (i) Reviews for data items relevant to d or (ii) ground credibility
       signals (GCS) resources (which are not CRs) in databases curated by a
       trusted person or organization.
     • the author of the review. The author can be a person, organizations or
       bot. Bots are automated agents that produce CRs.
    For this paper, the main thing to take into account is that the CR for a
particular data item (e.g. a Tweet) is composed of many “sub reviews” which
are available by following the provenance relation p. For any specific CRi , we
refer to the overall set of nodes Vi (Reviews, authors, data items and GCS) and
links between them (Ei ) as the Evidence Graph Gi = (Vi , Ei ) for CRi .

Crowdsourcing Review Tasks A Crowdsourcing Review Task (subsequently
simply referred as task ) t is defined as a tuple hd, a, oi, where d is a data item to
be reviewed by the user; a is the aspect of d that needs to be reviewed; and o is a
set of possible review values. Tasks need to be performed by human users, hence
we require a function frender which renders the task in a way that a user can
inspect. The user performs the task by inspecting the rendering and selecting
one of the available options, which produces a review of the form (d, ra , pu );
where ra is a rating for aspect a and the ratingValue is one of the options in o.

3.2   Problem Statement and Overview
Given an unlabeled data item d and an automatically derived credibility review
for it, CRd = (d, rd , cd , pd ) —and therefore its corresponding evidence graph
Gd = (Vd , Ed )—, create simple tasks t1 , t2 , ..., tn , which can be performed by
un-(or minimally)trained workers and which (i) allows us to decide whether rd
is accurate and (ii) if rd is not accurate, identifies sub-reviews Rid ∈ Vd which
directly caused the error. Furthermore, aim to minimise the number of tasks n.
    In this paper, we propose a two-step method to derive such tasks:
 1. collect agreement with overall rating rd
 2. for ratings with high disagreement:
      – identify candidate reviews in the evidence graph for rd and
      – derive tasks from the identified candidate reviews
6         R. Denaux and F. Merenda and JM Gomez-Perez

3.3     Capturing Overall Agreement with Credibility Reviews
In this first step, we generate tasks for users to help us identify CR instances
which have an inaccurate credibility rating. For this, we exploit the explainability
of credibility ratings. We propose the following task:
    Given a user u and a credibility review CRd for data item d, we define
tagreement = hCRd , agreement, oagreement i as a task where the user is shown a
summary of CRd (likely including a rendering of d), and is asked to produce
a rating oagreement = {agree, disagree}. For this task we consider two specific
rendering functions:
    – label maps the values rd and cd onto a credibility label. For example, rd >
      0.5 and cd > 0.75 could map to “credible”.
    – explain generates a more complex textual explanation by following the
      provenance information pd (recursively).
    The result of tagreement is an instance of a Review: (CRd , ragreement , pu ). An
example of such a task, using both rendering functions, is shown in figure 1.
    Although this task is much easier than performing a full fact-check of an ar-
ticle or claim, it can still be cognitively demanding and some users may not have
sufficient knowledge about the domain to make an informed decision. Therefore,
we expect this to be a challenging task for most crowdsource workers. As part of
the Co-inform project4 , instead of relying on crowdsource workers, we are asking
users of our browser plugin to provide such agreement ratings as an extension
of their daily browsing and news consumption habits. As shown in fig. 1a, given
sufficient users, a concensus can emerge enabling detection of erroneous reviews.

3.4     Finding Candidate Erroneous sub Reviews
Given a credibility review CRd which users have rated as erroneous, in this step,
we identify sub Reviews R0 , R2 , ..., Rn which have directly contributed to the
final rating and confidence in CRd . Recall that pd provides provenance infor-
mation that can be used. In acred, the relevant provenance is implemented by
providing a list of sub-reviews via property isBasedOn. This list contains ref-
erences to all the signals taken into account to derive the rating but in many
cases, the majority of these signals are discarded via aggregation functions (e.g.
selecting the subreview with highest confidence or with lowest credibility rat-
ing [4]). Therefore, we propose to define two disjoint subproperties of isBasedOn:
isBasedOnDiscarded and isBasedOnKept.
    Using these new subproperties we can define a subgraph Gkept  d     of Gd , which
contains only those nodes which can be linked to the final CRd via isBasedOnKept
edges. To illustrate this idea, figure 2a shows an example of a full evidence graph,
while figure 2b shows only the kept subgraph for the same credibility review. As
can be seen from the figures, this step greatly reduces the number of candidate
sub reviews, while also ensuring that those reviews directly contributed to the
final (presumably erroneous) rating.
4
    https://coinform.eu/
        Towards Crowdsourcing Tasks for Accurate Misinformation Detection               7

3.5     Defining Crowdsourcing Tasks

Now that we have identified a small number of sub-reviews which directly in-
fluence the final credibility rating, we can use crowdsourcing to identify which
steps contributed erroneous evidence. Although we could define user agreement
tasks for the individual steps, we can get more actionable information by asking
more specific questions to the users. For this, we need to define custom tasks
for each step in acred. Preliminary error analyses in [4] showed that most of the
errors were caused by the linking steps, therefore we discuss three specific types
of Reviews used in acred and how to derive crowdsourcing tasks for them.


SentenceCheckworthinessReview determines whether a Sentence is check-
worthy or not. This is the case when the sentence is both factual (i.e. not an
opinion or question) and verifiable (someone can, in principle, find out whether
the sentence is accurate or not). We derive task tcheckworthy where ocheckworthy =
{checkworthy, notFactual, notVerifiable}. Table 1 shows an example rendering
(and expected answer), based on the sub-reviews in Figures 2b and 1b.


Help us to detect if a sentence contains a factual claim
Do you think the following sentence contains a factual claim?

    – “The vast amounts of money made and stolen by China from the United States,
      year after year, for decades, will and must STOP.”

 2
  Yes, and the claim can be verified
 2 Yes, but nobody could verify it
 2 No
             Table 1: Example SentenceCheckworthinessReview task


SentenceSimilarityReview assigns a similarity score to a pair of sentences
hsa , sb i.5 There are existing crowdsourcing tasks defined for this [1], including
instructions and a rating schema, which we can reuse to define tsentenceSimilarity =
hd, sentenceSimilarity, osentenceSimilarity i. The schema, osentenceSimilarity consists of
a scale of 6 values ranging from 0 (the two sentences are completely dissimilar)
to 5 (the two sentences are completely equivalent, as they mean the same thing).
See table 2 for an example.


SentenceStanceReview assigns a stance label describing the relation between
a pair of sentences. 6 Although there are many existing datasets [9] for this
5
  This is implemented in acred via a RoBERTa model that has been fine-tuned on
  STS-B [3], which has in part been derived from previous semantic similarity tasks [1].
6
  This is implemented in acred via another RoBERTa model that has been fine-tuned
  on FNC-1 [8].
8        R. Denaux and F. Merenda and JM Gomez-Perez

Help us to detect how similar are two sentences
Choose one of the options that describes the semantic similarity grade between the
following pair of sentences.

    – “The vast amounts of money made and stolen by China from the United States,
      year after year, for decades, will and must STOP.”
    – ”The US still supplies much more goods from China and the EU than vice versa.‘’

The two sentences are:
2 completely equivalent, as they mean the same thing
2 mostly equivalent, but some unimportant details differ
2
 roughly equivalent, but some important information differs/missing
2 not equivalent, but share some details
2 not equivalent, but are on the same topic
2 on different topics
                 Table 2: Example SentenceSimilarityReview Task
problem, they differ in their target labels. We find FNC-1[8] labels (agree, dis-
agree, discuss and unrelated ) provide a good balance as other datasets often
are missing a label for the unrelated case. Also, the FNC-1 labels have the
advantage that they describe symmetric relations (although this is arguable
for discuss), while other datasets use asymmetric relations like query. There-
fore we define tasks tsentenceStance = hd, sentenceStance, osentenceStance i where
osentenceStance = {agree, disagree, discuss, unrelated}. Table 3 shows an example
of such a task.


Help us to better understand the relation between two sentences
Choose one of the options that describes the relation between the
following sentences.

    – “The vast amounts of money made and stolen by China from the United States,
      year after year, for decades, will and must STOP.”
    – ”The US still supplies much more goods from China and the EU than vice versa.‘’

The two sentences:
2 agree with each other        2 disagree with each other
2
 discuss the same issue       2 are unrelated
                   Table 3: Example SentenceStanceReview Task


4     Summary and Future Work

In this paper, we presented Crowdacred, a method for extending Linked Credibil-
ity Reviews to be able to crowdsource (i) the detection of inaccurate credibility
reviews, (ii) the error analysis or erroneous reviews and (iii) generation of realis-
tic sample data for NLP subtasks needed for accurate misinformation detection.
We are currently implementing the proposed method on top of acred [4] and plan
        Towards Crowdsourcing Tasks for Accurate Misinformation Detection           9

to run initial crowdsourcing experiments to validate the approach. The valida-
tion study will be based on a core set of (a few dozens) users from Co-inform7
and a larger pool of crowdsource workers. If successful, we aim to be able to
produce new datasets of contents in the wild on specific topics like covid-19.

Acknowledgements Work supported by the European Comission under grant
770302 – Co-Inform – as part of the Horizon 2020 research and innovation pro-
gramme.


References
 1. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared
    task: Semantic textual similarity. In: Second Joint Conference on Lexical and Com-
    putational Semantics (*SEM). pp. 32–43. Association for Computational Linguis-
    tics, Atlanta, Georgia, USA (Jun 2013)
 2. Babakar, M., Moy, W.: The State of Automated Factchecking. Tech. rep. (2016)
 3. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 Task 1:
    Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In:
    Proc. of the 10th International Workshop on Semantic Evaluation. pp. 1–14 (2018)
 4. Denaux, R., Perez-Gomez, J.M.: Linked Credibility Reviews for Explainable Misin-
    formation Detection. In: 19th International Semantic Web Conference (nov 2020),
    https://arxiv.org/abs/2008.12742
 5. Hassan, N., Zhang, G., Arslan, F., Caraballo, J., Jimenez, D., Gawsane, S., Hasan,
    S., Joseph, M., Kulkarni, A., Nayak, A.K., Sable, V., Li, C., Tremayne, M.: Claim
    buster: The first-ever end-to-end fact-checking system. In: Proceedings of the
    VLDB Endowment. vol. 10, pp. 1945–1948 (2017)
 6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain-
    ing Approach. Tech. rep. (2019)
 7. Meng, K., Jimenez, D., Arslan, F., Devasier, J.D., Obembe, D., Li, C.: Gradient-
    Based Adversarial Training on Transformer Networks for Detecting Check-Worthy
    Factual Claims (feb 2020), http://arxiv.org/abs/2002.07725
 8. Pomerleau, D., Rao, D.: The fake news challenge: Exploring how artificial intelli-
    gence technologies could be leveraged to combat fake news (2017)
 9. Schiller, B., Daxenberger, J., Gurevych, I.: Stance Detection Benchmark: How
    Robust Is Your Stance Detection? (jan 2020)
10. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H.: FakeNewsNet: A Data
    Repository with News Content, Social Context and Spatialtemporal Information
    for Studying Fake News on Social Media. Tech. rep. (2018)
11. Shu, K., Zheng, G., Li, Y., Mukherjee, S., Awadallah, A.H., Ruston, S., Liu, H.:
    Leveraging Multi-Source Weak Social Supervision for Early Detection of Fake News
    (2020), http://arxiv.org/abs/2004.01732


7
    https://coinform.eu/