-

Towards Crowdsourcing Tasks for Accurate Misinformation Detection

Ronald Denaux

Flavio Merenda

Jose Manuel Gom

jmgomezg@expertsystem.com 0 0 Expert System , Madrid , Spain

For all the recent advancements in Natural Language Processing and deep learning, current systems for misinformation detection are still woefully inaccurate in real-world data. Automated misinformation detection systems |available to the general public and producing explainable ratings| are therefore still an open problem and involvement of domain experts, journalists or fact-checkers is necessary to correct the mistakes such systems currently make. Reliance on such expert feedback imposes a bottleneck and prevents scalability of current approaches. In this paper, we propose a method |based on a recent semantic-based approach for misinformation detection, Credibility Reviews (CR)|, to (i) identify real-world errors of the automatic analysis; (ii) use the semantic links in the CR graphs to identify steps in the misinformation analysis which may have caused the errors and (iii) derive crowdsourcing tasks to pinpoint the source of errors. As a bonus, our approach generates real-world training samples which can improve existing datasets and the accuracy of the overall system.

Disinformation Detection Crowdsourcing Credibility Sig- nals Explainability

One of the reasons that makes misinformation a hard problem is that verifying a claim requires skills that only a fraction of the population have; typically welleducated domain experts, fact-checkers or journalists who know where to nd verifying information for a particular domain. As a consequence fact-checking is a task that cannot easily be performed by crowdsource workers, who have di erent levels of education and which may lack speci c domain knowledge. This bottleneck means in turn that it is di cult to train accurate, domain independent, automated systems to help in the fact-checking process as there is a relatively limited amount of fact-checks available. Furthermore, available fact-checks are highly biased towards claims of speci c domains considered more important at the time, i.e. political claims during elections or health claims during pandemics. ? Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

Several automated systems have been proposed [ 2,5,4,11 ] to help in misinformation detection tasks. However, their accuracy is still quite poor at the overall task of detecting misinforming claims, articles or social media posts in the wild. Ideally, these systems would catch misinformation before it is spread on social media, which means they should be accurate based on the content of the reviewed item. Current content-based systems only achieve about 72% accuracy [ 4 ] on datasets like FakeNewsNet, which are relatively easy as they (i) provide plenty of content (news articles), (ii) are simpli ed into a binary classi cation (fake or real), and (iii) which have been already reviewed by fact-checkers1.

In our previous work on Linked Credibility Reviews (LCRs) [ 4 ], we showed that our implementation, called acred, obtained state of the art results based on the following steps: { simple content decomposition: basing the credibility of more complex documents like articles and tweets on its parts like sentences or linked articles and metadata like its publisher website. In our current implementation of acred, we have introduced a checkworthiness lter to only take into account sentences which are factual statements.2. { linking those sentences to a database of claims already reviewed. This linking was achieved using simple, domain-independent linguistic tasks such as semantic similarity and stance detection for which high accuracy deep learning models can be trained (92% accuracy on stance detection and 83 pearson correlation on semantic similarity, using RoBERTa) { normalising existing evidence for: claims from ClaimReviews provided by reputable fact-checkers and websites from reputation scores by WebOfTrust, NewsGuard, and others.

Surprisingly, initial error analysis showed that most of the errors could be traced back to the sentence linking steps. One of the advantages of the LCR approach is that it generates a graph of sub-reviews, rather than just producing a single credibility label. In this paper we propose a method for exploiting the traceability of LCRs in order to (i) be able to crowdsource the error analysis process and (ii) derive new training samples for credibility review subtasks like semantic similarity and stance detection. 2

Problem and Intuition

Consider the tweet shown in Figure 1a. Using acred, we can generate a credibility review for that tweet, which we can show to the users in a couple of ways. The 1 Social signals (replies, likes, etc.) provide further evidence which can improve accuracy[ 10,11 ], but can only be used after the content has spread. 2 This is implemented as a RoBERTa model [ 6 ] netuned on a combination of datasets: CBD [ 7 ], Clef'20 Task 1 (see https://github.com/sshaar/ clef2020-factchecking-task1) and claims extracted from ClaimReview metadata. We obtain f1 weighted scores of 0:85 on Clef'19 Task 1 and 0:95 on 2020 debate (see https://github.com/idirlab/claimspotter/tree/master/data/two class) most concise way is shown as a bar on top of the tweet in Fig. 1a; the bar displays the acred credibility label for the tweet. To the right of the label, we see a couple of buttons that allow users to provide feedback about whether they agree (happy face) or disagree (sad face) with the label assigned by the system. In this case, the numbers indicate there's a clear majority of users who disagree with the label, which tells us that something has gone wrong in acred's analysis. The challenge is guring out which step(s) in the acred analysis introduced errors. Fig. 2a shows the graph of all the evidence gathered and considered by acred in order to produce the \credible" label shown to the user. Each of the \meter" icons is a sub-review |e.g. a credibility review of one sentence in the tweet, or a similarity review between that sentence and some other sentence for which a credibility value is known| which contributed to the nal rating, therefore any of those steps could have introduced an error, but which ones? Obviously we do not want to generate tasks for all 36 sub-reviews. Instead, we want to select the sub-reviews most likely to have produced the error. The rest of the paper discusses how to do that and what kind of crowdsourcing task could be used to nd errors in the graph.

Intuition for our approach LCR bots, responsible for contributing the subreviews, will tend to apply heuristics to select certain sub-reviews (and discard others). In Figure 1b we see an interface showing a card for the nal credibility review for the tweet. In essence, it is summarising the graph shown in Fig. 2a. The generated explanation clearly only uses some of the evidence in the graph. In particular, we see that the explanation hinges on just one of the sentences in the tweet and it agreeing with a similar sentence found on a website deemed to be credible. This chain of evidence is shown in Fig 2b, which is a subset of 7 (out of the initial 36) sub-reviews from 2a. In this sub-graph, all the sub-reviews directly contribute to the nal label. Since the nal label is erroneous, one or more of these evidence nodes must have introduced some error.3 3

Crowdacred

3.1

Preliminaries

In this section we formalise the problem and our approach, called Crowdacred. Schema.org Reviews and Credibility Reviews Linked Credibility Reviews (LCR) [ 4 ], is a linked data model for composable and explainable misinformation detection. A Credibility Review (CR) is an extension of the generic Review data model de ned in Schema.org. A Review R can be conceptualised as a tuple (d; r; p) where R: { reviews a data item d, via property itemReviewed, this can be any linkeddata node (e.g. an article, claim or social media post). 3 Note that some of the discarded sub-reviews may also be erroneous, but those errors did not contribute to the nal label, hence we ignore them. (a) Tweet with label and feedback buttons

(b) Credibility Review with explanation { assigns a numeric or textual rating r to (some, often implicit, reviewAspect of) d, via property reviewRating { optionally provides provenance information p, e.g. via properties author and isBasedOn.

A Credibility Review (CR) is a subtype of Review, de ned as a tuple hd; r; c; pi, where the CR: { r must have reviewAspect credibility and is recommended to be expressed as a numeric value in range [ 1; 1] and is quali ed with a rating con dence c (in range [0; 1]). { the provenance p is mandatory and must include information about: credibility signals (CS) used to derive the credibility rating, which can be either (i) Reviews for data items relevant to d or (ii) ground credibility signals (GCS) resources (which are not CRs) in databases curated by a trusted person or organization. the author of the review. The author can be a person, organizations or bot. Bots are automated agents that produce CRs.

For this paper, the main thing to take into account is that the CR for a particular data item (e.g. a Tweet) is composed of many \sub reviews" which are available by following the provenance relation p. For any speci c CRi, we refer to the overall set of nodes Vi (Reviews, authors, data items and GCS) and links between them (Ei) as the Evidence Graph Gi = (Vi; Ei) for CRi. Crowdsourcing Review Tasks A Crowdsourcing Review Task (subsequently simply referred as task ) t is de ned as a tuple hd; a; oi, where d is a data item to be reviewed by the user; a is the aspect of d that needs to be reviewed; and o is a set of possible review values. Tasks need to be performed by human users, hence we require a function frender which renders the task in a way that a user can inspect. The user performs the task by inspecting the rendering and selecting one of the available options, which produces a review of the form (d; ra; pu); where ra is a rating for aspect a and the ratingValue is one of the options in o. 3.2

Problem Statement and Overview

Given an unlabeled data item d and an automatically derived credibility review for it, CRd = (d; rd; cd; pd) |and therefore its corresponding evidence graph Gd = (Vd; Ed)|, create simple tasks t1; t2; :::; tn, which can be performed by un-(or minimally)trained workers and which (i) allows us to decide whether rd is accurate and (ii) if rd is not accurate, identi es sub-reviews Rid 2 Vd which directly caused the error. Furthermore, aim to minimise the number of tasks n.

In this paper, we propose a two-step method to derive such tasks: 1. collect agreement with overall rating rd 2. for ratings with high disagreement: { identify candidate reviews in the evidence graph for rd and { derive tasks from the identi ed candidate reviews

Capturing Overall Agreement with Credibility Reviews

In this rst step, we generate tasks for users to help us identify CR instances which have an inaccurate credibility rating. For this, we exploit the explainability of credibility ratings. We propose the following task:

Given a user u and a credibility review CRd for data item d, we de ne tagreement = hCRd; agreement; oagreementi as a task where the user is shown a summary of CRd (likely including a rendering of d), and is asked to produce a rating oagreement = fagree; disagreeg. For this task we consider two speci c rendering functions: { label maps the values rd and cd onto a credibility label. For example, rd > 0:5 and cd > 0:75 could map to \credible". { explain generates a more complex textual explanation by following the provenance information pd (recursively).

The result of tagreement is an instance of a Review: (CRd; ragreement; pu). An example of such a task, using both rendering functions, is shown in gure 1.

Although this task is much easier than performing a full fact-check of an article or claim, it can still be cognitively demanding and some users may not have su cient knowledge about the domain to make an informed decision. Therefore, we expect this to be a challenging task for most crowdsource workers. As part of the Co-inform project4, instead of relying on crowdsource workers, we are asking users of our browser plugin to provide such agreement ratings as an extension of their daily browsing and news consumption habits. As shown in g. 1a, given su cient users, a concensus can emerge enabling detection of erroneous reviews. 3.4

Finding Candidate Erroneous sub Reviews

Given a credibility review CRd which users have rated as erroneous, in this step, we identify sub Reviews R0; R2; :::; Rn which have directly contributed to the nal rating and con dence in CRd. Recall that pd provides provenance information that can be used. In acred, the relevant provenance is implemented by providing a list of sub-reviews via property isBasedOn. This list contains references to all the signals taken into account to derive the rating but in many cases, the majority of these signals are discarded via aggregation functions (e.g. selecting the subreview with highest con dence or with lowest credibility rating [ 4 ]). Therefore, we propose to de ne two disjoint subproperties of isBasedOn: isBasedOnDiscarded and isBasedOnKept.

Using these new subproperties we can de ne a subgraph Gkept of Gd, which d contains only those nodes which can be linked to the nal CRd via isBasedOnKept edges. To illustrate this idea, gure 2a shows an example of a full evidence graph, while gure 2b shows only the kept subgraph for the same credibility review. As can be seen from the gures, this step greatly reduces the number of candidate sub reviews, while also ensuring that those reviews directly contributed to the nal (presumably erroneous) rating. 4 https://coinform.eu/

De ning Crowdsourcing Tasks

Now that we have identi ed a small number of sub-reviews which directly inuence the nal credibility rating, we can use crowdsourcing to identify which steps contributed erroneous evidence. Although we could de ne user agreement tasks for the individual steps, we can get more actionable information by asking more speci c questions to the users. For this, we need to de ne custom tasks for each step in acred. Preliminary error analyses in [ 4 ] showed that most of the errors were caused by the linking steps, therefore we discuss three speci c types of Reviews used in acred and how to derive crowdsourcing tasks for them. SentenceCheckworthinessReview determines whether a Sentence is checkworthy or not. This is the case when the sentence is both factual (i.e. not an opinion or question) and veri able (someone can, in principle, nd out whether the sentence is accurate or not). We derive task tcheckworthy where ocheckworthy = fcheckworthy; notFactual; notVeri ableg. Table 1 shows an example rendering (and expected answer), based on the sub-reviews in Figures 2b and 1b. Help us to detect if a sentence contains a factual claim Do you think the following sentence contains a factual claim? { \The vast amounts of money made and stolen by China from the United States, year after year, for decades, will and must STOP." 2 Yes, and the claim can be veri ed 2 Yes, but nobody could verify it 2 No SentenceSimilarityReview assigns a similarity score to a pair of sentences hsa; sbi.5 There are existing crowdsourcing tasks de ned for this [ 1 ], including instructions and a rating schema, which we can reuse to de ne tsentenceSimilarity = hd; sentenceSimilarity; osentenceSimilarityi. The schema, osentenceSimilarity consists of a scale of 6 values ranging from 0 (the two sentences are completely dissimilar) to 5 (the two sentences are completely equivalent, as they mean the same thing). See table 2 for an example.

SentenceStanceReview assigns a stance label describing the relation between a pair of sentences. 6 Although there are many existing datasets [ 9 ] for this 5 This is implemented in acred via a RoBERTa model that has been ne-tuned on

STS-B [ 3 ], which has in part been derived from previous semantic similarity tasks [ 1 ]. 6 This is implemented in acred via another RoBERTa model that has been ne-tuned on FNC-1 [ 8 ]. Help us to detect how similar are two sentences Choose one of the options that describes the semantic similarity grade between the following pair of sentences.

{ \The vast amounts of money made and stolen by China from the United States, year after year, for decades, will and must STOP." { "The US still supplies much more goods from China and the EU than vice versa.`' The two sentences are: 2 completely equivalent, as they mean the same thing 2 mostly equivalent, but some unimportant details di er 2 roughly equivalent, but some important information di ers/missing 2 not equivalent, but share some details 2 not equivalent, but are on the same topic 2 on di erent topics

Table 2: Example SentenceSimilarityReview Task problem, they di er in their target labels. We nd FNC-1[ 8 ] labels (agree, disagree, discuss and unrelated ) provide a good balance as other datasets often are missing a label for the unrelated case. Also, the FNC-1 labels have the advantage that they describe symmetric relations (although this is arguable for discuss ), while other datasets use asymmetric relations like query. Therefore we de ne tasks tsentenceStance = hd; sentenceStance; osentenceStancei where osentenceStance = fagree; disagree; discuss; unrelatedg. Table 3 shows an example of such a task.

Help us to better understand the relation between two sentences Choose one of the options that describes the relation between the following sentences.

Table 3: Example SentenceStanceReview Task 4

Summary and Future Work

In this paper, we presented Crowdacred, a method for extending Linked Credibility Reviews to be able to crowdsource (i) the detection of inaccurate credibility reviews, (ii) the error analysis or erroneous reviews and (iii) generation of realistic sample data for NLP subtasks needed for accurate misinformation detection. We are currently implementing the proposed method on top of acred [ 4 ] and plan to run initial crowdsourcing experiments to validate the approach. The validation study will be based on a core set of (a few dozens) users from Co-inform7 and a larger pool of crowdsource workers. If successful, we aim to be able to produce new datasets of contents in the wild on speci c topics like covid-19. Acknowledgements Work supported by the European Comission under grant 770302 { Co-Inform { as part of the Horizon 2020 research and innovation programme.

1. Agirre , E. , Cer , D. , Diab , M. , Gonzalez-Agirre , A. , Guo , W.: * SEM 2013 shared task: Semantic textual similarity . In: Second Joint Conference on Lexical and Computational Semantics (*SEM) . pp. 32 { 43 . Association for Computational Linguistics, Atlanta, Georgia, USA (Jun 2013 )

2. Babakar , M. , Moy , W. : The State of Automated Factchecking. Tech. rep. ( 2016 )

3. Cer , D. , Diab , M. , Agirre , E. , Lopez-Gazpio , I. , Specia , L.: SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In: Proc. of the 10th International Workshop on Semantic Evaluation . pp. 1 { 14 ( 2018 )

4. Denaux , R. , Perez-Gomez , J.M. : Linked Credibility Reviews for Explainable Misinformation Detection . In: 19th International Semantic Web Conference (nov 2020 ), https://arxiv.org/abs/ 2008 .12742

5. Hassan , N. , Zhang , G., Arslan , F. , Caraballo , J. , Jimenez , D. , Gawsane , S. , Hasan , S. , Joseph , M. , Kulkarni , A. , Nayak , A.K. , Sable , V. , Li , C. , Tremayne , M. : Claim buster: The rst-ever end-to-end fact-checking system . In: Proceedings of the VLDB Endowment . vol. 10 , pp. 1945 { 1948 ( 2017 )

6. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , Chen , D. , Levy , O. , Lewis , M. , Zettlemoyer , L. , Stoyanov , V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach . Tech. rep. ( 2019 )

7. Meng , K. , Jimenez , D. , Arslan , F. , Devasier , J.D. , Obembe , D. , Li , C. : GradientBased Adversarial Training on Transformer Networks for Detecting Check-Worthy Factual Claims (feb 2020 ), http://arxiv.org/abs/ 2002 .07725

8. Pomerleau , D. , Rao , D. : The fake news challenge: Exploring how arti cial intelligence technologies could be leveraged to combat fake news ( 2017 )

9. Schiller , B. , Daxenberger , J. , Gurevych , I. : Stance Detection Benchmark: How Robust Is Your Stance Detection? (jan 2020 )

10. Shu , K. , Mahudeswaran , D. , Wang , S. , Lee , D. , Liu, H.: FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media . Tech. rep. ( 2018 )

11. Shu , K. , Zheng , G. , Li , Y. , Mukherjee , S. , Awadallah , A.H. , Ruston , S. , Liu, H.: Leveraging Multi-Source Weak Social Supervision for Early Detection of Fake News ( 2020 ), http://arxiv.org/abs/ 2004 .01732