A Hybrid Human-In-The-Loop Framework for Fact Checking

A Hybrid Human-In-The-Loop Framework for Fact Checking DavidLaBarbera david.labarbera@uniud.it University of Udine

Via Delle Scienze 206 Udine Italy

KevinRoitero kevin.roitero@uniud.it University of Udine

Via Delle Scienze 206 Udine Italy

StefanoMizzaro stefano.mizzaro@uniud.it University of Udine

Via Delle Scienze 206 Udine Italy

A Hybrid Human-In-The-Loop Framework for Fact Checking 1613-0073 006239C6BC58CD5239E063F2CE507205 GROBID - A machine learning software for extracting information from scholarly documents Misinformation Human-in-the-loop Artificial Intelligence . Mizzaro) 0000-0002-8215-5502 (D. L. Barbera); 0000-0002-9191-3280 (K. Roitero)

Online misinformation is posing a serious threat for the modern society. Assessing the veracity of online information is a complex problem which nowadays is addressed by heavily relying on trained fact-checking experts. This solution is not scalable and, also due the importance of the problem the issue gained the attention of the scientific community, which proposed many AI-based automatic solutions. Despite the efforts made, the effectiveness of such approaches is not yet enough to allow them to be used without supervision. In this position paper, we propose a hybrid human-in-the-loop framework for fact-checking: we address the misinformation issue by relying on a combination of automatic AI methods, crowdsourcing ones, and experts. We study the single components of the frameworks as well as their interactions, and we propose an interleaving of the different components which we believe will serve as useful starting point for the future research towards effective and scalable fact-checking.

Introduction

Modern times have highlighted the centrality of the threat for the modern society of fake news and misinformation. Traditionally, misinformation detection is a slow and costly process that is made solely by expert trained fact-checkers, that can not cope with the ever-increasing amount of information shared online everyday. To address this issue, researchers are developing automatic techniques to identify misinformation at scale, and significant efforts have been made to develop fast and scalable state-of-the-art Artificial Intelligence (AI) algorithms [2,3,4]. Another less traditional approach to tackle such issue is to take advantage of the wisdom of the crowd [5] and leverage crowdsourcing workers [6,7,8,9,10,11,12,13]. Both approaches have pro-and contra: while AI is usually cheaper and scalable, crowd-workers can perform more reliable and explainable classifications. To take the best from both worlds, researchers proposed hybrid Human-In-The-Loop (HITL) approaches that integrate AI, crowd, and experts, even though only few implementations exist [14,15,16,17]. Differently from previous work [17], in this paper we propose a concrete architecture for fact-checking, and we inspect the responsibilities of each component as well as their interactions. In particular, we detail a pragmatical workflow which should be implemented to effectively classify the veracity of a set of statements at scale.

Related Work

There are both numerous examples of AI techniques for misinformation detection [2] as well as of academic interest on their development and evaluation [18]. Many different AI approaches exist: Ozbay and Alatas [3] tested 23 supervised AI algorithms on public datasets, Zhao et al. [4] integrated linguistic, topic, sentiment, and behavioral features to develop a model for health misinformation, Stammbach and Neumann [19] used evidence retrieval techniques and finetune a BERT-based model for the FEVER challenge, Konstantinovskiy et al. [20] developed a pipeline to identify misinformation using a multi-task learning approach. Related to that, many approaches addressed the issue of credibility in social media [21].

Focusing on misinformation detection using crowdsourcing, La Barbera et al. [7] first found an effect of judgment scales and evidence of worker assessors' bias on political statements, Soprano et al. [11] used the dataset from Roitero et al. [8] to leverage a multidimensional scale to measure different aspects of a statement, Draws et al. [13] found that workers generally overestimate the truthfulness and that different type of workers show different biases when evaluating a given statement, Pennycook and Rand [6] used the crowd to study effects of reducing social media users' exposure to low-quality news, and Allen et al. [12] compared the accuracy ratings between fact-checkers and crowd-workers.

Finally, some work investigated the combination of AI and humans: Demartini et al. [17] introduced a theoretical hybrid HITL framework for misinformation, Qu et al. [22] used selfreported scores from both AI and crowd to develop a hybrid system, Shabani et al. [14] used humans to provide feedback on news stories about statement contextual information and integrated those features into an AI pipeline, and Yang et al. [15] showed the potential speed up to the fact-checking process by organizing and selecting representative statements.

Limitations of Current Approaches

As highlighted by Demartini et al. [17,Figure 2] each of the three state-of-the-art approaches for misinformation detection i.e., experts, AI tools, and crowd has its own advantages and disadvantages in terms of accuracy, scale, cost, explainability, and bias control. We detail these aspects in this section, focusing on the limitations of each approach.

Certainly AI tools outperform both crowd and experts when considering costs1 and evaluation speed, but despite recent works [26,27], they provide less or no explainability. More importantly, such models achieve lower accuracy than crowd or experts. To provide some examples, classical machine learning models achieved 74% accuracy on a two-level scale [28], and the best model of this year CLEF CheckThat! Lab reached 54.7% accuracy on a four-level scale [18]. Considering the accuracy from the crowd, experimental results [12] show a high correlation with the experts in terms of agreement, whereas other work reports accuracy values that are lower and comparable to those obtained by AI methods [7,8,9,10,11,13]; although further studies are needed to draw definitive conclusions it seems reasonable to assume that crowd accuracy can be higher than automatic AI solutions. The highest accuracy is achieved by the experts, which is always set to the value of 1 for practical reason. Nevertheless, even domain experts need confrontation and discussion phases to reach a final consensus (see for example the process used by PolitiFact2 ).

Bias is also a crucial limitation of current approaches. Experts and crowd-workers being humans are subject to cognitive biases [13,7], which can be mitigated by the discussion phase in the case of experts, but are difficult to remove for crowd-workers [29]. Moreover, all the aforementioned biases can be propagated from humans to AI models, e.g., when training or fine-tuning a model.

Another limitation of current approaches is given by the specific truthfulness scales used; different scales exist and are used, and such heterogeneity, apart from making a fair comparison difficult, has an impact on the quality of the collected data [7].

We believe that a HITL framework for misinformation detection should address and overcome all of the limitations detailed above by fruitfully combining the capabilities of AI, crowd, and experts.

HITL Framework for Misinformation Detection

Possible Architectures

A natural solution to the task investigated in this paper is to employ a pipeline model where the components are sorted with an increasing accuracy (i.e., first the AI, then the crowd, and finally the experts). Thus, if a statement is not adequately classified by a component, the subsequent pipeline component will perform a more accurate classification. Also, such a pipeline concatenates each component according to their increasing cost and evaluation time. This allows to perform a pipeline of annotation tasks where the majority of the statements are quickly and automatically labeled by AI, only a subset of the statements is sent for a slower evaluation to the crowd, and the few remaining statements are sent to experts for an in-depth investigation.

The key advantage of this configuration is that it takes the best from each component, and that it allows to minimize the overall costs. Particularly, this configuration lets the experts (i.e., the more costly component) to evaluate a very small number of statements. Nevertheless, the pipeline model has important limitations as it does not provide feedback among the components: a statement is simply forwarded until it is eventually classified with not much cooperation among the components.

Another possible combination of the components is by means of a blackboard architecture, a common solution in distributed multi-agent settings [30]. Such an approach allows the components to select which statements to evaluate. Each component is an autonomous agent that can access a central repository that contains both the statements and the partial contributions provided by each component. This approach would require both a high synergy between the components and to split a classification task in atomic sub-tasks to take advantage from each specific component of the architecture.

General Framework

An ideal framework should maximize accuracy while minimizing the cost of each component and strengthening the cooperation between and within its modules. Therefore, we propose first a basic framework, where each component provides feedback to, and cooperates with, the others. We then discuss possible variants and extensions.

Our proposal is summarized in Figure 1. Given a statement, each of the three components (AI, crowd, and experts) generates: a classification on a chosen scale and a confidence score for the performed classification. Whenever the component AI or crowd generates a prediction with a high confidence score, the statement is considered as correctly classified. Otherwise if the confidence is low, the statement is forwarded to the subsequent component. If this is the case, the output of the component (such as the confidence score and the classification) can be optionally forwarded along with the statement. This could allow the subsequent component to perform an informed assessment, if necessary. Also samples of statements considered as correctly classified by the component (i.e, with high confidence) should be propagated, to double check their classification score and deal with the problem of unknown-unknowns (i.e., statements for which AI is highly confident about its predictions but is wrong) using humans [31,32,33]. This allows each component to provide feedback to the previous ones, thereby improving their classifications.

In the following sections we will detail for each component: its possible internal structure, its specific interactions with other components, and additional outputs that can be added to the general framework.

First Component: AI

Assuming the use of a state-of-the-art model for misinformation detection [28,2,19,3,4,34,18], the output provided by the AI component should be at least a classification score on a chosen truthfulness scale, and a confidence score. While the classification score is straightforward, the confidence can be reliably calibrated following the methodology by Guo et al. [35]. To provide an adequate classification, AI tool can rely on a Knowledge Base (KB) to perform evidence retrieval.

Examples of such a system are the ones proposed by La Barbera et al. [36] and Stammbach and Neumann [19], who both use a transformer architecture who rely on retrieved evidence. The choice of the Knowledge Base (KB) to use to produce a classification and an explanation is not straightforward, since there is no evidence of a "universally best" KB [37]. Thus, the choice of the specific KB should be performed ad-hoc by leveraging statements and domain specific features, as for example the topic, speaker, year, etc. of the set of statements being processed.

To evaluate the classification score given by the component, we can use optional output. For example, many AI models are able to provide reasons for their predictions [26,27]. Some implementations are delivered by Kazemi et al. [38] and by Brand et al. [39] who develop models able to generate an explanation for their misinformation assessment. The generation of an explanation could improve the framework by providing additional and human-readable information useful for both the subsequent human-based components and the final classification.

Finally, the AI component could provide self-feedback by using counterfactual explanations [40]: generating instances that the model finds hard to classify or deceiving could improve the model performances, robustness, and generalization abilities.

The output of the AI component is thus made by classification, confidence, and optional information, such as explanation and retrieved evidence. The decision whether the statement has been adequately classified or not can be then performed by relying on the confidence of the model [22] as detailed in Section 4.2. To help this decision, it could be used the optional explanation, for example considering its readability or semantic scores. The decision for some statements might be more critical and not straightforward: a very recent statement made by an important public figure over a highly relevant topic with not much evidence available might be worth further investigation. Hence, it might be worth studying the effectiveness of an importance score using the statement's metadata.

Finally, if the assessment for the statement has a low confidence, the explanation is not satisfactory, or the assessment needs to be refined for any other reason, the statement is sent to the subsequent component: the crowd.

Second Component: Crowd

As for the AI, the crowd component should perform two tasks: misinformation classification and provide feedback to itself and to the AI component. There are many examples of misinformation classification directly performed by the crowd [6,7,8,11,12,13]. It could also be reasonable to perform an informed assessment relying on the output of the AI component [41]. Nevertheless, the use of this additional information could introduce biases into the assessment performed by the crowd, hence further studies in this direction are required. Moreover, to reduce workers cognitive effort, it is possible to design a two steps task using disjoints sets of workers: the first set will search for evidence for a given statement, the second will classify the statement using the provided evidence (and additional data). While all of the different mentioned tasks are indeed reasonable, it is necessary to perform ad-hoc studies to find the best possible setting. Along this line, we can leverage work done in related fields [42] to identify the subset of best workers and exploit their features to be able to minimize the workforce needed and at the same time maximize its effectiveness.

Also, the crowd can be asked to provide additional rationales to motivate their classification [43,44]. The classifications can be used to improve the AI component by fine-tuning the models with additional data, or even both workers and AI rationales can be used to adjust the confidence of the final assessment; nevertheless, this should be implemented with caution, as workers rationales might contain bias that can be involuntary injected into AI models. Finally, a subset of crowd-workers should look for counterfactual examples that could highlight AI classification errors with high confidence. While these methodologies still need to be tested in the field of misinformation detection, some work [45] shows the promising results of this approach applied to different domains.

As for the AI component, the output of the crowd component is composed by the default classification and confidence, along with optional additional data such as evidence, explanation, and rationales. Therefore, to decide if a statement is correctly classified or not it is possible to rely not only on the data generated by the crowd, but also to check for agreement and inconsistencies between crowd and AI [22].

At this point of the evaluation, the majority of the statements have been classified by the framework, and only a very small subset will reach the final step of the workflow: the experts.

Third Component: Experts

The last step of the framework is made by the experts. It is possible to let them evaluate a statement using a pre-defined fact-checking methodology, and ideally to provide to them all the outputs from the previous components to perform an informed assessment. The effects of such a decision need to be studied since, as discussed for the crowd, the use of additional information could introduce bias in the final evaluation. We remark that we believe that critical, important, and difficult statements should always be evaluated or at least checked by the experts. Note that to identify those statements it would be necessary to find a metric to be able to automatically evaluate the importance of a statement in a given context. Also, to increase the robustness of the framework, the experts should be able to directly look at the statements classified by the previous components and to decide whether some of them need to be re-assessed or not. Finally, each classification performed by the experts should be used to re-train the AI models, and used as an example to train the crowd before performing the task. This final aspect could also be performed interactively, following an active-learning scenario.

Conclusions

In this work we study the limitations of the current approaches for misinformation detection and propose a hybrid HITL framework that combines AI, crowd, and experts. Our main contributions are the following: we frame the problem and review the related work detailing frameworks for fact-checking; we study possible framework architectures detailing their respective advantages and disadvantages; we propose a solid architecture for performing fact-checking at scale, and we describe each component focusing on its role and outputs, as well as its interactions with other components. The main advantages of our framework are given by an efficient combination of the components in terms of increasing accuracy and evaluation time, decreasing costs, and by the feedback between and within each component.

Future work aims at proving a full framework implementation. More in detail, further study will be done on the synergies between crowd and AI to investigate the effects of an informed assessment made by the crowd leveraging AI outputs, and to set thresholds to decide about statement forwarding among components.

Figure 1 :1Figure 1: Overview of the proposed framework. while training language models from scratch can cost up to millions of dollars[23], once trained they can be used multiple times leveraging few-or zero-shot learning[24, ]. https://www.politifact.com/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/ #Truth-O-Meter%20ratings

Preface to the sixth workshop on natural language for artificial intelligence (nl4ai) DNozza LPassaro MPolignano CEUR-WS.org Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022) DNozza LCPassaro MPolignano the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022) November 30, 2022. 2022 BGuo YDing LYao YLiang ZYu 10.48550/ARXIV.1909.03654 The Future of Misinformation Detection: New Perspectives and Trends 2019 Fake news detection within online social media using supervised artificial intelligence algorithms FAOzbay BAlatas 10.1016/j.physa.2019.123174 Physica A: Statistical Mechanics and its Applications 540 123174 2020 Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches YZhao JDa JYan 10.1016/j.ipm.2020.102390 Information Processing & Management 58 102390 2021 The Wisdom of Crowds JSurowiecki 2005 Anchor Fighting misinformation on social media using crowdsourced judgments of news source quality GPennycook DGRand 10.1073/pnas.1806781116 Proceedings of the National Academy of Sciences 116 2019 Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias DLaBarbera KRoitero DSpina SMizzaro GDemartini Proceedings of the 42nd European Conference on Information Retrieval the 42nd European Conference on Information Retrieval Springer 2020 Can The Crowd Identify Misinformation Objectively? The Effects of Judgment Scale and Assessor's Background KRoitero MSoprano SFan DSpina SMizzaro GDemartini Proceedings of the 43rd Conference on Research and Development in Information Retrieval, SIGIR the 43rd Conference on Research and Development in Information Retrieval, SIGIR ACM 2020 The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively? KRoitero MSoprano BPortelli DSpina VDella Mea GSerra SMizzaro GDemartini 10.1145/3340531.3412048 Proceedings of the 29th Conference on Information & Knowledge Management the 29th Conference on Information & Knowledge Management ACM 2020 Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19 KRoitero MSoprano BPortelli MLuise DSpina VDella Mea GSerra SMizzaro GDemartini 10.1007/s00779-021-01604-6 Personal and Ubiquitous Computing 2021 The many dimensions of truthfulness: Crowdsourcing misinformation assessments on a multidimensional scale MSoprano KRoitero DLa DBarbera DCeolin SSpina GMizzaro Demartini 10.1016/j.ipm.2021.102710 Information Processing & Management 58 102710 2021 Scaling up fact-checking using the wisdom of crowds JAllen AAArechar GPennycook DGRand 10.1126/sciadv.abf4393 Science Advances 7 4393 2021 The Effects of Crowd Worker Biases in Fact-Checking Tasks TDraws DLa MBarbera KSoprano DRoitero ACeolin SChecco Mizzaro 10.1145/3531146.3534629 Conference on Fairness, Accountability, and Transparency, FAccT ACM 2022 SAMS: Human-in-the-loop Approach to Combat the Sharing of Digital Misinformation SShabani ZCharlesworth MSokhn HSchuldt CEUR Workshop Proc 2846 2021 Scalable Fact-checking with Human-in-the-Loop JYang DVega-Oliveros TSeibt ARocha 10.1109/WIFS53200.2021.9648388 IEEE Workshop on Information Forensics and Security, WIFS 2021 GKaragiannis MSaeed PPapotti ITrummer CoRR abs/2003.06708 Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data-Driven Claim Verification 2020 Human-in-the-loop Artificial Intelligence for Fighting Online Misinformation: Challenges and Opportunities GDemartini SMizzaro DSpina Bulletin of IEEE Computer Society 43 2020 PNakov ABarrón-Cedeño GDa San Martino FAlam JMStruß TMandl RMíguez TCaselli MKutlu WZaghouani CLi SShaar GKShahi HMubarak ANikolov NBabulkov YSKartal MWiegand MSiegel JKöhler Overview of the CLEF 2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection Experimental IR Meets Multilinguality, Multimodality, and Interaction ABarrón-Cedeño GDa San Martino MDegli FEsposti CSebastiani GMacdonald APasi MHanbury GPotthast NFaggioli Ferro Springer 2022 Team DOMLIN: Exploiting evidence enhancement for the FEVER shared task DStammbach GNeumann 10.18653/v1/D19-6616 Proceedings of the 2nd Workshop on Fact Extraction and VERification, FEVER, ACL the 2nd Workshop on Fact Extraction and VERification, FEVER, ACL 2019 Toward Automated Factchecking: Developing an Annotation Schema and Benchmark for Consistent Automated Claim Detection LKonstantinovskiy OPrice MBabakar AZubiaga 10.1145/3412869 Digital Threats 2 2021 Credibility in social media: opinions, news, and health information-a survey MViviani GPasi 10.1002/widm.1209 doi: WIREs Data Mining and Knowledge Discovery 7 e1209 2017 Combining Human and Machine Confidence in Truthfulness Assessment YQu DLBarbera KRoitero SMizzaro DSpina GDemartini 10.1145/3546916 Data and Information Quality 2022 OSharir BPeleg YShoham The Cost of Training NLP Models: A Concise Overview 2020 arXiv Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in neural information processing systems 33 2020 Large Language Models are Zero-Shot Reasoners TKojima SSGu MReid YMatsuo YIwasawa Workshop on Knowledge Retrieval and Language Models ICML 2022 Generating Fact Checking Explanations PAtanasova JGSimonsen CLioma IAugenstein CoRR abs/2004.05773 2020 NKotonya FToni CoRR abs/2011.03870 Explainable Automated Fact-Checking: A Survey 2020 Fake news detection using naive Bayes classifier MGranik VMesyura 10.1109/UKRCON.2017.8100379 IEEE First Ukraine Conference on Electrical and Computer Engineering

UKRCON

2017 A Checklist to Combat Cognitive Biases in Crowdsourcing TDraws ARieger OInel UGadiraju NTintarev Proceedings of the AAAI Conference on Human Computation and Crowdsourcing the AAAI Conference on Human Computation and Crowdsourcing 2021 9 Event-based blackboard architecture for multi-agent systems JDong SChen J.-JJeng 10.1109/ITCC.2005.149 Proceedings of the Conference on Information Technology: Coding and Computing the Conference on Information Technology: Coding and Computing 2005 2 of ITCC Beat the Machine: Challenging Humans to Find a Predictive Model's "Unknown Unknowns JAttenberg PIpeirotis FProvost 10.1145/2700832 J. Data and Information Quality 6 2015 Identifying Unknown Unknowns in the Open World: Representations and Policies for Guided Exploration HLakkaraju EKamar RCaruana EHorvitz Proceedings of the 31st AAAI Conference on Artificial Intelligence the 31st AAAI Conference on Artificial Intelligence AAAI Press 2017 Towards Hybrid Human-AI Workflows for Unknown Unknown Detection ALiu SGuerra IFung GMatute EKamar WLasecki 10.1145/3366423.3380306 Proceedings of The Web Conference, WWW, ACM The Web Conference, WWW, ACM 2020 BTaboubi MA BNessir HHaddad iCompass at CheckThat! 2022: combining deep language models for fake news detection Working Notes of CLEF 2022 On Calibration of Modern Neural Networks CGuo GPleiss YSun KQWeinberger Proceedings of the 34th Conference on Machine Learning the 34th Conference on Machine Learning JMLR 2017 70 BUM at Check-That! 2022: A Composite Deep Learning Approach to Fake News Detection using Evidence Retrieval DLaBarbera KRoitero JMackenzie DSpina GDemartini SMizzaro Working Notes of CLEF 2022-Conference and Labs of the Evaluation Forum CLEF 2022 The Choice of Knowledge Base in Automated Claim Checking DStammbach BZhang EAsh arXiv:2111.07795 2021 AKazemi ZLi VPérez-Rosas RMihalcea CoRR abs/2104.12918 Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News 2021 A Neural Model to Jointly Predict and Explain Truthfulness of Statements EBrand KRoitero MSoprano ARahimi GDemartini 10.1145/3546917 Data and Information Quality 2022 DMZiegler NStiennon JWu TBBrown ARadford DAmodei PFChristiano GIrving CoRR abs/1909.08593 Fine-Tuning Language Models from Human Preferences 2019 Humans and Algorithms Detecting Fake News: Effects of Individual and Contextual Confidence on Trust in Algorithmic Advice CSnijders RConijn EFouw KBerlo 10.1080/10447318.2022.2097601 Journal of Human-Computer Interaction 2022 Cheaper and Better: Selecting Good Workers for Crowdsourcing HLi QLiu Proceedings of the AAAI Conference on Human Computation and Crowdsourcing the AAAI Conference on Human Computation and Crowdsourcing 2015 3 Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments TMcdonnell MLease MKutlu TElsayed Proceedings of the 4th Conference on Human Computation and Crowdsourcing the 4th Conference on Human Computation and Crowdsourcing HCOMP 2016 4 Annotator Rationales for Labeling Tasks in Crowdsourcing MKutlu TMcdonnell MLease TElsayed 10.1613/jair.1.12012 Journal of Artificial Intelligence Research 69 2020 TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP JXMorris ELifland JYYoo JGrigsby DJin YQi 10.48550/ARXIV.2005.05909 2020