A Hybrid Human-In-The-Loop Framework for Fact Checking David La Barbera1 , Kevin Roitero1 and Stefano Mizzaro1 1 University of Udine, Via Delle Scienze 206, Udine, Italy Abstract Online misinformation is posing a serious threat for the modern society. Assessing the veracity of online information is a complex problem which nowadays is addressed by heavily relying on trained fact-checking experts. This solution is not scalable and, also due the importance of the problem the issue gained the attention of the scientific community, which proposed many AI-based automatic solutions. Despite the efforts made, the effectiveness of such approaches is not yet enough to allow them to be used without supervision. In this position paper, we propose a hybrid human-in-the-loop framework for fact-checking: we address the misinformation issue by relying on a combination of automatic AI methods, crowdsourcing ones, and experts. We study the single components of the frameworks as well as their interactions, and we propose an interleaving of the different components which we believe will serve as useful starting point for the future research towards effective and scalable fact-checking. Keywords Misinformation, Human-in-the-loop, Artificial Intelligence 1. Introduction Modern times have highlighted the centrality of the threat for the modern society of fake news and misinformation. Traditionally, misinformation detection is a slow and costly process that is made solely by expert trained fact-checkers, that can not cope with the ever-increasing amount of information shared online everyday. To address this issue, researchers are developing automatic techniques to identify misinformation at scale, and significant efforts have been made to develop fast and scalable state-of-the-art Artificial Intelligence (AI) algorithms [2, 3, 4]. Another less traditional approach to tackle such issue is to take advantage of the wisdom of the crowd [5] and leverage crowdsourcing workers [6, 7, 8, 9, 10, 11, 12, 13]. Both approaches have pro- and contra: while AI is usually cheaper and scalable, crowd-workers can perform more reliable and explainable classifications. To take the best from both worlds, researchers proposed hybrid Human-In-The-Loop (HITL) approaches that integrate AI, crowd, and experts, even though only few implementations exist [14, 15, 16, 17]. Differently from previous work [17], in this paper we propose a concrete architecture for fact-checking, and we inspect the responsibilities of each component as well as their interactions. In particular, we detail a NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [1] " david.labarbera@uniud.it (D. L. Barbera); kevin.roitero@uniud.it (K. Roitero); stefano.mizzaro@uniud.it (S. Mizzaro) ~ https://kevinroitero.com/ (K. Roitero); http://users.dimi.uniud.it/~stefano.mizzaro/ (S. Mizzaro)  0000-0002-8215-5502 (D. L. Barbera); 0000-0002-9191-3280 (K. Roitero) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) pragmatical workflow which should be implemented to effectively classify the veracity of a set of statements at scale. 2. Related Work There are both numerous examples of AI techniques for misinformation detection [2] as well as of academic interest on their development and evaluation [18]. Many different AI approaches exist: Ozbay and Alatas [3] tested 23 supervised AI algorithms on public datasets, Zhao et al. [4] integrated linguistic, topic, sentiment, and behavioral features to develop a model for health misinformation, Stammbach and Neumann [19] used evidence retrieval techniques and fine- tune a BERT-based model for the FEVER challenge, Konstantinovskiy et al. [20] developed a pipeline to identify misinformation using a multi-task learning approach. Related to that, many approaches addressed the issue of credibility in social media [21]. Focusing on misinformation detection using crowdsourcing, La Barbera et al. [7] first found an effect of judgment scales and evidence of worker assessors’ bias on political statements, Soprano et al. [11] used the dataset from Roitero et al. [8] to leverage a multidimensional scale to measure different aspects of a statement, Draws et al. [13] found that workers generally overestimate the truthfulness and that different type of workers show different biases when evaluating a given statement, Pennycook and Rand [6] used the crowd to study effects of reducing social media users’ exposure to low-quality news, and Allen et al. [12] compared the accuracy ratings between fact-checkers and crowd-workers. Finally, some work investigated the combination of AI and humans: Demartini et al. [17] introduced a theoretical hybrid HITL framework for misinformation, Qu et al. [22] used self- reported scores from both AI and crowd to develop a hybrid system, Shabani et al. [14] used humans to provide feedback on news stories about statement contextual information and integrated those features into an AI pipeline, and Yang et al. [15] showed the potential speed up to the fact-checking process by organizing and selecting representative statements. 3. Limitations of Current Approaches As highlighted by Demartini et al. [17, Figure 2] each of the three state-of-the-art approaches for misinformation detection i.e., experts, AI tools, and crowd has its own advantages and disadvantages in terms of accuracy, scale, cost, explainability, and bias control. We detail these aspects in this section, focusing on the limitations of each approach. Certainly AI tools outperform both crowd and experts when considering costs1 and evaluation speed, but despite recent works [26, 27], they provide less or no explainability. More importantly, such models achieve lower accuracy than crowd or experts. To provide some examples, classical machine learning models achieved 74% accuracy on a two-level scale [28], and the best model of this year CLEF CheckThat! Lab reached 54.7% accuracy on a four-level scale [18]. Considering the accuracy from the crowd, experimental results [12] show a high correlation with the experts in terms of agreement, whereas other work reports accuracy values that are lower and 1 while training language models from scratch can cost up to millions of dollars [23], once trained they can be used multiple times leveraging few- or zero-shot learning [24, 25]. comparable to those obtained by AI methods [7, 8, 9, 10, 11, 13]; although further studies are needed to draw definitive conclusions it seems reasonable to assume that crowd accuracy can be higher than automatic AI solutions. The highest accuracy is achieved by the experts, which is always set to the value of 1 for practical reason. Nevertheless, even domain experts need confrontation and discussion phases to reach a final consensus (see for example the process used by PolitiFact2 ). Bias is also a crucial limitation of current approaches. Experts and crowd-workers being humans are subject to cognitive biases [13, 7], which can be mitigated by the discussion phase in the case of experts, but are difficult to remove for crowd-workers [29]. Moreover, all the aforementioned biases can be propagated from humans to AI models, e.g., when training or fine-tuning a model. Another limitation of current approaches is given by the specific truthfulness scales used; different scales exist and are used, and such heterogeneity, apart from making a fair comparison difficult, has an impact on the quality of the collected data [7]. We believe that a HITL framework for misinformation detection should address and overcome all of the limitations detailed above by fruitfully combining the capabilities of AI, crowd, and experts. 4. HITL Framework for Misinformation Detection 4.1. Possible Architectures A natural solution to the task investigated in this paper is to employ a pipeline model where the components are sorted with an increasing accuracy (i.e., first the AI, then the crowd, and finally the experts). Thus, if a statement is not adequately classified by a component, the subsequent pipeline component will perform a more accurate classification. Also, such a pipeline concatenates each component according to their increasing cost and evaluation time. This allows to perform a pipeline of annotation tasks where the majority of the statements are quickly and automatically labeled by AI, only a subset of the statements is sent for a slower evaluation to the crowd, and the few remaining statements are sent to experts for an in-depth investigation. The key advantage of this configuration is that it takes the best from each component, and that it allows to minimize the overall costs. Particularly, this configuration lets the experts (i.e., the more costly component) to evaluate a very small number of statements. Nevertheless, the pipeline model has important limitations as it does not provide feedback among the components: a statement is simply forwarded until it is eventually classified with not much cooperation among the components. Another possible combination of the components is by means of a blackboard architecture, a common solution in distributed multi-agent settings [30]. Such an approach allows the compo- nents to select which statements to evaluate. Each component is an autonomous agent that can access a central repository that contains both the statements and the partial contributions provided by each component. This approach would require both a high synergy between the 2 https://www.politifact.com/article/2018/feb/12/principles-truth-o-meter-politifacts-methodology-i/ #Truth-O-Meter%20ratings Experts AI Tools Crowd Workers Statement Classified Statement final classification component classification confidence feedback next component Figure 1: Overview of the proposed framework. components and to split a classification task in atomic sub-tasks to take advantage from each specific component of the architecture. 4.2. General Framework An ideal framework should maximize accuracy while minimizing the cost of each component and strengthening the cooperation between and within its modules. Therefore, we propose first a basic framework, where each component provides feedback to, and cooperates with, the others. We then discuss possible variants and extensions. Our proposal is summarized in Figure 1. Given a statement, each of the three components (AI, crowd, and experts) generates: a classification on a chosen scale and a confidence score for the performed classification. Whenever the component AI or crowd generates a prediction with a high confidence score, the statement is considered as correctly classified. Otherwise if the confidence is low, the statement is forwarded to the subsequent component. If this is the case, the output of the component (such as the confidence score and the classification) can be optionally forwarded along with the statement. This could allow the subsequent component to perform an informed assessment, if necessary. Also samples of statements considered as correctly classified by the component (i.e, with high confidence) should be propagated, to double check their classification score and deal with the problem of unknown-unknowns (i.e., statements for which AI is highly confident about its predictions but is wrong) using humans [31, 32, 33]. This allows each component to provide feedback to the previous ones, thereby improving their classifications. In the following sections we will detail for each component: its possible internal structure, its specific interactions with other components, and additional outputs that can be added to the general framework. 4.3. First Component: AI Assuming the use of a state-of-the-art model for misinformation detection [28, 2, 19, 3, 4, 34, 18], the output provided by the AI component should be at least a classification score on a chosen truthfulness scale, and a confidence score. While the classification score is straightforward, the confidence can be reliably calibrated following the methodology by Guo et al. [35]. To provide an adequate classification, AI tool can rely on a Knowledge Base (KB) to perform evidence retrieval. Examples of such a system are the ones proposed by La Barbera et al. [36] and Stammbach and Neumann [19], who both use a transformer architecture who rely on retrieved evidence. The choice of the Knowledge Base (KB) to use to produce a classification and an explanation is not straightforward, since there is no evidence of a “universally best” KB [37]. Thus, the choice of the specific KB should be performed ad-hoc by leveraging statements and domain specific features, as for example the topic, speaker, year, etc. of the set of statements being processed. To evaluate the classification score given by the component, we can use optional output. For example, many AI models are able to provide reasons for their predictions [26, 27]. Some implementations are delivered by Kazemi et al. [38] and by Brand et al. [39] who develop models able to generate an explanation for their misinformation assessment. The generation of an explanation could improve the framework by providing additional and human-readable information useful for both the subsequent human-based components and the final classification. Finally, the AI component could provide self-feedback by using counterfactual explanations [40]: generating instances that the model finds hard to classify or deceiving could improve the model performances, robustness, and generalization abilities. The output of the AI component is thus made by classification, confidence, and optional information, such as explanation and retrieved evidence. The decision whether the statement has been adequately classified or not can be then performed by relying on the confidence of the model [22] as detailed in Section 4.2. To help this decision, it could be used the optional explanation, for example considering its readability or semantic scores. The decision for some statements might be more critical and not straightforward: a very recent statement made by an important public figure over a highly relevant topic with not much evidence available might be worth further investigation. Hence, it might be worth studying the effectiveness of an importance score using the statement’s metadata. Finally, if the assessment for the statement has a low confidence, the explanation is not satisfactory, or the assessment needs to be refined for any other reason, the statement is sent to the subsequent component: the crowd. 4.4. Second Component: Crowd As for the AI, the crowd component should perform two tasks: misinformation classification and provide feedback to itself and to the AI component. There are many examples of misinformation classification directly performed by the crowd [6, 7, 8, 11, 12, 13]. It could also be reasonable to perform an informed assessment relying on the output of the AI component [41]. Nevertheless, the use of this additional information could introduce biases into the assessment performed by the crowd, hence further studies in this direction are required. Moreover, to reduce workers cognitive effort, it is possible to design a two steps task using disjoints sets of workers: the first set will search for evidence for a given statement, the second will classify the statement using the provided evidence (and additional data). While all of the different mentioned tasks are indeed reasonable, it is necessary to perform ad-hoc studies to find the best possible setting. Along this line, we can leverage work done in related fields [42] to identify the subset of best workers and exploit their features to be able to minimize the workforce needed and at the same time maximize its effectiveness. Also, the crowd can be asked to provide additional rationales to motivate their classification [43, 44]. The classifications can be used to improve the AI component by fine-tuning the models with additional data, or even both workers and AI rationales can be used to adjust the confidence of the final assessment; nevertheless, this should be implemented with caution, as workers rationales might contain bias that can be involuntary injected into AI models. Finally, a subset of crowd-workers should look for counterfactual examples that could highlight AI classification errors with high confidence. While these methodologies still need to be tested in the field of misinformation detection, some work [45] shows the promising results of this approach applied to different domains. As for the AI component, the output of the crowd component is composed by the default classification and confidence, along with optional additional data such as evidence, explanation, and rationales. Therefore, to decide if a statement is correctly classified or not it is possible to rely not only on the data generated by the crowd, but also to check for agreement and inconsistencies between crowd and AI [22]. At this point of the evaluation, the majority of the statements have been classified by the framework, and only a very small subset will reach the final step of the workflow: the experts. 4.5. Third Component: Experts The last step of the framework is made by the experts. It is possible to let them evaluate a statement using a pre-defined fact-checking methodology, and ideally to provide to them all the outputs from the previous components to perform an informed assessment. The effects of such a decision need to be studied since, as discussed for the crowd, the use of additional information could introduce bias in the final evaluation. We remark that we believe that critical, important, and difficult statements should always be evaluated or at least checked by the experts. Note that to identify those statements it would be necessary to find a metric to be able to automatically evaluate the importance of a statement in a given context. Also, to increase the robustness of the framework, the experts should be able to directly look at the statements classified by the previous components and to decide whether some of them need to be re-assessed or not. Finally, each classification performed by the experts should be used to re-train the AI models, and used as an example to train the crowd before performing the task. This final aspect could also be performed interactively, following an active-learning scenario. 5. Conclusions In this work we study the limitations of the current approaches for misinformation detection and propose a hybrid HITL framework that combines AI, crowd, and experts. Our main contributions are the following: we frame the problem and review the related work detailing frameworks for fact-checking; we study possible framework architectures detailing their respective advantages and disadvantages; we propose a solid architecture for performing fact-checking at scale, and we describe each component focusing on its role and outputs, as well as its interactions with other components. The main advantages of our framework are given by an efficient combination of the components in terms of increasing accuracy and evaluation time, decreasing costs, and by the feedback between and within each component. Future work aims at proving a full framework implementation. More in detail, further study will be done on the synergies between crowd and AI to investigate the effects of an informed assessment made by the crowd leveraging AI outputs, and to set thresholds to decide about statement forwarding among components. References [1] D. Nozza, L. Passaro, M. Polignano, Preface to the sixth workshop on natural language for artificial intelligence (nl4ai), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022. [2] B. Guo, Y. Ding, L. Yao, Y. Liang, Z. Yu, The Future of Misinformation Detection: New Perspectives and Trends, 2019. doi:10.48550/ARXIV.1909.03654. [3] F. A. Ozbay, B. Alatas, Fake news detection within online social media using supervised artificial intelligence algorithms, Physica A: Statistical Mechanics and its Applications 540 (2020) 123174. doi:10.1016/j.physa.2019.123174. [4] Y. Zhao, J. Da, J. Yan, Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches, Information Processing & Management 58 (2021) 102390. doi:10.1016/j.ipm.2020.102390. [5] J. Surowiecki, The Wisdom of Crowds, Anchor, 2005. [6] G. Pennycook, D. G. Rand, Fighting misinformation on social media using crowdsourced judgments of news source quality, Proceedings of the National Academy of Sciences 116 (2019) 2521–2526. doi:10.1073/pnas.1806781116. [7] D. La Barbera, K. Roitero, D. Spina, S. Mizzaro, G. Demartini, Crowdsourcing Truthfulness: The Impact of Judgment Scale and Assessor Bias, in: Proceedings of the 42nd European Conference on Information Retrieval, ECIR, Springer, 2020, pp. 207–214. [8] K. Roitero, M. Soprano, S. Fan, D. Spina, S. Mizzaro, G. Demartini, Can The Crowd Identify Misinformation Objectively? The Effects of Judgment Scale and Assessor’s Background, in: Proceedings of the 43rd Conference on Research and Development in Information Retrieval, SIGIR, ACM, 2020, pp. 439–448. [9] K. Roitero, M. Soprano, B. Portelli, D. Spina, V. Della Mea, G. Serra, S. Mizzaro, G. Demartini, The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively?, in: Proceedings of the 29th Conference on Information & Knowledge Management, CIKM, ACM, 2020, p. 1305–1314. doi:10.1145/3340531.3412048. [10] K. Roitero, M. Soprano, B. Portelli, M. Luise, D. Spina, V. Della Mea, G. Serra, S. Mizzaro, G. Demartini, Can the Crowd Judge Truthfulness? A Longitudinal Study on Recent Misinformation about COVID-19, Personal and Ubiquitous Computing (2021). doi:10. 1007/s00779-021-01604-6. [11] M. Soprano, K. Roitero, D. La Barbera, D. Ceolin, D. Spina, S. Mizzaro, G. Demartini, The many dimensions of truthfulness: Crowdsourcing misinformation assessments on a multidimensional scale, Information Processing & Management 58 (2021) 102710. doi:10. 1016/j.ipm.2021.102710. [12] J. Allen, A. A. Arechar, G. Pennycook, D. G. Rand, Scaling up fact-checking using the wisdom of crowds, Science Advances 7 (2021) eabf4393. doi:10.1126/sciadv.abf4393. [13] T. Draws, D. La Barbera, M. Soprano, K. Roitero, D. Ceolin, A. Checco, S. Mizzaro, The Effects of Crowd Worker Biases in Fact-Checking Tasks, in: Conference on Fairness, Ac- countability, and Transparency, FAccT, ACM, 2022, p. 2114–2124. doi:10.1145/3531146. 3534629. [14] S. Shabani, Z. Charlesworth, M. Sokhn, H. Schuldt, SAMS: Human-in-the-loop Approach to Combat the Sharing of Digital Misinformation, CEUR Workshop Proc. 2846 (2021). [15] J. Yang, D. Vega-Oliveros, T. Seibt, A. Rocha, Scalable Fact-checking with Human-in-the- Loop, in: IEEE Workshop on Information Forensics and Security, WIFS, 2021, pp. 1–6. doi:10.1109/WIFS53200.2021.9648388. [16] G. Karagiannis, M. Saeed, P. Papotti, I. Trummer, Scrutinizer: A Mixed-Initiative Approach to Large-Scale, Data-Driven Claim Verification, CoRR abs/2003.06708 (2020). [17] G. Demartini, S. Mizzaro, D. Spina, Human-in-the-loop Artificial Intelligence for Fighting Online Misinformation: Challenges and Opportunities, Bulletin of IEEE Computer Society 43 (2020) 65–74. [18] P. Nakov, A. Barrón-Cedeño, G. da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, M. Wiegand, M. Siegel, J. Köhler, Overview of the CLEF–2022 CheckThat! Lab on Fighting the COVID-19 Infodemic and Fake News Detection, in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Springer, 2022, pp. 495–520. [19] D. Stammbach, G. Neumann, Team DOMLIN: Exploiting evidence enhancement for the FEVER shared task, in: Proceedings of the 2nd Workshop on Fact Extraction and VERification, FEVER, ACL, 2019, pp. 105–109. doi:10.18653/v1/D19-6616. [20] L. Konstantinovskiy, O. Price, M. Babakar, A. Zubiaga, Toward Automated Factchecking: Developing an Annotation Schema and Benchmark for Consistent Automated Claim Detection, Digital Threats 2 (2021). doi:10.1145/3412869. [21] M. Viviani, G. Pasi, Credibility in social media: opinions, news, and health information—a survey, WIREs Data Mining and Knowledge Discovery 7 (2017) e1209. doi:https://doi. org/10.1002/widm.1209. [22] Y. Qu, D. L. Barbera, K. Roitero, S. Mizzaro, D. Spina, G. Demartini, Combining Human and Machine Confidence in Truthfulness Assessment, Data and Information Quality (2022). doi:10.1145/3546916. [23] O. Sharir, B. Peleg, Y. Shoham, The Cost of Training NLP Models: A Concise Overview, arXiv (2020). [24] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [25] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large Language Models are Zero-Shot Reasoners, in: Workshop on Knowledge Retrieval and Language Models, ICML, 2022. [26] P. Atanasova, J. G. Simonsen, C. Lioma, I. Augenstein, Generating Fact Checking Explana- tions, CoRR abs/2004.05773 (2020). [27] N. Kotonya, F. Toni, Explainable Automated Fact-Checking: A Survey, CoRR abs/2011.03870 (2020). [28] M. Granik, V. Mesyura, Fake news detection using naive Bayes classifier, in: IEEE First Ukraine Conference on Electrical and Computer Engineering, UKRCON, 2017, pp. 900–903. doi:10.1109/UKRCON.2017.8100379. [29] T. Draws, A. Rieger, O. Inel, U. Gadiraju, N. Tintarev, A Checklist to Combat Cognitive Biases in Crowdsourcing, Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 9 (2021) 48–59. [30] J. Dong, S. Chen, J.-J. Jeng, Event-based blackboard architecture for multi-agent systems, in: Proceedings of the Conference on Information Technology: Coding and Computing, volume 2 of ITCC, 2005, pp. 379–384. doi:10.1109/ITCC.2005.149. [31] J. Attenberg, P. Ipeirotis, F. Provost, Beat the Machine: Challenging Humans to Find a Predictive Model’s “Unknown Unknowns”, J. Data and Information Quality 6 (2015). doi:10.1145/2700832. [32] H. Lakkaraju, E. Kamar, R. Caruana, E. Horvitz, Identifying Unknown Unknowns in the Open World: Representations and Policies for Guided Exploration, in: Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, AAAI Press, 2017, p. 2124–2132. [33] A. Liu, S. Guerra, I. Fung, G. Matute, E. Kamar, W. Lasecki, Towards Hybrid Human-AI Workflows for Unknown Unknown Detection, in: Proceedings of The Web Conference, WWW, ACM, 2020, p. 2432–2442. doi:10.1145/3366423.3380306. [34] B. Taboubi, M. A. B. Nessir, H. Haddad, iCompass at CheckThat! 2022: combining deep language models for fake news detection, Working Notes of CLEF (2022). [35] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On Calibration of Modern Neural Networks, in: Proceedings of the 34th Conference on Machine Learning, volume 70 of ICML, JMLR.org, 2017, p. 1321–1330. [36] D. La Barbera, K. Roitero, J. Mackenzie, D. Spina, G. Demartini, S. Mizzaro, BUM at Check- That! 2022: A Composite Deep Learning Approach to Fake News Detection using Evidence Retrieval, in: Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF, 2022, pp. 564–572. [37] D. Stammbach, B. Zhang, E. Ash, The Choice of Knowledge Base in Automated Claim Checking, CoRR abs/2111.07795 (2021). arXiv:2111.07795. [38] A. Kazemi, Z. Li, V. Pérez-Rosas, R. Mihalcea, Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News, CoRR abs/2104.12918 (2021). [39] E. Brand, K. Roitero, M. Soprano, A. Rahimi, G. Demartini, A Neural Model to Jointly Predict and Explain Truthfulness of Statements, Data and Information Quality (2022). doi:10.1145/3546917. [40] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. F. Christiano, G. Irving, Fine-Tuning Language Models from Human Preferences, CoRR abs/1909.08593 (2019). [41] C. Snijders, R. Conijn, E. Fouw, K. Berlo, Humans and Algorithms Detecting Fake News: Effects of Individual and Contextual Confidence on Trust in Algorithmic Advice, Journal of Human-Computer Interaction (2022) 1–12. doi:10.1080/10447318.2022.2097601. [42] H. Li, Q. Liu, Cheaper and Better: Selecting Good Workers for Crowdsourcing, Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 3 (2015) 20–21. [43] T. McDonnell, M. Lease, M. Kutlu, T. Elsayed, Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments, in: Proceedings of the 4th Conference on Human Computation and Crowdsourcing, volume 4 of HCOMP, 2016, pp. 139–148. [44] M. Kutlu, T. McDonnell, M. Lease, T. Elsayed, Annotator Rationales for Labeling Tasks in Crowdsourcing, Journal of Artificial Intelligence Research 69 (2020) 143–189. doi:10. 1613/jair.1.12012. [45] J. X. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, Y. Qi, TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP, 2020. doi:10.48550/ARXIV.2005.05909.