VeryfIT - Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge Jacopo Gili1 , Viviana Patti1,† , Lucia Passaro2,† and Tommaso Caselli3,† 1 Department of Computer Science, University of Turin, Italy 2 Department of Computer Science, University of Pisa, Italy 3 CLCG, University of Groningen, The Netherlands Abstract Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive component of user interaction and the extensive use of “spontaneous” training data, has made them highly adept at conversa- tional tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question. Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues. The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual relevance both temporally and factually, and ultimately verifying the accuracy of the statements. Keywords fact checking, benchmark, factual knowledge, Italian, fake news, CALAMITA, CheckIT! 1. Challenge: Introduction and accurately evaluate factual knowledge is more relevant than ever considering the ease of access of these tools to Motivation non-experts for any purpose (entertainment, education, The pollution of the information ecosystem by means professional settings) and the increasing integration of of misleading or false information has reached unprece- these technologies in every day activities. dented levels at a global scale. This has been possible Notably, most of these tasks and corresponding bench- thanks to a combination of multiple factors, among which marks are in English with other languages being repre- the collapse of (local and national) journalism; an increas- sented through machine-translated data or no data at all. ing sense of distrust in science and evidence-based facts; This is true for Italian too. For instance, SQUAD-IT [13] is and the presence of computational amplification tools a machine-translated version of the SQUAD dataset [14] such as bots [1, 2]. In this sense the rise of Large Language and it is the reference for evaluating models on QA-tasks. Models (LLMs) with the constant increase of their perfor- While machine-translation has been constantly im- mances has introduced both opportunities and challenges proving, it can indeed easily introduce artefacts in the in the fight against misinformation: while LLMs possess output text impairing naturalness and correctness, more- the capability to generate coherent and contextually rel- over translated data can be subjected to the loss of nuance evant text, they also pose risks by potentially producing and context as translations may not capture cultural nu- deceptive misinformation at scale [3, 4]. ances or contextual meanings, leading to misunderstand- Testing factual and common sense knowledge in LLMs ings or misinterpretations in the target language: certain has been a common although not easy task involving phrases or idioms may not have direct equivalents in mostly multi-choice question answering, a method easy other languages, and the presence of linguistic construc- to automate and not prone to ambiguity, and spanning tions typical of the source language may be encouraged across wide ranges of academic and professional domains excessively [15]. like mathematics, medicine, history, law, general knowl- By using data from a professional fact-checking edge and many others [5, 6, 7, 8, 9, 10, 11, 12]. agency1 we can test knowledge memorization of LMs Developing benchmarks to test the ability of LLMs to and to what extend intra-memory conflicts, resulting in “hallucinations”, arise. Furthermore, doing so using CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Italian data centered around the Italian and European Dec 04 — 06, 2024, Pisa, Italy † contexts ensures testing LM’s functionalities directly in These authors contributed equally. Italian. $ jacopo.gili584@edu.unito.it (J. Gili); viviana.patti@unito.it This task is based on CheckIT! [16], a resource of ex- (V. Patti); lucia.passaro@unipi.it (L. Passaro); t.caselli@rug.nl (T. Caselli) pert fact-checked claims designed to fill a gap for the  0009-0007-1343-3760 (J. Gili); 0000-0001-5991-370X (V. Patti); development of AI- assisted fact-checking pipelines for 0000-0003-4934-5344 (L. Passaro); 0000-0003-2936-0256 (T. Caselli) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). Data have been obtained from Pagella Politica CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Italian. recognizes an intermediate “Ni” [Half true] label. As a result, all claims with the “half-true” verdict were dis- carded. 2. Challenge: Description Furthermore, we considered pertaining to the task to provide also a smaller subset of claims, “VeryfIT_small”, The challenge is a binary classification task in a zero-shot balanced on the political orientation of the politician setting: for each atomic statement, any LM is asked to speaking, as misinformation can occur on all topics but determine its factuality with respect to the time it was when referring to political misinformation each side of uttered by answering only with one of the two labels, the political spectrum has some more widespread topics “Vero” (true) or “Falso” (false). A third label for half true and recurrent formulations. statements could have been easily kept as it was already Additionally, an annotation task was carried out on the part of the dataset from which the data is sourced, but in VeryfIT_small subset aimed at the clarification of state- this first stage we opted for the binary setting as to limit ments presenting a level of ambiguity that would have task complexity. proven detrimental to the task: around 12% of the state- Some cases in the dataset exhibit complexities due to ments have available an alternative version “enriched” the combination of multiple pieces of information within of informations vital to the task. We will refer to them a single claim, which can affect the final determination as “enriched statements” (subsection 3.2). of veracity. For instance, consider the following scenario: In conclusion, 2 versions of the dataset are available: VeryfIT (2,021 claims) and VeryfIT_small (352 claims of Original claim Translation which 43 with an enriched version). «Se è vero che oltre l’82% «If it is true that over 82% dei morti da Covid hanno of Covid deaths are over 70 più di 70 anni, non si years old, it is not clear why 3.1. Creation of VeryfIT_small capisce perché meno della less than half of those over The first step to achieve this goal was to exclude around metà degli over 80 sia stato 80 have been vaccinated so 400 out of the 2,021 claims of VeryfIT for which informa- vaccinato finora» far» tion about the political orientation of the speaker was Table 1 not available. Example of a claim We then mapped, using Wikipedia as a source, the political orientation of the parties (and thus of the authors of the claims at the moment of remark) into eight fine- The informations concerning this statement are: grained, commonly recognized political categories: far- 1. Out of all the deceased due to the Covid19 pan- left, left, center-left, center, center-right, right, far-right. demic, 82% are people over 70 years old. An illustration on the list of all the parties and their corresponding political orientation is reported in Table 2. 2. Less than half of the citizens over 80 years old had An additional label ‘transverse’ was added to indicate administered at least one dose of vaccine against a non precise placement in the political spectrum. This Covid19. label includes one party (“Movimento 5 Stelle”), members of the Italian institutions above political parties (e.g. the This example also highlights the importance of incor- President of the Republic), and experts not affiliated to porating the appropriate temporal context in the verifi- any political party or political coalition like members of cation process. Factual information, especially involving a technical government 2 . statistics or reports about the state of the world, evolves At first glance, the Italian political spectrum may ap- over time and failing to account for this can invalidate pear only slightly unbalanced. Despite the absence of a the conclusions drawn by experts. Although more com- far-left representation, the distribution of parties across plex statements require a broader knowledge base, by the spectrum is relatively symmetrical. Out of the 23 now language models have shown understanding abili- political parties in the data, six are from the left, two ties well over this level and should not be subjugated by from the center-left, six from the center, three from the it. center-right, two from the right, and three from the far- right. However, the distribution of claims is not as well 3. Data description balanced, with a larger number of claims from the rights and far-right parties than the rest as reported in table 3. The VeryfIT dataset consists of 2,021 claims taken from To ensure the balance of our benchmark we decided CheckIT! [16]. Not all claims were included due to the to reduce the label granularity from eight to four, by col- binary format of the task as VeryfIT classifies claims as either “Vero” [True] or “Falso” [False], whereas CheckIT! 2 https://en.wikipedia.org/wiki/Technocratic_government_(Italy) Political party Orientation label Claims Alleanza Verdi e Sinistra left Political side True False Total Alternativa Popolare center-right Articolo Uno center-left Left [SX] 367 138 505 Azione center Center [C] 105 82 187 Coraggio Italia center-right Europa Verde left Right [DX] 243 327 570 Forza Italia right Transverse [T] 209 146 355 Fratelli d’Italia far-right Impegno Civico center Table 4 Indipendente transverse VeryfIT data after exclusion of claims where information about Italexit far-right political orientation of the speaker was not available: Distri- Italia Viva center bution of verdict labels in the political spectrum after label Lega Nord far-right collapse. Liberi e uguali left Movimento 5 Stelle transverse Nuovo Centro Destra center-right Partito Democratico center-left the two end point (SX and DX), this setting, with the low- Più Europa center est cardinality being 187 (for C) easily allows us generate Popolo della Libertà right a perfectly balanced dataset along the political orienta- Possibile left tions. For the first version of VeryfIT_small, each block Radicali Italiani center contributes with 88 claims resulting in a total of 352 Scelta Civica center entries, with future works planned to expand it. Sinistra Ecologia Libertà left Sinistra italiana left Tecnico transverse Claims Political side True False Total Table 2 Left [SX] 64 [13] 24 [2] 88 [15] VeryfIT data: Italian political parties and their orientation. Center [C] 46 [4] 42 [7] 88 [11] Right [DX] 40 [4] 48 [5] 88 [9] Claims Transverse [T] 50 [2] 38 [6] 88 [8] Political side True False Total total 200 [23] 152 [20] 352 [43] Left 44 28 72 Table 5 Center-left 323 110 433 VeryfIT_small: Final distribution of verdict labels in the polit- ical spectrum. Highlighted in green the number of labels of Center 105 82 187 enriched statements (explained in subsection 3.2). Center-right 8 2 10 Right 79 84 163 Far-right 156 241 397 3.2. Enriched statements Transverse 209 146 355 Given the specificity of the statements, many of which total 924 693 1,617 require detailed knowledge of topics related to Italian institutions and policies, and the occasional ambiguity Table 3 arising from their oral nature, the task has been further VeryfIT data after exclusion of claims where information about divided into two sub-tasks with slight data modifications, political orientation of the speaker was not available: Distri- aimed at adding vital context to statements that were bution of verdict labels in the political spectrum. excessively reliant on information external to the state- ments themselves. The altered statements account for around 12% of the VeryfIT_small dataset, as excessive lapsing labels far-left, left and center-left into ‘left’ [SX], human intervention would undermine the core principle and far-right, right and center-right into ‘right’ [DX]. of testing on natural data, aligned with what language Labels center [C] and trasversal [T] remained untouched. models might be asked to handle in real-life scenarios. The re-aggregated coarse-grained labels are reported in In most cases, minimal adjustments were made, such as Table 4. retaining the original claim but adding the name of the Although the distribution is still unbalanced between politician speaking or clarifying specific references. The goal of partially or entirely removing the initial Preliminary results obtained through the chat function layer of complexity, by simplifying the extraction of the of Claude 3.5 Sonnet3 and GPT-4o 4 show that respec- relevant information from the statement for verification, tively two out of the four statements (Claude) and one is to highlight a stronger correlation between the bench- out of the four statements (GPT) reported in Table 6 get mark results and the language model’s actual factual wrongly classified when presented in the original version, knowledge: when working with natural data, the model’s while providing the models with the enriched versions responses may stem from its difficulty in comprehending brings up the correctly classifications to four out of four the specific information it is being asked to verify. How- for both models. These results however can only par- ever, with altered data, its responses are more directly tially prove the effectiveness of enriched statements as influenced by gaps in its knowledge. different models when presented a partial context could Examples of enriched statements are reported in Ta- provide different verdicts, even guessing the right one. ble 6: 3.3. Annotation details Original statement Enriched statement During the making of the VeryfIT datasets, it was noticed Abbiamo 490 grandi elet- Gli elettori dell’area di cen- tori trosinistra che voteranno that not all the statements were actual claims: in articles per l’elezione del Presi- with multiple claims to check, the ‘statement’ field was dente della Repubblica filled with a short title resuming them all, often in the saranno 490. format “[name of the politician] on [topic]”. Regular ex- Oggi in Italia sono 796 Oggi in Italia sono 796 pressions were used to highlight statements not starting quelli che pagano più di 1 quelli che dichiarano un with ‘“’ or ‘«’, the two symbols used to denote a dialogue milione di euro reddito superiore ad 1 mil- or part of a speech, and a manual check brought to the ione di euro. exclusion of around 170 statements. Moreover around [Alle europee] io ho battuto [Alle europee] io [Carlo 30 statements with formats resembling “[name of the Salvini in molti capoluoghi Calenda] ho battuto politician] is [right/wrong] on [topic]: [statement]” were di provincia Salvini in molti capoluoghi reformulated as claims by removing hints about the fac- di provincia. In parlamento stiamo L’elezione dei membri della tuality verdict and the author of the statement. A couple facendo un lavoro che Corte Costituzionale e del examples are brought up in table 7. risponde a una prerogativa Consiglio Superiore della costituzionale. Certamente Magistratura (Csm) è un Original statement Reworded statement si sarebbero tutti auspicati, dovere che la costituzione Giulia Grillo sbaglia: i i medici e gli infermieri ital- me compresa, tempi più italiana dà al parlamento. medici e gli infermieri ital- iani sono i meno pagati brevi ma non stiamo iani non sono i meno pagati perdendo tempo. Stiamo Secondo Di Maio il governo Il governo investe nelle cen- svolgendo un ruolo che ci investe nelle centrali a car- trali a carbone compete e che la Costi- bone, ma è il contrario tuzione da’ al parlamento. No, per la Corte dei Conti Per la Corte dei Conti non ci saranno 17 miliardi ci saranno 17 miliardi di Table 6 di nuove tasse nuove tasse Comparison of Original and Enriched Statements Table 7 The reasons for enriching the statements in table 6 all Examples of reworded statements revolve around the lack of pivotal information to deter- mine factuality: The first statement is completely missing Another important annotation step has been produc- the context and presents an unclear term “grandi elettori” ing the enriched statements. A human annotator5 re- [big voters], relatively known in the political context, viewed the VeryfIT_small dataset, identifying statements but that could be mistaken for a physical feature or for that could benefit from additional context, and produced a consideration regarding the age of voters; the second enriched variations of those statements. In most cases, statement has an unclear formulation as “pagare” [to pay] minimal adjustments were made, such as retaining the does not refer univocally to taxes; the third statement original claim but adding the name of the politician speak- is missing the subject; the fourth and last statement is ing or clarifying anaphoric references. missing part of its context as “stiamo facendo un lavoro” 3 https://claude.ai/chat [we are doing a job] “stiamo svolgendo un ruolo” [we are 4 https://chatgpt.com/ 5 playing a role] both refer to a very specific duty of the All the annotations noted in the report was done by the first author parliament that does not get mentioned directly. of the paper, master student in Computer Science with a background in Natural Language Processing The decision of applying this annotation step to the { VeryfIT_small subset, instead of the full dataset, is related "annotato": False, to the amount of manual work it would have required. "id": 991, Additionally another annotation step involved com- "statement_date": 2019-07-12, pleting the “macro_area” [topic] field for all the 352 en- "statement": "[Il salario minimo n.d.r.] Manca solo a noi e ai Paesi dell’Est tries of VeryfIT_small. Although this field was included Europa", in the original dataset, it was missing a value in approx- "verdict": "Falso", imately 15% of the entries. This was done manually, "orientamento": ’C’, classifying statements into the pre-existing topic labels "macro_area": "questioni sociali", which are: ‘questioni sociali’ [social matters], ‘economia’ "tags": "[’questioni sociali’, ’panzana pazzesca’, ’italia’, ’eu’, ’salario [economy], ‘esteri’ [foreign affairs], ‘giustizia’ [justice], minimo’]", ‘istituzioni’ [institutions], ‘ambiente’ [environment], ‘al- "statement_revised": "" tro’ [others]. The new labels were chosen by comparing }, unlabelled statements with statements that already had { a label and inspecting the contents of the articles from "annotato": True, "id": 123, which they were extracted, sometimes only needing to "statement_date": 2023-02-14, look at the ‘tags’ field to find all the information needed. "statement": "Il canone in bolletta fu una To avoid even the smallest imprecision that would have mia scelta. Costava 113 euro. Averlo impaired the original label system made by journalist, fatto pagare a tutti ha portato a un non-certain labels were put in the ‘altro’ category. abbassamento del costo da 113 a 90 euro ", Statistics about the distribution of these labels can be "verdict": "Vero", found in section 3.6. "orientamento": ’C’, "macro_area": "altro", "tags": "["canone", "rai", "bolletta", " 3.4. Data format costo"]", Brief explanation of the data fields: "statement_revised": "Il canone in bolletta fu una mia scelta [di Matteo Renzi]. • annotato: If True, the statement has a revised Costava 113 euro. Averlo fatto pagare a tutti ha portato a un abbassamento del version. costo da 113 a 90 euro" } • id: ID of the corresponding article in CheckIT!. • statement_date: Date of statements diffusion. Figure 1: Data format • statement: The statement. • verdict: Factuality verdict. 3.5. Example of prompts used for zero • orientamento: Orientation of the political party shots of the politician author of the statement. The models are expected to be evaluated on this task in • macro_area: Topic of the statement. a zero-shot setting, thereby also better resembling the conditions of a real use-case scenario. The prompt we • tags: List of tags. suggest to use for the evaluation is basic, and urges the model to limit its answer to just the letter corresponding • statement_revised: Revised version of the state- to the answer. The original prompt in Italian, together ment, if present. with its English translation, are reported in Box 1. Fields such as ‘macro_area’ and ‘tags’ serve as indi- cators of the topic, the former providing a general cat- Prompt egorization and the latter offering more specific details. These informations were included with in mind future Il seguente statement, nella data indicata, è vero tasks that could reveal differences in factual knowledge o falso? Rispondi solo con "Vero" o "Falso". across different subjects. The following statement, on the date indicated, is true or false? Answer only with "True" or "False". Box 1: Zero-shot prompt The prompt does not contain any information about Orientation label Macro_area SX CSX C CDX DX E-DX T the subject of the question or any other informative cues questioni sociali 19 105 27 5 27 101 80 apart from the time reference needed to anchor the claim economia 11 119 43 1 52 54 52 in a temporal context. In this way, our benchmark not istituzioni 10 81 10 3 38 33 71 only tests the model in question answering, but also in- esteri 4 32 17 0 11 41 33 directly tests the instruction-following abilities of the giustizia 3 11 1 0 13 11 24 altro 1 17 8 1 6 8 14 model in a language different than English. ambiente 1 10 2 0 2 6 14 un-noted 23 59 79 0 14 145 68 3.6. Detailed data statistics Table 9 The full VeryfIT! dataset is composed of 2,021 entries in VeryfIT data after exclusion of claims where information about political orientation of the speaker was not available: Distri- the italian language. Out of these claims, 352 form the bution of claims per topic and positioning in the political VeryfIT_small dataset in which the entries are equally spectrum. split across the three main sides of a semplification of the classical political spectrum (left, right, center) and a fourth label ‘trasversal’, used to address non precise Claims placement in the political spectrum or complete absence Macro_area True False Total of affiliation to any political party or political coalition. questioni sociali 50 [2] 37 [4] 87 [6] Of the 352 claims in the VeryfIT_small dataset, 43 have economia 53 [4] 37 [4] 90 [8] available an enriched variation of the statement, provid- istituzioni 46 [11] 17 [5] 63 [16] ing additional context alongside the original statement. esteri 26 [4] 19 [3] 45 [7] The distribution of claims and factuality labels across ambiente 8 [1] 10 18 [1] topics is presented in Table 8, Table 9, Table 10, Table 11. giustizia 7 8 15 altro 10 [1] 24 [4] 34 [5] Claims total 200 [23] 152 [20] 352 [43] Macro_area True False Total Table 10 questioni sociali 256 170 426 VeryfIT_small: Distribution of claims and factuality labels per economia 264 155 419 topics ordered by total value. Highlighted in green the number istituzioni 243 77 320 of labels of enriched statements. esteri 105 53 158 giustizia 60 26 86 Orientation label Macro_area SX C DX T altro 46 32 78 ambiente 42 18 60 questioni sociali 22 [2] 15 [1] 25 [2] 25 [1] economia 28 [2] 30 [5] 19 [1] 13 un-noted 180 294 474 istituzioni 19 [8] 8 [1] 15 [3] 21 [4] total 1,196 825 2,021 esteri 9 [2] 13 [1] 12 [2] 11 [2] ambiente 2 7 3 [1] 6 Table 8 giustizia 2 4 3 6 VeryfIT: Distribution of claims and factuality labels per topics altro 6 [1] 11 [3] 11 6 [1] ordered by total value. total 88 [15] 88 [11] 88 [9] 88 [8] Table 11 Further statistics on the original CheckIT! dataset is VeryfIT_small: Distribution of claims per topic and positioning available in Figure A and Table A in Appendix A. in the simplified political spectrum. Highlighted in green the number of labels of enriched statements. 4. Metrics Accuracy serves as the evaluation metric of the task due 5. Limitations to its intuitive interpretation and broad applicability. Ac- The totality of the data comes from an expert, reliable curacy provides a clear measure of a classifier’s overall source. For this reason, the quality of the verdicts is performance by calculating the proportion of correct pre- assured to be high. One possible limitation is due to the dictions among total cases examined. time-relatedness of said verdicts: claims can be truth No other metrics were chosen for the task. and false at times depending on the temporal context in which they are evaluated. LMs could have an hard massive multitask language understanding, time discerning informations pertaining specific time 2021. URL: https://arxiv.org/abs/2009.03300. intervals, given that they could also not have been trained arXiv:2009.03300. on data related to them. [6] A. Srivastava, D. Kleyjo, Z. Wu, Beyond the im- Another limitation could be the depth of the factual itation game: Quantifying and extrapolating the knowledge required to understand and consequently capabilities of language models, Transactions on answer the questions of the dataset. As previously Machine Learning Research (2023). stated, VeryfIT data is about italian/european context [7] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, and touches details of various fields that most probably H. Wang, C. You, Z. Guo, L. Zhu, et al., Benchmark- not even the citizens would know about! ing large language models on cmexam-a compre- Remarkably, the risk of the data being present in train- hensive chinese medical exam dataset, Advances in ing corpuses for LMs should be mitigated as the CheckIT! Neural Information Processing Systems 36 (2024). dataset is not publicly released. [8] S. Lin, J. Hilton, O. Evans, TruthfulQA: Measur- Finally, fact-checking is a very complex task and state- ing how models mimic human falsehoods, in: ments could carry different degrees of truthness, more S. Muresan, P. Nakov, A. Villavicencio (Eds.), Pro- than a binary setting can express. We chose to limit for ceedings of the 60th Annual Meeting of the Asso- now the task to a binary classification challenge to not ciation for Computational Linguistics (Volume 1: make it too complicated, but we do not exclude further de- Long Papers), Association for Computational Lin- velopment towards a multi-label setting to better capture guistics, Dublin, Ireland, 2022, pp. 3214–3252. URL: the nuances of the fact-checking process. https://aclanthology.org/2022.acl-long.229. doi:10. 18653/v1/2022.acl-long.229. [9] P. Wang, A. Chan, F. Ilievski, M. Chen, X. Ren, 6. Ethical issues Pinto: Faithful language reasoning using prompt- generated rationales, in: Workshop on Trustwor- No ethical issue has arisen from the making of this task, thy and Socially Responsible Machine Learning, all the data has been sourced through agreements with NeurIPS 2022, 2022. the original authors. [10] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, P. Szolovits, What disease does this patient have? 7. Data license and copyright a large-scale open domain question answering dataset from medical exams, 2020. URL: https:// issues arxiv.org/abs/2009.13081. arXiv:2009.13081. [11] A. Talmor, J. Herzig, N. Lourie, J. Berant, Com- The data cannot be publicly released due to a Data Shar- monsenseqa: A question answering challenge tar- ing Agreement between University of Groningen and geting commonsense knowledge, 2019. URL: https: Pagella Politica. At the moment of writing of this contri- //arxiv.org/abs/1811.00937. arXiv:1811.00937. bution to obtain VeryfIT! contact dr. Tommaso Caselli. [12] L. C. Passaro, A. Bondielli, P. Dell’Oglio, A. Lenci, F. Marcelloni, In-context annota- References tion of topic-oriented datasets of fake news: A case study on the notre-dame fire event, [1] T. Economist, Disinformation is on the rise. how Information Sciences 615 (2022) 657–677. does it work?, 2024. URL: https://www.economist. URL: https://www.sciencedirect.com/science/ com/science-and-technology/2024/05/01/ article/pii/S0020025522008167. doi:https: disinformation-is-on-the-rise-how-does-it-work. //doi.org/10.1016/j.ins.2022.07.128. [2] C. Wardle, H. Derakhshan, Information disorder: [13] D. Croce, A. Zelenanska, R. Basili, Neural learning Toward an interdisciplinary framework for research for question answering in italian, in: C. Ghidini, and policymaking, volume 27, Council of Europe B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA Strasbourg, 2017. 2018 – Advances in Artificial Intelligence, Springer [3] OpenAI, Disrupting deceptive uses of ai by covert International Publishing, Cham, 2018, pp. 389–402. influence operations, 2024. [14] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: [4] C. Chen, K. Shu, Combating misinformation in 100,000+ questions for machine comprehension of the age of llms: Opportunities and challenges, AI text, 2016. URL: https://arxiv.org/abs/1606.05250. Magazine (2024). URL: https://doi.org/10.1002/aaai. arXiv:1606.05250. 12188. doi:10.1002/aaai.12188. [15] I. Plaza, N. Melero, C. del Pozo, J. Conde, P. Re- [5] D. Hendrycks, C. Burns, S. Basart, A. Zou, viriego, M. Mayor-Rocher, M. Grandury, Spanish M. Mazeika, D. Song, J. Steinhardt, Measuring and llm benchmarks: is mmlu lost in translation?, arXiv preprint arXiv:2406.17789 (2024). [16] J. Gili, L. Passaro, T. Caselli, Checkit!: A cor- pus of expert fact-checked claims for italian, in: F. Boschetti, G. Lebani, B. Magnini, N. Novielli (Eds.), Proceedings of the 9th Italian Conference on Computational Linguistics, CEUR Workshop Proceedings, CEUR Workshop Proceedings (CEUR- WS.org), 2023. Publisher Copyright: © 2023 Copy- right for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 9th Italian Conference on Computational Linguistics, CLiC-it 2023 ; Con- ference date: 30-11-2023 Through 02-12-2023. Appendix A Figure A: Original data from subset d1 of CheckIT!: Claims distribution in the political spectrum in reference with factual veracity. Orientamento Macro_area SX CSX C CDX DX E-DX T economia 21 243 74 6 119 145 142 questioni sociali 30 215 62 12 50 203 174 istituzioni 11 150 24 6 81 54 144 esteri 7 75 25 0 19 99 80 ambiente 5 30 8 0 2 9 29 giustizia 3 23 4 0 23 23 35 altro 33 96 97 2 23 171 107 total 110 832 294 26 317 704 711 Table A Original data from subset d1 of CheckIT!: Distribution of claims per topic and positioning in the full political spectrum. Far-left label is omitted as non-present in the dataset.