VeryfIT -Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge

VeryfIT -Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge JacopoGili Department of Computer Science University of Turin

Italy

VivianaPatti viviana.patti@unito.it Department of Computer Science University of Turin

Italy

LuciaPassaro lucia.passaro@unipi.it Department of Computer Science University of Pisa

Italy

TommasoCaselli t.caselli@rug.nl CLCG University of Groningen

The Netherlands

VeryfIT -Benchmark of Fact-Checked Claims for Italian: A CALAMITA Challenge 1613-0073 47793B8231AFD3746C1B6DA2E538AEE6 GROBID - A machine learning software for extracting information from scholarly documents fact checking benchmark factual knowledge Italian fake news CALAMITA CheckIT! 1. Challenge: Introduction and Motivation

Achieving factual accuracy is a known pending issue for language models. Their design centered around the interactive component of user interaction and the extensive use of "spontaneous" training data, has made them highly adept at conversational tasks but not fully reliable in terms of factual correctness. VeryfIT addresses this issue by evaluating the in-memory factual knowledge of language models on data written by professional fact-checkers, posing it as a true or false question. Topics of the statements vary but most are in specific domains related to the Italian government, policies, and social issues. The task presents several challenges: extracting statements from segments of speeches, determining appropriate contextual relevance both temporally and factually, and ultimately verifying the accuracy of the statements.

The pollution of the information ecosystem by means of misleading or false information has reached unprecedented levels at a global scale. This has been possible thanks to a combination of multiple factors, among which the collapse of (local and national) journalism; an increasing sense of distrust in science and evidence-based facts; and the presence of computational amplification tools such as bots [1,2]. In this sense the rise of Large Language Models (LLMs) with the constant increase of their performances has introduced both opportunities and challenges in the fight against misinformation: while LLMs possess the capability to generate coherent and contextually relevant text, they also pose risks by potentially producing deceptive misinformation at scale [3,4].

Testing factual and common sense knowledge in LLMs has been a common although not easy task involving mostly multi-choice question answering, a method easy to automate and not prone to ambiguity, and spanning across wide ranges of academic and professional domains like mathematics, medicine, history, law, general knowledge and many others [5,6,7,8,9,10,11,12].

Developing benchmarks to test the ability of LLMs to accurately evaluate factual knowledge is more relevant than ever considering the ease of access of these tools to non-experts for any purpose (entertainment, education, professional settings) and the increasing integration of these technologies in every day activities. Notably, most of these tasks and corresponding benchmarks are in English with other languages being represented through machine-translated data or no data at all. This is true for Italian too. For instance, SQUAD-IT [13] is a machine-translated version of the SQUAD dataset [14] and it is the reference for evaluating models on QA-tasks.

While machine-translation has been constantly improving, it can indeed easily introduce artefacts in the output text impairing naturalness and correctness, moreover translated data can be subjected to the loss of nuance and context as translations may not capture cultural nuances or contextual meanings, leading to misunderstandings or misinterpretations in the target language: certain phrases or idioms may not have direct equivalents in other languages, and the presence of linguistic constructions typical of the source language may be encouraged excessively [15].

By using data from a professional fact-checking agency 1 we can test knowledge memorization of LMs and to what extend intra-memory conflicts, resulting in "hallucinations", arise. Furthermore, doing so using Italian data centered around the Italian and European contexts ensures testing LM's functionalities directly in Italian.

This task is based on CheckIT! [16], a resource of expert fact-checked claims designed to fill a gap for the development of AI-assisted fact-checking pipelines for Italian.

Challenge: Description

The challenge is a binary classification task in a zero-shot setting: for each atomic statement, any LM is asked to determine its factuality with respect to the time it was uttered by answering only with one of the two labels, "Vero" (true) or "Falso" (false). A third label for half true statements could have been easily kept as it was already part of the dataset from which the data is sourced, but in this first stage we opted for the binary setting as to limit task complexity.

Some cases in the dataset exhibit complexities due to the combination of multiple pieces of information within a single claim, which can affect the final determination of veracity. For instance, consider the following scenario:

Original claim Translation

«Se è vero che oltre l'82% dei morti da Covid hanno più di 70 anni, non si capisce perché meno della metà degli over 80 sia stato vaccinato finora» «If it is true that over 82% of Covid deaths are over 70 years old, it is not clear why less than half of those over 80 have been vaccinated so far»

Table 1 Example of a claim

The informations concerning this statement are:

1. Out of all the deceased due to the Covid19 pandemic, 82% are people over 70 years old.

2. Less than half of the citizens over 80 years old had administered at least one dose of vaccine against Covid19.

This example also highlights the importance of incorporating the appropriate temporal context in the verification process. Factual information, especially involving statistics or reports about the state of the world, evolves over time and failing to account for this can invalidate the conclusions drawn by experts. Although more complex statements require a broader knowledge base, by now language models have shown understanding abilities well over this level and should not be subjugated by it.

Data description

The VeryfIT dataset consists of 2,021 claims taken from CheckIT! [16]. Not all claims were included due to the binary format of the task as VeryfIT classifies claims as either "Vero" [True] or "Falso" [False], whereas CheckIT! recognizes an intermediate "Ni" [Half true] label. As a result, all claims with the "half-true" verdict were discarded.

Furthermore, we considered pertaining to the task to provide also a smaller subset of claims, "VeryfIT_small", balanced on the political orientation of the politician speaking, as misinformation can occur on all topics but when referring to political misinformation each side of the political spectrum has some more widespread topics and recurrent formulations.

Additionally, an annotation task was carried out on the VeryfIT_small subset aimed at the clarification of statements presenting a level of ambiguity that would have proven detrimental to the task: around 12% of the statements have available an alternative version "enriched" of informations vital to the task. We will refer to them as "enriched statements" (subsection 3.2).

In conclusion, 2 versions of the dataset are available: VeryfIT (2,021 claims) and VeryfIT_small (352 claims of which 43 with an enriched version).

Creation of VeryfIT_small

The first step to achieve this goal was to exclude around 400 out of the 2,021 claims of VeryfIT for which information about the political orientation of the speaker was not available.

We then mapped, using Wikipedia as a source, the political orientation of the parties (and thus of the authors of the claims at the moment of remark) into eight finegrained, commonly recognized political categories: farleft, left, center-left, center, center-right, right, far-right. An illustration on the list of all the parties and their corresponding political orientation is reported in Table 2. An additional label 'transverse' was added to indicate a non precise placement in the political spectrum. This label includes one party ("Movimento 5 Stelle"), members of the Italian institutions above political parties (e.g. the President of the Republic), and experts not affiliated to any political party or political coalition like members of a technical government 2 .

At first glance, the Italian political spectrum may appear only slightly unbalanced. Despite the absence of a far-left representation, the distribution of parties across the spectrum is relatively symmetrical. Out of the 23 political parties in the data, six are from the left, two from the center-left, six from the center, three from the center-right, two from the right, and three from the farright. However, the distribution of claims is not as well balanced, with a larger number of claims from the rights and far-right parties than the rest as reported in table 3.

To ensure the balance of our benchmark we decided to reduce the label granularity from eight to four, by col-

Political party

Orientation label

Alleanza VeryfIT_small: Final distribution of verdict labels in the political spectrum. Highlighted in green the number of labels of enriched statements (explained in subsection 3.2).

Enriched statements

Given the specificity of the statements, many of which require detailed knowledge of topics related to Italian institutions and policies, and the occasional ambiguity arising from their oral nature, the task has been further divided into two sub-tasks with slight data modifications, aimed at adding vital context to statements that were excessively reliant on information external to the statements themselves. The altered statements account for around 12% of the VeryfIT_small dataset, as excessive human intervention would undermine the core principle of testing on natural data, aligned with what language models might be asked to handle in real-life scenarios. In most cases, minimal adjustments were made, such as retaining the original claim but adding the name of the politician speaking or clarifying specific references.

The goal of partially or entirely removing the initial layer of complexity, by simplifying the extraction of the relevant information from the statement for verification, is to highlight a stronger correlation between the benchmark results and the language model's actual factual knowledge: when working with natural data, the model's responses may stem from its difficulty in comprehending the specific information it is being asked to verify. However, with altered data, its responses are more directly influenced by gaps in its knowledge.

Examples

L'elezione dei membri della

Corte Costituzionale e del Consiglio Superiore della Magistratura (Csm) è un dovere che la costituzione italiana dà al parlamento.

Table 6 Comparison of Original and Enriched Statements

The reasons for enriching the statements in table 6 all revolve around the lack of pivotal information to determine factuality: The first statement is completely missing the context and presents an unclear term "grandi elettori" [big voters], relatively known in the political context, but that could be mistaken for a physical feature or for a consideration regarding the age of voters; the second statement has an unclear formulation as "pagare" [to pay] does not refer univocally to taxes; the third statement is missing the subject; the fourth and last statement is missing part of its context as "stiamo facendo un lavoro" [we are doing a job] "stiamo svolgendo un ruolo" [we are playing a role] both refer to a very specific duty of the parliament that does not get mentioned directly.

Preliminary results obtained through the chat function of Claude 3.5 Sonnet3 and GPT-4o 4 show that respectively two out of the four statements (Claude) and one out of the four statements (GPT) reported in Table 6 get wrongly classified when presented in the original version, while providing the models with the enriched versions brings up the correctly classifications to four out of four for both models. These results however can only partially prove the effectiveness of enriched statements as different models when presented a partial context could provide different verdicts, even guessing the right one.

Annotation details

During the making of the VeryfIT datasets, it was noticed that not all the statements were actual claims: in articles with multiple claims to check, the 'statement' field was filled with a short title resuming them all, often in the format "[name of the politician] on [topic]". Regular expressions were used to highlight statements not starting with '"' or '«', the two symbols used to denote a dialogue or part of a speech, and a manual check brought to the exclusion of around 170 statements. Another important annotation step has been producing the enriched statements. A human annotator5 reviewed the VeryfIT_small dataset, identifying statements that could benefit from additional context, and produced enriched variations of those statements. In most cases, minimal adjustments were made, such as retaining the original claim but adding the name of the politician speaking or clarifying anaphoric references.

The decision of applying this annotation step to the VeryfIT_small subset, instead of the full dataset, is related to the amount of manual work it would have required.

Additionally another annotation step involved completing the "macro_area" [topic] field for all the 352 entries of VeryfIT_small. Although this field was included in the original dataset, it was missing a value in approximately 15% of the entries. This was done manually, classifying statements into the pre-existing topic labels which are: 'questioni sociali' [social matters], 'economia' [economy], 'esteri' [foreign affairs], 'giustizia' [justice], 'istituzioni' [institutions], 'ambiente' [environment], 'altro ' [others]. The new labels were chosen by comparing unlabelled statements with statements that already had a label and inspecting the contents of the articles from which they were extracted, sometimes only needing to look at the 'tags' field to find all the information needed. To avoid even the smallest imprecision that would have impaired the original label system made by journalist, non-certain labels were put in the 'altro' category.

Statistics about the distribution of these labels can be found in section 3.6.

Data format

Brief explanation of the data fields:

• annotato: If True, the statement has a revised version.

• id: ID of the corresponding article in CheckIT!.

• statement_date: Date of statements diffusion.

• statement: The statement.

• verdict: Factuality verdict.

• orientamento: Orientation of the political party of the politician author of the statement.

• macro_area: Topic of the statement.

• tags: List of tags.

• statement_revised: Revised version of the statement, if present.

Fields such as 'macro_area' and 'tags' serve as indicators of the topic, the former providing a general categorization and the latter offering more specific details. These informations were included with in mind future tasks that could reveal differences in factual knowledge across different subjects.

{ "annotato": False, "id": 991, "statement_date": 2019-07-12, "statement": "[Il salario minimo n.d.r.] Manca solo a noi e ai Paesi dell'Est Europa", "verdict": "Falso", "orientamento": 'C', "macro_area": "questioni sociali", "tags": "['questioni sociali', 'panzana pazzesca', 'italia', 'eu', 'salario minimo']", "statement_revised": "" }, { "annotato": True, "id": 123, "statement_date": 2023-02-14, "statement": "Il canone in bolletta fu una mia scelta. Costava 113 euro. Averlo fatto pagare a tutti ha portato a un abbassamento del costo da 113 a 90 euro ", "verdict": "Vero", "orientamento": 'C', "macro_area": "altro", "tags": "["canone", "rai", "bolletta", "

costo"]", "statement_revised": "Il canone in bolletta fu una mia scelta [di Matteo Renzi]. Costava 113 euro. Averlo fatto pagare a tutti ha portato a un abbassamento del costo da 113 a 90 euro" }

Example of prompts used for zero shots

The models are expected to be evaluated on this task in a zero-shot setting, thereby also better resembling the conditions of a real use-case scenario. The prompt we suggest to use for the evaluation is basic, and urges the model to limit its answer to just the letter corresponding to the answer. The original prompt in Italian, together with its English translation, are reported in Box 1.

Prompt

Il seguente statement, nella data indicata, è vero o falso? Rispondi solo con "Vero" o "Falso".

The following statement, on the date indicated, is true or false? Answer only with "True" or "False".

Box 1: Zero-shot prompt

The prompt does not contain any information about the subject of the question or any other informative cues apart from the time reference needed to anchor the claim in a temporal context. In this way, our benchmark not only tests the model in question answering, but also indirectly tests the instruction-following abilities of the model in a language different than English.

Detailed data statistics

The full VeryfIT! dataset is composed of 2,021 entries in the italian language. Out of these claims, 352 form the VeryfIT_small dataset in which the entries are equally split across the three main sides of a semplification of the classical political spectrum (left, right, center) and a fourth label 'trasversal', used to address non precise placement in the political spectrum or complete absence of affiliation to any political party or political coalition.

Of the 352 claims in the VeryfIT_small dataset, 43 have available an enriched variation of the statement, providing additional context alongside the original statement.

The distribution of claims and factuality labels across topics is presented in Table 8, Table 9, Table 10

Metrics

Accuracy serves as the evaluation metric of the task due to its intuitive interpretation and broad applicability. Accuracy provides a clear measure of a classifier's overall performance by calculating the proportion of correct predictions among total cases examined.

No other metrics were chosen for the task. total 88 [15] 88 [11] 88 [9] 88 [8] Table 11 VeryfIT_small: Distribution of claims per topic and positioning in the simplified political spectrum. Highlighted in green the number of labels of enriched statements.

Limitations

The totality of the data comes from an expert, reliable source. For this reason, the quality of the verdicts is assured to be high. One possible limitation is due to the time-relatedness of said verdicts: claims can be truth and false at times depending on the temporal context in which they are evaluated. LMs could have an hard time discerning informations pertaining specific time intervals, given that they could also not have been trained on data related to them. Another limitation could be the depth of the factual knowledge required to understand and consequently answer the questions of the dataset. As previously stated, VeryfIT data is about italian/european context and touches details of various fields that most probably not even the citizens would know about! Remarkably, the risk of the data being present in training corpuses for LMs should be mitigated as the CheckIT! dataset is not publicly released.

Finally, fact-checking is a very complex task and statements could carry different degrees of truthness, more than a binary setting can express. We chose to limit for now the task to a binary classification challenge to not make it too complicated, but we do not exclude further development towards a multi-label setting to better capture the nuances of the fact-checking process.

Ethical issues

No ethical issue has arisen from the making of this task, all the data has been sourced through agreements with the original authors.

Data license and copyright issues

The data cannot be publicly released due to a Data Sharing Agreement between University of Groningen and Pagella Politica. At the moment of writing of this contribution to obtain VeryfIT! contact dr. Tommaso Caselli.

Figure 1 :1Figure 1: Data format

Table 22VeryfIT data: Italian political parties and their orientation.Verdi e SinistraleftAlternativa Popolarecenter-rightArticolo Unocenter-leftAzionecenterCoraggio Italiacenter-rightEuropa VerdeleftForza ItaliarightFratelli d'Italiafar-rightImpegno CivicocenterIndipendentetransverseItalexitfar-rightItalia VivacenterLega Nordfar-rightLiberi e ugualileftMovimento 5 StelletransverseNuovo Centro Destracenter-rightPartito Democraticocenter-leftPiù EuropacenterPopolo della LibertàrightPossibileleftRadicali ItalianicenterScelta CivicacenterSinistra Ecologia LibertàleftSinistra italianaleftTecnicotransverseClaimsPolitical side True False TotalLeft442872Center-left323110433Center10582187Center-right8210Right7984163Far-right156241397Transverse209146355total9246931,617

Table 33VeryfIT data after exclusion of claims where information about political orientation of the speaker was not available: Distribution of verdict labels in the political spectrum.The re-aggregated coarse-grained labels are reported inTable 4.Although the distribution is still unbalanced between

lapsing labels far-left, left and center-left into 'left' [SX], and far-right, right and center-right into 'right' [DX]. Labels center [C] and trasversal [T] remained untouched.

Table 55

of enriched statements are reported in Table 6:

Original statementEnriched statementAbbiamo 490 grandi elet-Gli elettori dell'area di cen-toritrosinistra che voterannoper l'elezione del Presi-dente della Repubblicasaranno 490.Oggi in Italia sono 796Oggi in Italia sono 796quelli che pagano più di 1quelli che dichiarano unmilione di euroreddito superiore ad 1 mil-ione di euro.[Alle europee] io ho battuto[Alle europee] io [CarloSalvini in molti capoluoghiCalenda] ho battutodi provinciaSalvini in molti capoluoghidi provincia.In parlamento stiamofacendo un lavoro cherisponde a una prerogativacostituzionale. Certamentesi sarebbero tutti auspicati,me compresa, tempi piùbrevi ma non stiamoperdendo tempo. Stiamosvolgendo un ruolo che cicompete e che la Costi-tuzione da' al parlamento.

Table 77Moreover around 30 statements with formats resembling "[name of the politician] is [right/wrong] on [topic]: [statement]" were reformulated as claims by removing hints about the factuality verdict and the author of the statement. A couple examples are brought up in table 7. Examples of reworded statementsOriginal statementReworded statementGiulia Grillo sbaglia: ii medici e gli infermieri ital-medici e gli infermieri ital-iani sono i meno pagatiiani non sono i meno pagatiSecondo Di Maio il governoIl governo investe nelle cen-investe nelle centrali a car-trali a carbonebone, ma è il contrarioNo, per la Corte dei ContiPer la Corte dei Continon ci saranno 17 miliardici saranno 17 miliardi didi nuove tassenuove tasse

, Table11.ClaimsMacro_areaTrue False Totalquestioni sociali256170426economia264155419istituzioni24377320esteri10553158giustizia602686altro463278ambiente421860un-noted180294474total1,1968252,021

Table 88VeryfIT: Distribution of claims and factuality labels per topics ordered by total value.Further statistics on the original CheckIT! dataset is available in Figure A andTable A in Appendix A.

Table 99VeryfIT data after exclusion of claims where information about political orientation of the speaker was not available: Distribution of claims per topic and positioning in the political spectrum.

ClaimsMacro_areaTrueFalseTotalquestioni sociali50 [2]37 [4]87 [6]economia53 [4]37 [4]90 [8]istituzioni46 [11]17 [5]63 [16]esteri26 [4]19 [3]45 [7]ambiente8 [1]1018 [1]giustizia7815altro10 [1]24 [4]34 [5]total200 [23] 152 [20] 352 [43]

Table 1010VeryfIT_small: Distribution of claims and factuality labels per topics ordered by total value. Highlighted in green the number of labels of enriched statements.Orientation labelMacro_areaSXCDXTquestioni sociali22 [2]15 [1] 25 [2] 25 [1]economia28 [2]30 [5] 19 [1]13istituzioni19 [8]8 [1] 15 [3] 21 [4]esteri9 [2]13 [1] 12 [2] 11 [2]ambiente273 [1]6giustizia2436altro6 [1]11 [3]116 [1]

https://en.wikipedia.org/wiki/Technocratic_government_(Italy) https://claude.ai/chat https://chatgpt.com/ All the annotations noted in the report was done by the first author of the paper, master student in Computer Science with a background in Natural Language Processing

Disinformation is on the rise. how does it work? TEconomist 2024 CWardle HDerakhshan Information disorder: Toward an interdisciplinary framework for research and policymaking

Strasbourg

Council of Europe 2017 27 Openai Disrupting deceptive uses of ai by covert influence operations 2024 CChen KShu 10.1002/aaai.12188 Combating misinformation in the age of llms: Opportunities and challenges 2024 Measuring massive multitask language understanding DHendrycks CBurns SBasart AZou MMazeika DSong JSteinhardt 2021 Beyond the imitation game: Quantifying and extrapolating the capabilities of language models ASrivastava DKleyjo ZWu Transactions on Machine Learning Research 2023 Benchmarking large language models on cmexam-a comprehensive chinese medical exam dataset JLiu PZhou YHua DChong ZTian ALiu HWang CYou ZGuo LZhu Advances in Neural Information Processing Systems 36 2024 TruthfulQA: Measuring how models mimic human falsehoods SLin JHilton OEvans 10.18653/v1/2022.acl-long.229 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics Pinto: Faithful language reasoning using promptgenerated rationales PWang AChan FIlievski MChen XRen Workshop on Trustworthy and Socially Responsible Machine Learning

NeurIPS

2022. 2022 What disease does this patient have? a large-scale open domain question answering dataset from medical exams DJin EPan NOufattole W.-HWeng HFang PSzolovits 2020 Commonsenseqa: A question answering challenge targeting commonsense knowledge ATalmor JHerzig NLourie JBerant 2019 In-context annotation of topic-oriented datasets of fake news: A case study on the notre-dame fire event LCPassaro ABondielli PDell'oglio ALenci FMarcelloni 10.1016/j.ins.2022.07.128 Information Sciences 615 2022 <idno type="DOI">10.1016/j.ins.2022.07.128</idno> <idno>.07.128</idno> <ptr target="//doi.org/10.1016/j.ins.2022" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b13"> <analytic> <title level="a" type="main">Neural learning for question answering in italian DCroce AZelenanska RBasili AI*IA 2018 -Advances in Artificial Intelligence CGhidini BMagnini APasserini PTraverso

Cham

Springer International Publishing 2018 PRajpurkar JZhang KLopyrev PLiang Squad: 100,000+ questions for machine comprehension of text 2016 IPlaza NMelero CDel Pozo JConde PReviriego MMayor-Rocher MGrandury arXiv:2406.17789 Spanish and llm benchmarks: is mmlu lost in translation? 2024 arXiv preprint Checkit!: A corpus of expert fact-checked claims for italian JGili LPassaro TCaselli Proceedings of the 9th Italian Conference on Computational Linguistics, CEUR Workshop Proceedings, CEUR Workshop Proceedings FBoschetti GLebani BMagnini NNovielli the 9th Italian Conference on Computational Linguistics, CEUR Workshop Proceedings, CEUR Workshop Proceedings Publisher Copyright 2023. 2023 9th Italian Conference on Computational Linguistics, CLiC-it 2023 ; Conference date