Leveraging Large Language Models for Fact Verification in Italian Antonio Scaiella1,2 , Stefano Costanzo1 , Elisa Passone1 , Danilo Croce1,* and Giorgio Gambosi1 1 Department of Enterprise Engineering, University of Rome Tor Vergata, Italy 2 Reveal s.r.l. Abstract In recent years, Automatic Fact Checking has become a crucial tool for combating fake news by leveraging AI to verify the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages, highlighting the value of the proposed resource. Keywords Automatic Fact Checking, Fact Checking in Italian, Resource in Italian, Large Language Model for Fact Verification 1. Introduction cial intelligence communities, surveyed in [1] and more recently in [3] and [4]. In particular, in [1] the authors In recent years, Automatic Fact Checking (AFC) has as- expose a survey on the topic, describing the early develop- sumed a significant role as an instrument to identify fake ments that were surveyed in [5], which is an exhaustive news. AFC is a process that verifies the truthfulness and overview of the subject. accuracy of information, claims, and data contained in a As with most machine learning paradigms [1], state- text or speech. The focus is on debunking disinformation of-the-art methods require datasets and benchmarks. and misinformation, intercepting errors, and verifying One of the most impactful campaigns for collecting sources and facts. a large-scale benchmark is FEVER (Fact Extraction and Automated fact-checking uses AI tools to identify, ver- VERification) [6]. In this context, fact-checking involves ify, and respond to misleading claims, using techniques verifying whether a claim is supported by one or more based on natural language processing, machine learning, pieces of evidence. FEVER is a publicly available dataset knowledge representation, and databases to automati- designed for claim verification against textual sources. cally predict the truthfulness of claims [1]. This is a It comprises about 180K claims generated by altering complex process that involves searching, interpreting, sentences extracted from Wikipedia. The claims are clas- and assessing information. As discussed in [1] a NLP sified into three categories: Supported (a piece of evi- framework for automated fact-checking consists of three dence exists and it supports the claim), Refutes (a piece stages: claim detection to identify claims that require of evidence exists and it contradicts the claim), or NotE- verification; evidence retrieval to find sources supporting noughInfo (there is insufficient evidence to verify the or refuting the claim; and claim verification to assess the claim). The challenge, therefore, is to retrieve the rel- truthfulness of the claim based on the retrieved evidence. evant evidence and verify the accuracy of the claims, At first, automating the fact-checking process has been categorizing them with the correct label. discussed in the context of computational journalism in Many works like FEVER have recently focused on works like [2], and has received significant attention in building data and datasets for the task of Fact Verification, the computational linguistics and, in general, the artifi- achieving very good results [7, 8, 9, 10, 11, 12]. However, all of these datasets are designed for the English language. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Although multilingual models exist (e.g., in [13, 14]), fine- Dec 04 — 06, 2024, Pisa, Italy * tuning a model on a specific language, pre-training it for Corresponding author. $ scaiella@revealsrl.it (A. Scaiella); a specific task and use case, could lead to a significant stefano.costanzo@students.uniroma2.eu (S. Costanzo); decline in quality if applied to another language. Few passone@ing.uniroma2.it (E. Passone); croce@info.uniroma2.it studies have worked on training models for languages (D. Croce); giorgio.gambosi@uniroma2.it (G. Gambosi) other than English. An example is the work presented  0000-0001-9111-1950 (D. Croce); 0000-0001-9979-6931 (G. Gambosi) in [15], which focuses on developing automated claim © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License detection for Dutch-language fact-checkers. Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In this work, we propose a FEVER-IT dataset in which 2. Related Work the FEVER dataset has been translated into Italian to train the model for the Italian language. Inspired by SQUAD-IT One of the pioneering works in autonomous fact- [16] and MSCOCO-IT [17], we worked to obtain quality checking was conducted by [21], which proposed cre- data. Although the training set may be affected by trans- ating publicly available datasets and developing auto- lation errors, the test set will not, as it is composed of mated systems using natural language processing tech- manually validated data. Furthermore, while the original nologies. Recent challenges such as CheckThat! at CLEF FEVER dataset contained evidence only for Supports [10, 11, 12] and Fever [7, 8, 9] from 2018 have advanced and Refutes, in this work we have also added and trans- fact-checking tasks by leveraging advanced approaches lated examples for the NotEnoughInfo category using and integrating Large Language Models (LLMs) like BERT the heuristics proposed in [18]. This work extends the ex- and GPT. These models represent the current state of the perience described in [19], where translations were done art in many Natural Language Processing tasks, includ- using Google API, by using publicly available models ing fact-checking. Notable examples of such technology ([20]) and adding data for the NotEnoughInfo category. include FacTeR-Check [22], a multilingual architecture The contribution of this work is twofold. Firstly, we for semi-automated fact-checking and hoax propagation release FEVER-IT, a corpus with 228K claims each associ- analysis using the XLM-RoBERTa Transformer [13], and ated with at least one (possibly useful) piece of evidence, FACT-GPT [23], a framework that automates the claim- including a test set of 2,000 manually validated claims. matching phase of fact-checking using LLMs to identify In addition, we fine-tuned and validated a state-of-the- social media content that supports or contradicts claims art model, LLaMA3 [14], on both the original English previously debunked by fact-checkers. dataset and the Italian dataset. While this provides a The success of these systems is largely due to the capa- high-performance model ready for the task in both lan- bilities of LLMs as summarized in [3], which are neural guages, the primary goal is to assess whether the quality models based on the Transformer architecture. Specif- of the Italian data is comparable to the English one. By ically, decoder-based architectures, such as GPT [24], training the model separately on each dataset, we can GPT-3 [25], and LLaMA [14], generate output sequences evaluate its stability: if the model performs similarly on in an auto-regressive manner. These models have demon- the manually validated Italian dataset and the English strated impressive capabilities following pre-training on test set, we can conclude that the quality of the Italian large collections of documents. One notable outcome is data is on par with the English data. few-shot learning, where models can adapt to new tasks Additionally, we want to assess whether using an Ital- with only a few examples [25], greatly enhancing their ian train dataset, despite the noise from automatic trans- flexibility and applicability. lation, is truly beneficial. LLMs like LLaMA3 can already When new annotated data is available, fine-tuning perform tasks in other languages through zero-shot or further enhances a model’s capabilities. This process in- few-shot learning, without requiring fine-tuning on a volves taking the pre-trained base model and training it specific dataset, especially if that dataset is noisy. There- on a smaller, specialized dataset relevant to the desired fore, we aim to compare the performance on the test set task. Parameter Efficient Fine-Tuning (PEFT) is an opti- between a LLaMA3 model that hasn’t been fine-tuned on mized technique that involves training only a small por- the noisy Italian data and one that has been fine-tuned, to tion of the weights, typically by adding a new layer to the determine whether fine-tuning actually improves results model. One widely used technique is LoRA [26], which or if the model performs on par or better without it. adds an adapter consisting of two matrices of weights The experimental results show that the model without that are relatively small compared to the original model. fine-tuning achieves an average accuracy of only about Extremita [27] is an example of a decoder-based model 45%. Fine-tuning on the English dataset yields about 90% fine-tuned with LoRA in Italian for multi-task executions. mean accuracy, while fine-tuning on the Italian dataset Several benchmark datasets have been developed to results in a percentage quite similar to the fine-tuned fine-tune and evaluate fact-checking systems, typically English model and much greater than testing without collected by organizations like Snopes, FullFact, and Poli- fine-tuning1 . tiFact. The FEVER challenge has produced four major The remainder of the paper is organized as follows: Sec- datasets: FEVER (2018) [6], FEVER 2.0 (2019) [8], FEVER- tion 2 discusses related work, Section 3 presents FEVER- OUS (2021) [9], and AVeriTeC (2024) [28]. These datasets IT, Section 4 details the experimental measures, and Sec- range from labeled claim-evidence associations to veri- tion 5 provides the conclusions. fied claims with structured and unstructured evidence. Despite the wealth of resources available, there is a lack of large benchmark datasets in Italian. This work addresses 1 this gap by providing a large-scale Italian resource. The resource, fine-tuned models, and code will be released on a dedicated repository: https://github.com/crux82/FEVER-it 3. Fact Verification in Italian focused on correcting mistakes related to the proper sen- tence structure in Italian, the accurate meaning of specific As in [6], the original FEVER dataset is composed of English words that MADLAD had translated literally, any claims that can potentially be verified against an ency- misunderstandings of the intended meaning in Italian, clopedic resource, in this case, Wikipedia. The claims are and a few grammatical errors. classified into three categories: Supported, Refutes and In some cases, translation errors do not completely un- NotEnoughInfo. For the first two categories, each claim dermine the examples with respect to the task’s purpose. is associated with one or more passages from Wikipedia, For instance, the English sentence from an evidence, “he each specifying the page from which it was extracted. was booked to win a third world championship at a WWE For the NotEnoughInfo category, no passages are pro- event on the night of his death” was translated into Italian vided because no information was found on Wikipedia as “era stato prenotato per vincere un terzo titolo mondiale to support or refute the claim. For instance, the sentence in un evento della WWE la notte della sua morte”. A more “Dan Brown is illiterate.” is a claim associated with pieces accurate translation would be “si pensava avrebbe vinto of evidence such as: “Angels and Demons is a 2000 best- un terzo titolo mondiale in un evento della WWE la notte selling mystery-thriller novel written by American author della sua morte”, better capturing the verb’s meaning. In Dan Brown and published by Pocket Books and then by other, more problematic cases, translation errors, loss of Corgi Books.”. These pieces of evidence prove that the information, or introduction of hallucinations could even claim is incorrect, so it can be classified with the label Re- change the classification in the fact verification task. For futes. In FEVER, a claim is thus a sentence that expresses example, in the claim “The Thin Red Line (1998 film) has information (true or mutated) about a target entity. an all-British cast.”, the automatic translation was “La To generate the Italian dataset, we started from the sottile linea rossa (The Thin Red Line) è un film del 1998.”, dataset version2 proposed in [29], which consists of 260k which is incorrect because it omits the information about claims. This version extends the original FEVER by the cast. This detail is crucial, as its absence could lead adding evidence associated with claims justified as NotE- to incorrect labeling. noughInfo in FEVER, using the heuristics in [18]. The approach involved using a search engine to retrieve po- Metric BLEU-1 BLEU-2 BLEU-3 BLEU-4 tential evidence and a textual entailment system based Claim 0,9776 0,9695 0,9623 0,9544 on GPT [24]. Claims not judged as Supports or Refutes Evidence 0,9529 0,9411 0,9309 0,9207 were classified as NotEnoughInfo. Table 1 This gives us examples of sentences that are closely BLEU score metrics of Claim and Evidence manually validated related to the claim (according to the search engine) but (gold) respect automatic translation version (silver) neither support nor refute it. This makes it more straight- forward and efficient to train and/or evaluate a classifier, even though some of the derived examples might be some- Train (S) Dev (S) Test (G) Total what noisy, as they were generated through heuristics. Supports 114,801 4,638 654 120,095 For the automatic translation process, we utilized Refutes 47,096 4,887 643 52,626 NEI 66,380 6,410 766 73,556 MADLAD400 [20], a machine translation system based Total 228,277 15,935 2,063 246,275 on the Transformer architecture3 , trained on MADLAD, a manually audited, general domain 3T token multilin- Table 2 gual dataset based on CommonCrawl, spanning 419 lan- Number of claims and evidence in the Italian dataset. (S) indi- guages. Since the Italian data are obtained through ma- cates silver data (automatically translated), and (G) indicates gold data (manually validated). chine translation, and thus potentially incorrect as sug- gested in [16, 17], we needed validated test data to obtain a realistic benchmark. Our hypothesis is that an LLM is A quantitative analysis of the translation quality sug- robust enough to generalize from the 228k examples and gests that MADLAD performs well in translating simple recognize the relationships involved in FEVER without assertive sentences such as claims. In fact, 91% of the inheriting translation errors. However, to prevent these claims were not altered by the validators, who considered errors from being inherited by the model, we manually them completely correct. This percentage is lower for the corrected the translations of the test set. Wikipedia passages, dropping to 76%. This discrepancy Out of the approximately 16k available test examples, may be due to the greater complexity of the evidence com- three annotators were involved in verifying and correct- pared to the simpler sentence structures in the claims. ing 2, 063 translations from the test set. The annotators Additionally, we reported the results in terms of BLEU score [30] for the corrected translations compared to the 2 https://huggingface.co/datasets/copenlu/fever_gold_evidence originals, as shown in Table 1. It should be noted that 3 https://github.com/google-research/google-research/tree/master/ madlad_400 measuring the translation quality after correcting the sentences introduces a strong bias in the measurements; 4. Experimental Evaluation however, it provides a more specific idea of the trans- lation quality, especially in understanding the potential The goal of our experimentation is to assess the perfor- noisiness of the training and development sentences. In mance of a state-of-the-art LLM applied to Fact Verifica- this case, results of over 95% for BLEU-1 and over 92% for tion. Specifically, we aim to determine whether a multi- BLEU-4 suggest that very few terms were altered during lingual model maintains consistent quality when applied validation, and even the grammatical patterns remained to both the English FEVER dataset and our Italian dataset. largely unchanged. At most, a few mistranslated terms We utilize LLaMA3-Instruct4 , an instruction-tuned gen- needed updating, as indicated by the qualitative analysis. erative text model from META with 8 billion parameters, Table 2 summarizes the number of examples created released in April 2024. This model is trained to execute for the Italian dataset. In line with the original English specific instructions or prompts across various tasks. To material, the dataset is divided into training, develop- ensure alignment, we evaluate the systems on the manu- ment, and test sets, with claims categorized into Sup- ally validated Italian test set and the same subset of 2,063 ports, Refutes, and NotEnoughInfo (NEI). The ta- claims in the English counterpart. The model is evaluated ble also distinguishes between silver data (automatically in 0-shot and 1-shot settings to assess its capability with- translated) and gold data (manually validated). The train- out fine-tuning. The prompts used in English and Italian ing set consists of 228,277 claims, the development set are provided in Appendix A. Additionally, we fine-tuned contains 15,935 claims, and the test set has 2,063 claims. LLaMA3 on the English datasets from [29] and separately Each Italian claim or evidence is aligned with the English on the Italian datasets obtained via machine translation. counterpart, facilitating future research in cross-lingual Fine-tuning was conducted on an NVIDIA A100 using fact verification. the LoRA technique5 . Language Models for Fact Verification. For address- In FEVER, the title of the document associated with ing the capabilities of Large Language Models in Fact Veri- each claim often provides crucial context. For example, fication, they can be utilized through In-Context Learning the claim “The University of Leicester discovered and iden- techniques [31] or by directly fine-tuning the model for tified the remains of a king.” relies on the document titled specific downstream tasks. In-context learning relies on “University of Leicester” to correctly classify the claim the model’s pre-existing knowledge acquired during pre- as Supports. To ensure the model’s generalization, we training and on instructions provided in natural language will evaluate the impact of including document titles in at inference time. This method does not involve addi- prompts. The metrics used to analyze the results are re- tional training and can be categorized based on the num- call, precision, accuracy, and F1 score, calculated globally ber of examples provided: i) 0-shot Learning, where no and for each label (Supports, Refutes, NotEnough- examples are given, and the model generates responses Info). based solely on its pre-existing knowledge and the pro- The results are reported in Tables 3 and 4 for the En- vided instructions; ii) 1-shot Learning, where one example glish and Italian datasets, respectively. Each table shows per class is added to provide a more precise context, help- whether the model underwent fine-tuning (column FT), ing the model better understand the task by offering a whether a prompt without examples (0-shot) or with one concrete reference point; iii) Few-shot Learning, where example per class (1-shot) was used (column Prompt), and more than one example per class is provided to give the whether the document title was included (column Doc). model additional contextual information during decision- Notably, if no fine-tuning was performed, the original making. When the model’s pre-existing knowledge is LLaMA3-Instruct model was used. Given that the sys- insufficient, we can fine-tune it on the downstream task. tem’s response can consist of multiple words, we search Fine-tuning involves training the model in a traditional the output for the mention of one of the classes and asso- manner using input-output pairs (training data) to adjust ciate the example with that class. If no class is identified, its parameters. This process improves the model’s per- the result is classified as NotEnoughInfo. In general, formance on specific tasks, allowing it to learn from a the fine-tuned model is extremely stable, consistently more extensive set of examples. As a result, the model outputting one of the three categories for every request. becomes more adept at handling similar queries in the The non-fine-tuned model, on rare occasions—just a few future, with a focus on the specific task at hand. We dozen times out of 2000—produces responses that do not thus evaluated the application of state-of-the-art LLM, correspond to any of the required classes. This highlights namely LLAMA3 [32], by providing just the definition of the inherent stability of LLaMA3 while also supporting the task (zero-shot) or adding an example (one-shot) or 4 by performing fine-tuning, to demonstrate the necessity https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 5 The following hyperparameters were used: a learning rate of of a training dataset like the one constructed in this work, 0.0001, two epochs, LoRA_R set to 8, LoRA_alpha set to 16, and as discussed in the following section. LoRA_dropout at 0.05. The micro-batch size was 2, and gradient accumulation steps were set to 8. Support Refutes Not enough info Macro Average FT Prompt Doc Acc P R F1 P R F1 P R F1 P R F1 No 0.449 0.784 0.161 0.267 0.647 0.236 0.346 0.395 0.873 0.544 0.609 0.423 0.386 0-shot Yes 0.374 0.343 0.976 0.507 0.763 0.160 0.265 0.477 0.041 0.075 0.528 0.392 0.282 No No 0.591 0.555 0.864 0.675 0.699 0.415 0.521 0.586 0.507 0.543 0.613 0.595 0.580 1-shot Yes 0.383 0.929 0.020 0.039 0.867 0.020 0.040 0.376 0.999 0.546 0.724 0.346 0.208 No 0.917 0.932 0.947 0.939 0.924 0.888 0.906 0.899 0.916 0.908 0.918 0.917 0.918 0-shot Yes 0.922 0.938 0.953 0.945 0.929 0.896 0.912 0.902 0.918 0.910 0.923 0.922 0.923 Yes No 0.914 0.928 0.948 0.938 0.927 0.883 0.905 0.893 0.911 0.902 0.916 0.914 0.915 1-shot Yes 0.921 0.931 0.956 0.943 0.927 0.891 0.909 0.907 0.916 0.912 0.922 0.921 0.921 Table 3 Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-EN dataset Support Refutes Not enough info Macro Average FT Prompt Doc Acc P R F1 P R F1 P R F1 P R F1 No 0.462 0.411 0.951 0.574 0.607 0.457 0.522 0.585 0.050 0.092 0.534 0.486 0.396 0-shot Yes 0.507 0.463 0.942 0.620 0.587 0.663 0.622 0.800 0.005 0.010 0.617 0.537 0.418 No No 0.425 0.376 0.963 0.541 0.671 0.333 0.445 0.478 0.043 0.079 0.508 0.446 0.355 1-shot Yes 0.462 0.403 0.968 0.569 0.632 0.361 0.459 0.698 0.115 0.197 0.578 0.481 0.409 No 0.897 0.897 0.940 0.918 0.924 0.845 0.882 0.877 0.903 0.890 0.899 0.896 0.897 0-shot Yes 0.901 0.899 0.936 0.917 0.923 0.855 0.888 0.887 0.910 0.898 0.903 0.900 0.901 Yes No 0.895 0.891 0.947 0.918 0.919 0.843 0.879 0.881 0.894 0.887 0.897 0.895 0.895 1-shot Yes 0.905 0.913 0.942 0.927 0.924 0.854 0.888 0.883 0.915 0.899 0.907 0.904 0.905 Table 4 Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-IT dataset the soundness of the results achieved. LLM. For example, the claim “Il Castello di Praga attira A key finding is that the multilingual model generally oltre 18 milioni di visitatori ogni anno.6 ” was given the achieves similar, though modest, results on English and evidence “Il castello è tra le attrazioni turistiche più visitate Italian datasets without fine-tuning, with accuracy val- di Praga che attira oltre 1,8 milioni di visitatori all’anno.7 ” ues around 0.40-0.50 and average F1 scores in the range The model’s predicted label was Refutes, while the true of 0.35-0.55. This performance is relatively unstable, and label was Supports. Here, the true label should be Sup- the addition of an example in the prompt does not lead ports since 18 million is indeed greater than 1.8 million, to significant improvements. In English, there are some but the model found the numbers inconsistent. In an- improvements, but in Italian, there are fewer. We believe other case, the claim “Ned Stark è stato introdotto nel 1996 this is because, although LLaMA is multilingual, the per- in Tempesta di spade.8 ” was paired with the evidence centage of Italian examples observed during training is “Introdotto nel 1996 in Il Trono di Spade, Ned è l’onorevole less than 1%, making it less performant and less stable in signore di Winterfell, un’antica fortezza nel nord del con- this language. tinente immaginario di Westeros.9 ” The model predicted However, when fine-tuning is applied, the results im- Refutes, although the true label was Supports. The prove dramatically, with accuracy exceeding 90% in both confusion here is due to the difference in the book titles, languages. This demonstrates the utility of the translated which are from the same series but are distinct works. dataset, even if it contains some noise. In this scenario, The error analysis revealed that the model occasionally adding an example in the prompt leads to negligible but struggled with mathematical reasoning and contextual consistent improvements. Additionally, the inclusion of understanding, highlighting areas for future enhance- the document title, while sometimes causing inconsis- ment. Larger models and further fine-tuning could poten- tencies in zero-shot learning, is better utilized by the tially address these issues, which remain open questions fine-tuned model, leading to slight but not significant for future research. improvements. This is interesting because it suggests that the model not relying on document titles is more 6 In English: “The Prague Castle attracts over 18 million visitors every broadly applicable. Overall, the fine-tuned models per- year.” form significantly better, highlighting the importance of 7 In English: “The castle is among the most visited tourist attractions the translated dataset for achieving high accuracy in fact in Prague, attracting over 1.8 million visitors every year.” 8 verification tasks in both English and Italian. In English: “Ned Stark was introduced in 1996 in A Storm of Swords.” 9 In English: “Introduced in 1996 in A Game of Thrones, Ned is the The error analysis suggests that the model sometimes honorable lord of Winterfell, an ancient fortress in the north of the inherits the mathematical reasoning limitations of the imaginary continent of Westeros.” 5. Conclusion Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 3346–3359. URL: In this work, we have introduced FEVER-IT, an Italian https://aclanthology.org/C18-1283. version of the FEVER dataset, designed to improve the [6] J. Thorne, A. Vlachos, C. Christodoulopoulos, training and evaluation of models for fact verification in A. Mittal, FEVER: a large-scale dataset for fact ex- the Italian language. Using a machine translation system, traction and VERification, in: M. Walker, H. Ji, we translated a large-scale dataset of 228,000 claims/- A. Stent (Eds.), Proceedings of the 2018 Confer- pieces of evidence pairs and manually validated 2, 000 ence of the North American Chapter of the As- test instances to ensure meaningful evaluations. This en- sociation for Computational Linguistics: Human abled us to fine-tune a state-of-the-art LLM, specifically Language Technologies, Volume 1 (Long Papers), LLaMA3, and assess its performance in both English and Association for Computational Linguistics, New Italian. Orleans, Louisiana, 2018, pp. 809–819. URL: https: Our experiments demonstrated that the multilingual //aclanthology.org/N18-1074. doi:10.18653/v1/ model, without fine-tuning, performed similarly on both N18-1074. English and Italian datasets, though the accuracy and [7] J. Thorne, A. Vlachos, O. Cocarascu, stability were limited. Fine-tuning significantly improved C. Christodoulopoulos, A. Mittal, The fact the model’s performance, achieving over 90% accuracy extraction and VERification (FEVER) shared task, in both languages. This underscores the importance and in: Proceedings of the First Workshop on Fact Ex- effectiveness of the translated dataset, even if it contains traction and VERification (FEVER), Association for some noise. Computational Linguistics, Brussels, Belgium, 2018, Future work will explore the performance of larger pp. 1–9. URL: https://aclanthology.org/W18-5501. models and further refinement of the dataset to enhance doi:10.18653/v1/W18-5501. accuracy and generalization capabilities or explore more [8] J. Thorne, A. Vlachos, O. Cocarascu, complex settings such as those described in [9]. C. Christodoulopoulos, A. Mittal, The FEVER2.0 shared task, in: Proceedings of the Second Workshop on Fact Extraction and VERifica- Acknowledgments tion (FEVER), Association for Computational The team would like to thank Monika Kakol for her in- Linguistics, Hong Kong, China, 2019, pp. 1– valuable support in the validation of the translations. 6. URL: https://aclanthology.org/D19-6601. This work was supported by Project ECS 0000024 Rome doi:10.18653/v1/D19-6601. Technopole, - CUP B83C22002820006, NRP Mission 4 [9] R. Aly, Z. Guo, M. S. Schlichtkrull, J. Thorne, A. Vla- Component 2 Investment 1.5, Funded by the European chos, C. Christodoulopoulos, O. Cocarascu, A. Mit- Union - NextGenerationEU. tal, The fact extraction and VERification over unstructured and structured information (FEVER- OUS) shared task, in: Proceedings of the Fourth References Workshop on Fact Extraction and VERification (FEVER), Association for Computational Linguis- [1] Z. Guo, M. S. Schlichtkrull, A. Vlachos, A survey tics, Dominican Republic, 2021, pp. 1–13. URL: https: on automated fact-checking, Trans. Assoc. Comput. //aclanthology.org/2021.fever-1.1. doi:10.18653/ Linguistics 10 (2022) 178–206. v1/2021.fever-1.1. [2] A. D. Terry Flew, Christina Spurgeon, A. Swift, The [10] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón- promise of computational journalism, Journalism Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, Practice 6 (2012) 157–171. M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, [3] C. Chen, K. Shu, Combating misinformation J. M. Struß, T. Mandl, The CLEF-2021 CheckThat! in the age of llms: Opportunities and chal- lab on detecting check-worthy claims, previously lenges, 2023. URL: https://arxiv.org/abs/2311.05656. fact-checked claims, and fake news, in: Proceed- arXiv:2311.05656. ings of the 43rd European Conference on Infor- [4] M. Akhtar, M. Schlichtkrull, Z. Guo, O. Cocarascu, mation Retrieval, ECIR ’21, Lucca, Italy, 2021, pp. E. Simperl, A. Vlachos, Multimodal automated fact- 639–649. URL: https://link.springer.com/chapter/10. checking: A survey, 2023. URL: https://arxiv.org/ 1007/978-3-030-72240-1_75. abs/2305.13507. arXiv:2305.13507. [11] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, [5] J. Thorne, A. Vlachos, Automated fact check- F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli, ing: Task formulations, methods and future di- M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, rections, in: Proceedings of the 27th Interna- H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal, tional Conference on Computational Linguistics, J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19 infodemic and fake news detection, Associates, Inc., 2023, pp. 67284–67296. in: Advances in Information Retrieval, Springer [21] A. Vlachos, S. Riedel, Fact checking: Task defi- International Publishing, Cham, 2022, pp. 416–428. nition and dataset construction, in: C. Danescu- [12] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. El- Niculescu-Mizil, J. Eisenstein, K. McKeown, N. A. sayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, Smith (Eds.), Proceedings of the ACL 2014 Work- M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The shop on Language Technologies and Computational clef-2024 checkthat! lab: Check-worthiness, subjec- Social Science, Association for Computational Lin- tivity, persuasion, roles, authorities, and adversarial guistics, Baltimore, MD, USA, 2014, pp. 18–22. URL: robustness, in: N. Goharian, N. Tonellotto, Y. He, https://aclanthology.org/W14-2508. doi:10.3115/ A. Lipani, G. McDonald, C. Macdonald, I. Ounis v1/W14-2508. (Eds.), Advances in Information Retrieval, Springer [22] A. Martín, J. Huertas-Tato, Álvaro Huertas-García, Nature Switzerland, Cham, 2024, pp. 449–458. G. Villar-Rodríguez, D. Camacho, Facter-check: [13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- Semi-automated fact-checking through seman- hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, tic similarity and natural language inference, L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Knowledge-Based Systems 251 (2022) 109265. lingual representation learning at scale, arXiv doi:https://doi.org/10.1016/j.knosys. preprint arXiv:1911.02116 (2019). 2022.109265. [14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [23] E. C. Choi, E. Ferrara, Automated claim match- Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ing with large language models: Empowering fact- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, checkers in the fight against misinformation, in: G. Lample, Llama: Open and efficient foundation Companion Proceedings of the ACM on Web Con- language models, 2023. arXiv:2302.13971. ference 2024, WWW ’24, Association for Com- [15] B. Berendt, P. Burger, R. Hautekiet, J. Jagers, A. Plei- puting Machinery, New York, NY, USA, 2024, p. jter, P. Van Aelst, Factrank: Developing auto- 1441–1449. URL: https://doi.org/10.1145/3589335. mated claim detection for dutch-language fact- 3651910. doi:10.1145/3589335.3651910. checkers, Online Social Networks and Media 22 [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, (2021) 100113. doi:https://doi.org/10.1016/ Improving language understanding by gener- j.osnem.2020.100113. ative pre-training, CoRR abs/1801.06146 [16] D. Croce, A. Zelenanska, R. Basili, Enabling (2018). URL: http://arxiv.org/abs/1801.06146. deep learning for large scale question answering arXiv:1801.06146. in italian, Intelligenza Artificiale 13 (2019) 49– [25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, 61. URL: https://doi.org/10.3233/IA-190018. doi:10. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, 3233/IA-190018. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, [17] A. Scaiella, D. Croce, R. Basili, Large scale datasets G. Krueger, T. Henighan, R. Child, A. Ramesh, for image and video captioning in italian, Italian D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, Journal of Computational Linguistics 2 (2019) 49– E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, 60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_ C. Berner, S. McCandlish, A. Radford, I. Sutskever, 2_3___scaiella_et_al.pdf. D. Amodei, Language models are few-shot learners, [18] C. Malon, Team papelo: Transformer networks at in: H. Larochelle, M. Ranzato, R. Hadsell, M. Bal- FEVER, in: J. Thorne, A. Vlachos, O. Cocarascu, can, H. Lin (Eds.), Advances in Neural Information C. Christodoulopoulos, A. Mittal (Eds.), Proceed- Processing Systems 33: Annual Conference on Neu- ings of the First Workshop on Fact Extraction ral Information Processing Systems 2020, NeurIPS and VERification (FEVER), Association for Com- 2020, December, 2020, pp. 6–12. putational Linguistics, Brussels, Belgium, 2018, pp. [26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, 109–113. URL: https://aclanthology.org/W18-5517. Y. Li, S. Wang, W. Chen, Lora: Low-rank doi:10.18653/v1/W18-5517. adaptation of large language models, CoRR [19] L. Canale, A. Messina, Experimenting ai tech- abs/2106.09685 (2021). URL: https://arxiv.org/abs/ nologies for disinformation combat: the idmo 2106.09685. arXiv:2106.09685. project, 2023. URL: https://arxiv.org/abs/2310.11097. [27] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem- arXiv:2310.11097. ita at EVALITA 2023: Multi-task sustainable scaling [20] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, to large language models at its extreme, in: Pro- D. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat, ceedings of the Eighth Evaluation Campaign of Nat- Madlad-400: A multilingual and document-level ural Language Processing and Speech Tools for Ital- large audited dataset, in: Advances in Neural In- ian. Final Workshop (EVALITA 2023), Parma, Italy, formation Processing Systems, volume 36, Curran September 7th-8th, 2023, volume 3473 of CEUR Workshop Proceedings, CEUR-WS.org, 2023. URL: E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e https://ceur-ws.org/Vol-3473/paper13.pdf. e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key [28] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A terms used in t h i s t a s k are : − Claim : A s t a t e m e n t o r a s s e r t i o n un der dataset for real-world claim verification with ev- examination . idence from the web, in: A. Oh, T. Naumann, − Evidence : Information that e i t h e r supports A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), or opposes the claim . Advances in Neural Information Processing Sys- tems, volume 36, Curran Associates, Inc., 2023, pp. Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s b a s e d on t h e e v i d e n c e p r o v i d e d : 65128–65167. − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e [29] P. Atanasova, D. Wright, I. Augenstein, Gener- claim . ating label cohesive and well-formed adversarial − REFUTES : i f t h e e v i d e n c e d i r e c t l y claims, in: Proceedings of the 2020 Conference c o n t r a d i c t s the claim . on Empirical Methods in Natural Language Pro- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t evidence to determine the claim ’ s cessing (EMNLP), Association for Computational validity Linguistics, Online, 2020, pp. 3168–3177. URL: https: ### I n p u t //aclanthology.org/2020.emnlp-main.256. doi:10. − Claim : [ CLAIM HERE ] 18653/v1/2020.emnlp-main.256. − E v i d e n c e : [ EVIDENCE HERE ] [30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a # # # Answer : [ANSWER HERE ] method for automatic evaluation of machine trans- lation, in: Proceedings of the 40th Annual Meet- ing on Association for Computational Linguistics, A.1.2. 1-shot Setting ACL ’02, Association for Computational Linguis- The following prompt is used for 1-shot learning, where tics, USA, 2002, p. 311–318. URL: https://doi.org/ the task and classes are explained, and one example per 10.3115/1073083.1073135. doi:10.3115/1073083. class is provided. Notice that only the evidence is re- 1073135. ported without the title of the original document. [31] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A ### I n s t r u c t i o n E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e survey on in-context learning, 2024. URL: https: e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key //arxiv.org/abs/2301.00234. arXiv:2301.00234. terms used in t h i s t a s k are : [32] AI@Meta, Llama 3 model card, 2024. URL: − Claim : A s t a t e m e n t o r a s s e r t i o n un der https://github.com/meta-llama/llama3/blob/main/ examination . MODEL_CARD.md. − Evidence : Information that e i t h e r supports or opposes the claim . Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s b a s e d on t h e e v i d e n c e p r o v i d e d : A. Prompting Engineering − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e claim . This appendix contains the prompts used in the exper- − REFUTES : i f t h e e v i d e n c e d i r e c t l y iments. The prompts are provided in both Italian and c o n t r a d i c t s the claim . English, reflecting the task-specific nature of the experi- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t ments. Each prompt begins with an explanation of the evidence to determine the claim ’ s validity task and the meaning of the classes. In the different vari- ants, the 0-shot setting does not include any examples, # # # Examples unlike the 1-shot setting. Where necessary, the name of These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e the document from which the evidence is taken is also evaluation c r i t e r i a : specified. − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d Gothic . − E v i d e n c e : The Germanic p e o p l e s ( a l s o A.1. Prompts in English r e f e r r e d to as Teutonic , Suebian , or G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo − A.1.1. 0-shot Setting European ethno − l i n g u i s t i c group o f N o r t h e r n European o r i g i n . The following prompt is used for 0-shot learning, where − Answer : SUPPORTS the task and classes are presented without additional information. − Claim : T e n n i s i s n o t a s p o r t . − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f ### I n s t r u c t i o n r e c r e a t i o n a l p l a y e r s and i s a l s o a popular worldwide s p e c t a t o r s p o r t . − Answer : REFUTES − Document : d e n o t e s t h e s o u r c e document f o r the evidence . − Claim : Kick − Ass i s a h o r r o r f i l m . − E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h − Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s American f i l m b a s e d on t h e comic book o f b a s e d on t h e e v i d e n c e p r o v i d e d : t h e same name by Mark M i l l a r and John − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e Romita , J r . claim . − Answer : NOT ENOUGH INFO − REFUTES : i f t h e e v i d e n c e d i r e c t l y ### I n p u t c o n t r a d i c t s the claim . − Claim : [ CLAIM HERE ] − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t − E v i d e n c e : [ EVIDENCE HERE ] evidence to determine the claim ’ s # # # Answer : [ANSWER HERE ] validity # # # Examples These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e A.1.3. 0-shot Setting with Document Title evaluation c r i t e r i a : The following prompt is used for 0-shot learning, where − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d Gothic . the task and classes are explained without additional − E v i d e n c e : The Germanic p e o p l e s ( a l s o information. Each input evidence is provided with the r e f e r r e d to as Teutonic , Suebian , or title of its original document. G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo − European ethno − l i n g u i s t i c group o f ### I n s t r u c t i o n N o r t h e r n European o r i g i n . E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e − Document : Germanic p e o p l e s e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key − Answer : SUPPORTS terms used in t h i s t a s k are : − Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r − Claim : T e n n i s i s n o t a s p o r t . examination . − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f − Evidence : Information that e i t h e r supports r e c r e a t i o n a l p l a y e r s and i s a l s o a or opposes the claim . popular worldwide s p e c t a t o r s p o r t . − Document : d e n o t e s t h e s o u r c e document f o r − Document : T e n n i s the evidence . − Answer : REFUTES Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s − Claim : Kick − Ass i s a h o r r o r f i l m . b a s e d on t h e e v i d e n c e p r o v i d e d : − E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h − − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e American f i l m b a s e d on t h e comic book o f claim . t h e same name by Mark M i l l a r and John − REFUTES : i f t h e e v i d e n c e d i r e c t l y Romita , J r . c o n t r a d i c t s the claim . − Document : Kick − Ass ( f i l m ) − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t − Answer : NOT ENOUGH INFO evidence to determine the claim ’ s ### I n p u t validity − Claim : [ CLAIM HERE ] ### I n p u t − E v i d e n c e : [ EVIDENCE HERE ] − Claim : [ CLAIM HERE ] − Document : [DOCUMENT HERE ] − E v i d e n c e : [ EVIDENCE HERE ] # # # Answer : [ANSWER HERE ] − Document : [DOCUMENT HERE ] # # # Answer : [ANSWER HERE ] A.2. Prompts in Italian A.1.4. 1-shot Setting with Document Title A.2.1. 0-shot Setting The following prompt is used for 1-shot learning, where The following prompt is used for 0-shot learning, where the task and classes are explained, and one example per the task and classes are presented without additional class is provided. Each input evidence is provided with information. the title of its original document. ### I s t r u z i o n i ### I n s t r u c t i o n Valuta se l ’ affermazione è supportata d a l l e E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e p r o v e f o r n i t e . Le d e f i n i z i o n i d e i e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key termini chiave u t i l i z z a t i in questo terms used in t h i s t a s k are : c o m p i t o sono : − Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o examination . a s s e r z i o n e s o t t o esame . − Evidence : Information that e i t h e r supports − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o or opposes the claim . contraddicono l ’ affermazione . R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i ### I n p u t s u l l e prove f o r n i t e : − A f f e r m a z i o n e : [ CLAIM HERE ] − SUPPORTS : s e l e p r o v e co nf erm ano l ’ − P r o v a : [ EVIDENCE HERE ] affermazione . # # # R i s p o s t a : [ANSWER HERE ] − REFUTES : s e l e p r o v e c o n t r a d d i c o n o direttamente l ’ affermazione . − NOT ENOUGH INFO : s e l e p r o v e non sono s u f f i c i e n t i per determinare l a v a l i d i t à A.2.3. 0-shot Setting with Document Title dell ’ affermazione . The following prompt is used for 0-shot learning, where ### I n p u t − A f f e r m a z i o n e : [ CLAIM HERE ] the task and classes are explained without additional − P r o v a : [ EVIDENCE HERE ] information. Each input evidence is provided with the # # # R i s p o s t a : [ANSWER HERE ] title of its original document. ### I s t r u z i o n i Valuta se l ’ affermazione è supportata d a l l e A.2.2. 1-shot Setting p r o v e f o r n i t e . Le d e f i n i z i o n i d e i termini chiave u t i l i z z a t i in questo The following prompt is used for 1-shot learning, where c o m p i t o sono : the task and classes are explained, and one example per − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o class is provided. Notice that only the evidence is re- a s s e r z i o n e s o t t o esame . ported without the title of the original document. − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o contraddicono l ’ affermazione . ### I s t r u z i o n i − Documento : i n d i c a l a f o n t e da c u i è s t a t a Valuta se l ’ affermazione è supportata d a l l e e s t r a t t a l a prova . p r o v e f o r n i t e . Le d e f i n i z i o n i d e i termini chiave u t i l i z z a t i in questo R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i c o m p i t o sono : s u l l e prove f o r n i t e : − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o − SUPPORTS : s e l e p r o v e co nf erm ano l ’ a s s e r z i o n e s o t t o esame . affermazione . − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o − REFUTES : s e l e p r o v e c o n t r a d d i c o n o contraddicono l ’ affermazione . direttamente l ’ affermazione . − NOT ENOUGH INFO : s e l e p r o v e non sono R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i s u f f i c i e n t i per determinare l a v a l i d i t à s u l l e prove f o r n i t e : dell ’ affermazione . − SUPPORTS : s e l e p r o v e co nf erm ano l ’ ### I n p u t affermazione . − A f f e r m a z i o n e : [ CLAIM HERE ] − REFUTES : s e l e p r o v e c o n t r a d d i c o n o − P r o v a : [ EVIDENCE HERE ] direttamente l ’ affermazione . − Documento : [DOCUMENT HERE ] − NOT ENOUGH INFO : s e l e p r o v e non sono # # # R i s p o s t a : [ANSWER HERE ] s u f f i c i e n t i per determinare l a v a l i d i t à dell ’ affermazione . # # # Esempi A.2.4. 1-shot Setting with Document Title Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i The following prompt is used for 1-shot learning, where c r i t e r i di valutazione : − A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono the task and classes are explained, and one example per c h i a m a t i anche g o t i c i . class is provided. Each input evidence is provided with − P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i the title of its original document. Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a p i ù a n t i c a ) sono un gruppo etno − ### I s t r u z i o n i l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord Valuta se l ’ affermazione è supportata d a l l e europea . p r o v e f o r n i t e . Le d e f i n i z i o n i d e i − R i s p o s t a : SUPPORTS termini chiave u t i l i z z a t i in questo c o m p i t o sono : − A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t . − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o − P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i a s s e r z i o n e s o t t o esame . g i o c a t o r i a m a t o r i a l i ed è anche uno − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o s p o r t popolare a l i v e l l o mondiale . contraddicono l ’ affermazione . − R i s p o s t a : REFUTES − Documento : i n d i c a l a f o n t e da c u i è s t a t a e s t r a t t a l a prova . − A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r . − P r o v a : Kick − Ass è un f i l m b r i t a n n i c o − R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i americano d e l 2010 b as ato s u l fumetto s u l l e prove f o r n i t e : omonimo d i Mark M i l l a r e John Romita J r . − SUPPORTS : s e l e p r o v e co nf erm ano l ’ − R i s p o s t a : NOT ENOUGH INFO affermazione . − REFUTES : s e l e p r o v e c o n t r a d d i c o n o direttamente l ’ affermazione . − NOT ENOUGH INFO : s e l e p r o v e non sono s u f f i c i e n t i per determinare l a v a l i d i t à dell ’ affermazione . # # # Esempi Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i c r i t e r i di valutazione : − A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono c h i a m a t i anche g o t i c i . − P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a p i ù a n t i c a ) sono un gruppo etno − l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord europea . − Documento : P o p o l i g e r m a n i c i − R i s p o s t a : SUPPORTS − A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t . − P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i g i o c a t o r i a m a t o r i a l i ed è anche uno s p o r t popolare a l i v e l l o mondiale . − Documento : T e n n i s − R i s p o s t a : REFUTES − A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r . − P r o v a : Kick − Ass è un f i l m b r i t a n n i c o − americano d e l 2010 b as ato s u l fumetto omonimo d i Mark M i l l a r e John Romita J r . − Documento : Kick − Ass ( f i l m ) − R i s p o s t a : NOT ENOUGH INFO ### I n p u t − A f f e r m a z i o n e : [ CLAIM HERE ] − P r o v a : [ EVIDENCE HERE ] − Documento : [DOCUMENT HERE ] # # # R i s p o s t a : [ANSWER HERE ]