1. Introduction

Leveraging Large Language Models for Fact Verification in Italian

Antonio Scaiella

0 1

Stefano Costanzo

Elisa Passone

Danilo Croce

Giorgio Gambosi

0 0 Department of Enterprise Engineering, University of Rome Tor Vergata , Italy 1 Reveal s.r.l

In recent years, Automatic Fact Checking has become a crucial tool for combating fake news by leveraging AI to verify the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages, highlighting the value of the proposed resource.

eol>Automatic Fact Checking Fact Checking in Italian Resource in Italian Large Language Model for Fact Verification

1. Introduction

cial intelligence communities, surveyed in [1] and more recently in [3] and [4]. In particular, in [1] the authors In recent years, Automatic Fact Checking (AFC) has as- expose a survey on the topic, describing the early developsumed a significant role as an instrument to identify fake ments that were surveyed in [5], which is an exhaustive news. AFC is a process that verifies the truthfulness and overview of the subject. accuracy of information, claims, and data contained in a As with most machine learning paradigms [1], statetext or speech. The focus is on debunking disinformation of-the-art methods require datasets and benchmarks. and misinformation, intercepting errors, and verifying One of the most impactful campaigns for collecting sources and facts. a large-scale benchmark is FEVER (Fact Extraction and

Automated fact-checking uses AI tools to identify, ver- VERification) [ 6]. In this context, fact-checking involves ify, and respond to misleading claims, using techniques verifying whether a claim is supported by one or more based on natural language processing, machine learning, pieces of evidence. FEVER is a publicly available dataset knowledge representation, and databases to automati- designed for claim verification against textual sources. cally predict the truthfulness of claims [1]. This is a It comprises about 180K claims generated by altering complex process that involves searching, interpreting, sentences extracted from Wikipedia. The claims are clasand assessing information. As discussed in [1] a NLP sified into three categories: Supported (a piece of eviframework for automated fact-checking consists of three dence exists and it supports the claim), Refutes (a piece stages: claim detection to identify claims that require of evidence exists and it contradicts the claim), or NotEverification; evidence retrieval to find sources supporting noughInfo (there is insuficient evidence to verify the or refuting the claim; and claim verification to assess the claim). The challenge, therefore, is to retrieve the reltruthfulness of the claim based on the retrieved evidence. evant evidence and verify the accuracy of the claims,

At first, automating the fact-checking process has been categorizing them with the correct label. discussed in the context of computational journalism in Many works like FEVER have recently focused on works like [2], and has received significant attention in building data and datasets for the task of Fact Verification, the computational linguistics and, in general, the artifi- achieving very good results [7, 8, 9, 10, 11, 12]. However, all of these datasets are designed for the English language.

CDLeciC0-4it—200264,:2T0e2n4t,hPIitsaal,iIatnalCyonference on Computational Linguistics, Although multilingual models exist (e.g., in [13, 14]), fine* Corresponding author. tuning a model on a specific language, pre-training it for $ scaiella@revealsrl.it (A. Scaiella); a specific task and use case, could lead to a significant stefano.costanzo@students.uniroma2.eu (S. Costanzo); decline in quality if applied to another language. Few passone@ing.uniroma2.it (E. Passone); croce@info.uniroma2.it studies have worked on training models for languages (D.0C0r0o0c-e0)0;0g1i-o9r1g1i1o-.g1a9m50b(oDs.i@Curonciero);m0a020.0i-t0(0G0.1G-9a9m79b-o6s9i3)1 other than English. An example is the work presented (G. Gambosi) in [15], which focuses on developing automated claim © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License detection for Dutch-language fact-checkers. Attribution 4.0 International (CC BY 4.0).

2. Related Work

In this work, we propose a FEVER-IT dataset in which the FEVER dataset has been translated into Italian to train the model for the Italian language. Inspired by SQUAD-IT One of the pioneering works in autonomous fact[16] and MSCOCO-IT [17], we worked to obtain quality checking was conducted by [21], which proposed credata. Although the training set may be afected by trans- ating publicly available datasets and developing autolation errors, the test set will not, as it is composed of mated systems using natural language processing techmanually validated data. Furthermore, while the original nologies. Recent challenges such as CheckThat! at CLEF FEVER dataset contained evidence only for Supports [10, 11, 12] and Fever [7, 8, 9] from 2018 have advanced and Refutes, in this work we have also added and trans- fact-checking tasks by leveraging advanced approaches lated examples for the NotEnoughInfo category using and integrating Large Language Models (LLMs) like BERT the heuristics proposed in [18]. This work extends the ex- and GPT. These models represent the current state of the perience described in [19], where translations were done art in many Natural Language Processing tasks, includusing Google API, by using publicly available models ing fact-checking. Notable examples of such technology ([20]) and adding data for the NotEnoughInfo category. include FacTeR-Check [22], a multilingual architecture

The contribution of this work is twofold. Firstly, we for semi-automated fact-checking and hoax propagation release FEVER-IT, a corpus with 228K claims each associ- analysis using the XLM-RoBERTa Transformer [13], and ated with at least one (possibly useful) piece of evidence, FACT-GPT [23], a framework that automates the claimincluding a test set of 2,000 manually validated claims. matching phase of fact-checking using LLMs to identify In addition, we fine-tuned and validated a state-of-the- social media content that supports or contradicts claims art model, LLaMA3 [14], on both the original English previously debunked by fact-checkers. dataset and the Italian dataset. While this provides a The success of these systems is largely due to the capahigh-performance model ready for the task in both lan- bilities of LLMs as summarized in [3], which are neural guages, the primary goal is to assess whether the quality models based on the Transformer architecture. Specifof the Italian data is comparable to the English one. By ically, decoder-based architectures, such as GPT [24], training the model separately on each dataset, we can GPT-3 [25], and LLaMA [14], generate output sequences evaluate its stability: if the model performs similarly on in an auto-regressive manner. These models have demonthe manually validated Italian dataset and the English strated impressive capabilities following pre-training on test set, we can conclude that the quality of the Italian large collections of documents. One notable outcome is data is on par with the English data. few-shot learning, where models can adapt to new tasks

Additionally, we want to assess whether using an Ital- with only a few examples [25], greatly enhancing their ian train dataset, despite the noise from automatic trans- flexibility and applicability. lation, is truly beneficial. LLMs like LLaMA3 can already When new annotated data is available, fine-tuning perform tasks in other languages through zero-shot or further enhances a model’s capabilities. This process infew-shot learning, without requiring fine-tuning on a volves taking the pre-trained base model and training it specific dataset, especially if that dataset is noisy. There- on a smaller, specialized dataset relevant to the desired fore, we aim to compare the performance on the test set task. Parameter Eficient Fine-Tuning (PEFT) is an optibetween a LLaMA3 model that hasn’t been fine-tuned on mized technique that involves training only a small porthe noisy Italian data and one that has been fine-tuned, to tion of the weights, typically by adding a new layer to the determine whether fine-tuning actually improves results model. One widely used technique is LoRA [26], which or if the model performs on par or better without it. adds an adapter consisting of two matrices of weights

The experimental results show that the model without that are relatively small compared to the original model. ifne-tuning achieves an average accuracy of only about Extremita [27] is an example of a decoder-based model 45%. Fine-tuning on the English dataset yields about 90% fine-tuned with LoRA in Italian for multi-task executions. mean accuracy, while fine-tuning on the Italian dataset Several benchmark datasets have been developed to results in a percentage quite similar to the fine-tuned ifne-tune and evaluate fact-checking systems, typically English model and much greater than testing without collected by organizations like Snopes, FullFact, and Poliifne-tuning 1. tiFact. The FEVER challenge has produced four major

The remainder of the paper is organized as follows: Sec- datasets: FEVER (2018) [6], FEVER 2.0 (2019) [8], FEVERtion 2 discusses related work, Section 3 presents FEVER- OUS (2021) [9], and AVeriTeC (2024) [28]. These datasets IT, Section 4 details the experimental measures, and Sec- range from labeled claim-evidence associations to verition 5 provides the conclusions. ifed claims with structured and unstructured evidence. Despite the wealth of resources available, there is a lack of large benchmark datasets in Italian. This work addresses this gap by providing a large-scale Italian resource. 1The resource, fine-tuned models, and code will be released on a dedicated repository: https://github.com/crux82/FEVER-it

3. Fact Verification in Italian 2https://huggingface.co/datasets/copenlu/fever_gold_evidence

3https://github.com/google-research/google-research/tree/master/ madlad_400 focused on correcting mistakes related to the proper sentence structure in Italian, the accurate meaning of specific As in [6], the original FEVER dataset is composed of English words that MADLAD had translated literally, any claims that can potentially be verified against an ency- misunderstandings of the intended meaning in Italian, clopedic resource, in this case, Wikipedia. The claims are and a few grammatical errors. classified into three categories: Supported, Refutes and In some cases, translation errors do not completely unNotEnoughInfo. For the first two categories, each claim dermine the examples with respect to the task’s purpose. is associated with one or more passages from Wikipedia, For instance, the English sentence from an evidence, “he each specifying the page from which it was extracted. was booked to win a third world championship at a WWE For the NotEnoughInfo category, no passages are pro- event on the night of his death” was translated into Italian vided because no information was found on Wikipedia as “era stato prenotato per vincere un terzo titolo mondiale to support or refute the claim. For instance, the sentence in un evento della WWE la notte della sua morte”. A more “Dan Brown is illiterate.” is a claim associated with pieces accurate translation would be “si pensava avrebbe vinto of evidence such as: “Angels and Demons is a 2000 best- un terzo titolo mondiale in un evento della WWE la notte selling mystery-thriller novel written by American author della sua morte”, better capturing the verb’s meaning. In Dan Brown and published by Pocket Books and then by other, more problematic cases, translation errors, loss of Corgi Books.”. These pieces of evidence prove that the information, or introduction of hallucinations could even claim is incorrect, so it can be classified with the label Re- change the classification in the fact verification task. For futes. In FEVER, a claim is thus a sentence that expresses example, in the claim “The Thin Red Line (1998 film) has information (true or mutated) about a target entity. an all-British cast.”, the automatic translation was “La

To generate the Italian dataset, we started from the sottile linea rossa (The Thin Red Line) è un film del 1998. ”, dataset version2 proposed in [29], which consists of 260k which is incorrect because it omits the information about claims. This version extends the original FEVER by the cast. This detail is crucial, as its absence could lead adding evidence associated with claims justified as NotE- to incorrect labeling. noughInfo in FEVER, using the heuristics in [18]. The approach involved using a search engine to retrieve po- Metric BLEU-1 BLEU-2 BLEU-3 BLEU-4 tential evidence and a textual entailment system based Claim 0,9776 0,9695 0,9623 0,9544 on GPT [24]. Claims not judged as Supports or Refutes Evidence 0,9529 0,9411 0,9309 0,9207 were classified as NotEnoughInfo. Table 1

This gives us examples of sentences that are closely BLEU score metrics of Claim and Evidence manually validated related to the claim (according to the search engine) but (gold) respect automatic translation version (silver) neither support nor refute it. This makes it more straightforward and eficient to train and/or evaluate a classifier, even though some of the derived examples might be some- Train (S) Dev (S) Test (G) Total what noisy, as they were generated through heuristics. Supports 114,801 4,638 654 120,095

For the automatic translation process, we utilized Refutes 47,096 4,887 643 52,626 MADLAD400 [20], a machine translation system based TNoEtIal 22686,,237870 156,,943150 2,076636 24763,,257556 on the Transformer architecture3, trained on MADLAD, a manually audited, general domain 3T token multilin- Table 2 gual dataset based on CommonCrawl, spanning 419 lan- Number of claims and evidence in the Italian dataset. (S) indiguages. Since the Italian data are obtained through ma- cates silver data (automatically translated), and (G) indicates chine translation, and thus potentially incorrect as sug- gold data (manually validated). gested in [16, 17], we needed validated test data to obtain a realistic benchmark. Our hypothesis is that an LLM is robust enough to generalize from the 228k examples and recognize the relationships involved in FEVER without inheriting translation errors. However, to prevent these errors from being inherited by the model, we manually corrected the translations of the test set.

Out of the approximately 16k available test examples, three annotators were involved in verifying and correcting 2, 063 translations from the test set. The annotators

A quantitative analysis of the translation quality sug

gests that MADLAD performs well in translating simple assertive sentences such as claims. In fact, 91% of the claims were not altered by the validators, who considered them completely correct. This percentage is lower for the Wikipedia passages, dropping to 76%. This discrepancy may be due to the greater complexity of the evidence compared to the simpler sentence structures in the claims. Additionally, we reported the results in terms of BLEU score [30] for the corrected translations compared to the originals, as shown in Table 1. It should be noted that measuring the translation quality after correcting the

4. Experimental Evaluation

sentences introduces a strong bias in the measurements; however, it provides a more specific idea of the translation quality, especially in understanding the potential The goal of our experimentation is to assess the perfornoisiness of the training and development sentences. In mance of a state-of-the-art LLM applied to Fact Verificathis case, results of over 95% for BLEU-1 and over 92% for tion. Specifically, we aim to determine whether a multiBLEU-4 suggest that very few terms were altered during lingual model maintains consistent quality when applied validation, and even the grammatical patterns remained to both the English FEVER dataset and our Italian dataset. largely unchanged. At most, a few mistranslated terms We utilize LLaMA3-Instruct4, an instruction-tuned genneeded updating, as indicated by the qualitative analysis. erative text model from META with 8 billion parameters,

Table 2 summarizes the number of examples created released in April 2024. This model is trained to execute for the Italian dataset. In line with the original English specific instructions or prompts across various tasks. To material, the dataset is divided into training, develop- ensure alignment, we evaluate the systems on the manument, and test sets, with claims categorized into Sup- ally validated Italian test set and the same subset of 2,063 ports, Refutes, and NotEnoughInfo (NEI). The ta- claims in the English counterpart. The model is evaluated ble also distinguishes between silver data (automatically in 0-shot and 1-shot settings to assess its capability withtranslated) and gold data (manually validated). The train- out fine-tuning. The prompts used in English and Italian ing set consists of 228,277 claims, the development set are provided in Appendix A. Additionally, we fine-tuned contains 15,935 claims, and the test set has 2,063 claims. LLaMA3 on the English datasets from [29] and separately Each Italian claim or evidence is aligned with the English on the Italian datasets obtained via machine translation. counterpart, facilitating future research in cross-lingual Fine-tuning was conducted on an NVIDIA A100 using fact verification. the LoRA technique5.

Language Models for Fact Verification. For address- In FEVER, the title of the document associated with ing the capabilities of Large Language Models in Fact Veri- each claim often provides crucial context. For example, ifcation, they can be utilized through In-Context Learning the claim “The University of Leicester discovered and identechniques [31] or by directly fine-tuning the model for tified the remains of a king. ” relies on the document titled specific downstream tasks. In-context learning relies on “University of Leicester” to correctly classify the claim the model’s pre-existing knowledge acquired during pre- as Supports. To ensure the model’s generalization, we training and on instructions provided in natural language will evaluate the impact of including document titles in at inference time. This method does not involve addi- prompts. The metrics used to analyze the results are retional training and can be categorized based on the num- call, precision, accuracy, and F1 score, calculated globally ber of examples provided: i) 0-shot Learning, where no and for each label (Supports, Refutes, NotEnoughexamples are given, and the model generates responses Info). based solely on its pre-existing knowledge and the pro- The results are reported in Tables 3 and 4 for the Envided instructions; ii) 1-shot Learning, where one example glish and Italian datasets, respectively. Each table shows per class is added to provide a more precise context, help- whether the model underwent fine-tuning (column FT), ing the model better understand the task by ofering a whether a prompt without examples (0-shot) or with one concrete reference point; iii) Few-shot Learning, where example per class (1-shot) was used (column Prompt), and more than one example per class is provided to give the whether the document title was included (column Doc). model additional contextual information during decision- Notably, if no fine-tuning was performed, the original making. When the model’s pre-existing knowledge is LLaMA3-Instruct model was used. Given that the sysinsuficient, we can fine-tune it on the downstream task. tem’s response can consist of multiple words, we search Fine-tuning involves training the model in a traditional the output for the mention of one of the classes and assomanner using input-output pairs (training data) to adjust ciate the example with that class. If no class is identified, its parameters. This process improves the model’s per- the result is classified as NotEnoughInfo. In general, formance on specific tasks, allowing it to learn from a the fine-tuned model is extremely stable, consistently more extensive set of examples. As a result, the model outputting one of the three categories for every request. becomes more adept at handling similar queries in the The non-fine-tuned model, on rare occasions—just a few future, with a focus on the specific task at hand. We dozen times out of 2000—produces responses that do not thus evaluated the application of state-of-the-art LLM, correspond to any of the required classes. This highlights namely LLAMA3 [32], by providing just the definition of the inherent stability of LLaMA3 while also supporting the task (zero-shot) or adding an example (one-shot) or by performing fine-tuning, to demonstrate the necessity of a training dataset like the one constructed in this work, as discussed in the following section.

4https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

5The following hyperparameters were used: a learning rate of 0.0001, two epochs, LoRA_R set to 8, LoRA_alpha set to 16, and LoRA_dropout at 0.05. The micro-batch size was 2, and gradient accumulation steps were set to 8.

Prompt

Doc

Prompt Doc FT

No Yes FT No Yes 0-shot 1-shot 0-shot 1-shot 0-shot 1-shot 0-shot 1-shot No Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes Acc

Not enough info

P R F1 0.585 0.050 0.092 0.800 0.005 0.010 0.478 0.043 0.079 0.698 0.115 0.197 0.877 0.903 0.890 0.887 0.910 0.898 0.881 0.894 0.887 0.883 0.915 0.899

P 0.534 0.617 0.508 0.578 0.899 0.903 0.897 0.907 the soundness of the results achieved. LLM. For example, the claim “Il Castello di Praga attira

A key finding is that the multilingual model generally oltre 18 milioni di visitatori ogni anno.6” was given the achieves similar, though modest, results on English and evidence “Il castello è tra le attrazioni turistiche più visitate Italian datasets without fine-tuning, with accuracy val- di Praga che attira oltre 1,8 milioni di visitatori all’anno.7” ues around 0.40-0.50 and average F1 scores in the range The model’s predicted label was Refutes, while the true of 0.35-0.55. This performance is relatively unstable, and label was Supports. Here, the true label should be Supthe addition of an example in the prompt does not lead ports since 18 million is indeed greater than 1.8 million, to significant improvements. In English, there are some but the model found the numbers inconsistent. In animprovements, but in Italian, there are fewer. We believe other case, the claim “Ned Stark è stato introdotto nel 1996 this is because, although LLaMA is multilingual, the per- in Tempesta di spade.8” was paired with the evidence centage of Italian examples observed during training is “Introdotto nel 1996 in Il Trono di Spade, Ned è l’onorevole less than 1%, making it less performant and less stable in signore di Winterfell, un’antica fortezza nel nord del conthis language. tinente immaginario di Westeros.9” The model predicted

However, when fine-tuning is applied, the results im- Refutes, although the true label was Supports. The prove dramatically, with accuracy exceeding 90% in both confusion here is due to the diference in the book titles, languages. This demonstrates the utility of the translated which are from the same series but are distinct works. dataset, even if it contains some noise. In this scenario, The error analysis revealed that the model occasionally adding an example in the prompt leads to negligible but struggled with mathematical reasoning and contextual consistent improvements. Additionally, the inclusion of understanding, highlighting areas for future enhancethe document title, while sometimes causing inconsis- ment. Larger models and further fine-tuning could potentencies in zero-shot learning, is better utilized by the tially address these issues, which remain open questions ifne-tuned model, leading to slight but not significant for future research. improvements. This is interesting because it suggests that the model not relying on document titles is more broadly applicable. Overall, the fine-tuned models perform significantly better, highlighting the importance of the translated dataset for achieving high accuracy in fact verification tasks in both English and Italian.

The error analysis suggests that the model sometimes inherits the mathematical reasoning limitations of the 6In English: “The Prague Castle attracts over 18 million visitors every year.” 7In English: “The castle is among the most visited tourist attractions in Prague, attracting over 1.8 million visitors every year.” 8In English: “Ned Stark was introduced in 1996 in A Storm of Swords.” 9In English: “Introduced in 1996 in A Game of Thrones, Ned is the honorable lord of Winterfell, an ancient fortress in the north of the imaginary continent of Westeros.”

Acknowledgments The team would like to thank Monika Kakol for her in

valuable support in the validation of the translations. This work was supported by Project ECS 0000024 Rome Technopole, - CUP B83C22002820006, NRP Mission 4 Component 2 Investment 1.5, Funded by the European Union - NextGenerationEU.

5. Conclusion References

In this work, we have introduced FEVER-IT, an Italian version of the FEVER dataset, designed to improve the training and evaluation of models for fact verification in the Italian language. Using a machine translation system, we translated a large-scale dataset of 228,000 claims/pieces of evidence pairs and manually validated 2, 000 test instances to ensure meaningful evaluations. This enabled us to fine-tune a state-of-the-art LLM, specifically LLaMA3, and assess its performance in both English and Italian.

Our experiments demonstrated that the multilingual model, without fine-tuning, performed similarly on both English and Italian datasets, though the accuracy and stability were limited. Fine-tuning significantly improved the model’s performance, achieving over 90% accuracy in both languages. This underscores the importance and efectiveness of the translated dataset, even if it contains some noise.

Future work will explore the performance of larger models and further refinement of the dataset to enhance accuracy and generalization capabilities or explore more complex settings such as those described in [9]. the covid-19 infodemic and fake news detection, Associates, Inc., 2023, pp. 67284–67296. in: Advances in Information Retrieval, Springer [21] A. Vlachos, S. Riedel, Fact checking: Task defiInternational Publishing, Cham, 2022, pp. 416–428. nition and dataset construction, in: C. Danescu[12] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. El- Niculescu-Mizil, J. Eisenstein, K. McKeown, N. A. sayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, Smith (Eds.), Proceedings of the ACL 2014 WorkM. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The shop on Language Technologies and Computational clef-2024 checkthat! lab: Check-worthiness, subjec- Social Science, Association for Computational Lintivity, persuasion, roles, authorities, and adversarial guistics, Baltimore, MD, USA, 2014, pp. 18–22. URL: robustness, in: N. Goharian, N. Tonellotto, Y. He, https://aclanthology.org/W14-2508. doi:10.3115/ A. Lipani, G. McDonald, C. Macdonald, I. Ounis v1/W14-2508. (Eds.), Advances in Information Retrieval, Springer [22] A. Martín, J. Huertas-Tato, Álvaro Huertas-García, Nature Switzerland, Cham, 2024, pp. 449–458. G. Villar-Rodríguez, D. Camacho, Facter-check: [13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- Semi-automated fact-checking through semanhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, tic similarity and natural language inference, L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Knowledge-Based Systems 251 (2022) 109265. lingual representation learning at scale, arXiv doi:https://doi.org/10.1016/j.knosys. preprint arXiv:1911.02116 (2019). 2022.109265. [14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [23] E. C. Choi, E. Ferrara, Automated claim matchLachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ing with large language models: Empowering factbro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, checkers in the fight against misinformation, in: G. Lample, Llama: Open and eficient foundation Companion Proceedings of the ACM on Web Conlanguage models, 2023. arXiv:2302.13971. ference 2024, WWW ’24, Association for Com[15] B. Berendt, P. Burger, R. Hautekiet, J. Jagers, A. Plei- puting Machinery, New York, NY, USA, 2024, p. jter, P. Van Aelst, Factrank: Developing auto- 1441–1449. URL: https://doi.org/10.1145/3589335. mated claim detection for dutch-language fact- 3651910. doi:10.1145/3589335.3651910. checkers, Online Social Networks and Media 22 [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, (2021) 100113. doi:https://doi.org/10.1016/ Improving language understanding by generj.osnem.2020.100113. ative pre-training, CoRR abs/1801.06146 [16] D. Croce, A. Zelenanska, R. Basili, Enabling (2018). URL: http://arxiv.org/abs/1801.06146. deep learning for large scale question answering arXiv:1801.06146. in italian, Intelligenza Artificiale 13 (2019) 49– [25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, 61. URL: https://doi.org/10.3233/IA-190018. doi:10. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, 3233/IA-190018. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, [17] A. Scaiella, D. Croce, R. Basili, Large scale datasets G. Krueger, T. Henighan, R. Child, A. Ramesh, for image and video captioning in italian, Italian D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, Journal of Computational Linguistics 2 (2019) 49– E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, 60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_ C. Berner, S. McCandlish, A. Radford, I. Sutskever, 2_3___scaiella_et_al.pdf. D. Amodei, Language models are few-shot learners, [18] C. Malon, Team papelo: Transformer networks at in: H. Larochelle, M. Ranzato, R. Hadsell, M. BalFEVER, in: J. Thorne, A. Vlachos, O. Cocarascu, can, H. Lin (Eds.), Advances in Neural Information C. Christodoulopoulos, A. Mittal (Eds.), Proceed- Processing Systems 33: Annual Conference on Neuings of the First Workshop on Fact Extraction ral Information Processing Systems 2020, NeurIPS and VERification (FEVER), Association for Com- 2020, December, 2020, pp. 6–12. putational Linguistics, Brussels, Belgium, 2018, pp. [26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, 109–113. URL: https://aclanthology.org/W18-5517. Y. Li, S. Wang, W. Chen, Lora: Low-rank doi:10.18653/v1/W18-5517. adaptation of large language models, CoRR [19] L. Canale, A. Messina, Experimenting ai tech- abs/2106.09685 (2021). URL: https://arxiv.org/abs/ nologies for disinformation combat: the idmo 2106.09685. arXiv:2106.09685. project, 2023. URL: https://arxiv.org/abs/2310.11097. [27] C. D. Hromei, D. Croce, V. Basile, R. Basili, ExtremarXiv:2310.11097. ita at EVALITA 2023: Multi-task sustainable scaling [20] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, to large language models at its extreme, in: ProD. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat, ceedings of the Eighth Evaluation Campaign of NatMadlad-400: A multilingual and document-level ural Language Processing and Speech Tools for Itallarge audited dataset, in: Advances in Neural In- ian. Final Workshop (EVALITA 2023), Parma, Italy, formation Processing Systems, volume 36, Curran September 7th-8th, 2023, volume 3473 of CEUR Workshop Proceedings, CEUR-WS.org, 2023. URL: E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e https://ceur-ws.org/Vol-3473/paper13.pdf. e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key [28] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A t e r m s used i n t h i s t a s k a r e : dataset for real-world claim verification with ev- − Claei mxa:m Ain asttiaotne m. e n t or a s s e r t i o n under idence from the web, in: A. Oh, T. Naumann, − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), or o p p o s e s t h e c l a i m .

Advances in Neural Information Processing Systems, volume 36, Curran Associates, Inc., 2023, pp. Answer with one o f t h e f o l l o w i n g judgments 65128–65167. − SUPbPaOsReTdS :oni ft hteh ee veivdiednecnec ep rsouvbi ds etadn: t i a t e s t h e [29] P. Atanasova, D. Wright, I. Augenstein, Gener- c l a i m .

ating label cohesive and well-formed adversarial − REFUTES : i f t h e e v i d e n c e d i r e c t l y claims, in: Proceedings of the 2020 Conference c o n t r a d i c t s t h e c l a i m . on Empirical Methods in Natural Language Pro- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t cessing (EMNLP), Association for Computational evvaildi edni tcye t o d e t e r m i n e t h e claim ’ s Linguistics, Online, 2020, pp. 3168–3177. URL: https: ### I n p u t //aclanthology.org/2020.emnlp-main.256. doi:10. − Claim : [ CLAIM HERE ] 18653/v1/2020.emnlp-main.256. − E v i d e n c e : [ EVIDENCE HERE ] [30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a ### Answer : [ANSWER HERE ] method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, A.1.2. 1-shot Setting ACL ’02, Association for Computational Linguis- The following prompt is used for 1-shot learning, where tics, USA, 2002, p. 311–318. URL: https://doi.org/ the task and classes are explained, and one example per 10.3115/1073083.1073135. doi:10.3115/1073083. class is provided. Notice that only the evidence is re1073135. ported without the title of the original document. [31] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia,

J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A ### I n s t r u c t i o n survey on in-context learning, 2024. URL: https: E v a l ueavtied einfc et hper ocvliadiemd .i sD es ufipnpiot ri otends bfyo rt hkeey //arxiv.org/abs/2301.00234. arXiv:2301.00234. t e r m s used i n t h i s t a s k a r e : [32] AI@Meta, Llama 3 model card, 2024. URL: − Claim : A s t a t e m e n t or a s s e r t i o n under https://github.com/meta-llama/llama3/blob/main/ e x a m i n a t i o n .

MODEL_CARD.md. − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s or o p p o s e s t h e c l a i m .

Answer with one o f t h e f o l l o w i n g judgments A. Prompting Engineering − SUPbPaOsReTdS :oni ft hteh ee veivdiednecnec ep rsouvbi ds etadn: t i a t e s t h e c l a i m .

This appendix contains the prompts used in the exper- − REFUTES : i f t h e e v i d e n c e d i r e c t l y iments. The prompts are provided in both Italian and c o n t r a d i c t s t h e c l a i m .

English, reflecting the task-specific nature of the experi- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t ments. Each prompt begins with an explanation of the e v i d e n c e t o d e t e r m i n e t h e claim ’ s task and the meaning of the classes. In the diferent vari- v a l i d i t y ants, the 0-shot setting does not include any examples, ### Examples unlike the 1-shot setting. Where necessary, the name of These examples d e m o n s t r a t e how t o a p p l y t h e the document from which the evidence is taken is also e v a l u a t i o n c r i t e r i a : specified. − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d G o t h i c . − E v i d e n c e : The Germanic p e o p l e s ( a l s o A.1. Prompts in English r e f e r r e d t o a s T e u t o n i c , Suebian , or G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo − A.1.1. 0-shot Setting European ethno − l i n g u i s t i c group o f Northern European o r i g i n .

The following prompt is used for 0-shot learning, where − Answer : SUPPORTS the task and classes are presented without additional information. ### I n s t r u c t i o n − Claim : T e n n i s i s not a s p o r t . − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f r e c r e a t i o n a l p l a y e r s and i s a l s o a p o p u l a r worldwide s p e c t a t o r s p o r t . − Answer : REFUTES − Claim : Kick −Ass i s a h o r r o r f i l m . − E v i d e n c e : Kick −Ass i s a 2010 B r i t i s h −

American f i l m b a s e d on t h e comic book o f t h e same name by Mark M i l l a r and John

Romita , J r . − Answer : NOT ENOUGH INFO ### I n p u t − Claim : [ CLAIM HERE ] − E v i d e n c e : [ EVIDENCE HERE ] ### Answer : [ANSWER HERE ] A.1.3. 0-shot Setting with Document Title

The following prompt is used for 0-shot learning, where the task and classes are explained without additional information. Each input evidence is provided with the title of its original document.

### I n s t r u c t i o n E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key t e r m s used i n t h i s t a s k a r e : − Claim : A s t a t e m e n t or a s s e r t i o n under

e x a m i n a t i o n . − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s

or o p p o s e s t h e c l a i m . − Document : d e n o t e s t h e s o u r c e document f o r

t h e e v i d e n c e .

Answer with one o f t h e f o l l o w i n g judgments

b a s e d on t h e e v i d e n c e p r o v i d e d : − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e

c l a i m . − REFUTES : i f t h e e v i d e n c e d i r e c t l y

c o n t r a d i c t s t h e c l a i m . − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t e v i d e n c e t o d e t e r m i n e t h e c l a im ’ s v a l i d i t y ### I n p u t − Claim : [ CLAIM HERE ] − E v i d e n c e : [ EVIDENCE HERE ] − Document : [DOCUMENT HERE ] ### Answer : [ANSWER HERE ] A.1.4. 1-shot Setting with Document Title − Document : d e n o t e s t h e s o u r c e document f o r

t h e e v i d e n c e .

Answer with one o f t h e f o l l o w i n g judgments

b a s e d on t h e e v i d e n c e p r o v i d e d : − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e

c l a i m . − REFUTES : i f t h e e v i d e n c e d i r e c t l y

c o n t r a d i c t s t h e c l a i m . − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t e v i d e n c e t o d e t e r m i n e t h e c l a im ’ s v a l i d i t y ### Examples These examples d e m o n s t r a t e how t o a p p l y t h e

e v a l u a t i o n c r i t e r i a : − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d

G o t h i c . − E v i d e n c e : The Germanic p e o p l e s ( a l s o r e f e r r e d t o a s T e u t o n i c , Suebian , or G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo − European ethno − l i n g u i s t i c group o f

Nor ther n European o r i g i n . − Document : Germanic p e o p l e s − Answer : SUPPORTS − Claim : T e n n i s i s not a s p o r t . − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f r e c r e a t i o n a l p l a y e r s and i s a l s o a p o p u l a r worldwide s p e c t a t o r s p o r t . − Document : T e n n i s − Answer : REFUTES − Claim : Kick −Ass i s a h o r r o r f i l m . − E v i d e n c e : Kick −Ass i s a 2010 B r i t i s h −

American f i l m b a s e d on t h e comic book o f t h e same name by Mark M i l l a r and John

Romita , J r . − Document : Kick −Ass ( f i l m ) − Answer : NOT ENOUGH INFO ### I n p u t − Claim : [ CLAIM HERE ] − E v i d e n c e : [ EVIDENCE HERE ] − Document : [DOCUMENT HERE ] ### Answer : [ANSWER HERE ] A.2. Prompts in Italian A.2.1. 0-shot Setting The following prompt is used for 1-shot learning, where The following prompt is used for 0-shot learning, where the task and classes are explained, and one example per the task and classes are presented without additional class is provided. Each input evidence is provided with information. the title of its original document. ### I n s t r u c t i o n E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key t e r m s used i n t h i s t a s k a r e : − Claim : A s t a t e m e n t or a s s e r t i o n under

e x a m i n a t i o n . − E v i d e n c e : I n f o r m a t i o n t h a t e i t h e r s u p p o r t s or o p p o s e s t h e c l a i m .

a s s e r z i o n e s o t t o esame . − Prova : I n f o r m a z i o n i che s u p p o r t a n o o c o n t r a d d i c o n o l ’ a f f e r m a z i o n e .

R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i

s u l l e prove f o r n i t e : − SUPPORTS : s e l e prove confermano l ’

a f f e r m a z i o n e . − REFUTES : s e l e prove c o n t r a d d i c o n o

d i r e t t a m e n t e l ’ a f f e r m a z i o n e . − NOT ENOUGH INFO : s e l e prove non sono s u f f i c i e n t i p e r d e t e r m i n a r e l a v a l i d i t à d e l l ’ a f f e r m a z i o n e . ### I n p u t − A f f e r m a z i o n e : [ CLAIM HERE ] − Prova : [ EVIDENCE HERE ] ### R i s p o s t a : [ANSWER HERE ] A.2.2. 1-shot Setting

The following prompt is used for 1-shot learning, where

the task and classes are explained, and one example per class is provided. Notice that only the evidence is reported without the title of the original document. − A f f e r m a z i o n e : Kick −Ass è un f i l m h o r r o r . − Prova : Kick −Ass è un f i l m b r i t a n n i c o − americano d e l 2010 b a s a t o s u l f u m e t t o omonimo d i Mark M i l l a r e John Romita J r . − R i s p o s t a : NOT ENOUGH INFO ### I n p u t − A f f e r m a z i o n e : [ CLAIM HERE ] − Prova : [ EVIDENCE HERE ] ### R i s p o s t a : [ANSWER HERE ] A.2.3. 0-shot Setting with Document Title

The following prompt is used for 0-shot learning, where the task and classes are explained without additional information. Each input evidence is provided with the title of its original document.

### I s t r u z i o n i V a l u t a s e l ’ a f f e r m a z i o n e è s u p p o r t a t a d a l l e prove f o r n i t e . Le d e f i n i z i o n i d e i t e r m i n i c h i a v e u t i l i z z a t i i n q u e s t o compito sono : − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o a s s e r z i o n e s o t t o esame . − Prova : I n f o r m a z i o n i che s u p p o r t a n o o c o n t r a d d i c o n o l ’ a f f e r m a z i o n e . − Documento : i n d i c a l a f o n t e da c u i è s t a t a e s t r a t t a l a prova .

The following prompt is used for 1-shot learning, where the task and classes are explained, and one example per class is provided. Each input evidence is provided with the title of its original document.

a s s e r z i o n e s o t t o esame . − Prova : I n f o r m a z i o n i che s u p p o r t a n o o c o n t r a d d i c o n o l ’ a f f e r m a z i o n e . − Documento : i n d i c a l a f o n t e da c u i è s t a t a e s t r a t t a l a prova .

R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i s u l l e prove f o r n i t e : − SUPPORTS : s e l e prove confermano l ’ a f f e r m a z i o n e . − REFUTES : s e l e prove c o n t r a d d i c o n o d i r e t t a m e n t e l ’ a f f e r m a z i o n e . − NOT ENOUGH INFO : s e l e prove non sono s u f f i c i e n t i p e r d e t e r m i n a r e l a v a l i d i t à d e l l ’ a f f e r m a z i o n e . ### Esempi Q u e s t i esempi d i m o s t r a n o come a p p l i c a r e i c r i t e r i d i v a l u t a z i o n e : − A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono c h i a m a t i anche g o t i c i . − Prova : I p o p o l i g e r m a n i c i ( anche c h i a m a t i Teutoni , S u e b i o G o t i n e l l a l e t t e r a t u r a p i ù a n t i c a ) sono un gruppo etno − l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord e u r o p e a . − Documento : P o p o l i g e r m a n i c i − R i s p o s t a : SUPPORTS − A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t . − Prova : I l t e n n i s è p r a t i c a t o da m i l i o n i d i g i o c a t o r i a m a t o r i a l i ed è anche uno s p o r t p o p o l a r e a l i v e l l o mondiale . − Documento : T e n n i s − R i s p o s t a : REFUTES − A f f e r m a z i o n e : Kick −Ass è un f i l m h o r r o r . − Prova : Kick −Ass è un f i l m b r i t a n n i c o − americano d e l 2010 b a s a t o s u l f u m e t t o omonimo d i Mark M i l l a r e John Romita J r . − Documento : Kick −Ass ( f i l m ) − R i s p o s t a : NOT ENOUGH INFO ### I n p u t − A f f e r m a z i o n e : [ CLAIM HERE ] − Prova : [ EVIDENCE HERE ] − Documento : [DOCUMENT HERE ] ### R i s p o s t a : [ANSWER HERE ]