=Paper=
{{Paper
|id=Vol-3878/97_main_long
|storemode=property
|title=Leveraging Large Language Models for Fact Verification in Italian
|pdfUrl=https://ceur-ws.org/Vol-3878/97_main_long.pdf
|volume=Vol-3878
|authors=Antonio Scaiella,Stefano Costanzo,Elisa Passone,Danilo Croce,Giorgio Gambosi
|dblpUrl=https://dblp.org/rec/conf/clic-it/ScaiellaCPCG24
}}
==Leveraging Large Language Models for Fact Verification in Italian==
Leveraging Large Language Models for Fact Verification in
Italian
Antonio Scaiella1,2 , Stefano Costanzo1 , Elisa Passone1 , Danilo Croce1,* and Giorgio Gambosi1
1
Department of Enterprise Engineering, University of Rome Tor Vergata, Italy
2
Reveal s.r.l.
Abstract
In recent years, Automatic Fact Checking has become a crucial tool for combating fake news by leveraging AI to verify
the accuracy of information. Despite significant advancements, most datasets and models are predominantly available in
English, posing challenges for other languages. This paper presents an Italian resource based on the dataset made available
in the FEVER evaluation campaign, created to train and evaluate fact-checking models in Italian. The dataset comprises
approximately 240k examples, with over 2k test examples manually validated. Additionally, we fine-tuned a state-of-the-art
LLM, namely LLaMA3, on both the original English and translated Italian datasets, demonstrating that fine-tuning significantly
improves model performance. Our results suggest that the fine-tuned models achieve comparable accuracy in both languages,
highlighting the value of the proposed resource.
Keywords
Automatic Fact Checking, Fact Checking in Italian, Resource in Italian, Large Language Model for Fact Verification
1. Introduction cial intelligence communities, surveyed in [1] and more
recently in [3] and [4]. In particular, in [1] the authors
In recent years, Automatic Fact Checking (AFC) has as- expose a survey on the topic, describing the early develop-
sumed a significant role as an instrument to identify fake ments that were surveyed in [5], which is an exhaustive
news. AFC is a process that verifies the truthfulness and overview of the subject.
accuracy of information, claims, and data contained in a As with most machine learning paradigms [1], state-
text or speech. The focus is on debunking disinformation of-the-art methods require datasets and benchmarks.
and misinformation, intercepting errors, and verifying One of the most impactful campaigns for collecting
sources and facts. a large-scale benchmark is FEVER (Fact Extraction and
Automated fact-checking uses AI tools to identify, ver- VERification) [6]. In this context, fact-checking involves
ify, and respond to misleading claims, using techniques verifying whether a claim is supported by one or more
based on natural language processing, machine learning, pieces of evidence. FEVER is a publicly available dataset
knowledge representation, and databases to automati- designed for claim verification against textual sources.
cally predict the truthfulness of claims [1]. This is a It comprises about 180K claims generated by altering
complex process that involves searching, interpreting, sentences extracted from Wikipedia. The claims are clas-
and assessing information. As discussed in [1] a NLP sified into three categories: Supported (a piece of evi-
framework for automated fact-checking consists of three dence exists and it supports the claim), Refutes (a piece
stages: claim detection to identify claims that require of evidence exists and it contradicts the claim), or NotE-
verification; evidence retrieval to find sources supporting noughInfo (there is insufficient evidence to verify the
or refuting the claim; and claim verification to assess the claim). The challenge, therefore, is to retrieve the rel-
truthfulness of the claim based on the retrieved evidence. evant evidence and verify the accuracy of the claims,
At first, automating the fact-checking process has been categorizing them with the correct label.
discussed in the context of computational journalism in Many works like FEVER have recently focused on
works like [2], and has received significant attention in building data and datasets for the task of Fact Verification,
the computational linguistics and, in general, the artifi- achieving very good results [7, 8, 9, 10, 11, 12]. However,
all of these datasets are designed for the English language.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Although multilingual models exist (e.g., in [13, 14]), fine-
Dec 04 — 06, 2024, Pisa, Italy
* tuning a model on a specific language, pre-training it for
Corresponding author.
$ scaiella@revealsrl.it (A. Scaiella); a specific task and use case, could lead to a significant
stefano.costanzo@students.uniroma2.eu (S. Costanzo); decline in quality if applied to another language. Few
passone@ing.uniroma2.it (E. Passone); croce@info.uniroma2.it studies have worked on training models for languages
(D. Croce); giorgio.gambosi@uniroma2.it (G. Gambosi)
other than English. An example is the work presented
0000-0001-9111-1950 (D. Croce); 0000-0001-9979-6931
(G. Gambosi) in [15], which focuses on developing automated claim
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License detection for Dutch-language fact-checkers.
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
In this work, we propose a FEVER-IT dataset in which 2. Related Work
the FEVER dataset has been translated into Italian to train
the model for the Italian language. Inspired by SQUAD-IT One of the pioneering works in autonomous fact-
[16] and MSCOCO-IT [17], we worked to obtain quality checking was conducted by [21], which proposed cre-
data. Although the training set may be affected by trans- ating publicly available datasets and developing auto-
lation errors, the test set will not, as it is composed of mated systems using natural language processing tech-
manually validated data. Furthermore, while the original nologies. Recent challenges such as CheckThat! at CLEF
FEVER dataset contained evidence only for Supports [10, 11, 12] and Fever [7, 8, 9] from 2018 have advanced
and Refutes, in this work we have also added and trans- fact-checking tasks by leveraging advanced approaches
lated examples for the NotEnoughInfo category using and integrating Large Language Models (LLMs) like BERT
the heuristics proposed in [18]. This work extends the ex- and GPT. These models represent the current state of the
perience described in [19], where translations were done art in many Natural Language Processing tasks, includ-
using Google API, by using publicly available models ing fact-checking. Notable examples of such technology
([20]) and adding data for the NotEnoughInfo category. include FacTeR-Check [22], a multilingual architecture
The contribution of this work is twofold. Firstly, we for semi-automated fact-checking and hoax propagation
release FEVER-IT, a corpus with 228K claims each associ- analysis using the XLM-RoBERTa Transformer [13], and
ated with at least one (possibly useful) piece of evidence, FACT-GPT [23], a framework that automates the claim-
including a test set of 2,000 manually validated claims. matching phase of fact-checking using LLMs to identify
In addition, we fine-tuned and validated a state-of-the- social media content that supports or contradicts claims
art model, LLaMA3 [14], on both the original English previously debunked by fact-checkers.
dataset and the Italian dataset. While this provides a The success of these systems is largely due to the capa-
high-performance model ready for the task in both lan- bilities of LLMs as summarized in [3], which are neural
guages, the primary goal is to assess whether the quality models based on the Transformer architecture. Specif-
of the Italian data is comparable to the English one. By ically, decoder-based architectures, such as GPT [24],
training the model separately on each dataset, we can GPT-3 [25], and LLaMA [14], generate output sequences
evaluate its stability: if the model performs similarly on in an auto-regressive manner. These models have demon-
the manually validated Italian dataset and the English strated impressive capabilities following pre-training on
test set, we can conclude that the quality of the Italian large collections of documents. One notable outcome is
data is on par with the English data. few-shot learning, where models can adapt to new tasks
Additionally, we want to assess whether using an Ital- with only a few examples [25], greatly enhancing their
ian train dataset, despite the noise from automatic trans- flexibility and applicability.
lation, is truly beneficial. LLMs like LLaMA3 can already When new annotated data is available, fine-tuning
perform tasks in other languages through zero-shot or further enhances a model’s capabilities. This process in-
few-shot learning, without requiring fine-tuning on a volves taking the pre-trained base model and training it
specific dataset, especially if that dataset is noisy. There- on a smaller, specialized dataset relevant to the desired
fore, we aim to compare the performance on the test set task. Parameter Efficient Fine-Tuning (PEFT) is an opti-
between a LLaMA3 model that hasn’t been fine-tuned on mized technique that involves training only a small por-
the noisy Italian data and one that has been fine-tuned, to tion of the weights, typically by adding a new layer to the
determine whether fine-tuning actually improves results model. One widely used technique is LoRA [26], which
or if the model performs on par or better without it. adds an adapter consisting of two matrices of weights
The experimental results show that the model without that are relatively small compared to the original model.
fine-tuning achieves an average accuracy of only about Extremita [27] is an example of a decoder-based model
45%. Fine-tuning on the English dataset yields about 90% fine-tuned with LoRA in Italian for multi-task executions.
mean accuracy, while fine-tuning on the Italian dataset Several benchmark datasets have been developed to
results in a percentage quite similar to the fine-tuned fine-tune and evaluate fact-checking systems, typically
English model and much greater than testing without collected by organizations like Snopes, FullFact, and Poli-
fine-tuning1 . tiFact. The FEVER challenge has produced four major
The remainder of the paper is organized as follows: Sec- datasets: FEVER (2018) [6], FEVER 2.0 (2019) [8], FEVER-
tion 2 discusses related work, Section 3 presents FEVER- OUS (2021) [9], and AVeriTeC (2024) [28]. These datasets
IT, Section 4 details the experimental measures, and Sec- range from labeled claim-evidence associations to veri-
tion 5 provides the conclusions. fied claims with structured and unstructured evidence.
Despite the wealth of resources available, there is a lack of
large benchmark datasets in Italian. This work addresses
1 this gap by providing a large-scale Italian resource.
The resource, fine-tuned models, and code will be released on a
dedicated repository: https://github.com/crux82/FEVER-it
3. Fact Verification in Italian focused on correcting mistakes related to the proper sen-
tence structure in Italian, the accurate meaning of specific
As in [6], the original FEVER dataset is composed of English words that MADLAD had translated literally, any
claims that can potentially be verified against an ency- misunderstandings of the intended meaning in Italian,
clopedic resource, in this case, Wikipedia. The claims are and a few grammatical errors.
classified into three categories: Supported, Refutes and In some cases, translation errors do not completely un-
NotEnoughInfo. For the first two categories, each claim dermine the examples with respect to the task’s purpose.
is associated with one or more passages from Wikipedia, For instance, the English sentence from an evidence, “he
each specifying the page from which it was extracted. was booked to win a third world championship at a WWE
For the NotEnoughInfo category, no passages are pro- event on the night of his death” was translated into Italian
vided because no information was found on Wikipedia as “era stato prenotato per vincere un terzo titolo mondiale
to support or refute the claim. For instance, the sentence in un evento della WWE la notte della sua morte”. A more
“Dan Brown is illiterate.” is a claim associated with pieces accurate translation would be “si pensava avrebbe vinto
of evidence such as: “Angels and Demons is a 2000 best- un terzo titolo mondiale in un evento della WWE la notte
selling mystery-thriller novel written by American author della sua morte”, better capturing the verb’s meaning. In
Dan Brown and published by Pocket Books and then by other, more problematic cases, translation errors, loss of
Corgi Books.”. These pieces of evidence prove that the information, or introduction of hallucinations could even
claim is incorrect, so it can be classified with the label Re- change the classification in the fact verification task. For
futes. In FEVER, a claim is thus a sentence that expresses example, in the claim “The Thin Red Line (1998 film) has
information (true or mutated) about a target entity. an all-British cast.”, the automatic translation was “La
To generate the Italian dataset, we started from the sottile linea rossa (The Thin Red Line) è un film del 1998.”,
dataset version2 proposed in [29], which consists of 260k which is incorrect because it omits the information about
claims. This version extends the original FEVER by the cast. This detail is crucial, as its absence could lead
adding evidence associated with claims justified as NotE- to incorrect labeling.
noughInfo in FEVER, using the heuristics in [18]. The
approach involved using a search engine to retrieve po- Metric BLEU-1 BLEU-2 BLEU-3 BLEU-4
tential evidence and a textual entailment system based Claim 0,9776 0,9695 0,9623 0,9544
on GPT [24]. Claims not judged as Supports or Refutes Evidence 0,9529 0,9411 0,9309 0,9207
were classified as NotEnoughInfo. Table 1
This gives us examples of sentences that are closely BLEU score metrics of Claim and Evidence manually validated
related to the claim (according to the search engine) but (gold) respect automatic translation version (silver)
neither support nor refute it. This makes it more straight-
forward and efficient to train and/or evaluate a classifier,
even though some of the derived examples might be some- Train (S) Dev (S) Test (G) Total
what noisy, as they were generated through heuristics. Supports 114,801 4,638 654 120,095
For the automatic translation process, we utilized Refutes 47,096 4,887 643 52,626
NEI 66,380 6,410 766 73,556
MADLAD400 [20], a machine translation system based
Total 228,277 15,935 2,063 246,275
on the Transformer architecture3 , trained on MADLAD,
a manually audited, general domain 3T token multilin- Table 2
gual dataset based on CommonCrawl, spanning 419 lan- Number of claims and evidence in the Italian dataset. (S) indi-
guages. Since the Italian data are obtained through ma- cates silver data (automatically translated), and (G) indicates
gold data (manually validated).
chine translation, and thus potentially incorrect as sug-
gested in [16, 17], we needed validated test data to obtain
a realistic benchmark. Our hypothesis is that an LLM is A quantitative analysis of the translation quality sug-
robust enough to generalize from the 228k examples and gests that MADLAD performs well in translating simple
recognize the relationships involved in FEVER without assertive sentences such as claims. In fact, 91% of the
inheriting translation errors. However, to prevent these claims were not altered by the validators, who considered
errors from being inherited by the model, we manually them completely correct. This percentage is lower for the
corrected the translations of the test set. Wikipedia passages, dropping to 76%. This discrepancy
Out of the approximately 16k available test examples, may be due to the greater complexity of the evidence com-
three annotators were involved in verifying and correct- pared to the simpler sentence structures in the claims.
ing 2, 063 translations from the test set. The annotators Additionally, we reported the results in terms of BLEU
score [30] for the corrected translations compared to the
2
https://huggingface.co/datasets/copenlu/fever_gold_evidence originals, as shown in Table 1. It should be noted that
3
https://github.com/google-research/google-research/tree/master/
madlad_400
measuring the translation quality after correcting the
sentences introduces a strong bias in the measurements; 4. Experimental Evaluation
however, it provides a more specific idea of the trans-
lation quality, especially in understanding the potential The goal of our experimentation is to assess the perfor-
noisiness of the training and development sentences. In mance of a state-of-the-art LLM applied to Fact Verifica-
this case, results of over 95% for BLEU-1 and over 92% for tion. Specifically, we aim to determine whether a multi-
BLEU-4 suggest that very few terms were altered during lingual model maintains consistent quality when applied
validation, and even the grammatical patterns remained to both the English FEVER dataset and our Italian dataset.
largely unchanged. At most, a few mistranslated terms We utilize LLaMA3-Instruct4 , an instruction-tuned gen-
needed updating, as indicated by the qualitative analysis. erative text model from META with 8 billion parameters,
Table 2 summarizes the number of examples created released in April 2024. This model is trained to execute
for the Italian dataset. In line with the original English specific instructions or prompts across various tasks. To
material, the dataset is divided into training, develop- ensure alignment, we evaluate the systems on the manu-
ment, and test sets, with claims categorized into Sup- ally validated Italian test set and the same subset of 2,063
ports, Refutes, and NotEnoughInfo (NEI). The ta- claims in the English counterpart. The model is evaluated
ble also distinguishes between silver data (automatically in 0-shot and 1-shot settings to assess its capability with-
translated) and gold data (manually validated). The train- out fine-tuning. The prompts used in English and Italian
ing set consists of 228,277 claims, the development set are provided in Appendix A. Additionally, we fine-tuned
contains 15,935 claims, and the test set has 2,063 claims. LLaMA3 on the English datasets from [29] and separately
Each Italian claim or evidence is aligned with the English on the Italian datasets obtained via machine translation.
counterpart, facilitating future research in cross-lingual Fine-tuning was conducted on an NVIDIA A100 using
fact verification. the LoRA technique5 .
Language Models for Fact Verification. For address- In FEVER, the title of the document associated with
ing the capabilities of Large Language Models in Fact Veri- each claim often provides crucial context. For example,
fication, they can be utilized through In-Context Learning the claim “The University of Leicester discovered and iden-
techniques [31] or by directly fine-tuning the model for tified the remains of a king.” relies on the document titled
specific downstream tasks. In-context learning relies on “University of Leicester” to correctly classify the claim
the model’s pre-existing knowledge acquired during pre- as Supports. To ensure the model’s generalization, we
training and on instructions provided in natural language will evaluate the impact of including document titles in
at inference time. This method does not involve addi- prompts. The metrics used to analyze the results are re-
tional training and can be categorized based on the num- call, precision, accuracy, and F1 score, calculated globally
ber of examples provided: i) 0-shot Learning, where no and for each label (Supports, Refutes, NotEnough-
examples are given, and the model generates responses Info).
based solely on its pre-existing knowledge and the pro- The results are reported in Tables 3 and 4 for the En-
vided instructions; ii) 1-shot Learning, where one example glish and Italian datasets, respectively. Each table shows
per class is added to provide a more precise context, help- whether the model underwent fine-tuning (column FT),
ing the model better understand the task by offering a whether a prompt without examples (0-shot) or with one
concrete reference point; iii) Few-shot Learning, where example per class (1-shot) was used (column Prompt), and
more than one example per class is provided to give the whether the document title was included (column Doc).
model additional contextual information during decision- Notably, if no fine-tuning was performed, the original
making. When the model’s pre-existing knowledge is LLaMA3-Instruct model was used. Given that the sys-
insufficient, we can fine-tune it on the downstream task. tem’s response can consist of multiple words, we search
Fine-tuning involves training the model in a traditional the output for the mention of one of the classes and asso-
manner using input-output pairs (training data) to adjust ciate the example with that class. If no class is identified,
its parameters. This process improves the model’s per- the result is classified as NotEnoughInfo. In general,
formance on specific tasks, allowing it to learn from a the fine-tuned model is extremely stable, consistently
more extensive set of examples. As a result, the model outputting one of the three categories for every request.
becomes more adept at handling similar queries in the The non-fine-tuned model, on rare occasions—just a few
future, with a focus on the specific task at hand. We dozen times out of 2000—produces responses that do not
thus evaluated the application of state-of-the-art LLM, correspond to any of the required classes. This highlights
namely LLAMA3 [32], by providing just the definition of the inherent stability of LLaMA3 while also supporting
the task (zero-shot) or adding an example (one-shot) or
4
by performing fine-tuning, to demonstrate the necessity https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
5
The following hyperparameters were used: a learning rate of
of a training dataset like the one constructed in this work,
0.0001, two epochs, LoRA_R set to 8, LoRA_alpha set to 16, and
as discussed in the following section. LoRA_dropout at 0.05. The micro-batch size was 2, and gradient
accumulation steps were set to 8.
Support Refutes Not enough info Macro Average
FT Prompt Doc Acc
P R F1 P R F1 P R F1 P R F1
No 0.449 0.784 0.161 0.267 0.647 0.236 0.346 0.395 0.873 0.544 0.609 0.423 0.386
0-shot
Yes 0.374 0.343 0.976 0.507 0.763 0.160 0.265 0.477 0.041 0.075 0.528 0.392 0.282
No
No 0.591 0.555 0.864 0.675 0.699 0.415 0.521 0.586 0.507 0.543 0.613 0.595 0.580
1-shot
Yes 0.383 0.929 0.020 0.039 0.867 0.020 0.040 0.376 0.999 0.546 0.724 0.346 0.208
No 0.917 0.932 0.947 0.939 0.924 0.888 0.906 0.899 0.916 0.908 0.918 0.917 0.918
0-shot
Yes 0.922 0.938 0.953 0.945 0.929 0.896 0.912 0.902 0.918 0.910 0.923 0.922 0.923
Yes
No 0.914 0.928 0.948 0.938 0.927 0.883 0.905 0.893 0.911 0.902 0.916 0.914 0.915
1-shot
Yes 0.921 0.931 0.956 0.943 0.927 0.891 0.909 0.907 0.916 0.912 0.922 0.921 0.921
Table 3
Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-EN dataset
Support Refutes Not enough info Macro Average
FT Prompt Doc Acc
P R F1 P R F1 P R F1 P R F1
No 0.462 0.411 0.951 0.574 0.607 0.457 0.522 0.585 0.050 0.092 0.534 0.486 0.396
0-shot
Yes 0.507 0.463 0.942 0.620 0.587 0.663 0.622 0.800 0.005 0.010 0.617 0.537 0.418
No
No 0.425 0.376 0.963 0.541 0.671 0.333 0.445 0.478 0.043 0.079 0.508 0.446 0.355
1-shot
Yes 0.462 0.403 0.968 0.569 0.632 0.361 0.459 0.698 0.115 0.197 0.578 0.481 0.409
No 0.897 0.897 0.940 0.918 0.924 0.845 0.882 0.877 0.903 0.890 0.899 0.896 0.897
0-shot
Yes 0.901 0.899 0.936 0.917 0.923 0.855 0.888 0.887 0.910 0.898 0.903 0.900 0.901
Yes
No 0.895 0.891 0.947 0.918 0.919 0.843 0.879 0.881 0.894 0.887 0.897 0.895 0.895
1-shot
Yes 0.905 0.913 0.942 0.927 0.924 0.854 0.888 0.883 0.915 0.899 0.907 0.904 0.905
Table 4
Performance in terms of Accuracy, Precision, Recall and F1-measure of our systems on Fever-IT dataset
the soundness of the results achieved. LLM. For example, the claim “Il Castello di Praga attira
A key finding is that the multilingual model generally oltre 18 milioni di visitatori ogni anno.6 ” was given the
achieves similar, though modest, results on English and evidence “Il castello è tra le attrazioni turistiche più visitate
Italian datasets without fine-tuning, with accuracy val- di Praga che attira oltre 1,8 milioni di visitatori all’anno.7 ”
ues around 0.40-0.50 and average F1 scores in the range The model’s predicted label was Refutes, while the true
of 0.35-0.55. This performance is relatively unstable, and label was Supports. Here, the true label should be Sup-
the addition of an example in the prompt does not lead ports since 18 million is indeed greater than 1.8 million,
to significant improvements. In English, there are some but the model found the numbers inconsistent. In an-
improvements, but in Italian, there are fewer. We believe other case, the claim “Ned Stark è stato introdotto nel 1996
this is because, although LLaMA is multilingual, the per- in Tempesta di spade.8 ” was paired with the evidence
centage of Italian examples observed during training is “Introdotto nel 1996 in Il Trono di Spade, Ned è l’onorevole
less than 1%, making it less performant and less stable in signore di Winterfell, un’antica fortezza nel nord del con-
this language. tinente immaginario di Westeros.9 ” The model predicted
However, when fine-tuning is applied, the results im- Refutes, although the true label was Supports. The
prove dramatically, with accuracy exceeding 90% in both confusion here is due to the difference in the book titles,
languages. This demonstrates the utility of the translated which are from the same series but are distinct works.
dataset, even if it contains some noise. In this scenario, The error analysis revealed that the model occasionally
adding an example in the prompt leads to negligible but struggled with mathematical reasoning and contextual
consistent improvements. Additionally, the inclusion of understanding, highlighting areas for future enhance-
the document title, while sometimes causing inconsis- ment. Larger models and further fine-tuning could poten-
tencies in zero-shot learning, is better utilized by the tially address these issues, which remain open questions
fine-tuned model, leading to slight but not significant for future research.
improvements. This is interesting because it suggests
that the model not relying on document titles is more 6
In English: “The Prague Castle attracts over 18 million visitors every
broadly applicable. Overall, the fine-tuned models per- year.”
form significantly better, highlighting the importance of 7
In English: “The castle is among the most visited tourist attractions
the translated dataset for achieving high accuracy in fact in Prague, attracting over 1.8 million visitors every year.”
8
verification tasks in both English and Italian. In English: “Ned Stark was introduced in 1996 in A Storm of Swords.”
9
In English: “Introduced in 1996 in A Game of Thrones, Ned is the
The error analysis suggests that the model sometimes honorable lord of Winterfell, an ancient fortress in the north of the
inherits the mathematical reasoning limitations of the imaginary continent of Westeros.”
5. Conclusion Association for Computational Linguistics, Santa
Fe, New Mexico, USA, 2018, pp. 3346–3359. URL:
In this work, we have introduced FEVER-IT, an Italian https://aclanthology.org/C18-1283.
version of the FEVER dataset, designed to improve the [6] J. Thorne, A. Vlachos, C. Christodoulopoulos,
training and evaluation of models for fact verification in A. Mittal, FEVER: a large-scale dataset for fact ex-
the Italian language. Using a machine translation system, traction and VERification, in: M. Walker, H. Ji,
we translated a large-scale dataset of 228,000 claims/- A. Stent (Eds.), Proceedings of the 2018 Confer-
pieces of evidence pairs and manually validated 2, 000 ence of the North American Chapter of the As-
test instances to ensure meaningful evaluations. This en- sociation for Computational Linguistics: Human
abled us to fine-tune a state-of-the-art LLM, specifically Language Technologies, Volume 1 (Long Papers),
LLaMA3, and assess its performance in both English and Association for Computational Linguistics, New
Italian. Orleans, Louisiana, 2018, pp. 809–819. URL: https:
Our experiments demonstrated that the multilingual //aclanthology.org/N18-1074. doi:10.18653/v1/
model, without fine-tuning, performed similarly on both N18-1074.
English and Italian datasets, though the accuracy and [7] J. Thorne, A. Vlachos, O. Cocarascu,
stability were limited. Fine-tuning significantly improved C. Christodoulopoulos, A. Mittal, The fact
the model’s performance, achieving over 90% accuracy extraction and VERification (FEVER) shared task,
in both languages. This underscores the importance and in: Proceedings of the First Workshop on Fact Ex-
effectiveness of the translated dataset, even if it contains traction and VERification (FEVER), Association for
some noise. Computational Linguistics, Brussels, Belgium, 2018,
Future work will explore the performance of larger pp. 1–9. URL: https://aclanthology.org/W18-5501.
models and further refinement of the dataset to enhance doi:10.18653/v1/W18-5501.
accuracy and generalization capabilities or explore more [8] J. Thorne, A. Vlachos, O. Cocarascu,
complex settings such as those described in [9]. C. Christodoulopoulos, A. Mittal, The FEVER2.0
shared task, in: Proceedings of the Second
Workshop on Fact Extraction and VERifica-
Acknowledgments tion (FEVER), Association for Computational
The team would like to thank Monika Kakol for her in- Linguistics, Hong Kong, China, 2019, pp. 1–
valuable support in the validation of the translations. 6. URL: https://aclanthology.org/D19-6601.
This work was supported by Project ECS 0000024 Rome doi:10.18653/v1/D19-6601.
Technopole, - CUP B83C22002820006, NRP Mission 4 [9] R. Aly, Z. Guo, M. S. Schlichtkrull, J. Thorne, A. Vla-
Component 2 Investment 1.5, Funded by the European chos, C. Christodoulopoulos, O. Cocarascu, A. Mit-
Union - NextGenerationEU. tal, The fact extraction and VERification over
unstructured and structured information (FEVER-
OUS) shared task, in: Proceedings of the Fourth
References Workshop on Fact Extraction and VERification
(FEVER), Association for Computational Linguis-
[1] Z. Guo, M. S. Schlichtkrull, A. Vlachos, A survey tics, Dominican Republic, 2021, pp. 1–13. URL: https:
on automated fact-checking, Trans. Assoc. Comput. //aclanthology.org/2021.fever-1.1. doi:10.18653/
Linguistics 10 (2022) 178–206. v1/2021.fever-1.1.
[2] A. D. Terry Flew, Christina Spurgeon, A. Swift, The [10] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón-
promise of computational journalism, Journalism Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari,
Practice 6 (2012) 157–171. M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi,
[3] C. Chen, K. Shu, Combating misinformation J. M. Struß, T. Mandl, The CLEF-2021 CheckThat!
in the age of llms: Opportunities and chal- lab on detecting check-worthy claims, previously
lenges, 2023. URL: https://arxiv.org/abs/2311.05656. fact-checked claims, and fake news, in: Proceed-
arXiv:2311.05656. ings of the 43rd European Conference on Infor-
[4] M. Akhtar, M. Schlichtkrull, Z. Guo, O. Cocarascu, mation Retrieval, ECIR ’21, Lucca, Italy, 2021, pp.
E. Simperl, A. Vlachos, Multimodal automated fact- 639–649. URL: https://link.springer.com/chapter/10.
checking: A survey, 2023. URL: https://arxiv.org/ 1007/978-3-030-72240-1_75.
abs/2305.13507. arXiv:2305.13507. [11] P. Nakov, A. Barrón-Cedeño, G. Da San Martino,
[5] J. Thorne, A. Vlachos, Automated fact check- F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
ing: Task formulations, methods and future di- M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi,
rections, in: Proceedings of the 27th Interna- H. Mubarak, A. Nikolov, N. Babulkov, Y. S. Kartal,
tional Conference on Computational Linguistics, J. Beltrán, The clef-2022 checkthat! lab on fighting
the covid-19 infodemic and fake news detection, Associates, Inc., 2023, pp. 67284–67296.
in: Advances in Information Retrieval, Springer [21] A. Vlachos, S. Riedel, Fact checking: Task defi-
International Publishing, Cham, 2022, pp. 416–428. nition and dataset construction, in: C. Danescu-
[12] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. El- Niculescu-Mizil, J. Eisenstein, K. McKeown, N. A.
sayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari, Smith (Eds.), Proceedings of the ACL 2014 Work-
M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The shop on Language Technologies and Computational
clef-2024 checkthat! lab: Check-worthiness, subjec- Social Science, Association for Computational Lin-
tivity, persuasion, roles, authorities, and adversarial guistics, Baltimore, MD, USA, 2014, pp. 18–22. URL:
robustness, in: N. Goharian, N. Tonellotto, Y. He, https://aclanthology.org/W14-2508. doi:10.3115/
A. Lipani, G. McDonald, C. Macdonald, I. Ounis v1/W14-2508.
(Eds.), Advances in Information Retrieval, Springer [22] A. Martín, J. Huertas-Tato, Álvaro Huertas-García,
Nature Switzerland, Cham, 2024, pp. 449–458. G. Villar-Rodríguez, D. Camacho, Facter-check:
[13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- Semi-automated fact-checking through seman-
hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, tic similarity and natural language inference,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross- Knowledge-Based Systems 251 (2022) 109265.
lingual representation learning at scale, arXiv doi:https://doi.org/10.1016/j.knosys.
preprint arXiv:1911.02116 (2019). 2022.109265.
[14] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [23] E. C. Choi, E. Ferrara, Automated claim match-
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ing with large language models: Empowering fact-
bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, checkers in the fight against misinformation, in:
G. Lample, Llama: Open and efficient foundation Companion Proceedings of the ACM on Web Con-
language models, 2023. arXiv:2302.13971. ference 2024, WWW ’24, Association for Com-
[15] B. Berendt, P. Burger, R. Hautekiet, J. Jagers, A. Plei- puting Machinery, New York, NY, USA, 2024, p.
jter, P. Van Aelst, Factrank: Developing auto- 1441–1449. URL: https://doi.org/10.1145/3589335.
mated claim detection for dutch-language fact- 3651910. doi:10.1145/3589335.3651910.
checkers, Online Social Networks and Media 22 [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever,
(2021) 100113. doi:https://doi.org/10.1016/ Improving language understanding by gener-
j.osnem.2020.100113. ative pre-training, CoRR abs/1801.06146
[16] D. Croce, A. Zelenanska, R. Basili, Enabling (2018). URL: http://arxiv.org/abs/1801.06146.
deep learning for large scale question answering arXiv:1801.06146.
in italian, Intelligenza Artificiale 13 (2019) 49– [25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
61. URL: https://doi.org/10.3233/IA-190018. doi:10. J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
3233/IA-190018. G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
[17] A. Scaiella, D. Croce, R. Basili, Large scale datasets G. Krueger, T. Henighan, R. Child, A. Ramesh,
for image and video captioning in italian, Italian D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
Journal of Computational Linguistics 2 (2019) 49– E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_ C. Berner, S. McCandlish, A. Radford, I. Sutskever,
2_3___scaiella_et_al.pdf. D. Amodei, Language models are few-shot learners,
[18] C. Malon, Team papelo: Transformer networks at in: H. Larochelle, M. Ranzato, R. Hadsell, M. Bal-
FEVER, in: J. Thorne, A. Vlachos, O. Cocarascu, can, H. Lin (Eds.), Advances in Neural Information
C. Christodoulopoulos, A. Mittal (Eds.), Proceed- Processing Systems 33: Annual Conference on Neu-
ings of the First Workshop on Fact Extraction ral Information Processing Systems 2020, NeurIPS
and VERification (FEVER), Association for Com- 2020, December, 2020, pp. 6–12.
putational Linguistics, Brussels, Belgium, 2018, pp. [26] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
109–113. URL: https://aclanthology.org/W18-5517. Y. Li, S. Wang, W. Chen, Lora: Low-rank
doi:10.18653/v1/W18-5517. adaptation of large language models, CoRR
[19] L. Canale, A. Messina, Experimenting ai tech- abs/2106.09685 (2021). URL: https://arxiv.org/abs/
nologies for disinformation combat: the idmo 2106.09685. arXiv:2106.09685.
project, 2023. URL: https://arxiv.org/abs/2310.11097. [27] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem-
arXiv:2310.11097. ita at EVALITA 2023: Multi-task sustainable scaling
[20] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, to large language models at its extreme, in: Pro-
D. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat, ceedings of the Eighth Evaluation Campaign of Nat-
Madlad-400: A multilingual and document-level ural Language Processing and Speech Tools for Ital-
large audited dataset, in: Advances in Neural In- ian. Final Workshop (EVALITA 2023), Parma, Italy,
formation Processing Systems, volume 36, Curran September 7th-8th, 2023, volume 3473 of CEUR
Workshop Proceedings, CEUR-WS.org, 2023. URL: E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
https://ceur-ws.org/Vol-3473/paper13.pdf. e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
[28] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A terms used in t h i s t a s k are :
− Claim : A s t a t e m e n t o r a s s e r t i o n un der
dataset for real-world claim verification with ev- examination .
idence from the web, in: A. Oh, T. Naumann, − Evidence : Information that e i t h e r supports
A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), or opposes the claim .
Advances in Neural Information Processing Sys-
tems, volume 36, Curran Associates, Inc., 2023, pp. Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
b a s e d on t h e e v i d e n c e p r o v i d e d :
65128–65167. − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
[29] P. Atanasova, D. Wright, I. Augenstein, Gener- claim .
ating label cohesive and well-formed adversarial − REFUTES : i f t h e e v i d e n c e d i r e c t l y
claims, in: Proceedings of the 2020 Conference c o n t r a d i c t s the claim .
on Empirical Methods in Natural Language Pro- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
evidence to determine the claim ’ s
cessing (EMNLP), Association for Computational validity
Linguistics, Online, 2020, pp. 3168–3177. URL: https: ### I n p u t
//aclanthology.org/2020.emnlp-main.256. doi:10. − Claim : [ CLAIM HERE ]
18653/v1/2020.emnlp-main.256. − E v i d e n c e : [ EVIDENCE HERE ]
[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a # # # Answer : [ANSWER HERE ]
method for automatic evaluation of machine trans-
lation, in: Proceedings of the 40th Annual Meet-
ing on Association for Computational Linguistics, A.1.2. 1-shot Setting
ACL ’02, Association for Computational Linguis- The following prompt is used for 1-shot learning, where
tics, USA, 2002, p. 311–318. URL: https://doi.org/ the task and classes are explained, and one example per
10.3115/1073083.1073135. doi:10.3115/1073083. class is provided. Notice that only the evidence is re-
1073135. ported without the title of the original document.
[31] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia,
J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A ### I n s t r u c t i o n
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e
survey on in-context learning, 2024. URL: https: e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key
//arxiv.org/abs/2301.00234. arXiv:2301.00234. terms used in t h i s t a s k are :
[32] AI@Meta, Llama 3 model card, 2024. URL: − Claim : A s t a t e m e n t o r a s s e r t i o n un der
https://github.com/meta-llama/llama3/blob/main/ examination .
MODEL_CARD.md. − Evidence : Information that e i t h e r supports
or opposes the claim .
Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
b a s e d on t h e e v i d e n c e p r o v i d e d :
A. Prompting Engineering − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
claim .
This appendix contains the prompts used in the exper- − REFUTES : i f t h e e v i d e n c e d i r e c t l y
iments. The prompts are provided in both Italian and c o n t r a d i c t s the claim .
English, reflecting the task-specific nature of the experi- − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
ments. Each prompt begins with an explanation of the evidence to determine the claim ’ s
validity
task and the meaning of the classes. In the different vari-
ants, the 0-shot setting does not include any examples, # # # Examples
unlike the 1-shot setting. Where necessary, the name of These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e
the document from which the evidence is taken is also evaluation c r i t e r i a :
specified. − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d
Gothic .
− E v i d e n c e : The Germanic p e o p l e s ( a l s o
A.1. Prompts in English r e f e r r e d to as Teutonic , Suebian , or
G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
A.1.1. 0-shot Setting European ethno − l i n g u i s t i c group o f
N o r t h e r n European o r i g i n .
The following prompt is used for 0-shot learning, where − Answer : SUPPORTS
the task and classes are presented without additional
information. − Claim : T e n n i s i s n o t a s p o r t .
− E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
### I n s t r u c t i o n r e c r e a t i o n a l p l a y e r s and i s a l s o a
popular worldwide s p e c t a t o r s p o r t .
− Answer : REFUTES − Document : d e n o t e s t h e s o u r c e document f o r
the evidence .
− Claim : Kick − Ass i s a h o r r o r f i l m .
− E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h − Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s
American f i l m b a s e d on t h e comic book o f b a s e d on t h e e v i d e n c e p r o v i d e d :
t h e same name by Mark M i l l a r and John − SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e
Romita , J r . claim .
− Answer : NOT ENOUGH INFO − REFUTES : i f t h e e v i d e n c e d i r e c t l y
### I n p u t c o n t r a d i c t s the claim .
− Claim : [ CLAIM HERE ] − NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t
− E v i d e n c e : [ EVIDENCE HERE ] evidence to determine the claim ’ s
# # # Answer : [ANSWER HERE ] validity
# # # Examples
These e x a m p l e s d e m o n s t r a t e how t o a p p l y t h e
A.1.3. 0-shot Setting with Document Title evaluation c r i t e r i a :
The following prompt is used for 0-shot learning, where − Claim : The Germanic p e o p l e s a r e a l s o c a l l e d
Gothic .
the task and classes are explained without additional − E v i d e n c e : The Germanic p e o p l e s ( a l s o
information. Each input evidence is provided with the r e f e r r e d to as Teutonic , Suebian , or
title of its original document. G o t h i c i n o l d e r l i t e r a t u r e ) a r e an Indo −
European ethno − l i n g u i s t i c group o f
### I n s t r u c t i o n N o r t h e r n European o r i g i n .
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e − Document : Germanic p e o p l e s
e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key − Answer : SUPPORTS
terms used in t h i s t a s k are :
− Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r − Claim : T e n n i s i s n o t a s p o r t .
examination . − E v i d e n c e : T e n n i s i s p l a y e d by m i l l i o n s o f
− Evidence : Information that e i t h e r supports r e c r e a t i o n a l p l a y e r s and i s a l s o a
or opposes the claim . popular worldwide s p e c t a t o r s p o r t .
− Document : d e n o t e s t h e s o u r c e document f o r − Document : T e n n i s
the evidence . − Answer : REFUTES
Answer w i t h one o f t h e f o l l o w i n g j u d g m e n t s − Claim : Kick − Ass i s a h o r r o r f i l m .
b a s e d on t h e e v i d e n c e p r o v i d e d : − E v i d e n c e : Kick − Ass i s a 2 0 1 0 B r i t i s h −
− SUPPORTS : i f t h e e v i d e n c e s u b s t a n t i a t e s t h e American f i l m b a s e d on t h e comic book o f
claim . t h e same name by Mark M i l l a r and John
− REFUTES : i f t h e e v i d e n c e d i r e c t l y Romita , J r .
c o n t r a d i c t s the claim . − Document : Kick − Ass ( f i l m )
− NOT ENOUGH INFO : i f t h e r e i s i n s u f f i c i e n t − Answer : NOT ENOUGH INFO
evidence to determine the claim ’ s ### I n p u t
validity − Claim : [ CLAIM HERE ]
### I n p u t − E v i d e n c e : [ EVIDENCE HERE ]
− Claim : [ CLAIM HERE ] − Document : [DOCUMENT HERE ]
− E v i d e n c e : [ EVIDENCE HERE ] # # # Answer : [ANSWER HERE ]
− Document : [DOCUMENT HERE ]
# # # Answer : [ANSWER HERE ]
A.2. Prompts in Italian
A.1.4. 1-shot Setting with Document Title A.2.1. 0-shot Setting
The following prompt is used for 1-shot learning, where The following prompt is used for 0-shot learning, where
the task and classes are explained, and one example per the task and classes are presented without additional
class is provided. Each input evidence is provided with information.
the title of its original document.
### I s t r u z i o n i
### I n s t r u c t i o n Valuta se l ’ affermazione è supportata d a l l e
E v a l u a t e i f t h e c l a i m i s s u p p o r t e d by t h e p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
e v i d e n c e p r o v i d e d . D e f i n i t i o n s f o r key termini chiave u t i l i z z a t i in questo
terms used in t h i s t a s k are : c o m p i t o sono :
− Claim : A s t a t e m e n t o r a s s e r t i o n u nd e r − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
examination . a s s e r z i o n e s o t t o esame .
− Evidence : Information that e i t h e r supports − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
or opposes the claim . contraddicono l ’ affermazione .
R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i ### I n p u t
s u l l e prove f o r n i t e : − A f f e r m a z i o n e : [ CLAIM HERE ]
− SUPPORTS : s e l e p r o v e co nf erm ano l ’ − P r o v a : [ EVIDENCE HERE ]
affermazione . # # # R i s p o s t a : [ANSWER HERE ]
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o
direttamente l ’ affermazione .
− NOT ENOUGH INFO : s e l e p r o v e non sono
s u f f i c i e n t i per determinare l a v a l i d i t à
A.2.3. 0-shot Setting with Document Title
dell ’ affermazione . The following prompt is used for 0-shot learning, where
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ] the task and classes are explained without additional
− P r o v a : [ EVIDENCE HERE ] information. Each input evidence is provided with the
# # # R i s p o s t a : [ANSWER HERE ] title of its original document.
### I s t r u z i o n i
Valuta se l ’ affermazione è supportata d a l l e
A.2.2. 1-shot Setting p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
termini chiave u t i l i z z a t i in questo
The following prompt is used for 1-shot learning, where c o m p i t o sono :
the task and classes are explained, and one example per − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
class is provided. Notice that only the evidence is re- a s s e r z i o n e s o t t o esame .
ported without the title of the original document. − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
contraddicono l ’ affermazione .
### I s t r u z i o n i − Documento : i n d i c a l a f o n t e da c u i è s t a t a
Valuta se l ’ affermazione è supportata d a l l e e s t r a t t a l a prova .
p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
termini chiave u t i l i z z a t i in questo R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
c o m p i t o sono : s u l l e prove f o r n i t e :
− A f f e r m a z i o n e : Una d i c h i a r a z i o n e o − SUPPORTS : s e l e p r o v e co nf erm ano l ’
a s s e r z i o n e s o t t o esame . affermazione .
− P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o − REFUTES : s e l e p r o v e c o n t r a d d i c o n o
contraddicono l ’ affermazione . direttamente l ’ affermazione .
− NOT ENOUGH INFO : s e l e p r o v e non sono
R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i s u f f i c i e n t i per determinare l a v a l i d i t à
s u l l e prove f o r n i t e : dell ’ affermazione .
− SUPPORTS : s e l e p r o v e co nf erm ano l ’ ### I n p u t
affermazione . − A f f e r m a z i o n e : [ CLAIM HERE ]
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o − P r o v a : [ EVIDENCE HERE ]
direttamente l ’ affermazione . − Documento : [DOCUMENT HERE ]
− NOT ENOUGH INFO : s e l e p r o v e non sono # # # R i s p o s t a : [ANSWER HERE ]
s u f f i c i e n t i per determinare l a v a l i d i t à
dell ’ affermazione .
# # # Esempi
A.2.4. 1-shot Setting with Document Title
Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i The following prompt is used for 1-shot learning, where
c r i t e r i di valutazione :
− A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono the task and classes are explained, and one example per
c h i a m a t i anche g o t i c i . class is provided. Each input evidence is provided with
− P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i the title of its original document.
Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a
p i ù a n t i c a ) sono un gruppo etno − ### I s t r u z i o n i
l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord Valuta se l ’ affermazione è supportata d a l l e
europea . p r o v e f o r n i t e . Le d e f i n i z i o n i d e i
− R i s p o s t a : SUPPORTS termini chiave u t i l i z z a t i in questo
c o m p i t o sono :
− A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t . − A f f e r m a z i o n e : Una d i c h i a r a z i o n e o
− P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i a s s e r z i o n e s o t t o esame .
g i o c a t o r i a m a t o r i a l i ed è anche uno − P r o v a : I n f o r m a z i o n i che s u p p o r t a n o o
s p o r t popolare a l i v e l l o mondiale . contraddicono l ’ affermazione .
− R i s p o s t a : REFUTES − Documento : i n d i c a l a f o n t e da c u i è s t a t a
e s t r a t t a l a prova .
− A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r .
− P r o v a : Kick − Ass è un f i l m b r i t a n n i c o − R i s p o n d i con uno d e i s e g u e n t i g i u d i z i b a s a t i
americano d e l 2010 b as ato s u l fumetto s u l l e prove f o r n i t e :
omonimo d i Mark M i l l a r e John Romita J r . − SUPPORTS : s e l e p r o v e co nf erm ano l ’
− R i s p o s t a : NOT ENOUGH INFO affermazione .
− REFUTES : s e l e p r o v e c o n t r a d d i c o n o
direttamente l ’ affermazione .
− NOT ENOUGH INFO : s e l e p r o v e non sono
s u f f i c i e n t i per determinare l a v a l i d i t à
dell ’ affermazione .
# # # Esempi
Q u e s t i e s e m p i d i m o s t r a n o come a p p l i c a r e i
c r i t e r i di valutazione :
− A f f e r m a z i o n e : I p o p o l i g e r m a n i c i sono
c h i a m a t i anche g o t i c i .
− P r o v a : I p o p o l i g e r m a n i c i ( anche c h i a m a t i
Teutoni , Suebi o Goti n e l l a l e t t e r a t u r a
p i ù a n t i c a ) sono un gruppo etno −
l i n g u i s t i c o i n d o e u r o p e o d i o r i g i n e nord
europea .
− Documento : P o p o l i g e r m a n i c i
− R i s p o s t a : SUPPORTS
− A f f e r m a z i o n e : I l t e n n i s non è uno s p o r t .
− P r o v a : I l t e n n i s è p r a t i c a t o da m i l i o n i d i
g i o c a t o r i a m a t o r i a l i ed è anche uno
s p o r t popolare a l i v e l l o mondiale .
− Documento : T e n n i s
− R i s p o s t a : REFUTES
− A f f e r m a z i o n e : Kick − Ass è un f i l m h o r r o r .
− P r o v a : Kick − Ass è un f i l m b r i t a n n i c o −
americano d e l 2010 b as ato s u l fumetto
omonimo d i Mark M i l l a r e John Romita J r .
− Documento : Kick − Ass ( f i l m )
− R i s p o s t a : NOT ENOUGH INFO
### I n p u t
− A f f e r m a z i o n e : [ CLAIM HERE ]
− P r o v a : [ EVIDENCE HERE ]
− Documento : [DOCUMENT HERE ]
# # # R i s p o s t a : [ANSWER HERE ]