1. Introduction

WorthIt: Check-worthiness Estimation of Italian Social Media Posts

Agnese Dafara

0 2

Alan Ramponi

Sara Tonelli

1 0 Department of Humanities, University of Pavia - Pavia , Italy 1 Digital Humanities group , Fondazione Bruno Kessler - Trento , Italy 2 Institute for Natural Language Processing, University of Stuttgart - Stuttgart , Germany

2025

Check-worthiness estimation is the first and a paramount task in the automated fact-checking pipeline. It allows professional fact-checkers to cope with the increasing amount of mis/disinformative textual content being published online by prioritizing claims that are factual/verifiable and worthy of verification. Despite the long tradition of check-worthiness estimation in NLP, there is currently a lack of annotated resources and associated methods for Italian. Moreover, current datasets typically cover a single topic and focus on a limited time frame, afecting models' generalizability on out-of-distribution data. To fill these gaps, in this paper we introduce WorthIt, the first annotated dataset for factuality/verifiability and check-worthiness estimation of Italian social media posts that covers public discourse on migration, climate change, and public health issues across a large time period of six years. We describe the dataset creation in detail and conduct thorough experimentation with the WorthIt dataset using a wide array of encoder- and decoder-based models. Our results show that fine-tuning monolingual encoder-based models in a multi-task setting provides the best overall performance and that decoder-based models in a few-shot setup still struggle in capturing the relation between factuality/verifiability and check-worthiness. We release our dataset, code, and associated materials to the research community.§

eol>Automated fact-checking check-worthiness estimation factual/verifiable claim detection resources and evaluation

1. Introduction

Given the unprecedented amount of mis/disinformation spreading online, assisting fact-checkers in their everyday work by automatizing some of their tasks is becoming of paramount importance. The identification of content that is worthy of verification – i.e., check-worthiness estimation, also referred to as check-worthy claim detection – represents the first stage in the fact-checking pipeline [ 1] insofar as it allows professional fact-checkers to reduce the screening eforts of content that is not worth of attention, therefore focusing on the verification of potentially false or misleading information.

According to Nakov et al. [2], a claim is deemed checkworthy and calls for the attention of a fact-checker if it “is likely to be false, is of public interest, and/or appears to be harmful”, also being not “easy to fact-check by a layperson” (e.g., “The capital of Italy is Rome”). A check-worthy claim is both factual and verifiable [2, 3], i.e., it presents an “assertion about the world that is checkable” [4], namely it “state[s] a definition, mention[s] a quantity in the present or in the past, make[s] a verifiable § The repository is publicly available on GitHub at: https://github. com/dhfbk/worthit. prediction of the future, reference[s] laws, procedures, and rules of operation, discuss[es] images or videos, [or] state[s] correlation or causation” [2]. In other words, if a claim is factual and verifiable, it is possible to determine its check-worthiness based on whether it is relevant and may potentially have a broader impact on the general public [5] (see examples in Figure 1).

Check-worthy claim detection1 has become a wellestablished task in NLP since the introduction of the ifrst CheckThat! evaluation campaign [ 6]. However, despite the progress and the coverage of multiple languages in the past CheckThat! editions, no dataset or task for check-worthiness estimation specifically for the Italian language has been considered so far. Moreover, current 1In this paper, we refer to the task as “check-worthy claim detection” or “check-worthiness estimation” interchangeably. datasets for check-worthiness estimation in other lan- only for areas in which a given language is spoken. In this guages mostly focus on COVID-19 issues and consist respect, Italian represents an exception because, to our of posts that were drawn from a relatively small time knowledge, no dataset for check-worthiness estimation period (e.g., one year and three months [2]), afecting in this language has been developed so far. Recently the out-of-distribution generalization of models [7]. Check-IT! dataset [20] has been created, which however contains only fact-checked (i.e. check-worthy) claims.

Contributions In this paper, we address the aforemen- Likewise, the FEVER-IT dataset [21] is a translation into tioned gaps by developing WorthIt, the first annotated Italian of the widely-used FEVER dataset [22], and condataset for factuality/verifiability and check-worthiness tains only claims to be verified against textual sources. estimation for Italian, and by conducting extensive exper- In this work, we address this gap by presenting the novel iments with encoder- and decoder-based models. Wor- WorthIt dataset, which covers a previously overlooked thIt covers public discourse from Twitter on migration, language for the task of check-worthiness estimation. climate change, and public health issues over a large The dataset has been carefully sampled across topics time frame of six years. The full dataset was annotated and time for better models’ generalizability, since past by two expert annotators which discussed the cases of works have shown that the performance of automated disagreement to resolve annotation errors (e.g., due to fact-checking drops under domain shift [23]. The Worattention drops) while keeping genuine annotation diver- thIt dataset has also been fully annotated by two raters gences (e.g., due to diferent interpretations), in line with to value human label variation [8]. recent work advocating the importance of considering Concerning the methods for check-worthiness estimahuman label variation in subjective tasks [8, 9, 10, 11, 12, tion, state-of-the-art results in the CheckThat! evaluation inter alia]. We fine-tune a wide array of monolingual campaigns are mostly based on fine-tuned encoder-based and multilingual encoder-based models in single- and models such as BERT, RoBERTa, and DistilBERT [24, 25, multi-task learning settings, and experiment with four 26] and language-specific variants [ 16 ], often combined decoder-based models that include Italian in pretrain- with data augmentation [27] and ensembling strategies. ing data in a few-shot setup after a careful selection of Recently, large language models (LLMs) have been started representative examples. Results show that multi-task to be used for the task, showing promising performance. ifne-tuning of encoder-based models provides the best For instance, the best performing system on English at performance, and that decoder-based models – with or CheckThat! 2024 [17] fine-tuned Llama-2-7B on the prowithout annotation guidelines in the prompt, either in vided training data and then leveraged prompts generItalian or English – still struggle in tackling the task ef- ated by ChatGPT for check-worthy claim detection [28]. fectively, even when provided with information about However, previous works do not leverage the synergies the factuality/verifiability of the post. between factuality/verifiability and check-worthiness, albeit being strictly related tasks. Our work makes a step 2. Related Work towards this goal by fine-tuning encoder-based models in a multi-task learning setting and experimenting with sequential prompting using decoder-based models.

3. WorthIt Dataset In this section, we describe the dataset creation process, from data collection (Section 3.1) to data annotation (Section 3.2). We then present data statistics (Section 3.3). 3.1. Data Collection We collect social media posts pertaining to migration, cli

mate change, and public health issues using the Twitter APIs.3 To mitigate temporal bias in the dataset, we focus on a large time frame of six full years (from 2017-01-01 to 2022-12-31) and retain messages in Italian about the aforementioned topics by using a manually curated list of over 400 keywords derived from reliable glossaries and

3Tweets were retrieved in 02/2023 when the APIs for research pur

poses were still available for free.

Check-worthy claim detection is a popular task within

the NLP community mostly thanks to the series of CheckThat! shared tasks organized by the CLEF initiative.2 Indeed, check-worthy claim detection is the only task that has been proposed at all seven CheckThat! editions [ 6, 13, 14, 15, 2, 16, 17 ]. Several datasets for training check-worthiness estimation models in diferent languages have been created and released, starting from English and Arabic at CheckThat! 2018 [6] to Arabic, Bulgarian, Dutch, English, Spanish, and Turkish in later editions [ 2, 16, 17 ]. Besides CheckThat! datasets, additional resources have been developed over the years, mainly focused on specific events like COVID-19 [ 18 ] or political news [19]. English is the most represented language for check-worthiness estimation, but the scientific community has recently started to focus on the development of resources for other languages too, since check-worthy claims can refer to events that are relevant 2https://www.clef-initiative.eu/. scientific manuals (see Appendix A). Following Nakov the cases in which their annotations diverged, and conet al. [2], we further filter out posts containing ≤ 5 to- solidated the guidelines by specifying how to deal with kens4 and sort the remaining messages by their sum of special cases (e.g., in the presence of reported speech; see likes and retweets. We then select the top- ( = 10) Appendix B). Then, they both labeled the full set of posts posts exhibiting the highest number of likes and retweets in four rounds of annotation. Each round involved a disfor each month and topic subset, therefore focusing on cussion phase aimed at resolving annotation errors (e.g., the messages with the highest impact to the society while due to attention slips) while keeping instances exhibitsimultaneously mitigating topic and temporal biases. We ing genuine disagreement (e.g., diferent interpretations). further account for the potential presence of authors’ This makes WorthIt the first check-worthiness estimawriting style biases that can occur when many posts au- tion dataset that goes beyond the “single ground truth” thored by the same users are included in the dataset: we assumption in subjective annotation. therefore retain only the most impactful post authored by the same user in each data subset. Overall, we collect Inter-annotator agreement We computed the inter2,160 posts evenly distributed across topics (i.e., 720 for annotator agreement (IAA) on the full dataset for both each topic) and time periods (i.e., 360 for each year) for factuality/verifiability and check-worthiness using Kripfactuality/verifiability and check-worthiness annotation. pendorf’s alpha ( ) [29]. We obtain 0.8322 for factualiAll posts have been then anonymized by replacing user ty/verifiability and 0.6909 for check-worthiness. As exmentions, URLs, email addresses, and phone numbers pected, albeit substantial, the IAA for check-worthiness with placeholders (i.e., [USER], [URL], [EMAIL], and is lower than that for factuality/verifiability due to the [PHONE], respectively) and newline characters (i.e., \n genuine disagreement that we retain on purpose. and \r) have been replaced with single spaces.

3.3. Data Analysis and Statistics 3.2. Data Annotation WorthIt comprises 2,160 posts distributed across topics

Each post has been annotated with two labels, namely and time periods as shown in Figure 2, in which we also i) one denoting whether the content of the post is factu- highlight the overlap in posts with Faina [30], a previal/verifiable – either yes or no – and ii) one indicating ously released dataset for fine-grained fallacy detection. its check-worthiness – with labels in a 5-point Likert Specifically, WorthIt includes the same posts from 2019 scale: definitely yes, probably yes, neither yes nor to 2022 that are in Faina and further includes messages no, probably no, or definitely no. It is worth noting from 2017 and 2018 time periods. This opens opportunithat, as opposed to determining factuality/verifiability, ties for studying the interplay between check-worthiness estimating check-worthiness is a partly subjective task. and fallacious argumentation in future work as well as inThis motivates us to create WorthIt with parallel labels vestigations on human label variation, especially because by the annotators on all posts so that future studies on annotators are the same for both datasets. human label variation can be conducted. The annotation guidelines closely follow the ones used in CheckThat! shared tasks and are provided in Appendix B.

Annotators Annotation was conducted by two expert annotators. Both annotators are native speakers of Italian and have naturally been exposed to public discourse on migration, climate change, and public health in the Italian context. They identify themselves as a woman and a man, with age ranges 20–30 and 30–40. They have a background in linguistics and natural language processing and conducted annotation as part of their work.

Annotation process Annotators were provided with Overall, social media posts in WorthIt have an averannotation guidelines for determining the factuality/ver- age token length of 38.6 and the full dataset comprises ifiability and check-worthiness of social media posts (Ap- 83,315 tokens, of which 28,562, 26,667, and 28,086 are part pendix B). After conducting a pilot annotation phase of migration, climate change, and public health posts, on a small subset of the messages, annotators discussed respectively. In Table 1 we summarize the annotation statistics for factuality/verifiability and check-worthiness for both annotators (1 and 2). While 1,413–1,432 4Computed using the it_core_news_sm spaCy model (v3.5). posts (65.4%–66.3%) are considered as factual/verifiable

Data splits We divide WorthIt into training and test sets using -fold cross-validation ( = 5) preserving the label distribution across splits. For development, we rely on the training portions only and further divide them into training and development sets for the purpose of model variant and prompt selection (Section 4.2). Specifically, for encoder-based models we split them into five 80%/20% training/development sets, while for decoder-based models we divide them into two equal parts: the first half is used for selecting examples for few-shot prompting, while the second half serves as development set. All texts were lowercased for the purpose of the experiments. by annotators, we stress that the check-worthiness of a post can be estimated only if the post itself is deemed as factual/verifiable. Indeed, only 1,011 (46.8%) and 784 (36.3%) posts over the total are classified as check-worthy (i.e., either with the label probably yes or definitely Models For the experiments with encoder-based modyes) by 1 and 2, respectively. We observe that the els, we use four monolingual models specifically trained overall statistics for factuality/verifiability are similar on Italian data, namely AlBERTo [31],5 UmBERTo [32],6 among annotators, while those for check-worthiness, as and dbmdz’s Italian BERT models [33] in their base7 and expected, vary more. Specifically, 1 appears to have xxl8 versions (henceforth referred to as BERT-it base and been more inclined to assign clear-cut check-worthiness BERT-it xxl). Moreover, we employ widespread multiscores, whereas 2 distributed its ratings more across lingual models that include Italian in pretraining data, the scale. While in our experiments (Section 4) we do not namely mBERT [34]9 and XLM-RoBERTa [35].10 For finedirectly leverage this information, our dataset is released tuning, we use the MaChAmp toolkit (v0.4.2) [36] and to the community with disaggregated labels using the full select the best hyperparameter configuration based on av5-point Likert scale to encourage work on fine-grained erage Pos F1 score on the development sets (Appendix C). check-worthiness estimation and human label variation. As regards decoder-based models, we choose two Italian and two multilingual models, all instruction-tuned.

Specifically, we select LlaMAntino-3-ANITA-8B [ 37]11 4. Experiments and Minerva-7B [38]12 as monolingual models, while we use Qwen2.5-7B [39]13 and Llama3.1-8B [40]14 as multiWe conduct experiments on check-worthiness estimation lingual models. We choose these models because they with both encoder- and decoder-based models using the are widely used, freely available, and do not require very newly-introduced WorthIt dataset. In this section, we large computational resources that could be impractical thoroughly detail our experimental setup (Section 4.1) in real-world scenarios. Predicted labels are extracted and the model variant and prompt selection process (Sec- from models’ outputs using regular expressions. If no tion 4.2). Then, we present test set results (Section 4.3). 5Version: m-polignano-uniba/bert_uncased_L-12_H-768_A 4.1. Experimental Setup 6-V1e2rs_iiotna:lMiuasni_xamlabt3crht/0umberto-commoncrawl-cased-v1 Task setup We cast the check-worthiness task as a 78VVeerrssiioonn:: ddbbmmddzz//bbeerrtt--bbaassee--iittaalliiaann--uxnxcla-suendcased binary classification problem and consider factuality/ver- 9Version: google-bert/bert-base-multilingual-cased ifiability as auxiliary information that can be leveraged 10Version: FacebookAI/xlm-roberta-base by models to improve performance on the task. Given 11Version: swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO -ITA that each post has annotations provided by all annota- 12Version: sapienzanlp/Minerva-7B-instruct-v1.0 tors (i.e., 1 and 2) for both factuality/verifiability and 13Version: Qwen/Qwen2.5-7B-Instruct check-worthiness, for the purpose of the experiments 14Version: meta-llama/Meta-Llama-3.1-8B-Instruct matching label is found in the output,15 the response is average precision (mAP) scores for them to get addirecorded as “unknown”. Hyperparameter details are in tional insights on performance when ranking posts by Appendix C. Overall, we employ six encoder-based and check-worthiness. Moreover, for decoder-based models four decoder-based models, for a total of ten models. we include the number of “unknown” outputs (i.e., those not matching a label in the label set) to assess their ability Prompts and example sets For decoder-based mod- to follow the instructions. els, we design prompts in two languages (Italian and English) with or without annotation guidelines, leading to 4.2. Model Variant and Prompt Selection four diferent prompt configurations: Italian with guidelines (it_g), Italian without guidelines (it_ng), English with guidelines (en_g), and English without guidelines (en_ng). All models are prompted in a few-shot setup with five carefully-selected examples of posts and associated labels (Section 4.2).16 All prompts are in Appendix D.

We select the most promising setting (i.e., model vari

ant, set of few-shot examples, and prompt configuration) based on average Pos F1 score on the development sets.

While for encoder-based approaches the model selection was mainly a matter of tuning hyperparameter values (see Section 4.1 and additional details in Appendix C), for decoder-based models this involved the selection of the most promising set of examples as well as the prompt configuration (i.e., language and guidelines).

Multi-task fine-tuning and sequential prompting We hypothesize that factuality/verifiability information can help to predict the check-worthiness of a post. We thus design diferent fine-tuning and prompting settings for encoder- and decoder-based models, respectively, to test this hypothesis. Specifically, for encoder-based models we compare a standard single task approach (i.e., ifne-tuning a model with check-worthiness labels only) with an approach that leverages both factuality/verifiability and check-worthiness information in a multi-task learning framework (i.e., using check-worthiness as a main task and factuality/verifiability as an auxiliary task with diferent task loss weights fv and cw; see Appendix C). We compute the multi-task learning loss as = ∑︀ , where is the loss for the task , i.e., either factuality/verifiability ( fv) or check-worthiness (cw), and is the weight given to the task. For decoderbased models, we instead test a standard setting in which the models are prompted directly for check-worthiness (not seq) and a two-step sequential prompting approach (seq) (prompt are in Appendix D). In the latter case, the model is firstly instructed to classify the post based on its factuality/verifiability, then the output label is incorporated into a prompt which instructs the model to assess the check-worthiness of the same post.

Few-shot example set selection We create five different sets of few-shot examples (i.e., post texts and associated labels) by diversifying them across topics and annotation combinations for factuality/verifiability and check-worthiness, focusing on examples that are similar to those that are discussed in the annotation guidelines.

Each set is drawn from one of the five training splits used during development and contains five examples. Table 2 reports the composition of each set with respect to topics and annotations. To select the most promising example set to be used in the test phase, we prompt all decoder-based models with these example sets. In Table 2 we also report the Pos F1 obtained by using each example set, averaged on all models, development sets, and prompt configurations across seq and not seq settings (calculated over a total of 138,400 data points).17 Example set #1 leads to the highest average Pos F1 score and also exhibits the smallest standard deviation (Table 2); therefore, we select this set for the test phase (refer to Appendix D for post texts and labels included in the example set). It is worth noting that this is the only set that does not include any post annotated as factual/verifiable Evaluation metrics We use the F1 score for the pos- but not check-worthy (+-), suggesting that models may itive check-worthy class (Pos F1) as our main metric, learn more efectively from examples that are either both in line with previous work on check-worthy claim de- factual/verifiable and check-worthy or neither. In Table 3, tection [2, 16, 17, inter alia]. For completeness, we also we report the percentages of factuality/verifiability and report positive precision and recall scores (Pos Prec and check-worthiness label combinations outputted by modPos Rec, respectively), as well as accuracy (Acc) for test els when prompted using each example set over all the set results. Since encoder-based models provide confi- possible configurations in the seq setting (69,200 data dence scores for the output labels, we also compute mean points). We observe that even if the sets have diferent 15Allowed labels for factuality/verifiability: {factual, fattuale, not[-_ ]factual, non[-_ ]fattuale}; allowed labels for check-worthiness: {check[-_ ]worthy, not[-_ ]check-worthy, non[-_ ]check-worthy}. 16Testing a smaller/larger number of examples is left for future work. 17Each development split for decoder-based models consists of 865 examples (i.e., 50% of the training portion; see Section 4.1). Therefore, we have 865 outputs per development set (5× ) → 4,325 outputs per model’s configuration (4 × ) → 17,300 outputs per model (4× ) → 69,200 outputs per setting (2× ) → 138,400 outputs in total. distributions of label combinations, this does not influence significantly the distribution of the labels generated by models: in all cases, models frequently produce an invalid pair -fv +cw, while they tend to avoid the opposite one (i.e., +fv -cw). 18865 outputs per development set (5× ) → 4,325 outputs per example

set (5× ) → 21,625 outputs in total.

Best prompt selection To select the prompts for the test phase, we compare average Pos F1 scores on the development splits obtained by all decoder-based models

4.3. Results

when prompted with it_g, it_ng, en_g, and en_ng configurations (21,625 data points for each configura- We compute the results for the selected configurations of tion)18 in both seq and not seq settings. Results are in encoder- and decoder-based models across the = 5 test Table 4. All the best performing models do not use guide- splits, presenting average scores and standard deviations lines; therefore, we decide not to include guidelines in the across the applicable metrics as detailed in Section 4.1. prompts in further experiments. We keep both English and Italian prompt versions for the test phase, as some models perform better with Italian (particularly Minerva).

We also observe that the best results in the seq setting are overall higher than in the direct check-worthiness task (i.e., not seq). We keep both settings for testing to better highlight performance diferences.

Encoder-based models Results for encoder-based

models are shown in Table 5. We observe that using factuality/verifiability as an auxiliary task in a multi-task learning framework helps to improve the Pos F1 performance across all models. The best scores are obtained by BERT-it xxl, followed by UmBERTo and BERT-it base, all ifne-tuned in a multi-task setting. Specifically, BERTit xxl fine-tuned using both factuality/verifiability and check-worthiness information achieves a Pos F1 score

Model LlaMAntino-3-ANITA-8B Minerva-7B Qwen2.5-7B Llama3.1-8B seq seq mBERT seq seq seq seq of 0.7473 (+1.41 points increase compared to the single suggests that XLM-RoBERTa can be a viable approach task version) and a mAP score of 0.8095 on the check- for multilingual check-worthiness estimation. worthiness estimation task. Notably, XLM-RoBERTa in a multi-task setting shows only -3.35 points than the best Decoder-based models Results are presented in TaBERT-it xxl configuration in terms of Pos F 1 score, de- ble 6. Decoder-based models in a few-shot setup perform spite being pretrained on a mixture of languages. It also slightly worse on average than fine-tuned encoder-based outperforms AlBERTo in the multi-task setup and ob- models, but still achieve competitive results. Moreover, tains comparable results in the single task setting. This three models perform better when prompted in Italian.

Notably, LlaMAntino-3-ANITA-8B – despite being pre

trained on Italian data – performs better with English prompts and achieves the highest score in the seq setting (i.e., 0.6771 Pos F1 score). The two Italian models, LlaMAntino-3-ANITA-8B and Minerva-7B, reach the best results in the seq setup, while the multilingual models Qwen2.5-7B and Llama3.1-8B perform better when directly prompted for check-worthiness (i.e., in the not seq setup). Overall, factuality and verifiability information do not seem to significantly aid decoder-based models in predicting check-worthiness, as they are unable to leverage this information efectively (see Section 5 for an in-depth analysis). The lowest performance is observed with Minerva-7B, which is also the only model to produce “unknown” outputs – up to an average of 127 “unknown” labels when prompted in English in the seq setting.

5. Analysis and Discussion

Ranking of posts by check-worthiness Aggregate check-worthiness estimation scores (e.g., Pos F1) give a useful picture of models’ performance; however, knowing how the models rank the posts by check-worthiness is paramount for fact-checkers since they can only screen a limited number of posts in their daily work (say, ). In Figure 3, we report the ratio of posts correctly classified as check-worthy within the top- recommended checkworthy posts (P@) by all encoder-based models,19 with ∈ {5, 10, 25, 50, 100}. We observe that P@ is in the range of 0.90–0.95 and 0.80–0.85 points on average when the posts’ screening budget is set to = 25 and = 100, respectively. This indicates that these models can help fact-checkers in their daily routine. UmBERTo

mBERT

XLM-RoBERTa

Relationship between fv and cw To assess whether decoder-based models capture the relationship between factuality/verifiability and check-worthiness, we ana- classify check-worthiness independently. This is a parlyzed their outputs in the seq setup. Figure 4 shows ticularly important limitation, as it can potentially lead the frequencies of the four possible combinations of la- to fact verification eforts being wasted on content that bels both in the models’ outputs (i.e., +fv +cw, +fv -cw, is not factual. In contrast, all models except LlaMAntino-fv +cw, and -fv -cw; calculated over 8,650 data points) 3-ANITA-8B rarely assign the opposite combination, +fv and in the manual annotations (2,160 data points). The -cw, which is instead valid within our framework and most frequent label combination in the models’ outputs represents a consistent portion of annotated posts (27.9%). is +fv +cw, accounting for more than half of the predic- LlaMAntino-3-ANITA-8B favors either two negative lations for Minerva-7B and Llama3.1-8B, reaching 66.2% bels (-fv -cw) or two positive labels (+fv +cw), while for the latter. Interestingly, the second most frequent assigning mixed label combinations significantly less ofcombination is -fv +cw: we consider this as problematic, ten. A side efect of this is that it produces the - fv +cw because non-factual or non-verifiable posts should not be combination less frequently than the other models. Overclassified as check-worthy. This suggests that decoder- all, our analysis shows that models i) tend to avoid the based models do not grasp this correlation and instead combination +fv -cw, preferring to align the two labels rather than diversifying them, especially when they rely on positive factuality/verifiability, and ii) tend to produce 19Ionnltyhissianncaelywsiitsh,wdeecroedpeorr-tbPa@sed smcoordeeslsfoirt eisncnoodterp-obsassiebdlemtoodgeelst the invalid label combination -fv +cw. We stress that confidence scores for labels generated as part of raw outputs. this tendency is not due to the examples given in the prompts (cf. Table 3), but is rather a general preference of those models, which seem to ignore the relation between factuality/verifiability and check-worthiness.

Correlation between models’ outputs To assess if there is a pairwise correlation between encoder- and decoder-based models’ outputs, we calculate the Pearson correlation coeficient ( ) between all models’ predictions.

The heatmap in Figure 5 summarizes the results across the = 5 test splits. We consider the best-performing setup for each model, namely the multi-task setting for encoder-based models (see Table 5) and the setup that led to the best performance for each decoder-based model (i.e., language and setting; see Table 6). Encoderbased models exhibit strong positive mutual correlation ( ≥ 0.65; top-left section in Figure 5), indicating high consistency in the predictions. In contrast, decoder-based models display low inter-model correlation indicating greater output variability. Among them, LlaMAntino-3ANITA-8B shows the highest alignment with encoderbased models, reaching = 0.54 with UmBERTo and BERT-it xxl. Conversely, Minerva-7B consistently shows no or very weak correlation with other models – with ranging from 0.00 to 0.06 – revealing that its outputs are largely unrelated with those of all other models.

6. Conclusion

We introduce WorthIt, the first dataset of Italian social media posts annotated for factuality/verifiability and check-worthiness that spans multiple years and topics and includes human label variation. We conduct thorough check-worthiness estimation experiments with encoder- and decoder-based models. Results show that the former models in a multi-task setting reach the best results, while the latter models systematically classify non-factual/verifiable posts as check-worthy, failing to capture the relation between the two concepts.

WorthIt’s partial overlap with a dataset for fallacy detection, faina [30], opens new research avenues for combining the two tasks. Further opportunities include modeling human label variation for the check-worthiness task using the released parallel annotations and experimenting with additional models, training setups, and prompting strategies. Finally, the wide temporal coverage and the diverse set of topics represented in WorthIt open the field to studies on out-of-distribution generalization of check-worthiness estimation models.

Acknowledgments This work has been funded by the European Union’s

Horizon Europe research and innovation program under grant agreement No. 101070190 (AI4Trust). We also gratefully acknowledge funding from the German Federal Ministry of Research, Technology and Space (BMFTR) under the grant 01IS23072 for the Software Campus project MULTIVIEW.

Appendix A. Search Keywords In the guidelines, we further include information on

how to deal with special cases to minimize ambiguity.

All the cases provided to annotators are outlined below. ☞ Reported speech, including quotations, references We report the full list of search keywords, divided by to newspaper and TV, is always factual/verifiable. E.g.: topic, in Table 7. Within squared brackets are the gram- “‘What is done to migrants is criminal’ #PopeFrancis on matical gender and number variants (if any) that we in- #CTCF #Rai3” is factual/verifiable cluded for each keyword.

B. Annotation Guidelines

For factuality/verifiability annotation, a post can be either factual/verifiable (i.e., yes label) or non factual/verifiable (i.e., no). For posts that are factual/verifiable, a checkworthiness label in a 5-point Likert scale must also be assigned. Possible labels are: definitely yes, probably yes, neither yes nor no, probably no, and definitely no. For both annotation tasks, we strictly follow the ☞ Generic sentences are not factual/verifiable because they contain imprecise information (e.g., frequent use of indefinite quantifiers such as various, some, many). E.g.: guidelines by Nakov et al. [2] and translate them to Italian. “Three months after the collapse of the #MorandiBridge. The annotation guidelines are presented below.

[ Factuality/verifiability Il post contiene un’afermazione fattuale che può essere verificata? A titolo di esempio, sono fattuali/verificabili i post che riportano una definizione, menzionano una quantità nel presente o nel passato, fanno una previsione verificabile del futuro, fanno riferimento a leggi, procedure e norme operative, discutono di immagini o video, e indicano correlazioni o causalità. [ Check-worthiness Credi che l’afermazione contenuta nel post dovrebbe essere verificata da un fact-checker professionista? Questa domanda richiede un giudizio soggettivo basato sulle seguenti domande: 1. L’afermazione espressa nel post potrebbe

essere falsa? 2. L’afermazione espressa nel post potrebbe essere di interesse pubblico e/o avere impatto sulla collettività? 3. L’afermazione espressa nel post potrebbe danneggiare la società, un gruppo, un singolo o un’entità? L’annotazione è necessaria solo se il post è stato classificato come fattuale/verificabile. Nota: affermazioni facilmente verificabili dagli utenti (es. “Gli abitanti della Cina sono la metà di quelli dell’Italia”) non sono da ritenere check-worthy.

☞ If the claim is in a subordinate clause, the post is not factual/verifiable. However, it is factual/verifiable if the claim is salient and conveys the main information. E.g.: “Dear #novax who appeals to art.32 of the Constitution, you should know that the Constitutional Court with ruling no. 307/1990 has decided that a treatment can become mandatory if it serves to protect oneself and the health of others. So, if needed, you vaccinate or leave.” is factual/verifiable From the government only many promises, zero facts and a totally insuficient decree. ” is not factual/verifiable ☞ Personal opinions are not factual/verifiable, as there is no clear evidence to support them. E.g.: “Put Salvini back at the Interior Ministry, he is the only one who can handle migrants arrivals.” is not factual/verifiable ☞ When the implicit subject can be reconstructed, the sentence can be factual/verifiable. E.g.: “ When he was minister and closed the ports he said go ahead and prosecute me. Then he was investigated and hid behind parliamentary immunity. When he was minister he insulted Carola Rackete. Then they propose him a TV debate with her and he declines the invitation. And they call him Captain.” is factual/verifiable ☞ Descriptions of images/videos with URLs are factual/verifiable when they contain an externally verifiable fact. E.g.: “I receive directly from a Sudanese boy these images. The migrants are leaving the UNHCR center 15 km from #Agadez and marching towards the city.” is factual/verifiable ☞ Posts about weather conditions or temperatures are considered factual/verifiable when the information is precise, they specify the type of event described, the exact location and time. Posts about temperature are not check-worthy. E.g.: “The situation now in #Catania. I think there is a small problem with climate change. [URL]” is not factual/verifiable ☞ Posts describing events (demonstrations, marches, strikes, rallies, initiatives, assemblies, meetings, presentations) are always factual/verifiable. They can include the expressions everyone for, see you on, together with.

They are generally not check-worthy. E.g.: “#StopFalsePromises! In the streets of Rome with [USER] for global climate strike! #ClimateStrike” is factual/verifiable Search keywords used for collecting posts in WorthIt, with grammatical gender and number variants (if any) indicated using squared brackets. Note that these exactly match the keywords that have been used to collect the faina dataset [30]. Migration: apolid[e,i]; apolidia; centr[o,i] di accoglienza; centr[o,i] di identificazione ed espulsione; centr[o,i] di permanenza per il rimpatrio; centri di permanenza per i rimpatri; centr[o,i] di permanenza temporanea; centr[o,i] per il rimpatrio; centri per i rimpatri; corridio[io,i] umanitar[io,i]; domand[a,e] d’asilo; domand[a,e] di asilo; emigrant[e,i]; emigrat[o,i,a,e]; emigrazion[e,i]; espatr[io,i]; fattor[e,i] di spinta; immigrant[e,i]; immigrat[o,i,a,e]; immigrazion[e,i]; ius sanguinis; migrant[e,i]; migrator[io,i,ia,ie]; migrazion[e,i]; minor[e,i] stranier[o,i] non accompagnat[o,i]; minor[e,i] stranier[a,e] non accompagnat[a,e]; non-refoulemen[t,ts]; permess[o,i] di soggiorno; procedur[a,e] d’asilo; procedur[a,e] di asilo; protezion[e,i] sussidiari[a,e]; protezion[e,i] umanitari[a,e]; push facto[r,rs]; refoulemen[t,ts]; reinsediament[o,i]; respingiment[o,i]; richiedent[e,i] asilo; rifugiat[o,i,a,e]; rimpatr[io,i]; rimpatriat[o,i,a,e]; sfollat[o,i,a,e]; vittim[a,e] della tratta; vittim[a,e] di tratta Climate change: acidificazione dell’oceano; acidificazione degli oceani; aerosol atmosferic[o,i]; allagament[o,i]; alluvion[e,i]; alluvional[e,i]; ambientalismo di facciata; anidride carbonica; antropocene; aridità; bilanc[io,i] climatic[o,i]; bilanc[io,i] energetic[o,i]; bilanc[io,i] idrologic[o,i]; biocombustibil[e,i]; biodegradabil[e,i]; biodegradabilità; biodiversità; biossido di carbonio; cambiament[o,i] climatic[o,i]; cambiament[o,i] del clima; carbon cost; carbon footprint; carbon pricing; carbon tax; cost[o,i] del carbonio; climate; climate change; climate cris[is,es]; climatic[o,a,i,he]; climatologia; co2; combustibil[e,i] fossil[e,i]; confin[e,i] planetar[io,i]; consum[o,i] di suolo; crisi climatic[a,he]; deforestazion[e,i]; desalinizzazion[e,i]; desertificazion[e,i]; diossido di carbonio; disboscament[o,i]; dissalazion[e,i]; ecological footprint; ecologismo di facciata; economi[a,e] circolar[e,i]; efetto serra; emission[e,i]; energi[a,e] rinnovabil[e,i]; esondazion[e,i]; event[o,i] meteorologic[o,i] estrem[o,i]; fenomen[o,i] meteorologic[o,i] estrem[o,i]; finanza sostenibile; fonte di energia rinnovabile; fonti di energia rinnovabil[e,i]; forzant[e,i] radiativ[o,i]; gas serra; gas silvestre; glacialism[o,i]; glaciazion[e,i]; greenwashing; impronta carbonica; impronta di carbonio; impronta ecologica; innalzamento de[l,i] mar[e,i]; innalzamento del livello de[l,i] mar[e,i]; innalzamento dei livelli de[l,i] mar[e,i]; inondazion[e,i]; inquinamento atmosferico; inquinamento dell’atmosfera; isol[a,e] di calore; isol[a,e] urban[a,e] di calore; limit[e,i] planetar[io,i]; meteorologia; microclima; mobilità sostenibile; mutament[o,i] climatic[o,i]; olocene; ondat[a,e] di caldo; ondat[a,e] di calore; paleoclima; particellato; particolato; pedoclima; permafrost; permagelo; prezz[o,i] del carbonio; proiezion[e,i] climatic[a,he]; report di sostenibilità; riscaldamento climatico; riscaldamento globale; risch[io,i] climatic[o,i]; scenar[io,i] climatic[o,i]; sciogliment[o,i] dei ghiacciai; siccità; sistem[a,i] climatic[o,i]; sostenibilità ambientale; surriscaldamento climatico; surriscaldamento globale; svilupp[o,i] sostenibil[e,i]; tass[a,e] sul carbonio; transizion[e,i] ecologic[a,he]; transizion[e,i] energetic[a,he]; uso d[el,i] suolo; utilizzazion[e,i] del suolo; utilizzo d[el,i] suolo; variabilità climatic[a,he] Public health: agend[a,e] di prenotazione; alfabetizzazione alla salute; alfabetizzazione sanitaria; assistenz[a,e] domiciliar[e,i]; assistenz[a,e] ospedalier[a,e]; assistenz[a,e] sanitari[a,e]; assistenza universale; aziend[a,e] ospedalier[a,e]; aziend[a,e] sanitari[a,e]; bisogn[o,i] sanitar[io,i]; calendar[io,i] di prenotazione; caric[o,hi] di malattia; centro unificato di prenotazione; città san[a,e]; class[e,i] di priorità; comportament[o,i] a rischio; comportament[o,i] di salute; copertur[a,e] sanitari[a,e]; copertur[a,e] universal[e,i]; cur[a,e] medic[a,he]; cur[a,e] sanitari[a,e]; degent[e,i]; degenz[a,e]; determinant[e,i] della salute; determinant[e,i] di salute; dimission[e,i] ospedalier[a,e]; dispositiv[o,i] medic[o,i]; disuguaglianz[a,e] di salute; disuguaglianz[a,e] nella salute; disuguaglianz[a,e] sanitari[a,e]; educazione alla salute; educazione sanitaria; epidemi[a,e]; epidemic[o,a,i,he]; epidemiologia; epidemiologic[o,a,i,he]; equità di salute; equità nella salute; equità sanitari[a,e]; esenzion[e,i] dal ticket; esenzion[e,i] ticket; fattor[e,i] di rischio; indicator[e,i] di salute; investiment[o,i] nella sanità; investiment[o,i] per la salute; investiment[o,i] per la sanità; isol[a,e] san[a,e]; istitut[o,i] di cura; istituto di sanità pubblica; istituto superiore di sanità; list[a,e] di attesa; malatti[a,e] infettiv[a,e]; ministero della salute; ministero della sanità; misur[a,e] sanitari[a,e]; ospedali; ospedalier[o,i,a,e]; ospedalizzazion[e,i]; ospitalizzazion[e,i]; pandemi[a,e]; politic[a,he] sanitari[a,e]; post[o,i] letto; prestazion[e,i] ambulatorial[e,i]; prestazion[e,i] sanitari[a,e]; prestazion[e,i] specialistic[a,he] ambulatorial[e,i]; prevenzione delle malattie; prevenzione di malattie; prevenzione primaria; prevenzione sanitaria; prevenzione secondaria; prevenzione terziaria; programmazion[e,i] sanitari[a,e]; promozione della salute; promozione di salute; pronto soccorso; ricover[o,i]; salute globale; salute per tutti; salute pubblica; sanità; sanità pubblica; sanitar[io,i,ia,ie]; serviz[io,i] infermieristic[o,i]; serviz[io,i] medic[o,i]; serviz[io,i] sanitar[io,i]; settor[e,i] sanitar[io,i]; sicurezza dell[a,e] cur[a,e]; struttur[a,e] di ricovero; struttur[a,e] ospedalier[a,e]; struttur[a,e] sanitari[a,e]; terapi[a,e] intensiv[a,e]; trattament[o,i] di salute; trattament[o,i] medic[o,i]; trattament[o,i] sanitar[io,i]; uguaglianz[a,e] di salute; uguaglianz[a,e] nella salute; uguaglianz[a,e] sanitari[a,e]; vaccin[o,i]; vaccinazion[e,i]

C. Hyperparameters For encoder-based models, we use default MaChAmp

(v0.4.2) [36] hyperparameter values and tune the most crucial ones during development. The search space for them is indicated within brackets in Table 8, with best values underlined. The best loss weight value for the auxiliary factuality/verifiability task is set to 0.50 for UmBERTo, BERT-it base, and mBERT, to 0.75 for XLMRoBERTa, and to 1.00 for AlBERTo and BERT-it xxl.

For decoder-based models, we use the Hugging Face Transformers library using default hyperparameter values and setting the max_new_tokens parameter to 30.

Since all models are instruction-tuned, we structure our

inputs as conversational prompts using the following format: {"role": "user", "content": "prompt"}.

D. Prompts and Examples We present the prompt templates used for factuality/ver

ifiability and check-worthiness tasks. For prompts using guidelines, $[FV|CW]_GUIDELINES placeholders are replaced with text in the desired language from Table 9. $[FV|CW]_EXAMPLES placeholders are replaced with

Ora”) is included only in the seq setting, with $FV_LABEL

representing the factuality/verifiability label obtained for the same post using the factuality/verifiability prompt. Examples used for few-shot decoder-based models’ prompting on the test set. Examples refer to set #1 (see Table 2). Guidelines for both tasks in Italian and English used for prompting decoder-based models in configurations with guidelines. $FV_GUIDELINES Italian: “Linee guida:\\Un post è fattuale quando contiene informazioni salienti che possono essere verificate esternamente. Tali informazioni possono essere trovate ovunque, comprese subordinate, sostantivi e hashtag. I discorsi riportati e le citazioni sono sempre fattuali. Anche i post che descrivono eventi e attività sono sempre fattuali. I post sul meteo o sulla temperatura e le descrizioni di foto e video sono fattuali solo quando le informazioni sono precise e la località è nota. Al contrario, le afermazioni generiche o vaghe e le opinioni personali non sono fattuali perché non esistono prove chiare a sostegno.” English: “Guidelines:\\A post is factual when it contains salient information that can be externally verified. Such information can be found everywhere, including subordinates clauses, nouns and hashtags. Reported discourses and references are always factual. Similarly, posts describing events and activities are always factual. Posts about weather or temperature, as well as photo and video descriptions, are factual only when the information is precise and the location is known. On the other hand, generic or vague statements and personal opinions are not factual because there is no clear evidence to support them.” $CW_GUIDELINES Italian: “Linee guida:\\Un post può essere check-worthy solo se è fattuale. Un post è considerato check-worthy se è rilevante per la società e può causare danno o modificare le opinioni delle persone. Le afermazioni generiche e le opinioni non sono check-worthy. I post che descrivono eventi climatici e meteorologici di solito non sono check-worthy perché non contengono informazioni sensibili. Allo stesso modo, i post che menzionano che una specifica attività è in corso di svolgimento di solito non sono check-worthy.” English: “Guidelines:\\A post can be check-worthy only if it is factual. A post is check-worthy if it is relevant to society and can cause harm or modify people’s opinions. Generic statements and opinions are not check-worthy. Posts describing climate and weather events are usually not check-worthy because they do not contain sensitive information. Similarly, posts mentioning that a specific activity is taking place are usually not check-worthy.” è solo maggio. e questo #caldo mi terrorizza. ecco. l’ho detto. #crisiclimatica cosa diamine stiamo aspettando??? it’s only May. and this #heat terrifies me. there. I said it. #climatecrisis what the hell are we waiting for??? ma è tipo la seconda volta che i rifugiati recuperati in mare sono 49. mi è preso il sospetto che la libia stia trollando salvini. but it’s like the second time that the refugees rescued at sea are 49. I got the suspicion that Libya is trolling Salvini. ho scritto e riscritto che #inceneritore è proposta anti-europea: ue avrebbe eliminato esenzione dell’incenerimento dal pagamento co2 non più tardi del 2028 perché dannoso e rendendolo ancora meno conveniente. sono stato smentito: oggi hanno votato. dal 2026! [URL] I’ve written and rewritten that the #incinerator is an anti-European proposal: the EU would have removed the exemption of incineration from CO2 payments no later than 2028 because it’s harmful, making it even less cost-efective. I was contradicted: they voted today. from 2026! [URL] il fatto che zaia rivoglia il personale “novax” sospeso è la certificazione del danno procurato alla salute pubblica per scelte politiche scellerate e criminali. semplice. the fact that Zaia wants the suspended “novax” staf back is proof of the damage caused to public health by reckless and criminal political decisions. simple. lei pensa ai fratelli migranti in serbia [URL] she thinks of the migrant brothers in Serbia [URL] v Prompt for factuality/verifiability (en) v Prompt for factuality/verifiability (it)

Classify the post as “factual” or “not factual”.

Answer only with “factual” or “not factual”.

Classifica il post come “fattuale” o “non fattuale”.

Rispondi solo con “fattuale” o “non fattuale”. $FV_GUIDELINES $FV_EXAMPLES $POST_TEXT = $CW_GUIDELINES $CW_EXAMPLES $FV_GUIDELINES $FV_EXAMPLES $POST_TEXT = $CW_GUIDELINES $CW_EXAMPLES $POST_TEXT = During the preparation of this work, the author(s) did not use any generative AI tools or services.

18653 /v1/ 2022 .emnlp-main. 731 .

[9]

Poesio ,

Artstein , The reliability of anaphoric [1]

Guo ,

Schlichtkrull ,

Vlachos , A survey annotation, reconsidered: Taking ambiguity into ac-

Association for Computational Linguistics 10 ( 2022 ) shop on Frontiers in Corpus Annotations II: Pie in

178- 206 . doi: 10 .1162/tacl_a_ 00454 . the Sky, Association for Computational Linguis[2]

Nakov ,

Barrón-Cedeño , G. Da San Martino, tics, Ann Arbor, Michigan, 2005 , pp. 76 - 83 . URL:

Alam ,

Míguez ,

Caselli ,

Kutlu , W. Za- https://aclanthology.org/W05-0311/.

ghouani , C.

Li , S.

Shaar , H.

Mubarak , A.

Nikolov , [10] L.

Aroyo , C.

Welty , Truth is a lie: Crowd truth

Y. S.

Kartal , Overview of the CLEF-2022 Check- and the seven myths of human annotation , AI

That! lab task 1 on identifying relevant claims in Magazine 36 (

2015 ) 15 - 24 . doi: 10 .1609/aimag.

tweets, in: Proceedings of the Working Notes of v36i1.2564.

CLEF 2022 - Conference and Labs of the Evaluation [11] F.

Cabitza , A.

Campagner , V.

Basile , Toward a

Forum , CEUR-WS.org, Bologna, Italy, 2022 . URL: perspectivist turn in ground truthing for predic-

https://ceur-ws. org/ Vol- 3180 /paper-28.pdf. tive computing, Proceedings of the AAAI Confer [3]

Panchendrarajan ,

Zubiaga , Claim detection ence on Artificial Intelligence 37 ( 2023 ) 6860 - 6868 .

for automated fact-checking: A survey on monolin- doi:10 .1609/aaai.v37i6. 25840 .

gual, multilingual and cross-lingual research , Nat- [12]

Nie ,

Zhou ,

Bansal , What can we learn

ural Language Processing Journal 7 ( 2024 ) 100066. from collective human opinions on natural lan-

doi:10 .1016/j.nlp. 2024 . 100066 . guage inference data? , in: B. Webber , T. Cohn, Y.

He , [4] L.

Konstantinovskiy , O.

Price , M.

Babakar , A . Zubi- Y. Liu (Eds.), Proceedings of the 2020 Conference

tent automated claim detection, Digital Threats 2 Linguistics , Online, 2020 , pp. 9131 - 9143 . URL: https:

( 2021 ). doi: 10 .1145/3412869. //aclanthology.org/ 2020 .emnlp-main. 734 /. doi:10. [5] A. Das , H.

Liu , V.

Kovatchev , M.

Lease , The state of 18653/v1/ 2020 .emnlp-main. 734 .

human-centered NLP technology for fact-checking , [13]

Atanasova ,

Nakov , G. Karadzhov, M. Mo-

Information Processing & Management 60 ( 2023 ) htarami, G. Da San Martino, Overview of the

103219. doi: 10 .1016/j.ipm. 2022 .103219. CLEF-2019 CheckThat! lab: Automatic identifica[6]

Atanasova ,

Màrquez ,

Barrón-Cedeño , T. El- tion and verification of claims. Task 1 : Check-

sayed , R.

Suwaileh , W.

Zaghouani , S. Kyuchukov, worthiness, in: Working Notes of CLEF 2019

CLEF-2018 CheckThat

! lab on automatic identifi- CEUR-WS .org, Lugano, Switzerland, 2019 . URL:

cation and verification of political claims . Task 1 : https://ceur-ws. org/ Vol- 2380 /paper_269.pdf.

Check-worthiness , in: Working Notes of CLEF 2018 [14]

Shaar ,

Nikolov ,

Babulkov ,

Alam ,

CEUR-WS .org, Avignon, France, 2018 . URL: https: R. Suwaileh , F. Haouari , G. Da San Martino,

//ceur-ws. org/ Vol- 2125 /invited_paper_13.pdf. P. Nakov, Overview of CheckThat! 2020 English: [7]

Ramponi ,

Plank , Neural unsupervised do- Automatic identification and verification of claims

main adaptation in NLP-A survey , in: D. Scott, in social media, in: Working Notes of CLEF 2020

Bel , C. Zong (Eds.), Proceedings of the 28th - Conference and Labs of the Evaluation Forum,

International Conference on Computational Lin- CEUR-WS.org, Thessaloniki , Greece, 2020 . URL:

guistics , International Committee on Compu- https://ceur-ws. org/ Vol- 2696 /paper_265.pdf.

tational Linguistics , Barcelona, Spain (Online), [15] S.

Shaar , M.

Hasanain , B.

Hamdan , Z. S.

Ali ,

2020 , pp. 6838 - 6855 . URL: https://aclanthology.org/ F. Haouari,

Nikolov ,

Kutlu , Y. S. Kar-

2020.coling-main. 603 /. doi: 10 .18653/v1/ 2020 . tal,

Alam , G. Da San Martino, A. Barrón-

coling-main.603 . Cedeño , R.

Miguez , J.

Beltrán , T.

Elsayed , P. Nakov, [8] B.

Plank , The “problem” of human label variation: Overview of the CLEF- 2021 CheckThat! lab task 1

ceedings of the 2022 Conference on Empirical Meth- of CLEF 2021 - Conference and Labs of the Evalu-

for Computational

Linguistics

, Abu Dhabi, United 2021 . URL: https://ceur-ws. org/ Vol- 2936 /paper-28.

Arab

Emirates , 2022 , pp. 10671 - 10682 . URL: https: pdf.

//aclanthology.org/ 2022 .emnlp-main. 731 /. doi:10. [16]

Alam ,

Barrón-Cedeño ,

G. S.

Cheema , G. K.

Shahi , S.

Hakimov , M.

Hasanain , C.

Li , R.

Míguez , [23] G.

Valer , A.

Ramponi , S.

Tonelli , When you doubt,

of the CLEF-2023 CheckThat! lab task 1 on check- Italian under domain shift , in: F. Boschetti, G. E.

in: Working Notes of the Conference and Labs of of the 9th Italian Conference on Computational

the Evaluation Forum (CLEF 2023), CEUR-WS.org, Linguistics (CLiC-it

2023 ), CEUR Workshop Pro-

Thessaloniki , Greece, 2023 . URL: https://ceur-ws. ceedings, Venice, Italy, 2023 , pp. 433 - 440 . URL:

org/ Vol- 3497 /paper-019.pdf. https://aclanthology.org/ 2023 .clicit- 1 .52/. [17]

Hasanain ,

Suwaileh ,

Weering ,

Li , [24]

E. M.

Williams ,

Rodrigues ,

Tran , Accenture

Caselli ,

Zaghouani ,

Barrón-Cedeño , at CheckThat! 2021: Interesting claim identifica-

Nakov ,

Alam , Overview of the CLEF-2024 tion and ranking with contextually sensitive lex-

CheckThat! lab task 1 on check-worthiness estima- ical training data augmentation , in: G. Faggioli,

the Conference and Labs of the Evaluation Forum ceedings of the Working Notes of CLEF 2021 - Con-

(CLEF 2024 ), CEUR-WS .org, Grenoble, France, 2024 . ference and Labs of the Evaluation Forum , CEUR-

URL: https://ceur-ws. org/ Vol- 3740 /paper-24.pdf. WS.org, Bucharest, Romania, 2021 , pp. 659 - 669 . [18]

N. Salek

Faramarzi ,

Hashemi Chaleshtori , H. Shi- URL: https://ceur-ws. org/ Vol- 2936 /paper-55.pdf.

razi , I. Ray ,

Banerjee , Claim extraction and dy- [25]

R. A.

Frick , I. Vogel , J.-E. Choi, Fraunhofer SIT

namic stance detection in COVID-19 tweets , in: at CheckThat! 2023: Enhancing the detection of

ference 2023 , Association for Computing Machin- optical character recognition and model souping,

ery , New York, NY, USA, 2023 , pp. 1059 - 1068 . in: M. Aliannejadi , G. Faggioli, N. Ferro , M. Vlachos

doi:10.1145/3543873 .3587643. (Eds.), Working Notes of the Conference and Labs of [19]

Dhar , D. Das , Leveraging expectation maxi- the Evaluation Forum (CLEF 2023), CEUR-WS .org,

mization for identifying claims in low resource Thessaloniki , Greece, 2023 , pp. 337 - 350 . URL: https:

Indian languages , in: S. Bandyopadhyay, S. L. //ceur-ws. org/ Vol- 3497 /paper-029.pdf.

Devi , P. Bhattacharyya (Eds.), Proceedings of the [26]

Sawinski ,

Wecel , E. Ksiezniak, M. Strózyna,

18th International Conference on Natural Language W. Lewoniewski,

Stolarski , W. Abramowicz,

Processing (ICON), NLP Association of India (NL- OpenFact at CheckThat! 2023: Head-to-head GPT

PAI) , Silchar, India, 2021 , pp. 307 - 312 . URL: https: vs. BERT - A comparative study of transformers

//aclanthology.org/ 2021 . icon-main.37. language models for the detection of check-worthy [20]

Gili ,

Passaro , T. Caselli, Check-IT! : A cor- claims , in: M. Aliannejadi , G. Faggioli, N. Ferro,

Boschetti ,

G. E.

Lebani ,

Magnini , N.

Novielli ence and Labs of the Evaluation Forum (CLEF

2023 ),

(Eds.), Proceedings of the 9th Italian Conference CEUR-WS.org, Thessaloniki , Greece, 2023 , pp. 453 -

on Computational

Linguistics (CLiC-it

2023 ), CEUR 472 . URL: https://ceur-ws. org/ Vol- 3497 /paper-040.

Workshop

Proceedings , Venice, Italy, 2023 , pp. 227 - pdf .

235. URL: https://aclanthology.org/ 2023 .clicit- 1 .29/. [27]

Savchev , AI Rational at CheckThat! 2022: Us[21]

Scaiella ,

Costanzo ,

Passone , D. Croce, ing transformer models for tweet classification , in:

for fact verification in Italian , in: F. Dell'Orletta , Proceedings of the Working Notes of CLEF 2022

ceedings of the 10th Italian Conference on Compu- CEUR-WS.org , Bologna, Italy, 2022 , pp. 656 - 659 .

tational Linguistics ( CLiC-it 2024 ), CEUR Workshop URL: https://ceur-ws. org/ Vol- 3180 /paper-52.pdf.

Proceedings , Pisa, Italy, 2024 , pp. 898 - 908 . URL: [28]

Li ,

Panchendrarajan , A . Zubiaga, FactFind-

https://aclanthology.org/ 2024 .clicit- 1 .97/. ers at CheckThat! 2024: Refining check-worthy [22] P.

Atanasova , D.

Wright , I. Augenstein , Gener- statement detection with LLMs through data prun-

(Eds.), Proceedings of the 2020 Conference on the Conference and Labs of the Evaluation Forum

Empirical Methods in Natural Language Process- ( CLEF 2024 ), CEUR-WS .org, Grenoble, France, 2024 ,

ing (EMNLP), Association for Computational Lin - pp. 520 - 537 . URL: https://ceur-ws. org/ Vol- 3740 /

guistics , Online, 2020 , pp. 3168 - 3177 . URL: https: paper- 47 .pdf.

//aclanthology.org/ 2020 .emnlp-main. 256 /. doi:10. [29]

A. F.

Hayes ,

Krippendorf , Answering the call

18653 /v1/ 2020 . emnlp-main.256. for a standard reliability measure for coding data,

Communication Methods and Measures 1 ( 2007 ) the Association for Computational Linguistics:

77- 89 . doi: 10 .1080/19312450709336664. System Demonstrations, Association for Compu[30]

Ramponi ,

Dafara ,

Tonelli , Fine-grained tational Linguistics , Online, 2021 , pp. 176 - 197 .

fallacy detection with human label variation , in: URL: https://aclanthology.org/ 2021 .eacl-demos. 22 /.

Chiruzzo ,

Ritter , L. Wang (Eds.), Proceed- doi:10.18653/v1/ 2021 .eacl-demos. 22 .

ings of the 2025 Conference of the Nations of the [ 37]

Polignano ,

Basile , G. Semeraro, LLaMAntino-

Americas Chapter of the Association for Compu- 3-ANITA-8B-

Inst-DPO-ITA

model

, 2024 .

gies (Volume 1 : Long

Papers)

, Association for Com- LLaMAntino-3 -ANITA- 8B - Inst- DPO-ITA,

putational Linguistics , Albuquerque, New Mex- accessed: 2025 -05-01.

ico , 2025 , pp. 762 - 784 . URL: https://aclanthology. [38]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co-

org/ 2025 . naacl-long . 34 /. doi: 10 .18653/v1/ 2025 . nia, E. Barba,

Orlandini , G. Fiameni, R. Nav-

naacl-long.34. igli, Minerva LLMs: The first family of large [31]

Polignano ,

Basile , M. de Gemmis, G.

Semeraro, language models trained from scratch on Italian

standing model for NLP challenging tasks based R . Sprugnoli (Eds.), Proceedings of the 10th Italian

aro (Eds.), Proceedings of the Sixth Italian Confer- it 2024 ), CEUR Workshop Proceedings, Pisa, Italy,

ence on Computational Linguistics, CEUR-WS .org, 2024 , pp. 707 - 719 . URL: https://aclanthology.org/

Bari , Italy, 2019 . URL: https://ceur-ws. org/ Vol- 2481 / 2024 .clicit- 1 .77/.

paper57.pdf . [39] Qwen , A.

Yang , B.

Zhang , B.

Hui , B.

Zheng , [32] L.

Parisi , S.

Francia , P. Magnani, UmBERTo: B. Yu , C.

Li , D.

Liu , F.

Huang , H.

Wei , H.

Lin ,

word masking , 2020 . URL: https://github.com/ J. Lin,

Dang , et al., Qwen2.5 technical report,

musixmatchresearch/umberto, accessed: 2025 - 05 - arXiv preprint arXiv: 2412 .15115 ( 2025 ). URL: https:

01. //arxiv.org/abs/2412.15115. [33]

Schweter , Italian

BERT

and ELECTRA models, [40]

Grattafiori ,

Dubey ,

Jauhri , A . Pandey,

2020. doi: 10 .5281/zenodo.4263142, accessed

: A.

Kadian , A.

Al-Dahle , A.

Letman , A . Mathur,

2025-05- 01 . A. Schelten , A.

Vaughan , A.

Yang , A.

Fan , A.

Goyal , [34] J.

Devlin , M.- W.

Chang , K.

Lee , K.

Toutanova , BERT: A. Hartshorn , A.

Yang , A.

Mitra , A . Sravankumar,

language understanding , in: J. Burstein , C. Do- 3 herd of models , arXiv preprint arXiv:2407.21783

ran , T. Solorio (Eds.), Proceedings of the 2019 Con- ( 2024 ). URL: https://arxiv.org/abs/2407.21783.

Language

Technologies , Volume 1 (Long and Short

tics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 .

URL: https://aclanthology.org/N19-1423/. doi:10.

18653 /v1/ N19 -1423. [35]

Conneau ,

Khandelwal ,

Goyal , V. Chaud-

ceedings of the 58th Annual Meeting of the Associa-

Computational

Linguistics , Online, 2020 , pp. 8440 -

8451. URL: https://aclanthology.org/ 2020 .acl-main.

747 /. doi: 10 .18653/v1/ 2020 .acl-main. 747 . [36] R. van der Goot ,

Üstün ,

Ramponi , I. Sharaf,

16th Conference of the European Chapter of