=Paper=
{{Paper
|id=Vol-3878/133_calamita_long
|storemode=property
|title=GEESE - Generating and Evaluating Explanations for Semantic Entailment: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/133_calamita_long.pdf
|volume=Vol-3878
|authors=Andrea Zaninello,Bernardo Magnini
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZaninelloM24
}}
==GEESE - Generating and Evaluating Explanations for Semantic Entailment: A CALAMITA Challenge==
GEESE - Generating and Evaluating Explanations for
Semantic Entailment: A CALAMITA Challenge
Andrea Zaninello1,2,∗ , Bernardo Magnini1
1
Fondazione Bruno Kessler, Trento (Italy)
2
Free University of Bozen-Bolzano (Italy)
Abstract
In the GEESE challenge, we present a pipeline to evaluate generated explanations for the task of Recognizing Textual
Entailment (RTE) in Italian. The challenge focuses on evaluating the impact of generated explanations on the predictive
performance of language models. Using a dataset enriched with human-written explanations, we employ two large language
models (LLMs) to generate and utilize explanations for semantic relationships between sentence pairs. Our methodology
assesses the quality of generated explanations by measuring changes in prediction accuracy when explanations are provided.
Through reproducible experimentation, we establish benchmarks against various baseline approaches, demonstrating the
potential of explanation injection to enhance model interpretability and performance.
Keywords
CALAMITA, CLiC-it, Explanation generation, Explainability, RTE, Recognizing Textual Entailment, Inference, Italian
1. Introduction and Motivation As a consequence, the need to develop methods to un-
derstand their reasoning is becoming central. Many re-
The ability of a machine to justify its predictions and cent efforts have been devoted to explaining such models
provide human-understandable explanations has been [13], and the importance of interpretability and explain-
a key research objective of Machine Learning (ML) and ability in AI has become ever more urgent [14, 15, 16].
Artificial Intelligence (AI) since their early stages [1, 2, 3]. The role of explanations in NLP has been explored by
In the past few years, the field of AI has experienced a consistent body of research. Cambria et al. [17], for
an unprecedented acceleration in most areas, such as instance, provides a comprehensive survey of approaches
computer vision [4], audio [5], video [6], and program- for generating natural language explanations; Hartmann
ming languages [7], and especially in Natural Language and Sonntag [18] examines the benefits of explanations
Processing (NLP), with the popularization of generative for NLP models; Paranjape et al. [19] focuses on template-
Large Language Models (LLMs) such as OpenAI’s Chat- based explanations, Lampinen et al. [20] and Ye and Dur-
GPT [8], Google’s Gemini [9], or Meta’s Llama [10]. rett [21] demonstrate the benefits of in-context explana-
These models are currently able to produce natural- tions for large models in challenging reasoning tasks.
sounding and coherent language, often indistinguishable Explanation generation quality has traditionally been
from natural language [11, 12]. While these results open evaluated through automated ovelap metrics like BLEU
up new avenues for future applications and research, [22], ROUGE [23], or BERT-Score [24] against a gold
they also raise ethical issues considering the ubiquitous reference explanation written by humans. This usually
role of machines in our lives, and in sensitive fields like implies costly human-explanation collection campaigns;
education, health, justice, and private life. In fact, the additionally, these measures may neither fully capture
scarce transparency of neural architectures makes it hard the informativity or the effectiveness of an explanation,
to interpret their functioning (the so-called ”black-box” nor faithfully reflect human judgments.
problem). In addition, many of the currently available Recently, human simulatability scores have been pro-
LLMs are not fully open-source, so the data they were posed as an alternative method to understand the quality
trained on is not known to either researchers or the gen- of explanations from the perspective of the “utility to
eral public. Finally, these models have achieved such an end-user” [25]. Rather than focusing on the over-
sizes that their results are difficult to replicate, making lap between explanations and ground-truth data, this
them a kind of ”black box in a black box”. approach assesses how explanations enhance predictive
performance on a downstream task compared to the input
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy alone. While humans have traditionally been the predic-
∗
Corresponding author. tors [26], recent research has demonstrated that trained
Envelope-Open azaninello@fbk.eu (A. Zaninello); magnini@fbk.eu (B. Magnini) models can automate this process, showing moderate to
GLOBE https://github.com/andreazaninello (A. Zaninello) strong correlations with human judgments [27]. Pruthi
Orcid 0000-0001-9998-1942 (A. Zaninello) et al. [28], for instance, measures explanation quality
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
based on downstream performance: their methodology • assess the quality of the generated explanations
involves training a student model on explanations gener- 𝐸𝑔𝑒𝑛 by taking the delta between prediction accu-
ated by a teacher, using automatic explanation generation racy with and without explanation as a proxy of
techniques and training the student for the end task. explanations’ quality.
However, current LLMs may also benefit from expla-
nation injection even if they are not explicitly trained toStep 1: Generate Explanation: A first LLM (𝑀1 ) is
do so, and some works suggest using the explanation to prompted to produce explanations 𝐸𝑔𝑒𝑛 = {𝑒1 , 𝑒2 , … 𝑒𝑛 } for
augment the input to condition predictions of future data a specific semantic relation 𝑟𝑐 holding between a given
points on both the input and the explanation [29, 27]. In sentence pair, denoted as < 𝑠1 , 𝑠2 >. In the task, we focus
fact, LLMs are capable of understanding supplementary on the entailment relationship, which can take three val-
input content and including explanations in the input dur- ues: ”YES” (sentence 1 is entailed by sentence 2), ”NO”
ing inference without requiring additional supervision, (sentence 1 is contradicted by sentence 2), ”UNKNOWN”
which can indirectly demonstrate the role of explanations (sentence 1 is neither entailed nor contradicted by sen-
in the inference process. tence 2). In our baselines, we focus on one explanation
These observations underline two crucial aspects: type (why-explanation), but other kinds of explanations
or reasoning strategies (like counterfactual or example-
• providing LLMs with quality explanations that
based ones) are possible. In our baselines, we use llama-
allow them to infer relevant latent information,
3-3B-instruct [31] as 𝑀1 .
i.e. to provide additional background knowledge,
improves performance compared to only using
the input or to using spurious explanations; Step 2: Use Explanation on Relation Prediction: A
• the quality of a (human or machine-generated) second LLM (𝑀2 ) is then provided with the generated ex-
explanation can be measured based on its helpful- planations 𝐸𝑔𝑒𝑛 to evaluate if the generated explanations
ness (or impairment) to the (model’s or human’s) improve the task of predicting the correct relations. In
performance on a downstream task. practice, this is achieved by appending the explanation
as a “hint” to the prompt, and asking the model to make
To contribute to this line of research, we propose GEESE: a prediction thereof. This process aims to discover how
Generating and Evaluating Explanations for Seman- effectively 𝑀2 leverages the explanations from 𝑀1 to per-
tic Entailment at CALAMITA [30], a pipeline to indi- form the target task. We use llama-3-8B as 𝑀2 , but other
rectly assess the effectiveness of explanations through combinations of 𝑀1 and 𝑀2 are possible.
the evaluation of their impact on the task of Recognizing
Textual Entailment (RTE) in Italian1 . Step 3: Evaluate Explanation Effectiveness Expla-
nation effectiveness is evaluated by analyzing how pro-
viding different explanations generated in Step 1 affects
2. Task Description and GEESE the model 𝑀2 prediction in Step 2. In practice, this is
Explanatory Pipeline done by calculating the accuracy of the predictions of
𝑀2 given the explanations and comparing them to the
Consider a pair of sentences < 𝑠1 , 𝑠2 >, like the ones in selected baselines (see Section 4).
the following example:
(1) Il cielo è grigio oggi. 3. Data description
(2) Faresti bene a prendere l’ombrello.2
3.1. Origin of data
Consider a semantic relation 𝑟 holding between 𝑠1 and 𝑠2
(e.g., 𝑠1 entails 𝑠2 , 𝑠1 does not entail 𝑠2 , 𝑠1 contradicts 𝑠2 ). The Recognizing Textual Entailment (RTE) task emerged
Let 𝐸 be the set of possible explanations for 𝑟. GEESE’s in 2005 [32] as the problem of determining if two sen-
explanatory task consists in: tences stand in an entailment or not-entailment relation-
ship. A common definition of “semantic entailment” (also
• generating an explanation 𝑒𝑟 ∈ 𝐸 for the semantic referred to as presupposition in some studies) is that “A
relationship 𝑟 for each < 𝑠1 , 𝑠2 > in the dataset; sentence S presupposes a proposition p if p must be true
• predict the relation with and without the gener- in order for S to have a truth-value (to be true or false)”
ated explanation 𝑒𝑟 ; [33]. A text t is said to entail another text (hypothesis,
1
h) if h is true in every circumstance (possible world) in
Code and data are made available at github.com/andreazaninello/
calamita-geese
which t is true. RTE, however, suggests a more empir-
2
(1) The sky is grey today. (2) You better take your umbrella with ical definition, allowing for cases in which the truth of
you.
the hypothesis is highly plausible, for most practical pur- tions. In our implementation, this is done through reg-
poses, rather than certain. According to [34], this “shal- ular expressions by substituting (“anonimize”) the label
low” definition better accounts for the types of uncertain strings ("YES", "NO", "UNKNOWN" ) and all words start-
inferences that are typically expected from text-based ing with entail.*, contradict.*, neutr.*, impl*,
applications. contradd.* (verbs and nouns directly stating the kind
Recognizing Textual Entailment was formalized of relationship) with ”XXX ”.
through a series of successful challenges and workshops We therefore also provide the following “anonymized”
that began in 2005 [32] and lasted until 2012. Starting additional explanations for each example, which we use
from the RTE-3 edition, the task was extended from two in our prompts:
labels to a three-label classification, splitting the not-
entailment label into two classes, contradiction and neu- • anon_whyexp : the anonimized explanation gen-
trality. Given the interest in the task, an Italian version erated by llama3 as 𝑀1 ;
of the RTE-3 dataset was developed to explore language • anon_human : the anonimized human-written ex-
comprehension and textual entailment [35]. planation (from e-RTE-3-it).
The dataset used in the challenge is the e-RTE-3-it
dataset [36], which is an emended version enriched with 3.4. Data format
human-written explanations of the RTE-3-it dataset [35].
The dataset is freely distributed in HuggingFace’s Dataset
format4 . A snippet of the data is displayed in Table 1.
3.2. Detailed data statistics
The dataset contains 1600 text-hypothesis sentence pairs
in Italian (text_t and text_h in the dataset) divided
4. Metrics and baselines
into an 800-example validation and an 800-example test We conduct baseline experiments using Llama-3.1-8B-
split. Each example is annotated with an entailment Instruct as 𝑀1 with a custom implementation in Hugging-
label (label ): "YES" (entailed), "NO" (contradicted), or Face, and Llama-3-8B as 𝑀2 , using the LLM-Evaluation-
"UNKNOWN" (neutral). Harness library [38] in a zero-shot setting5 .
We provide baselines for the following settings:
3.3. Annotation details 1. no-exp: No explanations provided (baseline);
The e-RTE-3-it dataset presents human explanations writ- 2. dummy: The hypothesis itself (text_t ) provided
ten in Italian by native speakers. For each text-hypothesis as a ”non-informative” explanation, controlling
pair, annotators provided a natural language explanation for input length and providing a second baseline.
justifying the given label (explanation ) for the entail- 3. human: Human-written explanations (from e-
ment relation (“why does 𝑆1 stand in an 𝑟 relation with RTE-3-it) anonimized (anon_human ) provided as
𝑆2 ?”)3 . additional input;
All annotations underwent quality control, involving 4. llama-3: The explanation generated using
two expert linguists who manually checked the expla- LLama-3-8B-Instruct as 𝑀1 (anon_whyexp ).
nations for grammaticality, fluency, and logical validity.
This process ensured high quality of the final e-RTE-3-it
explanations, informativeness, as well as minimal label
4.1. Example of prompts for zero shots
leakage (see infra). All experiments have been carried out in a zero-shot
Label leakage [37] refers to the fact that the explana- setting using the following prompts6 .
tion may be directly suggesting the label without gen-
uinely being informative. While the manual check of (M1 - Generation): Your task
all original human explanations ensured minimal label is to clarify the entailment
leakage, to prevent this we automatically replace di- relationship between a pair
rect references to the label and to the task with place- of sentences by explaining
holders in the human-written and generated explana- why a classifier predicted
3 4
Additionally, the annotator provided a confidence score (1-5) reflect- https://huggingface.co/datasets/azaninello/explained-full-llama-3
ing their certainty about the provided explanation (which we don’t 5
Generation parameters are: stop sequences: ””, ”<|eot_id|>”,
use in the task), an optional alternative label, if they felt the ini- max. gen. tokens: 128; do sample: false; temperature: 0
6
tial label was inaccurate, along with explanations and confidence All prompts are provided in English to ensure comparability across
scores. We don’t consider these annotations in the task, and only use tasks and language models. In fact, using English as a “metalan-
the original label as our gold relationship and the human-written guage” allows us to establish a comparable set-up to study LLMs’
explanation for the original label as a strong baseline. capability regardless of their specific training language.
FIELD EXAMPLE
id 393
text_t Gli Eschimesi, i cacciatori di foche sparsi nell’Artico, prevedono di sollecitare un giudizio della
Commissione Interamericana dei Diritti dell’Uomo poiché gli Stati Uniti, contribuendo in modo
sostanziale al riscaldamento globale, stanno minacciando la loro esistenza.
text_h La caccia alle foche mette a rischio d’estinzione le specie.
label UNKNOWN
text_label neutrality
explanation Nella frase non vengono citate le conseguenze della caccia sulle foche perciò non si può evincere il
rischio per loro della caccia.
anon_whyexp The XXX relationship between the two sentences is XXX because the relationship is not clearly
established. Sentence 1 mentions the Inuit people, the Commission on Human Rights, and the
United States, which are not directly related to the topic of the second sentence, which is about
the threat of extinction of species due to hunting. The connection between the two sentences is
not explicitly stated, and it is not clear how the information presented in Sentence 1 supports or
XXX the information in Sentence 2. Therefore, the XXX label is XXX because the relationship
between the two sentences is not clear. For example, if the XXX label was XXX, it would mean that
Sentence 1 logically implies Sentence 2, but the information in Sentence 1 does not provide enough
evidence to support this claim. Similarly, if the XXX label was XXX, it would mean that Sentence 1
logically XXX Sentence 2, but again, the information in Sentence 1 does not provide enough
evidence to support this claim. Therefore, the XXX label is XXX because the relationship between
the two sentences is not clear.
anon_human Nella frase non vengono citate le conseguenze della caccia sulle foche perciò non si può evincere il
rischio per loro della caccia.
Table 1
Example from the processed e-RTE-3-it dataset.
a specific entailment label. accuracy at 78.12%. In comparison, using human-written
Sentence 1: text_t Sentence 2: explanations shows slightly lower accuracy compared
text_h Entailment label: label. to machine-generated, but higher scores compared to
exp_type 7 baselines, suggesting that explanations do enhance the
(M2 - Prediction): Your task is models’ understanding of semantic relationships.
to predict the entailment label Generated explanations, proving more effective than
between two sentences, selecting human-crafted ones, suggest that the quality and type
one label among YES (entailment), of explanations provided can influence predictive perfor-
NO (contradiction), or UNKNOWN mance, but also highlight the need for further research
(neutrality). Sentence 1: into optimizing explanation generation methods for im-
text_t Sentence 2: text_h Hint: proved outcomes in NLP tasks. In fact, note that gener-
anon_explanation. Entailment ated explanations may be positively influenced by factors
label: 8 other than informativeness alone, such as the lengths of
the explanations themselves, or may still be indirectly
suggesting the right relationship despite the anonymiza-
5. Baseline Results and Discussion tion process described in 3.3.
For example, as reported by one of the anonymous re-
Baseline results, reported in Table 2, demonstrate the im- viewers, see “anon_whyexp” explanation in Table 1: “In
pact of incorporating explanations on the performance of other words, Sentence 2 provides enough information
language models in the Recognizing Textual Entailment to infer the truth of Sentence 1”. The generated expla-
tasks. The accuracy scores indicate that models utilizing nation clearly (but not directly) hints at an ”entail” label,
explanations generated by Llama-3 achieve the highest potentially compromising the intended anonymity. The
fairness of the comparison between human- and machine-
7
Variables are indicated in color. In our experiments exp_type = generated explanation is an aspect that deserves further
“Explain how the two sentences are connected.” and the variables
investigation.
are read from each example.
8
Variables are indicated in color. In our experiments, anon_explana-
tion can take the following values: “Not given.” (no-exp), text_h
(dummy), anon_human (human), anon_whyexp (llama-3).
Tasks n-shot Metric Value Stderr models must adhere to strict privacy standards to ensure
geese_dummy 0 acc 0.5850 0.0174
that individuals’ rights are respected. Addressing these
geese_noexp 0 acc 0.5437 0.0176
ethical challenges is essential to foster trust and ensure
geese_llama3 0 acc 0.7812 0.0146
geese_human 0 acc 0.7575 0.0152 that AI technologies are developed and used responsibly.
Table 2
Results for the 0-shot baseline experiments on the full test set. 9. Data license and copyright
issues
6. Conclusion We release our original content under the MIT License.
Please refer to the original dataset’s copyright and license
The findings from the GEESE challenge underscore the regulations for information on the derived data.
significance of effective explanation generation in en-
hancing the capabilities of language models in RTE tasks.
Preliminary results show that models provided with Acknowledgments
explanations, whether human-written or generated by
LLMs, exhibit improved predictive accuracy compared to This work has been partially funded by PNRR project
those lacking such inputs. This supports the hypothesis FAIR - Future AI Research (PE00000013), under the NRRP
that explanations can facilitate a deeper understanding MUR program funded by the NextGenerationEU and the
of semantic relationships, thus aiding model inference. ANTIDOTE project (CHIST-ERA grant of the Call XAI
The GEESE challenge establishes a framework for gen- 2019 of the ANR with the grant number Project-ANR-21-
erating and evaluating explanations in the domain of CHR4-0002)
semantic entailment. By demonstrating the utility of
explanation injection, we contribute to the ongoing dis- References
course on interpretability in AI, advocating for a balanced
approach that enhances model transparency while main- [1] S. Lowry, G. Macpherson, A blot on the profession,
taining robustness. Our findings encourage further explo- 296 brit, MED. J 657 (1988) 657.
ration into the interplay between explanations and model [2] L. M. Fagan, E. H. Shortliffe, B. G. Buchanan,
performance, paving the way for more interpretable and Computer-based medical decision making: from
user-friendly AI systems. As language models continue mycin to vm, Automedica 3 (1980) 97–108.
to evolve, integrating effective explanation mechanisms [3] R. Bareiss, Exemplar-based knowledge acquisition:
will be crucial for ensuring their responsible deployment A unified approach to concept representation, clas-
in sensitive applications. sification, and learning, volume 2, Academic Press,
2014.
[4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser,
7. Limitations B. Ommer, High-resolution image synthesis with
The study also highlights limitations, including potential latent diffusion models, 2021. arXiv:2112.10752 .
biases in the generated explanations and the challenge of [5] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen,
ensuring that explanations remain informative without R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, B. Ram-
directly revealing the answer. Future research could ex- abhadran, Learning to speak fluently in a
plore diverse explanation types and their varying impacts foreign language: Multilingual speech synthe-
across different contexts and languages. sis and cross-language voice cloning, CoRR
abs/1907.04448 (2019). URL: http://arxiv.org/abs/
1907.04448. arXiv:1907.04448 .
8. Ethical issues [6] Y. Mirsky, W. Lee, The creation and detection of
deepfakes: A survey, ACM Comput. Surv. 54 (2021).
We would like to draw the readers’ attention on the fol- URL: https://doi.org/10.1145/3425780. doi:10.1145/
lowing. Firstly, the potential for bias in both the train- 3425780 .
ing data and the generated explanations can perpetu- [7] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
ate stereotypes or misinformation, leading to harmful de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
consequences, particularly in sensitive domains such as N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
healthcare or legal applications. There is also the risk M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
that users may place undue trust in machine-generated S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
explanations, mistakenly believing them to be infallible. M. Bavarian, C. Winter, P. Tillet, F. P. Such,
Finally, the collection and use of data for training these
D. Cummings, M. Plappert, F. Chantzis, E. Barnes, arXiv:https://academic.oup.com/idpl/article-
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, pdf/7/4/233/22923065/ipx022.pdf .
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, [16] L. Edwards, M. Veale, Slave to the algorithm: Why
W. Saunders, C. Hesse, A. N. Carr, J. Leike, a right to an explanation is probably not the remedy
J. Achiam, V. Misra, E. Morikawa, A. Radford, you are looking for, Duke L. & Tech. Rev. 16 (2017)
M. Knight, M. Brundage, M. Murati, K. Mayer, 18.
P. Welinder, B. McGrew, D. Amodei, S. McCan- [17] E. Cambria, L. Malandri, F. Mercorio, M. Mez-
dlish, I. Sutskever, W. Zaremba, Evaluating zanzanica, N. Nobani, A survey on xai and
large language models trained on code, CoRR natural language explanations, Information
abs/2107.03374 (2021). URL: https://arxiv.org/abs/ Processing Management 60 (2023) 103111.
2107.03374. arXiv:2107.03374 . URL: https://www.sciencedirect.com/science/
[8] OpenAI, Gpt-4 technical report, 2023. article/pii/S0306457322002126. doi:https:
arXiv:2303.08774 . //doi.org/10.1016/j.ipm.2022.103111 .
[9] G. Team, Gemini: A family of highly capable mul- [18] M. Hartmann, D. Sonntag, A survey on improving
timodal models, 2024. URL: https://arxiv.org/abs/ NLP models with human explanations, in: Proceed-
2312.11805. arXiv:2312.11805 . ings of the First Workshop on Learning with Nat-
[10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- ural Language Supervision, Association for Com-
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- putational Linguistics, Dublin, Ireland, 2022, pp.
gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, 40–47. URL: https://aclanthology.org/2022.lnls-1.5.
M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, doi:10.18653/v1/2022.lnls- 1.5 .
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, [19] B. Paranjape, J. Michael, M. Ghazvininejad, H. Ha-
A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar- jishirzi, L. Zettlemoyer, Prompting contrastive
das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko- explanations for commonsense reasoning tasks,
renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, in: Findings of the Association for Compu-
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, tational Linguistics: ACL-IJCNLP 2021, Asso-
P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen- ciation for Computational Linguistics, Online,
stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. 2021, pp. 4179–4192. URL: https://aclanthology.
Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay- org/2021.findings-acl.366. doi:10.18653/v1/2021.
lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, findings- acl.366 .
Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro- [20] A. Lampinen, I. Dasgupta, S. Chan, K. Math-
driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: ewson, M. Tessler, A. Creswell, J. McClelland,
Open foundation and fine-tuned chat models, 2023. J. Wang, F. Hill, Can language models learn
arXiv:2307.09288 . from explanations in context?, in: Y. Goldberg,
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- Z. Kozareva, Y. Zhang (Eds.), Findings of the Associ-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- ation for Computational Linguistics: EMNLP 2022,
try, A. Askell, et al., Language models are few-shot Association for Computational Linguistics, Abu
learners, Advances in neural information process- Dhabi, United Arab Emirates, 2022, pp. 537–563.
ing systems 33 (2020) 1877–1901. URL: https://aclanthology.org/2022.findings-emnlp.
[12] T. Labruna, S. Brenna, A. Zaninello, B. Magnini, Un- 38. doi:10.18653/v1/2022.findings- emnlp.38 .
raveling chatgpt: A critical analysis of ai-generated [21] X. Ye, G. Durrett, The unreliability of explanations
goal-oriented dialogues and annotations, 2023. in few-shot prompting for textual reasoning, in:
arXiv:2305.14556 . S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
[13] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi- K. Cho, A. Oh (Eds.), Advances in Neural Informa-
annotti, D. Pedreschi, A survey of methods for tion Processing Systems, volume 35, Curran Asso-
explaining black box models, ACM computing sur- ciates, Inc., 2022, pp. 30378–30392. URL: https://
veys (CSUR) 51 (2018) 1–42. proceedings.neurips.cc/paper_files/paper/2022/file/
[14] A. Vassiliades, N. Bassiliades, T. Patkos, Argumenta- c402501846f9fe03e2cac015b3f0e6b1-Paper-Conference.
tion and explainable artificial intelligence: a survey, pdf.
The Knowledge Engineering Review 36 (2021) e5. [22] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
doi:10.1017/S0269888921000011 . method for automatic evaluation of machine trans-
[15] A. D. Selbst, J. Powles, Meaningful information lation, in: Proceedings of the 40th annual meeting
and the right to explanation, International Data of the Association for Computational Linguistics,
Privacy Law 7 (2017) 233–242. URL: https://doi.org/ 2002, pp. 311–318.
10.1093/idpl/ipx022. doi:10.1093/idpl/ipx022 . [23] L. C. ROUGE, A package for automatic evaluation
of summaries, in: Proceedings of Workshop on Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,
Text Summarization of ACL, Spain, 2004. D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,
[24] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,
Y. Artzi, Bertscore: Evaluating text generation with E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,
bert, arXiv preprint arXiv:1904.09675 (2019). F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.
[25] B. Kim, R. Khanna, O. O. Koyejo, Examples are Anderson, G. Nail, G. Mialon, G. Pang, G. Cu-
not enough, learn to criticize! criticism for inter- curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou-
pretability, in: Advances in Neural Information vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,
Processing Systems, volume 29, 2016. I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes,
[26] S. Wiegreffe, A. Marasović, N. A. Smith, Measur- J. Park, J. Mahadeokar, J. Shah, J. van der Linde,
ing association between labels and free-text ratio- J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,
nales, in: Proceedings of the 2021 Conference J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,
on Empirical Methods in Natural Language Pro- J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,
cessing, Association for Computational Linguis- K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,
tics, Online and Punta Cana, Dominican Republic, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,
2021, pp. 10266–10284. URL: https://aclanthology. L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,
org/2021.emnlp-main.804. doi:10.18653/v1/2021. L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,
emnlp- main.804 . L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,
[27] P. Hase, S. Zhang, H. Xie, M. Bansal, Leakage- M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,
adjusted simulatability: Can models generate non- M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.
trivial explanations of their behavior in natu- Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,
ral language?, in: Findings of the Association N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,
for Computational Linguistics: EMNLP 2020, As- P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar-
sociation for Computational Linguistics, Online, gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,
2020, pp. 4351–4367. URL: https://aclanthology.org/ Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,
2020.findings-emnlp.390. doi:10.18653/v1/2020. R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa-
findings- emnlp.390 . tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,
[28] D. Pruthi, R. Bansal, B. Dhingra, L. B. Soares, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas-
M. Collins, Z. C. Lipton, G. Neubig, W. W. Cohen, appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,
Evaluating explanations: How much do explana- S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos-
tions from the teacher aid students?, Transactions ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit-
of the Association for Computational Linguistics man, S. Sootla, S. Collot, S. Gururangan, S. Borodin-
10 (2022) 359–375. URL: https://aclanthology.org/ sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,
2022.tacl-1.21. doi:10.1162/tacl_a_00465 . T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,
[29] P. Hase, M. Bansal, When can models learn from U. Karn, V. Goswami, V. Gupta, V. Ramanathan,
explanations? a formal framework for understand- V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
ing the roles of explanation data, arXiv preprint vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
arXiv:2102.02201 (2021). tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
[30] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
naldi, D. Scalena, CALAMITA: Challenge the Abili- Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
ties of LAnguage Models in ITAlian, in: Proceed- A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
ings of the 10th Italian Conference on Computa- toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem- senberg, A. Vaughan, A. Baevski, A. Feinstein,
ber 4 - December 6, 2024, CEUR Workshop Proceed- A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al-
ings, CEUR-WS.org, 2024. varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan,
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- A. Ramchandani, A. Franco, A. Saraf, A. Chowd-
Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz-
A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi- dan, B. James, B. Maurer, B. Leonhardi, B. Huang,
tra, A. Sravankumar, A. Korenev, A. Hinsvark, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu,
A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto-
A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur-
C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc- ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu,
Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer,
D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt, Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, models, 2024. URL: https://arxiv.org/abs/2407.21783.
D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, arXiv:2407.21783 .
E. Dowling, E. Jamil, E. Montgomery, E. Presani, [32] I. Dagan, O. Glickman, B. Magnini, The pascal
E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun- recognising textual entailment challenge, in: Ma-
bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel, chine learning challenges workshop, Springer, 2005,
F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M. pp. 177–190.
Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, [33] G. Chierchia, S. Mcconnell-Ginet, Mean-
G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang, ing and grammar: An introduction to seman-
G. Lakshminarayanan, H. Shojanazeri, H. Zou, tics, 1990. URL: https://api.semanticscholar.org/
H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, CorpusID:62731986.
H. Aspegren, H. Goldman, I. Damlaj, I. Molybog, [34] I. Dagan, B. Dolan, B. Magnini, D. Roth, Recog-
I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge- nizing textual entailment: Rational, evaluation and
boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang, approaches–erratum, Natural Language Engineer-
J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, ing 16 (2010) 105–105.
J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep- [35] B. Magnini, A. Lavelli, S. Magnolini, Comparing
ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, machine learning and deep learning approaches on
K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan- NLP tasks for the Italian language, in: Proceedings
delwal, K. Zand, K. Matosich, K. Veeraraghavan, of the Twelfth Language Resources and Evaluation
K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho- Conference, European Language Resources Associ-
tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva, ation, Marseille, France, 2020, pp. 2110–2119. URL:
L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, https://aclanthology.org/2020.lrec-1.259.
L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, [36] A. Zaninello, S. Brenna, B. Magnini, Textual en-
M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie, tailment with natural language explanations: The
M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke- italian e-rte-3 dataset, in: CLiC-it, 2023. URL:
neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, https://ceur-ws.org/Vol-3596/short21.pdf.
M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, [37] P. Hase, S. Zhang, H. Xie, M. Bansal, Leakage-
M. Wang, M. J. Hermoso, M. Metanat, M. Raste- adjusted simulatability: Can models generate non-
gari, M. Bansal, N. Santhanam, N. Parks, N. White, trivial explanations of their behavior in natural lan-
N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P. guage?, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings
Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz, of the Association for Computational Linguistics:
O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, EMNLP 2020, Association for Computational Lin-
P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, guistics, Online, 2020, pp. 4351–4367. URL: https://
P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, aclanthology.org/2020.findings-emnlp.390. doi:10.
Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, 18653/v1/2020.findings- emnlp.390 .
R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey, [38] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black,
R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J. A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h,
Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, H. Li, K. McDonell, N. Muennighoff, C. Ociepa,
S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra- J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron,
maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang,
S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang, A. Zou, A framework for few-shot language model
S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, evaluation, 2024. URL: https://zenodo.org/records/
S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, 12608602. doi:10.5281/zenodo.12608602 .
S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-
hury, S. Goldman, T. Remez, T. Glaser, T. Best,
T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,