=Paper=
{{Paper
|id=Vol-3878/133_calamita_long
|storemode=property
|title=GEESE - Generating and Evaluating Explanations for Semantic Entailment: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/133_calamita_long.pdf
|volume=Vol-3878
|authors=Andrea Zaninello,Bernardo Magnini
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZaninelloM24
}}
==GEESE - Generating and Evaluating Explanations for Semantic Entailment: A CALAMITA Challenge==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/133_calamita_long.pdf</pdf>
<pre>
                                GEESE - Generating and Evaluating Explanations for
                                Semantic Entailment: A CALAMITA Challenge
                                Andrea Zaninello1,2,∗ , Bernardo Magnini1
                                1
                                    Fondazione Bruno Kessler, Trento (Italy)
                                2
                                    Free University of Bozen-Bolzano (Italy)


                                                   Abstract
                                                   In the GEESE challenge, we present a pipeline to evaluate generated explanations for the task of Recognizing Textual
                                                   Entailment (RTE) in Italian. The challenge focuses on evaluating the impact of generated explanations on the predictive
                                                   performance of language models. Using a dataset enriched with human-written explanations, we employ two large language
                                                   models (LLMs) to generate and utilize explanations for semantic relationships between sentence pairs. Our methodology
                                                   assesses the quality of generated explanations by measuring changes in prediction accuracy when explanations are provided.
                                                   Through reproducible experimentation, we establish benchmarks against various baseline approaches, demonstrating the
                                                   potential of explanation injection to enhance model interpretability and performance.

                                                   Keywords
                                                   CALAMITA, CLiC-it, Explanation generation, Explainability, RTE, Recognizing Textual Entailment, Inference, Italian


                                1. Introduction and Motivation                                                                                      As a consequence, the need to develop methods to un-
                                                                                                                                                 derstand their reasoning is becoming central. Many re-
                                The ability of a machine to justify its predictions and cent efforts have been devoted to explaining such models
                                provide human-understandable explanations has been [13], and the importance of interpretability and explain-
                                a key research objective of Machine Learning (ML) and ability in AI has become ever more urgent [14, 15, 16].
                                Artificial Intelligence (AI) since their early stages [1, 2, 3].                                                    The role of explanations in NLP has been explored by
                                In the past few years, the field of AI has experienced a consistent body of research. Cambria et al. [17], for
                                an unprecedented acceleration in most areas, such as instance, provides a comprehensive survey of approaches
                                computer vision [4], audio [5], video [6], and program- for generating natural language explanations; Hartmann
                                ming languages [7], and especially in Natural Language and Sonntag [18] examines the benefits of explanations
                                Processing (NLP), with the popularization of generative for NLP models; Paranjape et al. [19] focuses on template-
                                Large Language Models (LLMs) such as OpenAI’s Chat- based explanations, Lampinen et al. [20] and Ye and Dur-
                                GPT [8], Google’s Gemini [9], or Meta’s Llama [10].                                                              rett [21] demonstrate the benefits of in-context explana-
                                              These models are currently able to produce natural- tions for large models in challenging reasoning tasks.
                                sounding and coherent language, often indistinguishable                                                             Explanation generation quality has traditionally been
                                from natural language [11, 12]. While these results open evaluated through automated ovelap metrics like BLEU
                                up new avenues for future applications and research, [22], ROUGE [23], or BERT-Score [24] against a gold
                                they also raise ethical issues considering the ubiquitous reference explanation written by humans. This usually
                                role of machines in our lives, and in sensitive fields like implies costly human-explanation collection campaigns;
                                education, health, justice, and private life. In fact, the additionally, these measures may neither fully capture
                                scarce transparency of neural architectures makes it hard the informativity or the effectiveness of an explanation,
                                to interpret their functioning (the so-called ”black-box” nor faithfully reflect human judgments.
                                problem). In addition, many of the currently available                                                              Recently, human simulatability scores have been pro-
                                LLMs are not fully open-source, so the data they were posed as an alternative method to understand the quality
                                trained on is not known to either researchers or the gen- of explanations from the perspective of the “utility to
                                eral public. Finally, these models have achieved such an end-user” [25]. Rather than focusing on the over-
                                sizes that their results are difficult to replicate, making lap between explanations and ground-truth data, this
                                them a kind of ”black box in a black box”.                                                                       approach assesses how explanations enhance predictive
                                                                                                                                                 performance on a downstream task compared to the input
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                   alone. While humans have traditionally been the predic-
                                ∗
                                     Corresponding author.                                                                                       tors [26], recent research has demonstrated that trained
                                Envelope-Open azaninello@fbk.eu (A. Zaninello); magnini@fbk.eu (B. Magnini) models can automate this process, showing moderate to
                                GLOBE https://github.com/andreazaninello (A. Zaninello)                                                          strong correlations with human judgments [27]. Pruthi
                                Orcid 0000-0001-9998-1942 (A. Zaninello)                                                                         et al. [28], for instance, measures explanation quality
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
based on downstream performance: their methodology                   • assess the quality of the generated explanations
involves training a student model on explanations gener-               𝐸𝑔𝑒𝑛 by taking the delta between prediction accu-
ated by a teacher, using automatic explanation generation              racy with and without explanation as a proxy of
techniques and training the student for the end task.                  explanations’ quality.
   However, current LLMs may also benefit from expla-
nation injection even if they are not explicitly trained toStep 1: Generate Explanation: A first LLM (𝑀1 ) is
do so, and some works suggest using the explanation to     prompted to produce explanations 𝐸𝑔𝑒𝑛 = {𝑒1 , 𝑒2 , … 𝑒𝑛 } for
augment the input to condition predictions of future data  a specific semantic relation 𝑟𝑐 holding between a given
points on both the input and the explanation [29, 27]. In  sentence pair, denoted as < 𝑠1 , 𝑠2 >. In the task, we focus
fact, LLMs are capable of understanding supplementary      on the entailment relationship, which can take three val-
input content and including explanations in the input dur- ues: ”YES” (sentence 1 is entailed by sentence 2), ”NO”
ing inference without requiring additional supervision,    (sentence 1 is contradicted by sentence 2), ”UNKNOWN”
which can indirectly demonstrate the role of explanations  (sentence 1 is neither entailed nor contradicted by sen-
in the inference process.                                  tence 2). In our baselines, we focus on one explanation
   These observations underline two crucial aspects:       type (why-explanation), but other kinds of explanations
                                                           or reasoning strategies (like counterfactual or example-
     • providing LLMs with quality explanations that
                                                           based ones) are possible. In our baselines, we use llama-
        allow them to infer relevant latent information,
                                                           3-3B-instruct [31] as 𝑀1 .
        i.e. to provide additional background knowledge,
        improves performance compared to only using
        the input or to using spurious explanations;       Step 2: Use Explanation on Relation Prediction: A
     • the quality of a (human or machine-generated) second LLM (𝑀2 ) is then provided with the generated ex-
        explanation can be measured based on its helpful- planations 𝐸𝑔𝑒𝑛 to evaluate if the generated explanations
        ness (or impairment) to the (model’s or human’s) improve the task of predicting the correct relations. In
        performance on a downstream task.                  practice, this is achieved by appending the explanation
                                                           as a “hint” to the prompt, and asking the model to make
To contribute to this line of research, we propose GEESE: a prediction thereof. This process aims to discover how
Generating and Evaluating Explanations for Seman- effectively 𝑀2 leverages the explanations from 𝑀1 to per-
tic Entailment at CALAMITA [30], a pipeline to indi- form the target task. We use llama-3-8B as 𝑀2 , but other
rectly assess the effectiveness of explanations through combinations of 𝑀1 and 𝑀2 are possible.
the evaluation of their impact on the task of Recognizing
Textual Entailment (RTE) in Italian1 .                     Step 3: Evaluate Explanation Effectiveness Expla-
                                                           nation effectiveness is evaluated by analyzing how pro-
                                                           viding different explanations generated in Step 1 affects
2. Task Description and GEESE                              the model 𝑀2 prediction in Step 2. In practice, this is
     Explanatory Pipeline                                  done by calculating the accuracy of the predictions of
                                                           𝑀2 given the explanations and comparing them to the
Consider a pair of sentences < 𝑠1 , 𝑠2 >, like the ones in selected baselines (see Section 4).
the following example:

        (1) Il cielo è grigio oggi.                             3. Data description
        (2) Faresti bene a prendere l’ombrello.2
                                                                3.1. Origin of data
Consider a semantic relation 𝑟 holding between 𝑠1 and 𝑠2
(e.g., 𝑠1 entails 𝑠2 , 𝑠1 does not entail 𝑠2 , 𝑠1 contradicts 𝑠2 ). The Recognizing Textual Entailment (RTE) task emerged
Let 𝐸 be the set of possible explanations for 𝑟. GEESE’s in 2005 [32] as the problem of determining if two sen-
explanatory task consists in:                                       tences stand in an entailment or not-entailment relation-
                                                                    ship. A common definition of “semantic entailment” (also
       • generating an explanation 𝑒𝑟 ∈ 𝐸 for the semantic referred to as presupposition in some studies) is that “A
          relationship 𝑟 for each < 𝑠1 , 𝑠2 > in the dataset;       sentence S presupposes a proposition p if p must be true
       • predict the relation with and without the gener- in order for S to have a truth-value (to be true or false)”
          ated explanation 𝑒𝑟 ;                                     [33]. A text t is said to entail another text (hypothesis,
1
                                                                    h) if h is true in every circumstance (possible world) in
  Code and data are made available at github.com/andreazaninello/
  calamita-geese
                                                                    which   t is true. RTE, however, suggests a more empir-
2
  (1) The sky is grey today. (2) You better take your umbrella with ical definition, allowing for cases in which the truth of
you.
the hypothesis is highly plausible, for most practical pur-                  tions. In our implementation, this is done through reg-
poses, rather than certain. According to [34], this “shal-                   ular expressions by substituting (“anonimize”) the label
low” definition better accounts for the types of uncertain                   strings ("YES", "NO", "UNKNOWN" ) and all words start-
inferences that are typically expected from text-based                       ing with entail.*, contradict.*, neutr.*, impl*,
applications.                                                                contradd.* (verbs and nouns directly stating the kind
   Recognizing Textual Entailment was formalized                             of relationship) with ”XXX ”.
through a series of successful challenges and workshops                         We therefore also provide the following “anonymized”
that began in 2005 [32] and lasted until 2012. Starting                      additional explanations for each example, which we use
from the RTE-3 edition, the task was extended from two                       in our prompts:
labels to a three-label classification, splitting the not-
entailment label into two classes, contradiction and neu-                          • anon_whyexp : the anonimized explanation gen-
trality. Given the interest in the task, an Italian version                          erated by llama3 as 𝑀1 ;
of the RTE-3 dataset was developed to explore language                             • anon_human : the anonimized human-written ex-
comprehension and textual entailment [35].                                           planation (from e-RTE-3-it).
   The dataset used in the challenge is the e-RTE-3-it
dataset [36], which is an emended version enriched with                      3.4. Data format
human-written explanations of the RTE-3-it dataset [35].
                                                                             The dataset is freely distributed in HuggingFace’s Dataset
                                                                             format4 . A snippet of the data is displayed in Table 1.
3.2. Detailed data statistics
The dataset contains 1600 text-hypothesis sentence pairs
in Italian (text_t and text_h in the dataset) divided
                                                                             4. Metrics and baselines
into an 800-example validation and an 800-example test                       We conduct baseline experiments using Llama-3.1-8B-
split. Each example is annotated with an entailment                          Instruct as 𝑀1 with a custom implementation in Hugging-
label (label ): "YES" (entailed), "NO" (contradicted), or                    Face, and Llama-3-8B as 𝑀2 , using the LLM-Evaluation-
"UNKNOWN" (neutral).                                                         Harness library [38] in a zero-shot setting5 .
                                                                               We provide baselines for the following settings:
3.3. Annotation details                                         1. no-exp: No explanations provided (baseline);
The e-RTE-3-it dataset presents human explanations writ-        2. dummy: The hypothesis itself (text_t ) provided
ten in Italian by native speakers. For each text-hypothesis        as a ”non-informative” explanation, controlling
pair, annotators provided a natural language explanation           for input length and providing a second baseline.
justifying the given label (explanation ) for the entail-       3. human: Human-written explanations (from e-
ment relation (“why does 𝑆1 stand in an 𝑟 relation with            RTE-3-it) anonimized (anon_human ) provided as
𝑆2 ?”)3 .                                                          additional input;
   All annotations underwent quality control, involving         4. llama-3: The explanation generated using
two expert linguists who manually checked the expla-               LLama-3-8B-Instruct as 𝑀1 (anon_whyexp ).
nations for grammaticality, fluency, and logical validity.
This process ensured high quality of the final e-RTE-3-it
explanations, informativeness, as well as minimal label
                                                            4.1. Example of prompts for zero shots
leakage (see infra).                                        All experiments have been carried out in a zero-shot
   Label leakage [37] refers to the fact that the explana- setting using the following prompts6 .
tion may be directly suggesting the label without gen-
uinely being informative. While the manual check of                (M1 - Generation):            Your task
all original human explanations ensured minimal label              is to clarify the entailment
leakage, to prevent this we automatically replace di-              relationship between a pair
rect references to the label and to the task with place-           of sentences by explaining
holders in the human-written and generated explana-                why a classifier predicted

3                                                                            4
    Additionally, the annotator provided a confidence score (1-5) reflect-     https://huggingface.co/datasets/azaninello/explained-full-llama-3
    ing their certainty about the provided explanation (which we don’t       5
                                                                               Generation parameters are: stop sequences: ”</s>”, ”<|eot_id|>”,
    use in the task), an optional alternative label, if they felt the ini-     max. gen. tokens: 128; do sample: false; temperature: 0
                                                                             6
    tial label was inaccurate, along with explanations and confidence          All prompts are provided in English to ensure comparability across
    scores. We don’t consider these annotations in the task, and only use      tasks and language models. In fact, using English as a “metalan-
    the original label as our gold relationship and the human-written          guage” allows us to establish a comparable set-up to study LLMs’
    explanation for the original label as a strong baseline.                   capability regardless of their specific training language.
      FIELD                  EXAMPLE
      id                     393
      text_t                 Gli Eschimesi, i cacciatori di foche sparsi nell’Artico, prevedono di sollecitare un giudizio della
                             Commissione Interamericana dei Diritti dell’Uomo poiché gli Stati Uniti, contribuendo in modo
                             sostanziale al riscaldamento globale, stanno minacciando la loro esistenza.
      text_h                 La caccia alle foche mette a rischio d’estinzione le specie.
      label                  UNKNOWN
      text_label             neutrality
      explanation            Nella frase non vengono citate le conseguenze della caccia sulle foche perciò non si può evincere il
                             rischio per loro della caccia.
      anon_whyexp            The XXX relationship between the two sentences is XXX because the relationship is not clearly
                             established. Sentence 1 mentions the Inuit people, the Commission on Human Rights, and the
                             United States, which are not directly related to the topic of the second sentence, which is about
                             the threat of extinction of species due to hunting. The connection between the two sentences is
                             not explicitly stated, and it is not clear how the information presented in Sentence 1 supports or
                             XXX the information in Sentence 2. Therefore, the XXX label is XXX because the relationship
                             between the two sentences is not clear. For example, if the XXX label was XXX, it would mean that
                             Sentence 1 logically implies Sentence 2, but the information in Sentence 1 does not provide enough
                             evidence to support this claim. Similarly, if the XXX label was XXX, it would mean that Sentence 1
                             logically XXX Sentence 2, but again, the information in Sentence 1 does not provide enough
                             evidence to support this claim. Therefore, the XXX label is XXX because the relationship between
                             the two sentences is not clear.
      anon_human             Nella frase non vengono citate le conseguenze della caccia sulle foche perciò non si può evincere il
                             rischio per loro della caccia.
Table 1
Example from the processed e-RTE-3-it dataset.


            a specific entailment label.                          accuracy at 78.12%. In comparison, using human-written
            Sentence 1: text_t Sentence 2:                        explanations shows slightly lower accuracy compared
            text_h Entailment label: label.                       to machine-generated, but higher scores compared to
            exp_type 7                                            baselines, suggesting that explanations do enhance the
          (M2 - Prediction):             Your task is             models’ understanding of semantic relationships.
          to predict the entailment label                            Generated explanations, proving more effective than
          between two sentences, selecting                        human-crafted     ones, suggest that the quality and type
          one label among YES (entailment),                       of explanations provided can influence predictive perfor-
          NO (contradiction), or UNKNOWN                          mance, but also highlight the need for further research
          (neutrality). Sentence 1:                               into optimizing explanation generation methods for im-
          text_t Sentence 2: text_h Hint:                         proved outcomes in NLP tasks. In fact, note that gener-
          anon_explanation. Entailment                            ated explanations may be positively influenced by factors
          label:   8                                              other than informativeness alone, such as the lengths of
                                                                  the explanations themselves, or may still be indirectly
                                                                  suggesting the right relationship despite the anonymiza-
5. Baseline Results and Discussion tion process described in 3.3.
                                                                     For example, as reported by one of the anonymous re-
Baseline results, reported in Table 2, demonstrate the im- viewers, see “anon_whyexp” explanation in Table 1: “In
pact of incorporating explanations on the performance of other words, Sentence 2 provides enough information
language models in the Recognizing Textual Entailment to infer the truth of Sentence 1”. The generated expla-
tasks. The accuracy scores indicate that models utilizing nation clearly (but not directly) hints at an ”entail” label,
explanations generated by Llama-3 achieve the highest potentially compromising the intended anonymity. The
                                                                  fairness of the comparison between human- and machine-
7
  Variables are indicated in color. In our experiments exp_type = generated explanation is an aspect that deserves further
 “Explain how the two sentences are connected.” and the variables
                                                                  investigation.
    are read from each example.
8
    Variables are indicated in color. In our experiments, anon_explana-
    tion can take the following values: “Not given.” (no-exp), text_h
    (dummy), anon_human (human), anon_whyexp (llama-3).
  Tasks              n-shot      Metric      Value      Stderr      models must adhere to strict privacy standards to ensure
  geese_dummy             0      acc         0.5850     0.0174
                                                                    that individuals’ rights are respected. Addressing these
  geese_noexp             0      acc         0.5437     0.0176
                                                                    ethical challenges is essential to foster trust and ensure
  geese_llama3            0      acc        0.7812      0.0146
  geese_human             0      acc         0.7575     0.0152      that AI technologies are developed and used responsibly.

Table 2
Results for the 0-shot baseline experiments on the full test set.   9. Data license and copyright
                                                                       issues
6. Conclusion                                                       We release our original content under the MIT License.
                                                                    Please refer to the original dataset’s copyright and license
The findings from the GEESE challenge underscore the                regulations for information on the derived data.
significance of effective explanation generation in en-
hancing the capabilities of language models in RTE tasks.
Preliminary results show that models provided with                  Acknowledgments
explanations, whether human-written or generated by
LLMs, exhibit improved predictive accuracy compared to              This work has been partially funded by PNRR project
those lacking such inputs. This supports the hypothesis             FAIR - Future AI Research (PE00000013), under the NRRP
that explanations can facilitate a deeper understanding             MUR program funded by the NextGenerationEU and the
of semantic relationships, thus aiding model inference.             ANTIDOTE project (CHIST-ERA grant of the Call XAI
   The GEESE challenge establishes a framework for gen-             2019 of the ANR with the grant number Project-ANR-21-
erating and evaluating explanations in the domain of                CHR4-0002)
semantic entailment. By demonstrating the utility of
explanation injection, we contribute to the ongoing dis-            References
course on interpretability in AI, advocating for a balanced
approach that enhances model transparency while main-                [1] S. Lowry, G. Macpherson, A blot on the profession,
taining robustness. Our findings encourage further explo-                296 brit, MED. J 657 (1988) 657.
ration into the interplay between explanations and model             [2] L. M. Fagan, E. H. Shortliffe, B. G. Buchanan,
performance, paving the way for more interpretable and                   Computer-based medical decision making: from
user-friendly AI systems. As language models continue                    mycin to vm, Automedica 3 (1980) 97–108.
to evolve, integrating effective explanation mechanisms              [3] R. Bareiss, Exemplar-based knowledge acquisition:
will be crucial for ensuring their responsible deployment                A unified approach to concept representation, clas-
in sensitive applications.                                               sification, and learning, volume 2, Academic Press,
                                                                         2014.
                                                                     [4] R. Rombach, A. Blattmann, D. Lorenz, P. Esser,
7. Limitations                                                           B. Ommer, High-resolution image synthesis with
The study also highlights limitations, including potential               latent diffusion models, 2021. arXiv:2112.10752 .
biases in the generated explanations and the challenge of            [5] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen,
ensuring that explanations remain informative without                    R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, B. Ram-
directly revealing the answer. Future research could ex-                 abhadran,       Learning to speak fluently in a
plore diverse explanation types and their varying impacts                foreign language: Multilingual speech synthe-
across different contexts and languages.                                 sis and cross-language voice cloning,          CoRR
                                                                         abs/1907.04448 (2019). URL: http://arxiv.org/abs/
                                                                         1907.04448. arXiv:1907.04448 .
8. Ethical issues                                                    [6] Y. Mirsky, W. Lee, The creation and detection of
                                                                         deepfakes: A survey, ACM Comput. Surv. 54 (2021).
We would like to draw the readers’ attention on the fol-                 URL: https://doi.org/10.1145/3425780. doi:10.1145/
lowing. Firstly, the potential for bias in both the train-               3425780 .
ing data and the generated explanations can perpetu-                 [7] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
ate stereotypes or misinformation, leading to harmful                    de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
consequences, particularly in sensitive domains such as                  N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
healthcare or legal applications. There is also the risk                 M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
that users may place undue trust in machine-generated                    S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
explanations, mistakenly believing them to be infallible.                M. Bavarian, C. Winter, P. Tillet, F. P. Such,
Finally, the collection and use of data for training these
     D. Cummings, M. Plappert, F. Chantzis, E. Barnes,                arXiv:https://academic.oup.com/idpl/article-
     A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,                pdf/7/4/233/22923065/ipx022.pdf .
     N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,       [16] L. Edwards, M. Veale, Slave to the algorithm: Why
     W. Saunders, C. Hesse, A. N. Carr, J. Leike,                     a right to an explanation is probably not the remedy
     J. Achiam, V. Misra, E. Morikawa, A. Radford,                    you are looking for, Duke L. & Tech. Rev. 16 (2017)
     M. Knight, M. Brundage, M. Murati, K. Mayer,                     18.
     P. Welinder, B. McGrew, D. Amodei, S. McCan-                [17] E. Cambria, L. Malandri, F. Mercorio, M. Mez-
     dlish, I. Sutskever, W. Zaremba,           Evaluating            zanzanica, N. Nobani, A survey on xai and
     large language models trained on code, CoRR                      natural language explanations,          Information
     abs/2107.03374 (2021). URL: https://arxiv.org/abs/               Processing      Management 60 (2023) 103111.
     2107.03374. arXiv:2107.03374 .                                   URL:       https://www.sciencedirect.com/science/
 [8] OpenAI, Gpt-4 technical report,                    2023.         article/pii/S0306457322002126.             doi:https:
     arXiv:2303.08774 .                                               //doi.org/10.1016/j.ipm.2022.103111 .
 [9] G. Team, Gemini: A family of highly capable mul-            [18] M. Hartmann, D. Sonntag, A survey on improving
     timodal models, 2024. URL: https://arxiv.org/abs/                NLP models with human explanations, in: Proceed-
     2312.11805. arXiv:2312.11805 .                                   ings of the First Workshop on Learning with Nat-
[10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-             ural Language Supervision, Association for Com-
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-               putational Linguistics, Dublin, Ireland, 2022, pp.
     gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,            40–47. URL: https://aclanthology.org/2022.lnls-1.5.
     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,            doi:10.18653/v1/2022.lnls- 1.5 .
     W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,             [19] B. Paranjape, J. Michael, M. Ghazvininejad, H. Ha-
     A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-              jishirzi, L. Zettlemoyer, Prompting contrastive
     das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-                   explanations for commonsense reasoning tasks,
     renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,            in: Findings of the Association for Compu-
     D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,           tational Linguistics: ACL-IJCNLP 2021, Asso-
     P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-            ciation for Computational Linguistics, Online,
     stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.        2021, pp. 4179–4192. URL: https://aclanthology.
     Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-               org/2021.findings-acl.366. doi:10.18653/v1/2021.
     lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,           findings- acl.366 .
     Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-            [20] A. Lampinen, I. Dasgupta, S. Chan, K. Math-
     driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:             ewson, M. Tessler, A. Creswell, J. McClelland,
     Open foundation and fine-tuned chat models, 2023.                J. Wang, F. Hill, Can language models learn
     arXiv:2307.09288 .                                               from explanations in context?, in: Y. Goldberg,
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-               Z. Kozareva, Y. Zhang (Eds.), Findings of the Associ-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-             ation for Computational Linguistics: EMNLP 2022,
     try, A. Askell, et al., Language models are few-shot             Association for Computational Linguistics, Abu
     learners, Advances in neural information process-                Dhabi, United Arab Emirates, 2022, pp. 537–563.
     ing systems 33 (2020) 1877–1901.                                 URL: https://aclanthology.org/2022.findings-emnlp.
[12] T. Labruna, S. Brenna, A. Zaninello, B. Magnini, Un-             38. doi:10.18653/v1/2022.findings- emnlp.38 .
     raveling chatgpt: A critical analysis of ai-generated       [21] X. Ye, G. Durrett, The unreliability of explanations
     goal-oriented dialogues and annotations, 2023.                   in few-shot prompting for textual reasoning, in:
     arXiv:2305.14556 .                                               S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
[13] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi-         K. Cho, A. Oh (Eds.), Advances in Neural Informa-
     annotti, D. Pedreschi, A survey of methods for                   tion Processing Systems, volume 35, Curran Asso-
     explaining black box models, ACM computing sur-                  ciates, Inc., 2022, pp. 30378–30392. URL: https://
     veys (CSUR) 51 (2018) 1–42.                                      proceedings.neurips.cc/paper_files/paper/2022/file/
[14] A. Vassiliades, N. Bassiliades, T. Patkos, Argumenta-            c402501846f9fe03e2cac015b3f0e6b1-Paper-Conference.
     tion and explainable artificial intelligence: a survey,          pdf.
     The Knowledge Engineering Review 36 (2021) e5.              [22] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
     doi:10.1017/S0269888921000011 .                                  method for automatic evaluation of machine trans-
[15] A. D. Selbst, J. Powles, Meaningful information                  lation, in: Proceedings of the 40th annual meeting
     and the right to explanation, International Data                 of the Association for Computational Linguistics,
     Privacy Law 7 (2017) 233–242. URL: https://doi.org/              2002, pp. 311–318.
     10.1093/idpl/ipx022. doi:10.1093/idpl/ipx022 .              [23] L. C. ROUGE, A package for automatic evaluation
     of summaries, in: Proceedings of Workshop on              Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,
     Text Summarization of ACL, Spain, 2004.                   D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,
[24] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,            D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,
     Y. Artzi, Bertscore: Evaluating text generation with      E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,
     bert, arXiv preprint arXiv:1904.09675 (2019).             F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.
[25] B. Kim, R. Khanna, O. O. Koyejo, Examples are             Anderson, G. Nail, G. Mialon, G. Pang, G. Cu-
     not enough, learn to criticize! criticism for inter-      curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou-
     pretability, in: Advances in Neural Information           vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,
     Processing Systems, volume 29, 2016.                      I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes,
[26] S. Wiegreffe, A. Marasović, N. A. Smith, Measur-          J. Park, J. Mahadeokar, J. Shah, J. van der Linde,
     ing association between labels and free-text ratio-       J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,
     nales, in: Proceedings of the 2021 Conference             J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,
     on Empirical Methods in Natural Language Pro-             J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,
     cessing, Association for Computational Linguis-           K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,
     tics, Online and Punta Cana, Dominican Republic,          K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,
     2021, pp. 10266–10284. URL: https://aclanthology.         L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,
     org/2021.emnlp-main.804. doi:10.18653/v1/2021.            L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,
     emnlp- main.804 .                                         L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,
[27] P. Hase, S. Zhang, H. Xie, M. Bansal, Leakage-            M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,
     adjusted simulatability: Can models generate non-         M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.
     trivial explanations of their behavior in natu-           Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,
     ral language?, in: Findings of the Association            N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,
     for Computational Linguistics: EMNLP 2020, As-            P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar-
     sociation for Computational Linguistics, Online,          gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,
     2020, pp. 4351–4367. URL: https://aclanthology.org/       Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,
     2020.findings-emnlp.390. doi:10.18653/v1/2020.            R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa-
     findings- emnlp.390 .                                     tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,
[28] D. Pruthi, R. Bansal, B. Dhingra, L. B. Soares,           R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas-
     M. Collins, Z. C. Lipton, G. Neubig, W. W. Cohen,         appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,
     Evaluating explanations: How much do explana-             S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos-
     tions from the teacher aid students?, Transactions        ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit-
     of the Association for Computational Linguistics          man, S. Sootla, S. Collot, S. Gururangan, S. Borodin-
     10 (2022) 359–375. URL: https://aclanthology.org/         sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,
     2022.tacl-1.21. doi:10.1162/tacl_a_00465 .                T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,
[29] P. Hase, M. Bansal, When can models learn from            U. Karn, V. Goswami, V. Gupta, V. Ramanathan,
     explanations? a formal framework for understand-          V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
     ing the roles of explanation data, arXiv preprint         vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
     arXiv:2102.02201 (2021).                                  tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
[30] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-   Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-   Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
     naldi, D. Scalena, CALAMITA: Challenge the Abili-         Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
     ties of LAnguage Models in ITAlian, in: Proceed-          A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
     ings of the 10th Italian Conference on Computa-           toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-    senberg, A. Vaughan, A. Baevski, A. Feinstein,
     ber 4 - December 6, 2024, CEUR Workshop Proceed-          A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al-
     ings, CEUR-WS.org, 2024.                                  varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan,
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-         A. Ramchandani, A. Franco, A. Saraf, A. Chowd-
     Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,        hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz-
     A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi-           dan, B. James, B. Maurer, B. Leonhardi, B. Huang,
     tra, A. Sravankumar, A. Korenev, A. Hinsvark,             B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu,
     A. Rao, A. Zhang, A. Rodriguez, A. Gregerson,             B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto-
     A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern,      jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur-
     C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc-          ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu,
     Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C.      C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer,
D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt,                  Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh,             models, 2024. URL: https://arxiv.org/abs/2407.21783.
D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland,                arXiv:2407.21783 .
E. Dowling, E. Jamil, E. Montgomery, E. Presani,             [32] I. Dagan, O. Glickman, B. Magnini, The pascal
E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun-                recognising textual entailment challenge, in: Ma-
bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel,          chine learning challenges workshop, Springer, 2005,
F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M.               pp. 177–190.
Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern,          [33] G. Chierchia, S. Mcconnell-Ginet,                Mean-
G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang,                  ing and grammar: An introduction to seman-
G. Lakshminarayanan, H. Shojanazeri, H. Zou,                      tics, 1990. URL: https://api.semanticscholar.org/
H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk,                   CorpusID:62731986.
H. Aspegren, H. Goldman, I. Damlaj, I. Molybog,              [34] I. Dagan, B. Dolan, B. Magnini, D. Roth, Recog-
I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge-            nizing textual entailment: Rational, evaluation and
boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang,        approaches–erratum, Natural Language Engineer-
J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong,            ing 16 (2010) 105–105.
J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep-           [35] B. Magnini, A. Lavelli, S. Magnolini, Comparing
ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang,                  machine learning and deep learning approaches on
K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan-                    NLP tasks for the Italian language, in: Proceedings
delwal, K. Zand, K. Matosich, K. Veeraraghavan,                   of the Twelfth Language Resources and Evaluation
K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho-               Conference, European Language Resources Associ-
tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva,                  ation, Marseille, France, 2020, pp. 2110–2119. URL:
L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich,                  https://aclanthology.org/2020.lrec-1.259.
L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt,               [36] A. Zaninello, S. Brenna, B. Magnini, Textual en-
M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie,                 tailment with natural language explanations: The
M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke-                  italian e-rte-3 dataset, in: CLiC-it, 2023. URL:
neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel,           https://ceur-ws.org/Vol-3596/short21.pdf.
M. Vyatskov, M. Samvelyan, M. Clark, M. Macey,               [37] P. Hase, S. Zhang, H. Xie, M. Bansal, Leakage-
M. Wang, M. J. Hermoso, M. Metanat, M. Raste-                     adjusted simulatability: Can models generate non-
gari, M. Bansal, N. Santhanam, N. Parks, N. White,                trivial explanations of their behavior in natural lan-
N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P.                  guage?, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings
Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz,                of the Association for Computational Linguistics:
O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh,             EMNLP 2020, Association for Computational Lin-
P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux,            guistics, Online, 2020, pp. 4351–4367. URL: https://
P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj,              aclanthology.org/2020.findings-emnlp.390. doi:10.
Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy,              18653/v1/2020.findings- emnlp.390 .
R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey,             [38] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black,
R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J.                 A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h,
Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon,                   H. Li, K. McDonell, N. Muennighoff, C. Ociepa,
S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra-                 J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron,
maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin,                 L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang,
S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang,               A. Zou, A framework for few-shot language model
S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max,                    evaluation, 2024. URL: https://zenodo.org/records/
S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad,              12608602. doi:10.5281/zenodo.12608602 .
S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-
hury, S. Goldman, T. Remez, T. Glaser, T. Best,
T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,

</pre>