<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Direct and Indirect Interpretations of Speech Acts: Evidence from Human Judgments and Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Massimiliano Orsini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominique Brunato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC)</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper introduces INDIR-IT (Indirectness for the Italian language), a linguistically informed, manually curated benchmark for evaluating large language models' (LLMs) understanding of indirect speech acts (ISAs) in Italian. By systematically contrasting conventionalized and non-conventionalized ISAs with literal interpretations, the corpus enables fine-grained assessment of pragmatic competence, an area still relatively underexplored compared to lexical and syntactic understanding. Preliminary results show that LLMs handle conventionalized ISAs relatively well, while performance on non-conventionalized ISAs remains more sensitive to model size and capacity. INDIR-IT ofers a foundation for advancing research on pragmatic inference in both humans and LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Indirectness</kwd>
        <kwd>Speech acts</kwd>
        <kwd>Italian benchmark</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Human evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>These abilities are particularly relevant for designing</title>
        <p>more natural and humanlike dialogue systems.</p>
        <p>
          Since Vaswani et al.’s seminal work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], pre-trained large In addition to the conceptual challenge, there is also
language models based on the transformer architecture a resource gap: most of the available resources are
de(LLMs) have shown outstanding capabilities in under- veloped in English and often merely translated to fit
anstanding and generating natural language. However, other language. This practice risks neglecting
languagethese advances have also raised important concerns re- specific pragmatic nuances and may compromise the
garding interpretability. From a linguistic perspective, validity and fidelity of evaluations conducted in
nonquestions remain about the true nature and depth of English contexts.
the linguistic competence exhibited by these models This article intends to address both of these challenges
[
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], and whether they can serve as computational by focusing on a central yet underrepresented pragmatic
evidence for usage-based theories of language [
          <xref ref-type="bibr" rid="ref5">4</xref>
          ]. In phenomenon: indirectness.
response, a growing body of research has focused on We outline a methodology for the construction of a
improving interpretability and systematically evaluat- dataset of indirect speech acts (ISAs) and a
corresponding LLMs across diverse linguistic domains. This is of- ing evaluation task in Italian. The dataset is designed
ten achieved through the development of standardized with two complementary purposes: on the one hand, to
benchmarks, i.e. datasets paired with metrics designed measure the degree of competence of LLMs with regard
to evaluate various models on specific tasks. to ISAs; and on the other, to provide insights into the
        </p>
        <p>While substantial progress has been made in evalu- interpretability of LLMs in processing indirectness in
ating LLMs’ syntactic, semantic, and general natural comparison with humans.
language understanding (NLU) abilities, pragmatic com- Contributions The contributions of this article can be
petences remains relatively underexplored despite its briefly summarized in the following points:
central role in human communication, where meaning
depends on intentional language use, interactional
context, and communicative efects [ 5]. This is due in part to
the dificulty of operationalizing pragmatic phenomena,
which encompass a wide range of abilities, such as
resolving deixis, interpreting implicatures, understanding
ifgurative language, adhering to conversational maxims,
and deriving speaker intentions from indirect speech.
• A methodology for developing a benchmark of</p>
        <p>ISAs that accounts for both their variety and
degree of conventionality;
• INDIR-IT, a manually-curated Italian-language
dataset and evaluation task constructed in
accordance with this methodology1;
• Preliminary results comparing human and LLM
performance, providing initial insights into how
current models handle ISA-related pragmatic
competence.</p>
        <p>1The dataset is freely available at this link: https://huggingface.co/
datasets/MaxiOr/ISA</p>
        <sec id="sec-1-1-1">
          <title>2.1. Indirect Speech Acts</title>
          <p>In what follows, we first introduce key concepts from cessed. Competing models continue to propose difering
the linguistic literature on indirect speech acts and review mechanisms and processing orders, and much depends on
existing NLP resources for evaluating model interpretabil- contextual, cognitive, and conventional factors [16, 17].
ity. We then present our novel dataset and describe the This lack of consensus reflects not only the complexity of
design of the associated evaluation task. Finally, we re- the phenomenon but also the variability observed even
port and discuss the results of the human annotation among human comprehenders.
study alongside preliminary evaluation outcomes across Since both conventionalized and non-conventionalized
several LLMs. ISAs play a central role in human interaction,
mastering indirectness remains a major challenge for language
models, which must grapple with these multiple layers
2. Related Works of pragmatic reasoning to approach human-like
communicative competence.</p>
          <p>
            Within the domain of pragmatics, the concept of speech 2.2. Pragmatics Understanding
acts is central, as they are defined as the minimal unit Benchmarks
of communication [
            <xref ref-type="bibr" rid="ref16">6</xref>
            ]. In How to Do Things with Words
[7], Austin makes a distinction between what is said Despite some criticism [18, 19, 20], benchmarks remain a
(locution), what is intended (illocution) and the efect central tool for evaluating the performance of (large)
lanproduced on the hearer (perlocution). This distinction guage models across a wide range of tasks. They ofer a
is crucial for the pragmatic phenomenon known as in- standardized framework to compare models’ capabilities
directness, where the locution and the illocution of an and have become an essential part of LLM development
utterance do not correspond to each other. and assessment. While benchmarks for syntax, semantics,
          </p>
          <p>
            In Searle’s framework [8], an indirect speech act is and general NLU are well developed— including recent
defined as the simultaneous performance of two speech eforts tailored to Italian [ 21, 22]—resources targeting
acts: a primary act, which functions as the final intended pragmatic competence remain scarce, especially in
lanmeaning, and a secondary act that lends its locution to guages other than English. This is particularly true for
the primary act. This view, which is known as standard ISAs, a complex and context-dependent pragmatic
phepragmatic view or literal force hypothesis (LFH) [9], es- nomenon. One broad multilingual initiative that includes
tablishes that the illocution of the secondary act, the pragmatics-related tasks is BIG-Bench [
            <xref ref-type="bibr" rid="ref30">23</xref>
            ]. Although
literal force, is always functional for the retrieval process primarily aimed at probing the general capabilities of
of the primary illocutionary force. LLMs, it contains several tasks touching on pragmatics,
          </p>
          <p>
            However, this literal-first processing assumption is including Implicature Recovery, which tests
interpretafar from universally accepted. An alternative proposal, tion of indirect responses to polar questions (limited to
the Direct Access View advanced by Gibbs [
            <xref ref-type="bibr" rid="ref19">10</xref>
            ], holds binary yes/no inferences) and Intent Recognition, which
that listeners can often directly infer the intended mean- evaluates models’ ability to detect indirect requests.
ing without fully processing the literal content, partic- Another recent contribution is the Pragmatic
Underularly when the context strongly supports a nonliteral standing Benchmark (PUB) [24], which aggregates
multireading. Several experimental studies support this view ple tasks focused on diferent aspects of pragmatic
compe[11, 12, 13], especially in the case of conventionalized tence, such as figurative language, presupposition, deixis,
indirect speech acts, whose interpretation is often fa- and indirectness. In PUB, three tasks specifically
tarcilitated by lexicalized or syntactic triggers. Examples get indirectness, based on the CIRCA [25] and GRICE
include indirect requests like “Can you V?” or indirect [26] datasets. CIRCA ofers indirect responses to polar
ofers such as “Would you like to V?”, which are often questions and includes both a classification task
distinprocessed rapidly and efortlessly. guishing between direct and indirect answers and an
          </p>
          <p>
            While conventionalized ISAs may often be interpretation task for identifying the implied meaning.
identified via such surface cues, a large class of The GRICE dataset similarly focuses on indirect replies
non-conventionalized ISAs remains highly context- but extends the scope by including scalar implicatures.
dependent, as no fixed mapping exists between form and Despite their usefulness, these datasets share several
function. These acts require more complex inferential limitations. The context is minimal, often limited to a
sinreasoning, often drawing on Theory of Mind (ToM) ca- gle question, which reduces the realism and ecological
vapacities [
            <xref ref-type="bibr" rid="ref20">14, 15</xref>
            ] and sensitivity to subtle discourse-level lidity of the tasks. Additionally, the evaluation paradigm
cues. is typically binary or multiple choice, which may
over
          </p>
          <p>Importantly, despite decades of research, there is still simplify the inherent ambiguity of non-conventionalized
no unified account of how indirect speech acts are pro- ISAs. The tasks often focus on a narrow range of ISA
types, particularly indirect responses to yes/no questions,
as these are generally easier to generate and annotate. To investigate whether LLMs (and humans) process</p>
          <p>To address some of these limitations, Hu et al. [27] conventionalized and non-conventionalized ISAs
diferdesigned an indirectness understanding task embedded in ently, the dataset is split into two parts: 40 scenarios
short scenarios. Each item requires selecting the correct featuring non-conventionalized ISAs (NC-ISAs) and 30
interpretation of an ISA from four options: the indirect pairs of conventionalized ISAs. Each pair includes the
meaning, the literal meaning, and two distractors. The same utterance embedded in two distinct contexts: one
task ofers more variability in speech act combinations, favoring the indirect reading (C-ISAs) and one favoring
though the dataset remains small (20 items total). the literal reading (Lit). This design, inspired in part by</p>
          <p>A more ambitious approach is proposed by Roque et Roque et al. [28], allows us to probe models for
contextal. [28], who suggest using ISA schemas, modeled after sensitivity and bias in ISA interpretation.
Winograd schemas [29]. These consist of paired contexts In summary, the indirect interpretation is considered
designed to favor either a literal or an indirect reading of the target reading for both non-conventionalized and
conthe same utterance. While this method introduces richer ventionalized scenarios, while the literal interpretation
contexts and greater variability, it remains easily scalable is expected to be preferred in literal scenarios.
with minimal expert intervention only if it is applied to Table 1 illustrates a representative example for each
a limited set of ISA types. scenario included in the dataset2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Overview of INDIR-IT</title>
      <sec id="sec-2-1">
        <title>3.1. Internal Partitioning</title>
        <p>Inspired by Hu et al.’s work [27], the dataset presented
in this paper consists of 100 scenarios. Each scenario
includes a brief contextual description involving two
characters, followed by an indirect speech act produced by
one of the speakers. For each scenario, four candidate
interpretations are provided: the indirect meaning, the
literal meaning, and two lexical distractors, ranging from
non-sequiturs to even another literal interpretation,
albeit less plausible.
3.1.1. Scenario design and coverage
In order to create a challenging and heterogeneous ISA
dataset, the combinations of primary and secondary acts
were designed to be as diverse as possible. However, some
constraints limited this goal. First, not all primary acts
can plausibly be expressed indirectly, as indirectness may
conflict with their felicity conditions (e.g., declarations
or promises). Second, not all secondary acts are equally
suitable for every primary act, since the inferential paths
required to recover the intended meaning of an ISA often
follow conventionalized patterns.
2Appendix D provides the English translation for all the examples
reported in the paper.</p>
        <p>To address these challenges and expand coverage, sce- apply to C-ISAs, at least until further empirical evidence
narios were crafted to include longer contextual windows, confirms whether the Direct Access View systematically
allowing us to probe models on less frequently explored governs their interpretation in these contexts.
primary/secondary act pairings. To ensure comparability between human and model</p>
        <p>As a result, 26 distinct combinations were created for evaluations, annotation instructions and model prompts
NC-ISAs, while 7 combinations were designed for C-ISAs, were aligned as closely as possible. For models, the
with indirect requests making up the majority. The difi- prompts include structural tags: COMPITO precedes
culty of crafting diferent combinations for convention- the task instructions, STORIA introduces the scenario,
alized ISAs might be due to the fact that indirectness is and the question "Cosa intende dire Fausto?" ("What
often adopted as a politeness strategy in order to decrease does Fausto mean?") follows immediately after the
the imposing potential of such directive acts [8], and as scenario. These tags help delineate task components
consequence, indirect request might be those ISAs that while maintaining the consistency of the input. In
mostly undergo conventionalization. both the prompts and human annotation interface,</p>
        <p>With regard to lexical triggers, the most represented technical jargon is deliberately avoided. Interpretations
is ‘Puoi V?’, functioning similarly to its English coun- are presented in random order and labeled with tags
terpart ‘Can you?’. However, the indirect meaning of a, b, c, and d to prevent any biases related to order efects.
conventionalized ISAs seems to be conveyed not only by
a lexical entry but also by other factors such as modality,
negation and grammatical person. This is clear by
confronting Puoi V? and ’Posso V?’, which conveys a diferent 4. Human Annotation Procedure
lpartitmeratrhyaatcdto,eosr n’PoetrtcrhiéggneornaVn’yacnodnvPeenrcthioénVal?i’z,ewdiItShAthaet The human annotation task was conducted with a total of
all. Since conventionality is only assumed beforehand, 21 native Italian speakers recruited via the Prolific
crowdwe cannot rule out this possibility for other forms of the sourcing platform3. To ensure annotation quality, only
same triggers that consequently are treated as trigger on participants who reported Italian as their first language
their own. Each utterance in the dataset is labeled with and who had no known language-related disorders were
both its primary and secondary act types: in literal sce- included. The final sample was balanced for gender (10
narios, these labels are identical, as they are not supposed females and 11 males), with participants ranging in age
to convey any indirect meaning. from 21 to 63 years (mean age: 31).</p>
        <p>To clarify how these labels apply, we refer back to To minimize the risk of participants inferring the
purthe examples in Table 1: in the non-conventionalized pose of the experiment and potentially biasing their
rescenario, the primary act is labeled as a positive response, sponses, the raters were divided into three independent
while the secondary act is a question, which reflects the groups of seven annotators, with each group evaluating
indirect intention. In the conventionalized example, the a diferent subset of the dataset.
utterance is a request (primary act) expressed through In order to avoid exposing participants to both
mema question (secondary act). In the literal version of that bers of the conventionalized/literal pairs, these pairs were
scenario, both acts correspond to a question, with no distributed across the sets so that each participant only
indirectness involved. saw one member of any given pair.</p>
        <p>The whole dataset, along with a complete list of all To limit the overall length of the task, each group was
primary/secondary act combinations and triggers, is pro- presented with a questionnaire containing 27 items. This
vided in the dataset card of the Hugging Face’s repository. distribution preserved the internal balance of the dataset
while reducing the number of non-conventionalized
scenarios included per set. Specifically, each questionnaire
3.2. Task Design comprised 10 conventionalized scenarios, 10 literal
sceBased on the newly collected dataset, the task involves narios, and 7 non-conventionalized scenarios, resulting
assigning a plausibility score ranging from 1 (not plausi- in a total of 81 annotated items across the entire dataset.
ble) to 5 (very plausible) to each candidate interpretation
of a given scenario. Rather than framing the task as a cat- 4.1. Results
egorical classification, we opted for graded judgments in
order to capture the intrinsic ambiguity of indirect speech Results on the human annotation tasks are reported in
acts, particularly in the case of NC-ISAs. In these cases, Table 2 in terms of mean and standard deviation values
both the indirect and literal meanings may be conveyed for each interpretation.
simultaneously by the speaker, making it inappropriate Recall that in both non-conventionalized and
convento label any interpretation as definitively correct or in- tionalized scenarios, the indirect interpretation was
concorrect. It is worth noting that similar caution may also</p>
        <p>Ind</p>
        <p>4.2. Qualitative Analysis</p>
        <p>To have an in-depth understanding of the human
annotation performance, we carried out a closer examination of
specific scenarios that feature contrasting results. In
particular, we analyzed two conventionalized/literal pairs
(presented in Table 3), and two non-conventionalized
scesidered the target reading, while in literal scenarios the narios (Table 4). For brevity, we report only their mean
literal interpretation was expected to be preferred. Over- ratings. The full scenarios and associated interpretations
all, human participants aligned with these expectations are provided in Appendix D.3.
and exhibited clear, context-sensitive interpretive prefer- As mentioned in Section 3, diferent triggers may yield
ences across the three scenario types. diferent outcomes, depending on their degree of
conven</p>
        <p>In conventionalized scenarios, the indirect interpre- tionality. In the first conventionalized/literal pair in Table
tations received the highest ratings, consistent with ex- 3 featuring the trigger "Perché non...?" (Why not...?), the
pectations for conventionalized indirect speech acts. Lit- indirect interpretation was significantly rated higher in
eral interpretations in these scenarios were rated notably both scenarios. Conversely, in the second pair involving
lower, indicating that participants were attuned to the the trigger "Si può sapere...?" (Is it possible to know...?),
pragmatics of the context. the indirect interpretation was rated higher only in the</p>
        <p>In non-conventionalized scenarios, indirect readings conventionalized scenario, as expected. This
asymmeremained the most favored, though literal interpreta- try suggests that while both Perché non...? and Si può
tions showed a moderate increase in ratings, suggesting sapere...? may be considered conventionalized ISAs due
greater interpretive ambiguity when conventional cues to their frequent use in indirect communication, they
are weaker. likely difer in how strongly they activate the indirect</p>
        <p>In literal scenarios, participants rated both indirect reading across contexts.
and literal interpretations similarly, reflecting a balanced Variation in conventionality is also evident in the
consideration of both meanings in contexts designed to non-conventionalized ISAs, depending on the inferential
support literal readings. chain required to infer the indirect meaning, which
re</p>
        <p>Across all scenarios, distractor interpretations consis- sults in diferent combinations of primary and secondary
tently received low ratings, demonstrating participants’ acts. As Searle [8] points out, the secondary act (i.e. the
ability to identify and reject implausible alternatives. literal utterance of the sentence) often contains a
refer</p>
        <p>
          Importantly, despite the diferent experimental ence to a preparatory condition of the primary act, which
paradigm, our findings ofer additional support for the as- is considered one of the conditions that allow a speech
sumptions underlying Gibbs’ Direct Access View of prag- act to be uttered felicitously. This holds for the first
scematic comprehension [
          <xref ref-type="bibr" rid="ref19">10</xref>
          ]. Specifically, the consistently nario in Table 4, where asking Margherita whether she
high ratings for indirect interpretations—even in contexts has to work means asking for her availability to go out
explicitly constructed to favour literal readings—suggest which can be loosely considered a preparatory
condithat comprehenders often bypass literal meanings when tion for a subsequent proposal. Notably, this utterance
indirect interpretations are pragmatically accessible. This may still be felicitous even if the speaker already knows
reinforces the notion that pragmatic inference does not the interlocutor’s availability, highlighting its indirect
obligatorily follow from a literal-first processing strat- character. In contrast, the second non-conventionalized
egy, but rather may arise directly from contextual and scenario in Table 4 features a positive reply expressed
discourse-level cues. through a promise that does not contain any references
        </p>
        <p>Additional support for this view emerges from the to a preparatory condition. We hypothesize that this is
analysis of inter-annotator agreement, assessed using the reason why the literal interpretation received the
Krippendorf’s  . For the entire annotated test set, we highest mean score in this scenario.
obtained a relatively moderate agreement of  = 0.642.</p>
        <p>Values are consistently higher in the conventionalized
items ( 0.717) than in both the literal and the
nonconventionalized ones ( 0.59 and  0.6, respectively).</p>
        <p>Assuming lower agreement as an indication of a higher
4To further validate the reliability of the human annotations,
Krippendorf’s  was also computed separately for each of the three
independent rater groups corresponding to the three questionnaires.</p>
        <p>The obtained values ranged from  = 0.485 to  = 0.754, indicating
a consistent level of inter-annotator agreement across groups.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Models Performance on</title>
    </sec>
    <sec id="sec-4">
      <title>INDIR-IT</title>
      <p>(GPT-4: M = 2.87; LLaMA: M = 3.80) than humans did (M
= 2.57), suggesting less sensitivity to suppressing literal
readings when indirect meanings are expected.</p>
      <p>In non-conventionalized scenarios, GPT-4 continued
to strongly favor indirect interpretations (M = 4.76),
more than humans (M = 4.22), while Gemini and LLaMA
showed weaker alignment (Ms = 3.43 and 3.48,
respectively). Literal ratings in NC scenarios were more
comparable between humans and GPT-4 (3.33 vs. 3.24), but
notably higher in LLaMA (M = 4.48), suggesting possible
overgeneration of literal readings.</p>
      <p>In literal scenarios, all models struggled to mirror the
human balance between literal and indirect
interpretations. LLaMA especially overvalued literal meanings, and
GPT-4 gave similar scores to both interpretations.
Distractor ratings remained low across models and humans,
though LLaMA occasionally overvalued distractors.</p>
      <p>Overall, the findings suggest that while LLMs can
approximate human pragmatic reasoning, especially in
highly conventional contexts, they still lack the
finegrained contextual sensitivity and interpretive flexibility
exhibited by human participants.</p>
      <sec id="sec-4-1">
        <title>5.1. Correlations between Humans and</title>
      </sec>
      <sec id="sec-4-2">
        <title>Models Ratings</title>
        <p>This section presents a preliminary analysis of model per- To assess alignment between LLMs and human
interpreformance on the INDIR-IT dataset. To this end, we eval- tations on INDIR-IT, we computed Pearson correlations
uated three highly representative large language models, between their ratings across the three scenarios and
interi.e. GPT-4o, Gemini 1.5 Flash, and Llama 3-8B Instruct, pretation types for each. Table 5 presents a summary of
which difer in architecture, parameter size, and deploy- these correlations, with an average score (AVG) reflecting
ment setting. The primary goal here is not to exhaustively overall agreement per scenario.
assess model performance on indirect speech acts, but Among the evaluated LLMs, GPT-4 demonstrates the
rather to provide an initial demonstration of how the most robust and scenario-generalizable alignment with
proposed dataset and methodology can be applied. human interpretive preferences, particularly in contexts</p>
        <p>The models were tested in a zero-shot setting, using requiring nuanced reasoning (NC, L). Gemini exhibits
the same uncoupled literal/conventionalized pairs as in moderate alignment, reliably scoring literal and
distracthe human annotation task. In line with [27], zero- tor interpretations but falling short in indirect meaning
shot prompting was meant to assess models’ implicit resolution. In contrast, LLaMA demonstrates the
weakknowledge of indirectness as acquired during pretraining, est and most inconsistent agreement, especially in
nonrather than to optimize performance through fine-tuning conventional scenarios.
or task-specific prompting strategies. In Table 6 we reported the results of the models on the</p>
        <p>Figure 1 displays a general overview of the LLM mod- same scenarios discussed in Section 4.2. As it can be seen,
els’ performances, along with human reference. The in the most challenging items, LLaMA often inverts the
detailed scores for all models are reported in Appendix B. scores of the literal and indirect interpretations, assigning
Across scenarios, GPT-4 consistently showed the closest a higher score to the non-target option. Misalignment
alignment with human preferences, particularly in iden- also frequently arises from disproportionately high scores
tifying the most contextually appropriate interpretation. assigned to distractors.</p>
        <p>More specifically, in conventional scenarios, all models
approximated human preferences by assigning high rat- 6. Discussion and Conclusion
ings to indirect interpretations (GPT-4: M = 4.90; Gemini:
M = 4.23; LLaMA: M = 4.90), with GPT-4 and LLaMA
showing even stronger preferences than humans (M =
4.64). Models also gave higher scores to literal meanings</p>
        <sec id="sec-4-2-1">
          <title>This study introduced INDIR-IT, a novel dataset for the Italian language specifically designed to enable nuanced investigations into the processing of indirect speech acts</title>
          <p>(ISAs) by both humans and large language models (LLMs). L GPT 1 5 1 2
Unlike previous benchmarks, this dataset systematically Gemini 1 5 1 2
contrasts conventionalized and non-conventionalized LLaMA 1 5 2 3
scenarios, alongside literal interpretations, thereby pro- Proposal as question I L D1 D2
viding a fine-grained tool for assessing pragmatic
competence. This design makes it possible not only to evaluate NC GPT 5 4 1 2
overall model performance, but also to explore difer- Gemini 4 2 1 1
ences in how various forms of indirectness are handled, LLaMA 3 5 2 1
both by human annotators and by computational sys- Positive reply as Promise I L D1 D2
tems. NC GPT 4 5 1 1</p>
          <p>While the dataset and experimental task presented Gemini 2 5 1 1
here constitute a preliminary implementation of this LLaMA 2 5 2 3
methodology, the results nonetheless ofer several
general insights into LLMs’ pragmatic abilities, as well as
into human performance. In terms of LLM performance, tent ratings in conventionalized scenarios, while literal
the findings consistently point to the role of model size and non-conventionalized scenarios elicited lower
agreein pragmatic competence. Larger models such as GPT-4o ment levels, reflecting greater interpretive variability and
and Gemini Flash 1.5 display a markedly higher align- ambiguity. Interestingly, this suggests that literal
interment with human judgments across all scenario types, pretations in literal scenarios are not necessarily fully
while the smaller LLaMA 3 8B model struggles, particu- transparent and may involve pragmatic inferencing
comlarly with non-conventionalized ISAs. The human anno- parable to that required for non-conventionalized ISAs.
tation data also reveal meaningful patterns. As expected, This is a finding that supports theoretical perspectives
indirect interpretations received higher and more consis- such as Gibbs’ Direct Access View.
I</p>
          <p>Future work will aim to refine these preliminary results
by expanding both the empirical scope and the range of
model evaluations. In particular, INDIR-IT provides a
foundation for more systematic investigations into how
LLMs handle the interface between linguistic form,
context, and pragmatic inference. Moreover, this
methodology can be adopted to construct comparable datasets
in other languages. A partial translation of INDIR-IT
may also be feasible, but only for a subset of items, as
certain lexical triggers are language-specific, and some
non-conventionalized ISAs require culture-specific
background knowledge in order for their intended meaning
to be inferred.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Prompt</title>
      <sec id="sec-5-1">
        <title>Below is the prompt fed to the models. In bold, the portions that are removed for the human annotation instructions.</title>
        <p>COMPITO: Leggerai delle storie brevi che descrivono Literal
una situazione ordinaria tra due personaggi: Fausto
e Margherita. Ogni storia si conclude con una frase Non-conventional
che Fausto rivolge a Margherita. Per ogni storia ven- 21
gono fornite quattro possibili interpretazioni per
spiegare l’intenzione comunicativa della frase di Fausto, in Non-conventional
relazione alla situazione presentata. Ad ogni interpre- 40
tazione, dovrai assegnare un punteggio da 1 a 5, in base
alla sua plausibilità: (1 = non plausibile, 2 = poco plau- Table 9
sibile, 3 = plausibile, 4 = più che plausibile, 5 = molto LLaMA-3 8B instruct
plausibile). Scenario type
STORIA: Margherita non trova più il suo cellulare, così
chiede a Fausto se sa dove si trova e lui le dice: "Hai Conventional
sentito lo squillo provenire dalla cucina prima?"
Cosa intende dire Fausto? Literal
a) Fausto vuole far sapere a Margherita che il suo cellulare
è in cucina. Non-conventional
b) Fausto vuole sapere se Margherita ha fatto caso a un 21
rumore proveniente dalla cucina.
c) Fausto intende dire che non ha la minima idea di dove 4N0on-conventional
si trovi il cellulare di Margherita.
d) Fausto vuole dire che a lui non importa se la loro
conoscente sia sposata.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>B. Models’ Results</title>
      <sec id="sec-6-1">
        <title>This section reports the models’ results in terms of mean</title>
        <p>and standard deviation across each scenario and
interpretation types. Row Non-conventional 21 refers to the
results obtained from the same 21 items administered
to the annotators. Row Non-conventional 40 refers to all
non-conventionalized items of the dataset.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>C. Scenarios discussed in Section 4.2</title>
      <p>C.1. "Perché non...?"</p>
      <sec id="sec-7-1">
        <title>Conventionalized/Literal Pair</title>
        <sec id="sec-7-1-1">
          <title>CS: Margherita e Fausto stanno discutendo su cosa</title>
          <p>cucinare per cena. Fausto dice a Margherita:
LS: Margherita e Fausto stanno discutendo su cosa
cucinare per cena. Fausto però era convinto che
Margherita volesse fare la pizza, allora le dice:
ISA: "Perché non facciamo la pizza stasera?"
I: Fausto sta proponendo a Margherita di fare la pizza.
L: Fausto vuole capire perché non hanno più possibilità
di fare la pizza.</p>
          <p>D1: Fausto sta manifestando la sua frustrazione perché
non hanno ancora preso una decisione.</p>
          <p>D2: Fausto vuole far sapere a Margherita che lui non ha
proprio voglia di pizza.</p>
          <p>C.2. "Si può sapere...?"</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Conventionalized/Literal Pair</title>
        <sec id="sec-7-2-1">
          <title>CS: Margherita sta cucinando, quando Fausto nota che</title>
          <p>sta per mettere lo zucchero al posto del sale nell’ acqua
della pasta. Fausto allora le dice:</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>C: Fausto and Margherita have planned to go out to eat,</title>
          <p>but Fausto feels a bit tired, so he says to Margherita: "Can
you drive?"
L: Fausto and Margherita have planned to go out to eat,
but Margherita has a bit of a headache, so Fausto says to
her: "Can you drive?"
C.4. Positive Reply as Promise a) Fausto wants Margherita to drive to the restaurant.
b) Fausto wants to make sure that Margherita is able to
NCS: Margherita chiede a Fausto se ci sia bisogno di drive.
ritirare dei contanti dal bancomat, visto che hanno pro- c) Fausto wants to know if Margherita has a driver’s
grammato di fare un viaggio a breve. Fausto le risponde: license.</p>
          <p>ISA: "Ci passo io domani". d) Fausto means that he doesn’t feel like going out for
I: Fausto intende dire che pensa che ci sia bisogno di dinner.
contanti.</p>
          <p>L: Fausto promette di passare domani a ritirare dei
contanti. D.3. Scenarios discussed in Section 4.2
D1: Fausto vuole che Margherita passi a ritirare i con- "PERCHE’ NON?" PAIR - PROPOSAL AS QUESTION
tanti. CS: Margherita and Fausto are discussing what to cook
D2: Fausto intende dire che pensa che non ci sia bisogno for dinner. Fausto says to Margherita: "Why don’t we
di contanti. make pizza tonight?"
L: Margherita and Fausto are discussing what to cook
D. English Translation of all the for dinner. However, Fausto was sure that Margherita
wanted to make pizza, so he says to her: "Why don’t we</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Examples discussed in the make pizza tonight?"</title>
      <p>Paper I: Fausto is suggesting making pizza to Margherita
L: Fausto wants to understand why they no longer have
D.1. Prompt the possibility of making pizza.</p>
      <p>D1: Fausto is expressing his frustration because they
TASK: You will read short stories that describe an or- haven’t made a decision yet.
dinary situation between two characters: Fausto and D2: Fausto wants to let Margherita know that he really
Margherita. Each story ends with a sentence that Fausto doesn’t feel like eating pizza.
addresses to Margherita. For each story, four possible
interpretations are provided to explain the communicative "IS IT POSSIBLE TO KNOW" PAIR - REPROACH
intention of Fausto’s sentence, in relation to the situation AS QUESTION
presented. For each interpretation, you will have to as- C: Margherita is cooking, when Fausto notices that she
sign a score from 1 to 5, based on its plausibility: (1 = not is about to put sugar instead of salt in the pasta water.
plausible, 2 = slightly plausible, 3 = plausible, 4 = more Fausto then says to her: "Is it possible to know what you
than plausible, 5 = very plausible) are doing?"</p>
      <sec id="sec-8-1">
        <title>L: Margherita is cooking. Fausto smells a good smell</title>
        <p>coming from the kitchen, so he asks Margherita: "Is it
possible to know what you are doing?"
I: Fausto blames Margherita for her carelessness.</p>
        <p>L: Fausto wants to know what Margherita is cooking.</p>
        <p>D1: Fausto complains because Margherita keeps too
many things hidden from him.</p>
        <p>D2: Fausto ofers to help Margherita cook.</p>
        <p>NON CONVENTIONAL - PROPOSAL AS QUESTION
Fausto wants to buy himself a new suit, but he doesn’t
trust his own taste in clothing, so he says to Margherita:
"Are you at work tomorrow morning?"
I: Fausto would like Margherita to go with him to help
him buy a new suit.</p>
        <p>L: Fausto wants to know if Margherita is working
tomorrow.</p>
        <p>D1: Fausto wants Margherita to stay home tomorrow.</p>
        <p>D2: Fausto wants to ask Margherita to buy him a new
suit.</p>
        <p>NON CONVENTIONAL - POSITIVE REPLY AS
PROMISE
Margherita asks Fausto if they need to withdraw some
cash from the ATM, given that they have planned to take
a trip soon. Fausto replies to her: "I’ll stop by tomorrow."
I: Fausto means that he thinks there is a need for cash.</p>
        <p>L: Fausto promises to come by tomorrow to pick up some
cash.</p>
        <p>D1: Fausto wants Margherita to come and collect the
cash.</p>
        <p>D2: Fausto means that he thinks there is no need for
cash.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text
translation and Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , arXiv (Cornell University) (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>What artificial neural networks can tell us about human language acquisition, in: Algebraic structures in natural language</article-title>
          , CRC Press,
          <year>2022</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <article-title>On the proper role of linguisticallyoriented deep net analysis in linguistic theorizing</article-title>
          ,
          <source>ArXiv abs/2106</source>
          .08694 (
          <year>2021</year>
          ). URL: https://api.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>semanticscholar.org/CorpusID:235446467.</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Futrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahowald</surname>
          </string-name>
          ,
          <article-title>How linguistics learned to 7. Limitations stop worrying and love the language models</article-title>
          ,
          <source>arXiv preprint arXiv:2501.17047</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>The limitations of this work concern both dataset con-</article-title>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Crystal</surname>
          </string-name>
          ,
          <article-title>The Cambridge Encyclopedia struction and the experimental setup</article-title>
          .
          <source>of Language</source>
          , Cambridge University Press,
          <article-title>First, the selection of primary/secondary act combina- 2010</article-title>
          . URL: https://books.google.it/books?id=
          <article-title>tions was not guided by their real distribution in Italian, J976wAEACAAJ</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>as such labeled data are currently unavailable</article-title>
          .
          <source>While</source>
          [6]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Searle</surname>
          </string-name>
          ,
          <article-title>What is a speech act, 1996. URL: https: INDIR-IT includes a variety of combinations</article-title>
          , it may not //api.semanticscholar.org/CorpusID:142781882.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>fully reflect natural frequencies</article-title>
          .
          <source>Future work could ad-</source>
          [7]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <article-title>How to Do Things with Words, Clarendress this by expanding the dataset</article-title>
          , possibly adopting don Press, Oxford [Eng.],
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>hybrid methods that combine expert annotation with [8</article-title>
          ]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Searle</surname>
          </string-name>
          ,
          <article-title>Expression and Meaning: Studies in corpus extraction, as fully automatic approaches are not the Theory of Speech Acts, Cambridge University feasible given the contextual specificity required</article-title>
          . Press, Cambridge,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Second</surname>
          </string-name>
          , inter
          <article-title>-speaker variability poses challenges</article-title>
          , es- [9]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Levinson</surname>
          </string-name>
          , Pragmatics / Stephen C.
          <article-title>Levinson, pecially in pragmatics. Since the task itself invites inter- Cambridge textbooks in linguistics, Cambridge unipretive variation, a larger pool of annotators would help versity</article-title>
          , Cambridge,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>mitigate individual diferences in pragmatic competence</article-title>
          . [10]
          <string-name>
            <given-names>R. W. Gibbs</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <article-title>A new look at literal meaning in Third, model outputs are also sensitive to sampling understanding what is said and implicated, Journal variability</article-title>
          . In this study,
          <source>hyperparameters such as tem- of Pragmatics 34</source>
          (
          <year>2002</year>
          )
          <fpage>457</fpage>
          -
          <lpage>486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>perature, top-k, and top-p were not controlled</article-title>
          .
          <source>While</source>
          [11]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Gibbs</surname>
          </string-name>
          ,
          <article-title>Do people always process the literal allowing some randomness is appropriate given the inher- meanings of indirect requests?, Journal of experient ambiguity of the task, future studies should standard- mental psychology. Learning, memory, and cogniize these parameters across models to ensure replicability tion 9 (</article-title>
          <year>1983</year>
          )
          <fpage>524</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>and comparability</article-title>
          . [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Marocchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Domaneschi</surname>
          </string-name>
          , “
          <article-title>can you read my mind?” conventionalized indirect requests and Acknowledgments theory of mind abilities</article-title>
          ,
          <source>Journal of Pragmatics</source>
          <volume>193</volume>
          (
          <year>2022</year>
          )
          <fpage>201</fpage>
          -
          <lpage>221</lpage>
          . URL: https://www.sciencedirect.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>This work has been supported by the project “XAI-CARE” com</article-title>
          /science/article/pii/S0378216622000819.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>funded by the European Union - Next Generation</surname>
          </string-name>
          EU - doi:https://doi.org/10.1016/j.pragma.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>NRRP M6C2 “Investment 2.1 Enhancement and strength- 2022</source>
          .
          <volume>03</volume>
          .011.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>ening of biomedical research in the NHS” (PNRR-MAD-</article-title>
          [13]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Responding to indirect speech acts,
          <year>2022</year>
          -
          <fpage>12376692</fpage>
          _VADALA'
          <article-title>- CUP F83C22002470001)</article-title>
          and
          <source>Cognitive psychology 11</source>
          (
          <year>1979</year>
          )
          <fpage>430</fpage>
          -
          <lpage>477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>the project "Language Of Dreams: the relationship be</article-title>
          - [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Trott</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. B.</surname>
          </string-name>
          and,
          <article-title>Individual diferences in mentaltween sleep mentation, neurophysiology, and neurologi- izing capacity predict indirect request comprehencal disorders" - PRIN 2022 2022BNE97C_SH4_PRIN2022</article-title>
          . sion,
          <source>Discourse Processes</source>
          <volume>56</volume>
          (
          <year>2019</year>
          )
          <fpage>675</fpage>
          -
          <lpage>707</lpage>
          . URL: https://doi.org/10.1080/0163853X.
          <year>2018</year>
          .
          <volume>1548219</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>doi:10.1080/0163853X</source>
          .
          <year>2018</year>
          .
          <volume>1548219</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bašnáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. M.</given-names>
            <surname>Petersson</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. van Berkum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hagoort</surname>
          </string-name>
          ,
          <article-title>Beyond the language given: The neural correlates of inferring speaker meaning</article-title>
          ,
          <source>Cerebral Cortex</source>
          <volume>24</volume>
          (
          <year>2013</year>
          )
          <fpage>2572</fpage>
          -
          <lpage>2578</lpage>
          . A.
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Power</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Warstadt</surname>
          </string-name>
          , A. W. KoURL: https://doi.org/10.1093/cercor/bht112. doi:10.
          <string-name>
            <surname>curek</surname>
          </string-name>
          , A.
          <string-name>
            <surname>Safaya</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tazarv</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          ...
          <string-name>
            <surname>"</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            , Beyond the 1093/cercor/bht112. imitation game: Quantifying and extrapolating the [16]
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , S. C. Levinson, Politeness: Some Uni- capabilities
          <source>of language models</source>
          ,
          <year>2023</year>
          .
          <article-title>URL: https: versals in Language Usage</article-title>
          , Studies in Interactional //arxiv.org/abs/2206.04615. arXiv:
          <volume>2206</volume>
          .
          <fpage>04615</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Sociolinguistics</surname>
            , Cambridge University Press, Cam- [24]
            <given-names>S. L.</given-names>
          </string-name>
          <string-name>
            <surname>Sravanthi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Doshi</surname>
            ,
            <given-names>T. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kalyan</surname>
          </string-name>
          , R. Murthy, bridge,
          <year>1987</year>
          . P. Bhattacharyya,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          , Pub: A pragmatics [17]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Janney</surname>
          </string-name>
          , H. Arndt,
          <article-title>1. Intracultural tact understanding benchmark for assessing llms' pragversus intercultural tact</article-title>
          , De Gruyter Mouton, matics capabilities (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Berlin</surname>
          </string-name>
          , Boston,
          <year>1992</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>42</lpage>
          . URL: https:// [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Louis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          , “
          <article-title>I'd rather doi</article-title>
          .
          <source>org/10</source>
          .1515/
          <fpage>9783110886542</fpage>
          -
          <lpage>004</lpage>
          . doi:doi:10. just go to bed”:
          <source>Understanding indirect answers</source>
          ,
          <volume>1515</volume>
          /
          <fpage>9783110886542</fpage>
          -
          <lpage>004</lpage>
          . in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            , Y. Liu (Eds.), [18]
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Bowman</surname>
          </string-name>
          , G. Dahl,
          <article-title>What will it take to Proceedings of the 2020 Conference on Empirifx benchmarking in natural language understand- ical Methods in Natural Language Processing ing?</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
          </string-name>
          , L. Zettle
          <string-name>
            <surname>- (EMNLP</surname>
            ), Association for Computational Linguismoyer,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
          </string-name>
          , S. Bethard, tics, Online,
          <year>2020</year>
          , pp.
          <fpage>7411</fpage>
          -
          <lpage>7425</lpage>
          . URL: https: R. Cotterell,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.), Pro- //aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>601</volume>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>ceedings of the 2021 Conference of the North</source>
          <volume>18653</volume>
          /v1/
          <year>2020</year>
          .emnlp-main.
          <volume>601</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>American Chapter of the Association for Com-</article-title>
          [26]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S.-C.
          <article-title>Zhu, GRICE: putational Linguistics: Human Language Tech- A grammar-based dataset for recovering implicanologies, Association for Computational Linguis- ture and conversational rEasoning</article-title>
          , in: C. Zong, tics, Online,
          <year>2021</year>
          , pp.
          <fpage>4843</fpage>
          -
          <lpage>4855</lpage>
          . URL: https:
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          (Eds.), Findings of the //aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>385</volume>
          /. doi:10. Association for Computational Linguistics: ACL18653/v1/
          <year>2021</year>
          .naacl-main.
          <source>385. IJCNLP</source>
          <year>2021</year>
          , Association for Computational Lin[19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aiyappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kwak</surname>
          </string-name>
          , Y.-y. Ahn, Can we guistics,
          <source>Online</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2074</fpage>
          -
          <lpage>2085</lpage>
          .
          <article-title>URL: https: trust the evaluation on ChatGPT?</article-title>
          , in: A. Ovalle, //aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>182</volume>
          . doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>K.-W. Chang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mehrabi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Pruksachatkun</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ga-
          <volume>18653</volume>
          /v1/
          <year>2021</year>
          .findings-acl.
          <volume>182</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>lystan</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Dhamala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Verma</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
            , [27]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Floyd</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Jouravlev</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Fedorenko</surname>
          </string-name>
          , E. Gibson, R. Gupta (Eds.),
          <article-title>Proceedings of the 3rd Work- A fine-grained comparison of pragmatic language shop on Trustworthy Natural Language Processing understanding in humans and language models</article-title>
          ,
          <source>in: (TrustNLP</source>
          <year>2023</year>
          ),
          <article-title>Association for Computational A</article-title>
          .
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.), ProLinguistics, Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          . URL:
          <article-title>ceedings of the 61st Annual Meeting of the Associahttps://aclanthology</article-title>
          .org/
          <year>2023</year>
          .trustnlp-
          <volume>1</volume>
          .5/. doi:10.
          <article-title>tion for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          : Long 18653/v1/
          <year>2023</year>
          .trustnlp-
          <volume>1</volume>
          .5. Papers),
          <source>Association for Computational Linguis</source>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          , tics, Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>4194</fpage>
          -
          <lpage>4213</lpage>
          . URL:
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , J. Han, Don't make https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>230</volume>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>your llm an evaluation benchmark cheater</article-title>
          ,
          <source>arXiv</source>
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>230</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>preprint arXiv:2311</source>
          .
          <year>01964</year>
          (
          <year>2023</year>
          ). [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsuetaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sarathy</surname>
          </string-name>
          , M. Scheutz, De[21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fran- veloping a corpus of indirect speech act schemas, cis</article-title>
          , J.
          <string-name>
            <surname>Gili</surname>
            , E. Musacchio,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nissim</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Patti</surname>
          </string-name>
          , M. Ri- in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            , K. Choukri, naldi, D. Scalena, Calamita: Challenge the abilities C. Cieri,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Maeof language models in italian</article-title>
          , in: Italian Conference gaard, J. Mariani,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          , J. Odijk, on Computational Linguistics,
          <year>2024</year>
          . URL: https: S. Piperidis (Eds.), Proceedings of the Twelfth //api.semanticscholar.org/CorpusID:275357573.
          <string-name>
            <given-names>Language</given-names>
            <surname>Resources</surname>
          </string-name>
          and Evaluation Conference, [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potertì</surname>
          </string-name>
          , E. Federici,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          , European Language Resources Association, MarF. Mercorio, et al.,
          <article-title>Italic: An italian culture-aware seille</article-title>
          , France,
          <year>2020</year>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>228</lpage>
          . URL: https:// natural language benchmark,
          <source>in: Proceedings of aclanthology.org/2020.lrec-1</source>
          .
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>the 2025 Conference of the Nations of the Americas</source>
          [29]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Levesque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <article-title>The winoChapter of the Association for Computational Lin- grad schema challenge</article-title>
          ,
          <source>in: Proceedings of the Thirguistics: Human Language Technologies (Volume teenth International Conference on Principles of 1: Long Papers)</source>
          ,
          <source>April 29-May 4</source>
          ,
          <year>2025</year>
          , volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>Representation</given-names>
          </string-name>
          <source>and Reasoning</source>
          , KR'12,
          <year>2025</year>
          , pp.
          <fpage>1469</fpage>
          -
          <lpage>1478</lpage>
          . AAAI Press,
          <year>2012</year>
          , p.
          <fpage>552</fpage>
          -
          <lpage>561</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. M.</given-names>
            <surname>Shoeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garriga-Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kluska</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Lewkowycz,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>