1. Introduction

Direct and Indirect Interpretations of Speech Acts: Evidence from Human Judgments and Large Language Models

Massimiliano Orsini

Dominique Brunato

0 0 ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC) , Pisa , Italy 1 University of Padua , Padua , Italy

2025

This paper introduces INDIR-IT (Indirectness for the Italian language), a linguistically informed, manually curated benchmark for evaluating large language models' (LLMs) understanding of indirect speech acts (ISAs) in Italian. By systematically contrasting conventionalized and non-conventionalized ISAs with literal interpretations, the corpus enables fine-grained assessment of pragmatic competence, an area still relatively underexplored compared to lexical and syntactic understanding. Preliminary results show that LLMs handle conventionalized ISAs relatively well, while performance on non-conventionalized ISAs remains more sensitive to model size and capacity. INDIR-IT ofers a foundation for advancing research on pragmatic inference in both humans and LLMs.

eol>Indirectness Speech acts Italian benchmark Large Language Models Human evaluation

1. Introduction These abilities are particularly relevant for designing

more natural and humanlike dialogue systems.

Since Vaswani et al.’s seminal work [ 1 ], pre-trained large In addition to the conceptual challenge, there is also language models based on the transformer architecture a resource gap: most of the available resources are de(LLMs) have shown outstanding capabilities in under- veloped in English and often merely translated to fit anstanding and generating natural language. However, other language. This practice risks neglecting languagethese advances have also raised important concerns re- specific pragmatic nuances and may compromise the garding interpretability. From a linguistic perspective, validity and fidelity of evaluations conducted in nonquestions remain about the true nature and depth of English contexts. the linguistic competence exhibited by these models This article intends to address both of these challenges [ 2, 3 ], and whether they can serve as computational by focusing on a central yet underrepresented pragmatic evidence for usage-based theories of language [ 4 ]. In phenomenon: indirectness. response, a growing body of research has focused on We outline a methodology for the construction of a improving interpretability and systematically evaluat- dataset of indirect speech acts (ISAs) and a corresponding LLMs across diverse linguistic domains. This is of- ing evaluation task in Italian. The dataset is designed ten achieved through the development of standardized with two complementary purposes: on the one hand, to benchmarks, i.e. datasets paired with metrics designed measure the degree of competence of LLMs with regard to evaluate various models on specific tasks. to ISAs; and on the other, to provide insights into the

While substantial progress has been made in evalu- interpretability of LLMs in processing indirectness in ating LLMs’ syntactic, semantic, and general natural comparison with humans. language understanding (NLU) abilities, pragmatic com- Contributions The contributions of this article can be petences remains relatively underexplored despite its briefly summarized in the following points: central role in human communication, where meaning depends on intentional language use, interactional context, and communicative efects [ 5]. This is due in part to the dificulty of operationalizing pragmatic phenomena, which encompass a wide range of abilities, such as resolving deixis, interpreting implicatures, understanding ifgurative language, adhering to conversational maxims, and deriving speaker intentions from indirect speech. • A methodology for developing a benchmark of

ISAs that accounts for both their variety and degree of conventionality; • INDIR-IT, a manually-curated Italian-language dataset and evaluation task constructed in accordance with this methodology1; • Preliminary results comparing human and LLM performance, providing initial insights into how current models handle ISA-related pragmatic competence.

1The dataset is freely available at this link: https://huggingface.co/ datasets/MaxiOr/ISA

2.1. Indirect Speech Acts

In what follows, we first introduce key concepts from cessed. Competing models continue to propose difering the linguistic literature on indirect speech acts and review mechanisms and processing orders, and much depends on existing NLP resources for evaluating model interpretabil- contextual, cognitive, and conventional factors [16, 17]. ity. We then present our novel dataset and describe the This lack of consensus reflects not only the complexity of design of the associated evaluation task. Finally, we re- the phenomenon but also the variability observed even port and discuss the results of the human annotation among human comprehenders. study alongside preliminary evaluation outcomes across Since both conventionalized and non-conventionalized several LLMs. ISAs play a central role in human interaction, mastering indirectness remains a major challenge for language models, which must grapple with these multiple layers 2. Related Works of pragmatic reasoning to approach human-like communicative competence.

Within the domain of pragmatics, the concept of speech 2.2. Pragmatics Understanding acts is central, as they are defined as the minimal unit Benchmarks of communication [ 6 ]. In How to Do Things with Words [7], Austin makes a distinction between what is said Despite some criticism [18, 19, 20], benchmarks remain a (locution), what is intended (illocution) and the efect central tool for evaluating the performance of (large) lanproduced on the hearer (perlocution). This distinction guage models across a wide range of tasks. They ofer a is crucial for the pragmatic phenomenon known as in- standardized framework to compare models’ capabilities directness, where the locution and the illocution of an and have become an essential part of LLM development utterance do not correspond to each other. and assessment. While benchmarks for syntax, semantics,

In Searle’s framework [8], an indirect speech act is and general NLU are well developed— including recent defined as the simultaneous performance of two speech eforts tailored to Italian [ 21, 22]—resources targeting acts: a primary act, which functions as the final intended pragmatic competence remain scarce, especially in lanmeaning, and a secondary act that lends its locution to guages other than English. This is particularly true for the primary act. This view, which is known as standard ISAs, a complex and context-dependent pragmatic phepragmatic view or literal force hypothesis (LFH) [9], es- nomenon. One broad multilingual initiative that includes tablishes that the illocution of the secondary act, the pragmatics-related tasks is BIG-Bench [ 23 ]. Although literal force, is always functional for the retrieval process primarily aimed at probing the general capabilities of of the primary illocutionary force. LLMs, it contains several tasks touching on pragmatics,

However, this literal-first processing assumption is including Implicature Recovery, which tests interpretafar from universally accepted. An alternative proposal, tion of indirect responses to polar questions (limited to the Direct Access View advanced by Gibbs [ 10 ], holds binary yes/no inferences) and Intent Recognition, which that listeners can often directly infer the intended mean- evaluates models’ ability to detect indirect requests. ing without fully processing the literal content, partic- Another recent contribution is the Pragmatic Underularly when the context strongly supports a nonliteral standing Benchmark (PUB) [24], which aggregates multireading. Several experimental studies support this view ple tasks focused on diferent aspects of pragmatic compe[11, 12, 13], especially in the case of conventionalized tence, such as figurative language, presupposition, deixis, indirect speech acts, whose interpretation is often fa- and indirectness. In PUB, three tasks specifically tarcilitated by lexicalized or syntactic triggers. Examples get indirectness, based on the CIRCA [25] and GRICE include indirect requests like “Can you V?” or indirect [26] datasets. CIRCA ofers indirect responses to polar ofers such as “Would you like to V?”, which are often questions and includes both a classification task distinprocessed rapidly and efortlessly. guishing between direct and indirect answers and an

While conventionalized ISAs may often be interpretation task for identifying the implied meaning. identified via such surface cues, a large class of The GRICE dataset similarly focuses on indirect replies non-conventionalized ISAs remains highly context- but extends the scope by including scalar implicatures. dependent, as no fixed mapping exists between form and Despite their usefulness, these datasets share several function. These acts require more complex inferential limitations. The context is minimal, often limited to a sinreasoning, often drawing on Theory of Mind (ToM) ca- gle question, which reduces the realism and ecological vapacities [ 14, 15 ] and sensitivity to subtle discourse-level lidity of the tasks. Additionally, the evaluation paradigm cues. is typically binary or multiple choice, which may over

Importantly, despite decades of research, there is still simplify the inherent ambiguity of non-conventionalized no unified account of how indirect speech acts are pro- ISAs. The tasks often focus on a narrow range of ISA types, particularly indirect responses to yes/no questions, as these are generally easier to generate and annotate. To investigate whether LLMs (and humans) process

To address some of these limitations, Hu et al. [27] conventionalized and non-conventionalized ISAs diferdesigned an indirectness understanding task embedded in ently, the dataset is split into two parts: 40 scenarios short scenarios. Each item requires selecting the correct featuring non-conventionalized ISAs (NC-ISAs) and 30 interpretation of an ISA from four options: the indirect pairs of conventionalized ISAs. Each pair includes the meaning, the literal meaning, and two distractors. The same utterance embedded in two distinct contexts: one task ofers more variability in speech act combinations, favoring the indirect reading (C-ISAs) and one favoring though the dataset remains small (20 items total). the literal reading (Lit). This design, inspired in part by

A more ambitious approach is proposed by Roque et Roque et al. [28], allows us to probe models for contextal. [28], who suggest using ISA schemas, modeled after sensitivity and bias in ISA interpretation. Winograd schemas [29]. These consist of paired contexts In summary, the indirect interpretation is considered designed to favor either a literal or an indirect reading of the target reading for both non-conventionalized and conthe same utterance. While this method introduces richer ventionalized scenarios, while the literal interpretation contexts and greater variability, it remains easily scalable is expected to be preferred in literal scenarios. with minimal expert intervention only if it is applied to Table 1 illustrates a representative example for each a limited set of ISA types. scenario included in the dataset2.

3. Overview of INDIR-IT 3.1. Internal Partitioning

Inspired by Hu et al.’s work [27], the dataset presented in this paper consists of 100 scenarios. Each scenario includes a brief contextual description involving two characters, followed by an indirect speech act produced by one of the speakers. For each scenario, four candidate interpretations are provided: the indirect meaning, the literal meaning, and two lexical distractors, ranging from non-sequiturs to even another literal interpretation, albeit less plausible. 3.1.1. Scenario design and coverage In order to create a challenging and heterogeneous ISA dataset, the combinations of primary and secondary acts were designed to be as diverse as possible. However, some constraints limited this goal. First, not all primary acts can plausibly be expressed indirectly, as indirectness may conflict with their felicity conditions (e.g., declarations or promises). Second, not all secondary acts are equally suitable for every primary act, since the inferential paths required to recover the intended meaning of an ISA often follow conventionalized patterns. 2Appendix D provides the English translation for all the examples reported in the paper.

To address these challenges and expand coverage, sce- apply to C-ISAs, at least until further empirical evidence narios were crafted to include longer contextual windows, confirms whether the Direct Access View systematically allowing us to probe models on less frequently explored governs their interpretation in these contexts. primary/secondary act pairings. To ensure comparability between human and model

As a result, 26 distinct combinations were created for evaluations, annotation instructions and model prompts NC-ISAs, while 7 combinations were designed for C-ISAs, were aligned as closely as possible. For models, the with indirect requests making up the majority. The difi- prompts include structural tags: COMPITO precedes culty of crafting diferent combinations for convention- the task instructions, STORIA introduces the scenario, alized ISAs might be due to the fact that indirectness is and the question "Cosa intende dire Fausto?" ("What often adopted as a politeness strategy in order to decrease does Fausto mean?") follows immediately after the the imposing potential of such directive acts [8], and as scenario. These tags help delineate task components consequence, indirect request might be those ISAs that while maintaining the consistency of the input. In mostly undergo conventionalization. both the prompts and human annotation interface,

With regard to lexical triggers, the most represented technical jargon is deliberately avoided. Interpretations is ‘Puoi V?’, functioning similarly to its English coun- are presented in random order and labeled with tags terpart ‘Can you?’. However, the indirect meaning of a, b, c, and d to prevent any biases related to order efects. conventionalized ISAs seems to be conveyed not only by a lexical entry but also by other factors such as modality, negation and grammatical person. This is clear by confronting Puoi V? and ’Posso V?’, which conveys a diferent 4. Human Annotation Procedure lpartitmeratrhyaatcdto,eosr n’PoetrtcrhiéggneornaVn’yacnodnvPeenrcthioénVal?i’z,ewdiItShAthaet The human annotation task was conducted with a total of all. Since conventionality is only assumed beforehand, 21 native Italian speakers recruited via the Prolific crowdwe cannot rule out this possibility for other forms of the sourcing platform3. To ensure annotation quality, only same triggers that consequently are treated as trigger on participants who reported Italian as their first language their own. Each utterance in the dataset is labeled with and who had no known language-related disorders were both its primary and secondary act types: in literal sce- included. The final sample was balanced for gender (10 narios, these labels are identical, as they are not supposed females and 11 males), with participants ranging in age to convey any indirect meaning. from 21 to 63 years (mean age: 31).

To clarify how these labels apply, we refer back to To minimize the risk of participants inferring the purthe examples in Table 1: in the non-conventionalized pose of the experiment and potentially biasing their rescenario, the primary act is labeled as a positive response, sponses, the raters were divided into three independent while the secondary act is a question, which reflects the groups of seven annotators, with each group evaluating indirect intention. In the conventionalized example, the a diferent subset of the dataset. utterance is a request (primary act) expressed through In order to avoid exposing participants to both mema question (secondary act). In the literal version of that bers of the conventionalized/literal pairs, these pairs were scenario, both acts correspond to a question, with no distributed across the sets so that each participant only indirectness involved. saw one member of any given pair.

The whole dataset, along with a complete list of all To limit the overall length of the task, each group was primary/secondary act combinations and triggers, is pro- presented with a questionnaire containing 27 items. This vided in the dataset card of the Hugging Face’s repository. distribution preserved the internal balance of the dataset while reducing the number of non-conventionalized scenarios included per set. Specifically, each questionnaire 3.2. Task Design comprised 10 conventionalized scenarios, 10 literal sceBased on the newly collected dataset, the task involves narios, and 7 non-conventionalized scenarios, resulting assigning a plausibility score ranging from 1 (not plausi- in a total of 81 annotated items across the entire dataset. ble) to 5 (very plausible) to each candidate interpretation of a given scenario. Rather than framing the task as a cat- 4.1. Results egorical classification, we opted for graded judgments in order to capture the intrinsic ambiguity of indirect speech Results on the human annotation tasks are reported in acts, particularly in the case of NC-ISAs. In these cases, Table 2 in terms of mean and standard deviation values both the indirect and literal meanings may be conveyed for each interpretation. simultaneously by the speaker, making it inappropriate Recall that in both non-conventionalized and convento label any interpretation as definitively correct or in- tionalized scenarios, the indirect interpretation was concorrect. It is worth noting that similar caution may also

Ind

4.2. Qualitative Analysis

To have an in-depth understanding of the human annotation performance, we carried out a closer examination of specific scenarios that feature contrasting results. In particular, we analyzed two conventionalized/literal pairs (presented in Table 3), and two non-conventionalized scesidered the target reading, while in literal scenarios the narios (Table 4). For brevity, we report only their mean literal interpretation was expected to be preferred. Over- ratings. The full scenarios and associated interpretations all, human participants aligned with these expectations are provided in Appendix D.3. and exhibited clear, context-sensitive interpretive prefer- As mentioned in Section 3, diferent triggers may yield ences across the three scenario types. diferent outcomes, depending on their degree of conven

In conventionalized scenarios, the indirect interpre- tionality. In the first conventionalized/literal pair in Table tations received the highest ratings, consistent with ex- 3 featuring the trigger "Perché non...?" (Why not...?), the pectations for conventionalized indirect speech acts. Lit- indirect interpretation was significantly rated higher in eral interpretations in these scenarios were rated notably both scenarios. Conversely, in the second pair involving lower, indicating that participants were attuned to the the trigger "Si può sapere...?" (Is it possible to know...?), pragmatics of the context. the indirect interpretation was rated higher only in the

In non-conventionalized scenarios, indirect readings conventionalized scenario, as expected. This asymmeremained the most favored, though literal interpreta- try suggests that while both Perché non...? and Si può tions showed a moderate increase in ratings, suggesting sapere...? may be considered conventionalized ISAs due greater interpretive ambiguity when conventional cues to their frequent use in indirect communication, they are weaker. likely difer in how strongly they activate the indirect

In literal scenarios, participants rated both indirect reading across contexts. and literal interpretations similarly, reflecting a balanced Variation in conventionality is also evident in the consideration of both meanings in contexts designed to non-conventionalized ISAs, depending on the inferential support literal readings. chain required to infer the indirect meaning, which re

Across all scenarios, distractor interpretations consis- sults in diferent combinations of primary and secondary tently received low ratings, demonstrating participants’ acts. As Searle [8] points out, the secondary act (i.e. the ability to identify and reject implausible alternatives. literal utterance of the sentence) often contains a refer

Importantly, despite the diferent experimental ence to a preparatory condition of the primary act, which paradigm, our findings ofer additional support for the as- is considered one of the conditions that allow a speech sumptions underlying Gibbs’ Direct Access View of prag- act to be uttered felicitously. This holds for the first scematic comprehension [ 10 ]. Specifically, the consistently nario in Table 4, where asking Margherita whether she high ratings for indirect interpretations—even in contexts has to work means asking for her availability to go out explicitly constructed to favour literal readings—suggest which can be loosely considered a preparatory condithat comprehenders often bypass literal meanings when tion for a subsequent proposal. Notably, this utterance indirect interpretations are pragmatically accessible. This may still be felicitous even if the speaker already knows reinforces the notion that pragmatic inference does not the interlocutor’s availability, highlighting its indirect obligatorily follow from a literal-first processing strat- character. In contrast, the second non-conventionalized egy, but rather may arise directly from contextual and scenario in Table 4 features a positive reply expressed discourse-level cues. through a promise that does not contain any references

Additional support for this view emerges from the to a preparatory condition. We hypothesize that this is analysis of inter-annotator agreement, assessed using the reason why the literal interpretation received the Krippendorf’s . For the entire annotated test set, we highest mean score in this scenario. obtained a relatively moderate agreement of = 0.642.

Values are consistently higher in the conventionalized items ( 0.717) than in both the literal and the nonconventionalized ones ( 0.59 and 0.6, respectively).

Assuming lower agreement as an indication of a higher 4To further validate the reliability of the human annotations, Krippendorf’s was also computed separately for each of the three independent rater groups corresponding to the three questionnaires.

The obtained values ranged from = 0.485 to = 0.754, indicating a consistent level of inter-annotator agreement across groups.

5. Models Performance on INDIR-IT

(GPT-4: M = 2.87; LLaMA: M = 3.80) than humans did (M = 2.57), suggesting less sensitivity to suppressing literal readings when indirect meanings are expected.

In non-conventionalized scenarios, GPT-4 continued to strongly favor indirect interpretations (M = 4.76), more than humans (M = 4.22), while Gemini and LLaMA showed weaker alignment (Ms = 3.43 and 3.48, respectively). Literal ratings in NC scenarios were more comparable between humans and GPT-4 (3.33 vs. 3.24), but notably higher in LLaMA (M = 4.48), suggesting possible overgeneration of literal readings.

In literal scenarios, all models struggled to mirror the human balance between literal and indirect interpretations. LLaMA especially overvalued literal meanings, and GPT-4 gave similar scores to both interpretations. Distractor ratings remained low across models and humans, though LLaMA occasionally overvalued distractors.

Overall, the findings suggest that while LLMs can approximate human pragmatic reasoning, especially in highly conventional contexts, they still lack the finegrained contextual sensitivity and interpretive flexibility exhibited by human participants.

5.1. Correlations between Humans and Models Ratings

This section presents a preliminary analysis of model per- To assess alignment between LLMs and human interpreformance on the INDIR-IT dataset. To this end, we eval- tations on INDIR-IT, we computed Pearson correlations uated three highly representative large language models, between their ratings across the three scenarios and interi.e. GPT-4o, Gemini 1.5 Flash, and Llama 3-8B Instruct, pretation types for each. Table 5 presents a summary of which difer in architecture, parameter size, and deploy- these correlations, with an average score (AVG) reflecting ment setting. The primary goal here is not to exhaustively overall agreement per scenario. assess model performance on indirect speech acts, but Among the evaluated LLMs, GPT-4 demonstrates the rather to provide an initial demonstration of how the most robust and scenario-generalizable alignment with proposed dataset and methodology can be applied. human interpretive preferences, particularly in contexts

The models were tested in a zero-shot setting, using requiring nuanced reasoning (NC, L). Gemini exhibits the same uncoupled literal/conventionalized pairs as in moderate alignment, reliably scoring literal and distracthe human annotation task. In line with [27], zero- tor interpretations but falling short in indirect meaning shot prompting was meant to assess models’ implicit resolution. In contrast, LLaMA demonstrates the weakknowledge of indirectness as acquired during pretraining, est and most inconsistent agreement, especially in nonrather than to optimize performance through fine-tuning conventional scenarios. or task-specific prompting strategies. In Table 6 we reported the results of the models on the

Figure 1 displays a general overview of the LLM mod- same scenarios discussed in Section 4.2. As it can be seen, els’ performances, along with human reference. The in the most challenging items, LLaMA often inverts the detailed scores for all models are reported in Appendix B. scores of the literal and indirect interpretations, assigning Across scenarios, GPT-4 consistently showed the closest a higher score to the non-target option. Misalignment alignment with human preferences, particularly in iden- also frequently arises from disproportionately high scores tifying the most contextually appropriate interpretation. assigned to distractors.

More specifically, in conventional scenarios, all models approximated human preferences by assigning high rat- 6. Discussion and Conclusion ings to indirect interpretations (GPT-4: M = 4.90; Gemini: M = 4.23; LLaMA: M = 4.90), with GPT-4 and LLaMA showing even stronger preferences than humans (M = 4.64). Models also gave higher scores to literal meanings

This study introduced INDIR-IT, a novel dataset for the Italian language specifically designed to enable nuanced investigations into the processing of indirect speech acts

(ISAs) by both humans and large language models (LLMs). L GPT 1 5 1 2 Unlike previous benchmarks, this dataset systematically Gemini 1 5 1 2 contrasts conventionalized and non-conventionalized LLaMA 1 5 2 3 scenarios, alongside literal interpretations, thereby pro- Proposal as question I L D1 D2 viding a fine-grained tool for assessing pragmatic competence. This design makes it possible not only to evaluate NC GPT 5 4 1 2 overall model performance, but also to explore difer- Gemini 4 2 1 1 ences in how various forms of indirectness are handled, LLaMA 3 5 2 1 both by human annotators and by computational sys- Positive reply as Promise I L D1 D2 tems. NC GPT 4 5 1 1

While the dataset and experimental task presented Gemini 2 5 1 1 here constitute a preliminary implementation of this LLaMA 2 5 2 3 methodology, the results nonetheless ofer several general insights into LLMs’ pragmatic abilities, as well as into human performance. In terms of LLM performance, tent ratings in conventionalized scenarios, while literal the findings consistently point to the role of model size and non-conventionalized scenarios elicited lower agreein pragmatic competence. Larger models such as GPT-4o ment levels, reflecting greater interpretive variability and and Gemini Flash 1.5 display a markedly higher align- ambiguity. Interestingly, this suggests that literal interment with human judgments across all scenario types, pretations in literal scenarios are not necessarily fully while the smaller LLaMA 3 8B model struggles, particu- transparent and may involve pragmatic inferencing comlarly with non-conventionalized ISAs. The human anno- parable to that required for non-conventionalized ISAs. tation data also reveal meaningful patterns. As expected, This is a finding that supports theoretical perspectives indirect interpretations received higher and more consis- such as Gibbs’ Direct Access View. I

Future work will aim to refine these preliminary results by expanding both the empirical scope and the range of model evaluations. In particular, INDIR-IT provides a foundation for more systematic investigations into how LLMs handle the interface between linguistic form, context, and pragmatic inference. Moreover, this methodology can be adopted to construct comparable datasets in other languages. A partial translation of INDIR-IT may also be feasible, but only for a subset of items, as certain lexical triggers are language-specific, and some non-conventionalized ISAs require culture-specific background knowledge in order for their intended meaning to be inferred.

A. Prompt Below is the prompt fed to the models. In bold, the portions that are removed for the human annotation instructions.

COMPITO: Leggerai delle storie brevi che descrivono Literal una situazione ordinaria tra due personaggi: Fausto e Margherita. Ogni storia si conclude con una frase Non-conventional che Fausto rivolge a Margherita. Per ogni storia ven- 21 gono fornite quattro possibili interpretazioni per spiegare l’intenzione comunicativa della frase di Fausto, in Non-conventional relazione alla situazione presentata. Ad ogni interpre- 40 tazione, dovrai assegnare un punteggio da 1 a 5, in base alla sua plausibilità: (1 = non plausibile, 2 = poco plau- Table 9 sibile, 3 = plausibile, 4 = più che plausibile, 5 = molto LLaMA-3 8B instruct plausibile). Scenario type STORIA: Margherita non trova più il suo cellulare, così chiede a Fausto se sa dove si trova e lui le dice: "Hai Conventional sentito lo squillo provenire dalla cucina prima?" Cosa intende dire Fausto? Literal a) Fausto vuole far sapere a Margherita che il suo cellulare è in cucina. Non-conventional b) Fausto vuole sapere se Margherita ha fatto caso a un 21 rumore proveniente dalla cucina. c) Fausto intende dire che non ha la minima idea di dove 4N0on-conventional si trovi il cellulare di Margherita. d) Fausto vuole dire che a lui non importa se la loro conoscente sia sposata.

B. Models’ Results This section reports the models’ results in terms of mean

and standard deviation across each scenario and interpretation types. Row Non-conventional 21 refers to the results obtained from the same 21 items administered to the annotators. Row Non-conventional 40 refers to all non-conventionalized items of the dataset.

C. Scenarios discussed in Section 4.2

C.1. "Perché non...?"

Conventionalized/Literal Pair CS: Margherita e Fausto stanno discutendo su cosa

cucinare per cena. Fausto dice a Margherita: LS: Margherita e Fausto stanno discutendo su cosa cucinare per cena. Fausto però era convinto che Margherita volesse fare la pizza, allora le dice: ISA: "Perché non facciamo la pizza stasera?" I: Fausto sta proponendo a Margherita di fare la pizza. L: Fausto vuole capire perché non hanno più possibilità di fare la pizza.

D1: Fausto sta manifestando la sua frustrazione perché non hanno ancora preso una decisione.

D2: Fausto vuole far sapere a Margherita che lui non ha proprio voglia di pizza.

C.2. "Si può sapere...?"

Conventionalized/Literal Pair CS: Margherita sta cucinando, quando Fausto nota che

sta per mettere lo zucchero al posto del sale nell’ acqua della pasta. Fausto allora le dice:

C: Fausto and Margherita have planned to go out to eat,

but Fausto feels a bit tired, so he says to Margherita: "Can you drive?" L: Fausto and Margherita have planned to go out to eat, but Margherita has a bit of a headache, so Fausto says to her: "Can you drive?" C.4. Positive Reply as Promise a) Fausto wants Margherita to drive to the restaurant. b) Fausto wants to make sure that Margherita is able to NCS: Margherita chiede a Fausto se ci sia bisogno di drive. ritirare dei contanti dal bancomat, visto che hanno pro- c) Fausto wants to know if Margherita has a driver’s grammato di fare un viaggio a breve. Fausto le risponde: license.

ISA: "Ci passo io domani". d) Fausto means that he doesn’t feel like going out for I: Fausto intende dire che pensa che ci sia bisogno di dinner. contanti.

L: Fausto promette di passare domani a ritirare dei contanti. D.3. Scenarios discussed in Section 4.2 D1: Fausto vuole che Margherita passi a ritirare i con- "PERCHE’ NON?" PAIR - PROPOSAL AS QUESTION tanti. CS: Margherita and Fausto are discussing what to cook D2: Fausto intende dire che pensa che non ci sia bisogno for dinner. Fausto says to Margherita: "Why don’t we di contanti. make pizza tonight?" L: Margherita and Fausto are discussing what to cook D. English Translation of all the for dinner. However, Fausto was sure that Margherita wanted to make pizza, so he says to her: "Why don’t we

Examples discussed in the make pizza tonight?"

Paper I: Fausto is suggesting making pizza to Margherita L: Fausto wants to understand why they no longer have D.1. Prompt the possibility of making pizza.

D1: Fausto is expressing his frustration because they TASK: You will read short stories that describe an or- haven’t made a decision yet. dinary situation between two characters: Fausto and D2: Fausto wants to let Margherita know that he really Margherita. Each story ends with a sentence that Fausto doesn’t feel like eating pizza. addresses to Margherita. For each story, four possible interpretations are provided to explain the communicative "IS IT POSSIBLE TO KNOW" PAIR - REPROACH intention of Fausto’s sentence, in relation to the situation AS QUESTION presented. For each interpretation, you will have to as- C: Margherita is cooking, when Fausto notices that she sign a score from 1 to 5, based on its plausibility: (1 = not is about to put sugar instead of salt in the pasta water. plausible, 2 = slightly plausible, 3 = plausible, 4 = more Fausto then says to her: "Is it possible to know what you than plausible, 5 = very plausible) are doing?"

L: Margherita is cooking. Fausto smells a good smell

coming from the kitchen, so he asks Margherita: "Is it possible to know what you are doing?" I: Fausto blames Margherita for her carelessness.

L: Fausto wants to know what Margherita is cooking.

D1: Fausto complains because Margherita keeps too many things hidden from him.

D2: Fausto ofers to help Margherita cook.

NON CONVENTIONAL - PROPOSAL AS QUESTION Fausto wants to buy himself a new suit, but he doesn’t trust his own taste in clothing, so he says to Margherita: "Are you at work tomorrow morning?" I: Fausto would like Margherita to go with him to help him buy a new suit.

L: Fausto wants to know if Margherita is working tomorrow.

D1: Fausto wants Margherita to stay home tomorrow.

D2: Fausto wants to ask Margherita to buy him a new suit.

NON CONVENTIONAL - POSITIVE REPLY AS PROMISE Margherita asks Fausto if they need to withdraw some cash from the ATM, given that they have planned to take a trip soon. Fausto replies to her: "I’ll stop by tomorrow." I: Fausto means that he thinks there is a need for cash.

L: Fausto promises to come by tomorrow to pick up some cash.

D1: Fausto wants Margherita to come and collect the cash.

D2: Fausto means that he thinks there is no need for cash.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Text translation and Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , arXiv (Cornell University) ( 2017 ).

[2]

Warstadt ,

S. R.

Bowman , What artificial neural networks can tell us about human language acquisition, in: Algebraic structures in natural language , CRC Press, 2022 , pp. 17 - 60 .

[3]

Baroni , On the proper role of linguisticallyoriented deep net analysis in linguistic theorizing , ArXiv abs/2106 .08694 ( 2021 ). URL: https://api.

semanticscholar.org/CorpusID:235446467.

[4]

Futrell ,

Mahowald , How linguistics learned to 7. Limitations stop worrying and love the language models , arXiv preprint arXiv:2501.17047 ( 2025 ).

The limitations of this work concern both dataset con- [5]

Crystal , The Cambridge Encyclopedia struction and the experimental setup . of Language , Cambridge University Press, First, the selection of primary/secondary act combina- 2010 . URL: https://books.google.it/books?id= tions was not guided by their real distribution in Italian, J976wAEACAAJ .

as such labeled data are currently unavailable . While [6]

J. R.

Searle , What is a speech act, 1996. URL: https: INDIR-IT includes a variety of combinations , it may not //api.semanticscholar.org/CorpusID:142781882.

fully reflect natural frequencies . Future work could ad- [7]

J. L.

Austin , How to Do Things with Words, Clarendress this by expanding the dataset , possibly adopting don Press, Oxford [Eng.], 1962 .

hybrid methods that combine expert annotation with [8 ]

J. R.

Searle , Expression and Meaning: Studies in corpus extraction, as fully automatic approaches are not the Theory of Speech Acts, Cambridge University feasible given the contextual specificity required . Press, Cambridge, 1979 .

Second , inter -speaker variability poses challenges , es- [9]

S. C.

Levinson , Pragmatics / Stephen C. Levinson, pecially in pragmatics. Since the task itself invites inter- Cambridge textbooks in linguistics, Cambridge unipretive variation, a larger pool of annotators would help versity , Cambridge, 1983 .

mitigate individual diferences in pragmatic competence . [10]

R. W. Gibbs

Jr , A new look at literal meaning in Third, model outputs are also sensitive to sampling understanding what is said and implicated, Journal variability . In this study, hyperparameters such as tem- of Pragmatics 34 ( 2002 ) 457 - 486 .

perature, top-k, and top-p were not controlled . While [11]

R. W.

Gibbs , Do people always process the literal allowing some randomness is appropriate given the inher- meanings of indirect requests?, Journal of experient ambiguity of the task, future studies should standard- mental psychology. Learning, memory, and cogniize these parameters across models to ensure replicability tion 9 ( 1983 ) 524 - 533 .

and comparability . [12]

Marocchini ,

Domaneschi , “ can you read my mind?” conventionalized indirect requests and Acknowledgments theory of mind abilities , Journal of Pragmatics 193 ( 2022 ) 201 - 221 . URL: https://www.sciencedirect.

This work has been supported by the project “XAI-CARE” com /science/article/pii/S0378216622000819.

funded by the European Union - Next Generation EU - doi:https://doi.org/10.1016/j.pragma.

NRRP M6C2 “Investment 2.1 Enhancement and strength- 2022 . 03 .011.

ening of biomedical research in the NHS” (PNRR-MAD- [13]

H. H.

Clark , Responding to indirect speech acts, 2022 - 12376692 _VADALA' - CUP F83C22002470001) and Cognitive psychology 11 ( 1979 ) 430 - 477 .

the project "Language Of Dreams: the relationship be - [14]

Trott , B. B. and, Individual diferences in mentaltween sleep mentation, neurophysiology, and neurologi- izing capacity predict indirect request comprehencal disorders" - PRIN 2022 2022BNE97C_SH4_PRIN2022 . sion, Discourse Processes 56 ( 2019 ) 675 - 707 . URL: https://doi.org/10.1080/0163853X. 2018 . 1548219 .

doi:10.1080/0163853X . 2018 . 1548219 .

[15]

Bašnáková ,

Weber ,

K. M.

Petersson , J. van Berkum ,

Hagoort , Beyond the language given: The neural correlates of inferring speaker meaning , Cerebral Cortex 24 ( 2013 ) 2572 - 2578 . A. Agarwal , A.

Power , A.

Ray , A.

Warstadt , A. W. KoURL: https://doi.org/10.1093/cercor/bht112. doi:10. curek , A. Safaya , A. Tazarv , " ... " , Z.

Wu , Beyond the 1093/cercor/bht112. imitation game: Quantifying and extrapolating the [16] P.

Brown , S. C. Levinson, Politeness: Some Uni- capabilities of language models , 2023 . URL: https: versals in Language Usage , Studies in Interactional //arxiv.org/abs/2206.04615. arXiv: 2206 . 04615 .

Sociolinguistics , Cambridge University Press, Cam- [24] S. L.

Sravanthi , M.

Doshi , T. P.

Kalyan , R. Murthy, bridge, 1987 . P. Bhattacharyya,

Dabre , Pub: A pragmatics [17]

R. W.

Janney , H. Arndt, 1. Intracultural tact understanding benchmark for assessing llms' pragversus intercultural tact , De Gruyter Mouton, matics capabilities ( 2024 ).

Berlin , Boston, 1992 , pp. 21 - 42 . URL: https:// [25]

Louis ,

Roth ,

Radlinski , “ I'd rather doi . org/10 .1515/ 9783110886542 - 004 . doi:doi:10. just go to bed”: Understanding indirect answers , 1515 / 9783110886542 - 004 . in: B. Webber , T. Cohn, Y.

He , Y. Liu (Eds.), [18] S. R.

Bowman , G. Dahl, What will it take to Proceedings of the 2020 Conference on Empirifx benchmarking in natural language understand- ical Methods in Natural Language Processing ing? , in: K. Toutanova , A. Rumshisky , L. Zettle - (EMNLP ), Association for Computational Linguismoyer, D.

Hakkani-Tur , I.

Beltagy , S. Bethard, tics, Online, 2020 , pp. 7411 - 7425 . URL: https: R. Cotterell,

Chakraborty , Y. Zhou (Eds.), Pro- //aclanthology.org/ 2020 .emnlp-main. 601 . doi:10.

ceedings of the 2021 Conference of the North 18653 /v1/ 2020 .emnlp-main. 601 .

American Chapter of the Association for Com- [26]

Zheng ,

Qiu ,

Fan ,

Zhu , S.-C. Zhu, GRICE: putational Linguistics: Human Language Tech- A grammar-based dataset for recovering implicanologies, Association for Computational Linguis- ture and conversational rEasoning , in: C. Zong, tics, Online, 2021 , pp. 4843 - 4855 . URL: https:

Xia ,

Li ,

Navigli (Eds.), Findings of the //aclanthology.org/ 2021 .naacl-main. 385 /. doi:10. Association for Computational Linguistics: ACL18653/v1/ 2021 .naacl-main. 385. IJCNLP 2021 , Association for Computational Lin[19]

Aiyappa ,

An ,

Kwak , Y.-y. Ahn, Can we guistics, Online , 2021 , pp. 2074 - 2085 . URL: https: trust the evaluation on ChatGPT? , in: A. Ovalle, //aclanthology.org/ 2021 .findings-acl. 182 . doi: 10.

K.-W. Chang , N.

Mehrabi , Y.

Pruksachatkun , A . Ga- 18653 /v1/ 2021 .findings-acl. 182 .

lystan , J. Dhamala , A.

Verma , T.

Cao , A.

Kumar , [27] J.

Hu , S.

Floyd , O.

Jouravlev , E.

Fedorenko , E. Gibson, R. Gupta (Eds.), Proceedings of the 3rd Work- A fine-grained comparison of pragmatic language shop on Trustworthy Natural Language Processing understanding in humans and language models , in: (TrustNLP 2023 ), Association for Computational A . Rogers , J. Boyd-Graber , N. Okazaki (Eds.), ProLinguistics, Toronto, Canada, 2023 , pp. 47 - 54 . URL: ceedings of the 61st Annual Meeting of the Associahttps://aclanthology .org/ 2023 .trustnlp- 1 .5/. doi:10. tion for Computational Linguistics (Volume 1 : Long 18653/v1/ 2023 .trustnlp- 1 .5. Papers), Association for Computational Linguis [20]

Zhou ,

Zhu ,

Chen ,

W. X.

Zhao , tics, Toronto, Canada, 2023 , pp. 4194 - 4213 . URL:

Chen ,

Lin ,

J.-R.

Wen , J. Han, Don't make https://aclanthology.org/ 2023 . acl-long . 230 . doi:10.

your llm an evaluation benchmark cheater , arXiv 18653 /v1/ 2023 . acl-long . 230 .

preprint arXiv:2311 . 01964 ( 2023 ). [28]

Roque ,

Tsuetaki ,

Sarathy , M. Scheutz, De[21]

Attanasio ,

Basile ,

Borazio ,

Croce , M. Fran- veloping a corpus of indirect speech act schemas, cis , J. Gili , E. Musacchio, M.

Nissim , V.

Patti , M. Ri- in: N. Calzolari , F.

Béchet , P.

Blache , K. Choukri, naldi, D. Scalena, Calamita: Challenge the abilities C. Cieri, T.

Declerck , S.

Goggi , H.

Isahara , B.

Maeof language models in italian , in: Italian Conference gaard, J. Mariani,

Mazo ,

Moreno , J. Odijk, on Computational Linguistics, 2024 . URL: https: S. Piperidis (Eds.), Proceedings of the Twelfth //api.semanticscholar.org/CorpusID:275357573.

Language

Resources and Evaluation Conference, [22]

Seveso ,

Potertì , E. Federici,

Mezzanzanica , European Language Resources Association, MarF. Mercorio, et al., Italic: An italian culture-aware seille , France, 2020 , pp. 220 - 228 . URL: https:// natural language benchmark, in: Proceedings of aclanthology.org/2020.lrec-1 . 28 .

the 2025 Conference of the Nations of the Americas [29]

H. J.

Levesque ,

Davis ,

Morgenstern , The winoChapter of the Association for Computational Lin- grad schema challenge , in: Proceedings of the Thirguistics: Human Language Technologies (Volume teenth International Conference on Principles of 1: Long Papers) , April 29-May 4 , 2025 , volume 1 , Knowledge

Representation

and Reasoning , KR'12, 2025 , pp. 1469 - 1478 . AAAI Press, 2012 , p. 552 - 561 .

[23]

Srivastava ,

Rastogi ,

Rao ,

A. A. M.

Shoeb ,

Abid ,

Fisch ,

A. R.

Brown ,

Santoro ,

Gupta ,

Garriga-Alonso ,

Kluska , A . Lewkowycz,