-

1613-0073

TRACE-it: Testing Relative clAuses Comprehension through Entailment in ITalian: A CALAMITA Challenge

Dominique Brunato

dominique.brunato@ilc.cnr.it 0 1 0 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 1 Istituto di Linguistica Computazionale ”A. Zampolli”, CNR-ILC, ItaliaNLP Lab

2024

ITalian) is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to comprehend a specific type of complex syntactic construction in Italian: object relative clauses. In this report, we outline the theoretical framework that informed the creation of the dataset and provide a comprehensive overview of the linguistic materials used. Object Relative Clauses, Italian language, benchmark, syntactic assessment, entailment

CEUR

ceur-ws.org A

CALAMITA

1. Introduction and Motivation

TRACE-it (Testing Relative clAuses Comprehension through Entailment in Italian) is a benchmark designed to assess the ability of Large Language Models (LLMs) to comprehend complex sentences in Italian. Complex sentences, in this context, are defined as those containrates in comprehension questions after reading– has been extensively studied and explained by formal linguistic theories and processing models [ 7, 4, 8, 6 ], including child language acquisition data [9, 10, 11]. This benchmark aims to determine whether LLMs encounter similar dificulties and to explore various factors that were shown to modulate this complexity for humans, such as altering derstanding requires the computation of a grammatical ing a type of unbounded dependency, whose correct un- the nature of the elements involved in the dependency in terms of grammatical and/or semantic features, as well relationship between phrases that are pronounced in a po- as varying the distance between the filler and the gap. sition diferent from the one where they are interpreted.

These structures, also known as “filler-gap” constructions in psycholinguistics, pose significant challenges for human sentence processing, particularly pronounced when the “filler” (the pronounced element) is distant from

Examples of this include object-gap relationships, which occur in constructions such as relative clauses (1), cleft sentences (2), or wh-questions (3), like the following1: 1. Il giornalista che il senatore contestò ammise l’errore. [The reporter who the senator attacked admitted the error.]

2. E’ il giornalista che il senatore contestò. [It is the

reporter that the senator attacked.] reporter did the senator attack? ]

3. Quale giornalista il senatore contestò? [Which The higher complexity of these constructions compared to their subject counterparts –typically measured in terms of reading times and often accompanied by error

0000-0003-3256-4794 (D. Brunato)

1Examples are taken from [6].

In this respect, the proposed benchmark is part of a growing set of resources specifically designed for syntactic evaluation of neural language models, which are typically composed by minimal pairs of grammatical and non-grammatical sentences addressing a specific [12, 13, 14, 15, 16], i.a.). To succeed, a model must score the grammatical sentence higher than its ungrammatical counterpart, either assigning a binary value or in terms of model perplexity. Two main resources in this respect are Corpus of Linguistic Acceptability (CoLA) [17] and BLiMP (Benchmark of Linguistic Minimal Pairs) [18], which include minimal pairs for various grammatical phenomena in English. Adaptations of these resources have been recently released also in other languages, Italian included. Notable examples include ITaCoLA [19], which is directly inspired by CoLA, and the dataset developed for the AcCompl-It task (Acceptability & Complexity

Evaluation for Italian) held in the context of Evalita 2020 campaign [20]. While similar for purposes, the novelty of TRACEit lies in its approach. Unlike previous benchmarks

that have focused on testing LLMs’ ability to distinguish between grammatical and ungrammatical sentences through minimal pairs or assigning a complexthe “gap” (the position where it is interpreted) [ 2, 3, 4, 5 ]. linguistic phenomenon that difers in the sentence (see © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ity score to such sentences, this benchmark introduces Attribution 4.0 International (CC BY 4.0). a more advanced task based on entailment. Instead of embedded verb (example 5). It is false if NP2 is the passimply assessing grammaticality, the model is tasked sive subject of the embedded verb or is presented as the with determining whether a given complex sentence logi- subject of the main clause’s verb (examples 6 and 7, recally entails a simpler yes/no implication. This approach spectively). In the majority of cases, the second sentence would thus provide a more nuanced evaluation of the closely mirrors the lexical structure of the first, as the model’s ability to understand deep syntactic structures, dataset is firstly designed to investigate syntactic entailgoing beyond surface-level grammaticality to probe its ment. However, in some instances, a paraphrase is used comprehension of meaning. (e.g.. 8).

The ability to grasp complex syntactic relationships, These criteria were almost equally balanced across the such as those present in filler-gap constructions, is fun- distinct portions of the whole dataset, which are detailed damental to higher-order language tasks. For instance, in the following section. summarization, information extraction, and question answering all depend on the model’s capacity to correctly interpret sentence structure and meaning. By requiring 3. Data description the model to process complex syntactic dependencies, this benchmark aims to provide a further step towards The benchmark consists of 566 sentence pairs, all strucmore rigorous and meaningful evaluation of syntactic tured to evaluate the comprehension of Object Relative comprehension, with a specific focus on Italian. More- Clauses (ORCs). While the task’s main objective and the over, TRACE-it contributes to the growing field of linguis- criteria for determining entailment between the two sentically informed resources that enhance interpretability tences in each pair remain constant, the dataset is divided in NLP [21]. These benchmarks are essential for un- into four main sections. Each section corresponds to a raveling the linguistic competence implicitly encoded distinct type of ORC in the first sentence, diferentiated in neural network representations, and they can shed by specific conditions that characterize the two lexical light on the similarities and diferences between how hu- noun phrases (NPs) involved in the relative clause: mans and LLMs acquire, represent, and process linguistic These conditions are inspired by findings from psyknowledge [22, 23]. cholinguistic literature, which reveal that the processing dificulty humans encounter with ORCs - particularly in online comprehension - can be reduced when there is a 2. Challenge: Description mismatch between the two NPs in certain grammatical and semantic features [24, 10, 25, 26, 27]. Specifically, we focus on three key features that were shown to have this efect: gender, number, and animacy. To ensure a balanced dataset, we consulted existing resources and literature that have carefully controlled for these conditions.

For gender and number, we utilized the Italian experi

mental stimuli set described by [24], focusing exclusively on the center-embedded ORCs portion. This dataset, referred to as Biondo-et-al-2023, contains 306 ORCs equally divided into three subsets:

The proposed challenge focuses on evaluating LLMs’ understanding of a precise linguistic structure in the

Italian language: restrictive object-extracted relative clauses (ORCs). We specifically examine centreembedded ORCs where both the relative head and the embedded subject are expressed as lexical noun phrases.

The assessment involves a yes/no entailment task in which the model is given two paired sentences. The first contains the target structure, and the second is a simple declarative sentence whose meaning may or may not be logically inferred from the first based on the syntactic relationship between the elements in the ORC. Specifically, the second sentence focuses either on the relative head (NP1) or the embedded subject (NP2) and has been designed according to the following criteria: When the focus is on NP1, the entailment is true if the second sentence presents NP1 as the active subject of the matrix verb of the main clause or as the passive subject of the embedded verb (see examples 1 and 2 in Table 1, respectively). The entailment is false if NP1 is shown as the active subject of the embedded verb or if the verb of the main clause is negated (see examples 3 and 4, respectively).

When the focus is on NP2, the entailment is true if the second sentence presents NP2 as the subject of the • The first subset ( gen-num-match condition) contains ORCs where both NPs match in gender and number (i.e., both singular and masculine); • The second subset (gen-mismatch condition) introduces a gender mismatch, where NP2 remains singular but is feminine; • The third subset (num-mismatch condition) introduces a number mismatch, where NP2 is masculine but plural.

For animacy, we incorporated 56 examples drawn from

a larger set of experimental stimuli described in the paper by Gennari and McDonald, 2008 [25]. These sentences were originally in English and were translated into Italian, ensuring that the object relative clause construction PAIR SENTENCE1 1 Il professore che lo studente chiama apre la porta dell’aula. 2 Il pittore che il fotografo coinvolge inaugura una mostra d’avanguardia. 3 L’attore che il ballerino ringrazia rompe il microfono nuovo. 4 L’infermiere che il dottore critica aggiorna i turni della settimana. 5 L’allenatore che il nuotatore accusa commette un’infrazione del regolamento.

6 Il cuoco che il cameriere consulta introduce un

menù per vegetariani.

7 Il nonno che il bambino insegue calpesta un sasso

appuntito.

8 Il pagliaccio che la ragazza deride attira l’attenzione di tutti. SENTENCE2 Il professore sta aprendo una porta. NP target NP1 GOLD Il pittore è stato coinvolto dal fotografo. L’attore sta ringraziando il ballerino. L’infermiere non ha aggiornato i turni settimanali. Il nuotatore sta accusando l’allenatore. Il cameriere è stato consultato dal cuoco. Il bambino ha calpestato un sasso. La ragazza sta prendendo in giro il pagliaccio. NP1 NP1

NP1 NP2 NP2 NP2 NP2

YES NO NO YES NO NO YES remained syntactically correct and semantically natural these models have acquired the ability to reason about in the target language. All of these sentences exhibit complex constructions they might have already encounan animacy mismatch: in half of the examples, NP1 is tered and been tested on, beyond simply recognizing animate and NP2 is inanimate, while in the other half, their grammaticality. the reverse configuration is applied. Table 2 summarizes the types of ORCs included in the

Additionally, we introduced a fourth condition, also dataset, along with an example for each condition. inspired by psycholinguistic research, which focuses on manipulating the distance between the two NPs. This 3.1. Human Evaluation manipulation aims to increase sentence complexity due to a longer subject-verb agreement dependency in the main Since the assignment of gold labels to sentence pairs in clause[ 4, 28 ], which might result in agreement attraction the benchmark was manually derived, though primarily efects [ 29, 30]. This condition was obtained by adding informed by linguistic literature, we conducted a human one or more prepositional phrases (PP) to either NP1 or evaluation with untrained native speakers to validate the NP2, thereby extending the distance between the noun examples and ensure they conveyed clear implications. phrases and increasing the subject-verb agreement de- For this validation, we selected 240 sentence pairs, pendency in the main clause. This fourth condition was representing approximately 42% of the entire benchmark, applied to 156 sentences, which were sourced from the with an equal distribution across all conditions. These two aforementioned datasets. Specifically, 100 sentences pairs were annotated by Italian native speakers, recruited were selected from the Biondo-et-al-2023 dataset, dis- via the Prolific platform 3. The annotation process was tributed evenly across the three subsets (match, gender organized into eight questionnaires, each containing 30 mismatch, and number mismatch), and the entire set sentence pairs. Each pair was labeled by five diferent from [25] was used. workers, resulting in a total of 1,050 human judgments.

Finally, we included a small set of ‘mix-category’ To maintain accuracy and reliability, each questionORCs, with sentences sourced from ‘sister challenge’ naire included five control items where the first sentence benchmarks such as CoLA [17], ITaCoLA [19], and was a simple declarative. Annotators were given very ACCOMPL-it [20], specifically selecting only those simple instructions, similar to the prompt used for the marked as grammatical in the original datasets. While LLM, and were asked to carefully evaluate each pair and these sentences all contain ORC constructions, the two determine whether the first sentence implied the second. NPs were not controlled for specific features. Further- The final label for each pair was determined through more, except for the CoLA sentences2, these examples fea- majority voting. This process yielded an accuracy rate ture right-branching rather than center-embedded struc- of 94.2% (226 correct; 14 incorrect). Of the 226 correctly tures. Given the novel formulation of our task (to our annotated pairs, 207 achieved agreement from at least knowledge), it will be interesting to determine whether

2Sentences included in TRACE-it were translated into Italian. 3https://www.prolific.com/

gen-num animacy distance sister-ch mixed FEAT all-match gen-mism num-mism mism [an-in] mism [in-an] all-match_NP1+PP gen-mism_NP2+PP anim-mism_NP1+PP anim-mism_NP2+PP

EXAMPLE Il professore che lo studente chiama apre la porta dell’aula. Il professore che la studentessa chiama apre la porta dell’aula. Il professore che gli studenti chiamano apre la porta dell’aula. Lo scienziato che il libro ha infastidito era rinomato per i suoi saggi sull’ecologia. Il libro che lo scienziato ha studiato era rinomato per i suoi argomenti sull’ecologia. Il professore di storia e filosofia di Marco che lo studente chiama apre la porta dell’aula. Il primario che la specializzanda di oculistica rassicura lascia il reparto incustodito Lo scienziato dell’agenzia pubblica europea che il libro ha infas

tidito era rinomato per i suoi saggi sull’ecologia.

Il libro che lo scienziato dell’agenzia pubblica europea ha studi

ato era rinomato per i suoi argomenti sull’ecologia.

Il cane che la macchina ferì aveva un collare giallo. Ho bevuto il vino che Tommaso mi ha portato. Carlo conosceva bene il compagno di classe che Anna voleva

sempre incontrare. 3.2. Data format The benchmark is provided as a tab-separated text file with the following information for each entry: • UniqueID: a numerical identifier for the entry; • Source: the original reference from which the sentence has been taken; • ID-mapping: an identifier mapping for crossreferencing according to the condition; • Condition: The type of ORC, based on the features (i.e. gender, number, animacy, distance, mixed) and specific configurations (match, mismatch) of the two NPs involved; • Sentence1: the first sentence containing the ORC; • Sentence2: the second sentence that may or may not be implied by sentence 1; • NP target: indicates whether Sentence 2 targets the head of the relative clause (NP1) or the subject of the embedded clause (NP2) in sentence1.; • Gold: the gold label assigned to the pair (“sì” if sentence 1 implied sentence 2, “no” otherwise).

4. Evaluation

4.1. Zero-shot Prompting

To evaluate knowledge that emerges from the model’s

training rather than through in-context learning, we chose to adopt a zero-shot evaluation paradigm.

We formulate a very simple prompt, which is nearly

identical to the instruction presented to humans in the annotation task: “Data questa coppia di frasi, valuta se la prima frase implica la seconda. Rispondi sì o no.”

Although we experimented with various prompt for

mulations, we ultimately decided to avoid any prompts that encouraged the model to explicitly analyze the linguistic structure of the sentence. Our aim was to evaluate the model’s raw ability to infer entailment without any task-specific guidance.

Metrics Given the perfectly balanced data distribution

across the two classes, the evaluation metrics will be based on the Accuracy and F1_score. 4.2. Preliminary Results

We conducted an initial evaluation of the TRACE-it challenge on llama-3-8B Instruct [31], achieving an accuracy of 0.71. to create a more comprehensive evaluation framework. 6. Limitations 5. Conclusion In this report, we have described TRACE-it, a novel

benchmark, with a corresponding task, presented for the CALAMITA challenge and designed to evaluate the ability of large language models (LLMs) to comprehend object relative clauses (ORCs) in Italian. By focusing on this specific type of complex syntactic construction,

TRACE-it allows for a detailed examination of how models handle key grammatical and semantic features, such as gender, number, and animacy, which are known to influence human comprehension. The results from our preliminary evaluation showed

that while models are able to grasp ORC comprehension, challenges remain, and they are consistent with patterns observed in human language processing studies.

Although the benchmark is small in scale and limited to a single syntactic structure, it serves as a crucial first step towards a deeper understanding of LLMs’ syntactic capabilities in Italian. Future work should aim to expand both the dataset and the range of syntactic phenomena [7] A. Staub, Eye movements and processing dificulty doi:10.1162/tacl_a_00290.

in object relative clauses, Cognition 116 (2010) [18] A. Warstadt, A. Parrish, H. Liu, A. Mohananey, 71–86. W. Peng, S.-F. Wang, S. R. Bowman, Blimp: The [8] M. De Vincenzi, Syntactic parsing strategies in benchmark of linguistic minimal pairs for english, Italian: The minimal chain principle, volume 12, Transactions of the Association for Computational Springer Science & Business Media, 1991. Linguistics 8 (2020) 377–392. [9] L. M. S. Corrêa, An alternative assessment of chil- [19] D. Trotta, R. Guarasci, E. Leonardelli, S. Tonelli, dren’s comprehension of relative clauses, Journal Monolingual and cross-lingual acceptability judgof psycholinguistic research 24 (1995) 183–203. ments with the Italian CoLA corpus, in: Find[10] N. Friedmann, A. Belletti, L. Rizzi, Relativized rel- ings of the Association for Computational Linatives: Types of intervention in the acquisition of guistics: EMNLP 2021, Association for Computaa-bar dependencies, Lingua 119 (2009) 67–88. tional Linguistics, Punta Cana, Dominican Republic, [11] H. Diessel, M. Tomasello, A new look at the acqui- 2021, pp. 2929–2940. URL: https://aclanthology.org/ sition of relative clauses, Language (2005) 882–906. 2021.findings-emnlp.250. doi:10.18653/v1/2021. [12] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, findings-emnlp.250.

M. Baroni, Colorless green recurrent networks [20] D. Brunato, C. Chesi, F. Dell’Orletta, S. Montedream hierarchically, in: M. Walker, H. Ji, A. Stent magni, G. Venturi, R. Zamparelli, Accompl-it @ (Eds.), Proceedings of the 2018 Conference of the evalita2020: Overview of the acceptability & comNorth American Chapter of the Association for plexity evaluation task for italian, EVALITA EvaluaComputational Linguistics: Human Language Tech- tion of NLP and Speech Tools for Italian - December nologies, Volume 1 (Long Papers), Association for 17th, 2020 (2020). URL: https://api.semanticscholar.

Computational Linguistics, New Orleans, Louisiana, org/CorpusID:229292651.

2018, pp. 1195–1205. URL: https://aclanthology.org/ [21] J. Opitz, S. Wein, N. Schneider, Natural language N18-1108. doi:10.18653/v1/N18-1108. processing relies on linguistics, arXiv preprint [13] R. Marvin, T. Linzen, Targeted syntactic evalu- arXiv:2405.05966 (2024).

ation of language models, in: E. Rilof, D. Chi- [22] A. Warstadt, S. R. Bowman, What artificial neural ang, J. Hockenmaier, J. Tsujii (Eds.), Proceed- networks can tell us about human language acquiings of the 2018 Conference on Empirical Meth- sition, in: Algebraic structures in natural language, ods in Natural Language Processing, Association CRC Press, 2022, pp. 17–60. for Computational Linguistics, Brussels, Belgium, [23] Y. Belinkov, J. Glass, Analysis Methods in Neural 2018, pp. 1192–1202. URL: https://aclanthology.org/ Language Processing: A Survey, Transactions of D18-1151. doi:10.18653/v1/D18-1151. the Association for Computational Linguistics 7 [14] S. A. Chowdhury, R. Zamparelli, Rnn simulations of (2019) 49–72. URL: https://doi.org/10.1162/tacl_a_ grammaticality judgments on long-distance depen- 00254. doi:10.1162/tacl_a_00254. dencies, in: Proceedings of the 27th international [24] N. Biondo, E. Pagliarini, V. Moscati, L. Rizzi, conference on computational linguistics, 2018, pp. A. Belletti, Features matter: the role of num133–144. ber and gender features during the online pro[15] E. G. Wilcox, R. Levy, T. Morita, R. Futrell, What cessing of subject- and object- relative clauses do rnn language models learn about filler–gap in italian, Language, Cognition and Neudependencies?, in: BlackboxNLP@EMNLP, 2018. roscience 38 (2023) 802–820. URL: https://doi. URL: https://api.semanticscholar.org/CorpusID: org/10.1080/23273798.2022.2159989. doi:10.1080/ 52156878. 23273798.2022.2159989. [16] J. Gauthier, J. Hu, E. Wilcox, P. Qian, R. Levy, Syn- [25] S. P. Gennari, M. C. MacDonald, Semantic indetaxGym: An online platform for targeted evalua- terminacy in object relative clauses, Journal of tion of language models, in: A. Celikyilmaz, T.- memory and language 58 (2008) 161–187. H. Wen (Eds.), Proceedings of the 58th Annual [26] M. W. Lowder, P. C. Gordon, Efects of animacy Meeting of the Association for Computational Lin- and noun-phrase relatedness on the processing of guistics: System Demonstrations, Association for complex sentences, Memory & cognition 42 (2014)

Computational Linguistics, Online, 2020, pp. 70–76. 794–805.

URL: https://aclanthology.org/2020.acl-demos.10. [27] W. M. Mak, W. Vonk, H. Schriefers, The indoi:10.18653/v1/2020.acl-demos.10. lfuence of animacy on relative clause process[17] A. Warstadt, A. Singh, S. R. Bowman, Neural net- ing, Journal of Memory and Language 47 work acceptability judgments, Transactions of the (2002) 50–68. URL: https://www.sciencedirect.com/ Association for Computational Linguistics 7 (2019) science/article/pii/S0749596X01928372. doi:https: 625–641. URL: https://aclanthology.org/Q19-1040. //doi.org/10.1006/jmla.2001.2837. [28] H. Liu, C. Xu, J. Liang, Dependency distance: A new perspective on syntactic patterns in natural languages, Physics of life reviews 21 (2017) 171–193. [29] J. Franck, G. Lassi, U. H. Frauenfelder, L. Rizzi,

Agreement and movement: A syntactic analysis of attraction, Cognition 101 (2006) 173–216. [30] D. Parker, A. An, Not all phrases are equally attractive: Experimental evidence for selective agreement attraction efects, Frontiers in psychology 9 (2018) 1566. [31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al

Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., The llama 3 herd of models, arXiv

preprint arXiv:2407.21783 (2024).

[1]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi ,

Scalena , CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), Pisa, Italy, December 4 - December 6, 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[2]

J. A.

Hawkins , Processing complexity and fillergap dependencies across grammars , Language 75 ( 1999 ) 244 - 285 . URL: https://api.semanticscholar. org/CorpusID:89607408.

[3]

Frazier ,

Clifton , Successive cyclicity in the grammar and the parser , Language and Cognitive Processes 4 ( 1989 ) 93 - 126 . URL: https://api. semanticscholar.org/CorpusID:62152168.

[4]

Gibson , Linguistic complexity: Locality of syntactic dependencies , Cognition 68 ( 1998 ) 1 - 76 .

[5]

L. A.

Stowe , Parsing wh-constructions: Evidence for on-line gap location , Language and Cognitive Processes 1 ( 1986 ) 227 - 245 . URL: https://api. semanticscholar.org/CorpusID:62596346.

[6]

J. P.

King ,

M. A.

Just , Individual diferences in syntactic processing: The role of working memory , Journal of Memory and Language 30 ( 1991 ) 580 - 602 . URL: https://api.semanticscholar. org/CorpusID:144231849.