=Paper=
{{Paper
|id=Vol-3878/119_calamita_long
|storemode=property
|title=TRACE-it: Testing Relative clAuses Comprehension through Entailment in ITalian: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/119_calamita_long.pdf
|volume=Vol-3878
|authors=Dominique Brunato
|dblpUrl=https://dblp.org/rec/conf/clic-it/Brunato24
}}
==TRACE-it: Testing Relative clAuses Comprehension through Entailment in ITalian: A CALAMITA Challenge==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/119_calamita_long.pdf</pdf>
<pre>
                                TRACE-it: Testing Relative clAuses Comprehension through
                                Entailment in ITalian:
                                A CALAMITA Challenge
                                Dominique Brunato1
                                1
                                    Istituto di Linguistica Computazionale ”A. Zampolli”, CNR-ILC, ItaliaNLP Lab


                                                 Abstract
                                                 Introduced in the context of CALAMITA 2024 [1], TRACE-it (Testing Relative clAuses Comprehension through Entailment in
                                                 ITalian) is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to comprehend a specific type of
                                                 complex syntactic construction in Italian: object relative clauses. In this report, we outline the theoretical framework that
                                                 informed the creation of the dataset and provide a comprehensive overview of the linguistic materials used.

                                                 Keywords
                                                 Object Relative Clauses, Italian language, benchmark, syntactic assessment, entailment


                                1. Introduction and Motivation                                                                                   rates in comprehension questions after reading– has been
                                                                                                                                                 extensively studied and explained by formal linguistic
                                TRACE-it (Testing Relative clAuses Comprehension theories and processing models [7, 4, 8, 6], including child
                                through Entailment in Italian) is a benchmark designed language acquisition data [9, 10, 11]. This benchmark
                                to assess the ability of Large Language Models (LLMs) aims to determine whether LLMs encounter similar diffi-
                                to comprehend complex sentences in Italian. Complex culties and to explore various factors that were shown to
                                sentences, in this context, are defined as those contain- modulate this complexity for humans, such as altering
                                ing a type of unbounded dependency, whose correct un- the nature of the elements involved in the dependency in
                                derstanding requires the computation of a grammatical terms of grammatical and/or semantic features, as well
                                relationship between phrases that are pronounced in a po- as varying the distance between the filler and the gap.
                                sition different from the one where they are interpreted.                                                           In this respect, the proposed benchmark is part of a
                                              These structures, also known as “filler-gap” construc- growing set of resources specifically designed for syn-
                                tions in psycholinguistics, pose significant challenges tactic evaluation of neural language models, which are
                                for human sentence processing, particularly pronounced typically composed by minimal pairs of grammatical
                                when the “filler” (the pronounced element) is distant from and non-grammatical sentences addressing a specific
                                the “gap” (the position where it is interpreted) [2, 3, 4, 5]. linguistic phenomenon that differs in the sentence (see
                                Examples of this include object-gap relationships, which [12, 13, 14, 15, 16], i.a.). To succeed, a model must score
                                occur in constructions such as relative clauses (1), cleft the grammatical sentence higher than its ungrammat-
                                sentences (2), or wh-questions (3), like the following1 :                                                        ical counterpart, either assigning a binary value or in
                                                                                                                                                 terms of model perplexity. Two main resources in this re-
                                                1. Il giornalista che il senatore contestò ammise l’er-
                                                                                                                                                 spect are Corpus of Linguistic Acceptability (CoLA) [17]
                                                   rore. [The reporter who the senator attacked ad-
                                                                                                                                                 and BLiMP (Benchmark of Linguistic Minimal Pairs) [18],
                                                   mitted the error.]
                                                                                                                                                 which include minimal pairs for various grammatical phe-
                                                2. E’ il giornalista che il senatore contestò. [It is the nomena in English. Adaptations of these resources have
                                                   reporter that the senator attacked.]                                                          been recently released also in other languages, Italian
                                                3. Quale giornalista il senatore contestò? [Which included. Notable examples include ITaCoLA [19], which
                                                   reporter did the senator attack? ]                                                            is directly inspired by CoLA, and the dataset developed
                                                                                                                                                 for the AcCompl-It task (Acceptability & Complexity
                                              The higher complexity of these constructions com-
                                                                                                                                                 Evaluation for Italian) held in the context of Evalita 2020
                                pared to their subject counterparts –typically measured
                                                                                                                                                 campaign [20].
                                in terms of reading times and often accompanied by error
                                                                                                                                                    While similar for purposes, the novelty of TRACE-
                                                                                                                                                 it lies in its approach. Unlike previous benchmarks
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy
                                                                                                                                                 that have focused on testing LLMs’ ability to distin-
                                Envelope-Open dominique.brunato@ilc.cnr.it (D. Brunato)                                                          guish between grammatical and ungrammatical sen-
                                Orcid 0000-0003-3256-4794 (D. Brunato)                                                                           tences through minimal pairs or assigning a complex-
                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                    Attribution 4.0 International (CC BY 4.0).                                                   ity score to such sentences, this benchmark introduces
                                1
                                    Examples are taken from [6].


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
a more advanced task based on entailment. Instead of          embedded verb (example 5). It is false if NP2 is the pas-
simply assessing grammaticality, the model is tasked          sive subject of the embedded verb or is presented as the
with determining whether a given complex sentence logi-       subject of the main clause’s verb (examples 6 and 7, re-
cally entails a simpler yes/no implication. This approach     spectively). In the majority of cases, the second sentence
would thus provide a more nuanced evaluation of the           closely mirrors the lexical structure of the first, as the
model’s ability to understand deep syntactic structures,      dataset is firstly designed to investigate syntactic entail-
going beyond surface-level grammaticality to probe its        ment. However, in some instances, a paraphrase is used
comprehension of meaning.                                     (e.g.. 8).
   The ability to grasp complex syntactic relationships,         These criteria were almost equally balanced across the
such as those present in filler-gap constructions, is fun-    distinct portions of the whole dataset, which are detailed
damental to higher-order language tasks. For instance,        in the following section.
summarization, information extraction, and question an-
swering all depend on the model’s capacity to correctly
interpret sentence structure and meaning. By requiring        3. Data description
the model to process complex syntactic dependencies,
                                                              The benchmark consists of 566 sentence pairs, all struc-
this benchmark aims to provide a further step towards
                                                              tured to evaluate the comprehension of Object Relative
more rigorous and meaningful evaluation of syntactic
                                                              Clauses (ORCs). While the task’s main objective and the
comprehension, with a specific focus on Italian. More-
                                                              criteria for determining entailment between the two sen-
over, TRACE-it contributes to the growing field of linguis-
                                                              tences in each pair remain constant, the dataset is divided
tically informed resources that enhance interpretability
                                                              into four main sections. Each section corresponds to a
in NLP [21]. These benchmarks are essential for un-
                                                              distinct type of ORC in the first sentence, differentiated
raveling the linguistic competence implicitly encoded
                                                              by specific conditions that characterize the two lexical
in neural network representations, and they can shed
                                                              noun phrases (NPs) involved in the relative clause:
light on the similarities and differences between how hu-
                                                                 These conditions are inspired by findings from psy-
mans and LLMs acquire, represent, and process linguistic
                                                              cholinguistic literature, which reveal that the processing
knowledge [22, 23].
                                                              difficulty humans encounter with ORCs - particularly in
                                                              online comprehension - can be reduced when there is a
2. Challenge: Description                                     mismatch between the two NPs in certain grammatical
                                                              and semantic features [24, 10, 25, 26, 27]. Specifically,
The proposed challenge focuses on evaluating LLMs’            we focus on three key features that were shown to have
understanding of a precise linguistic structure in the        this effect: gender, number, and animacy. To ensure
Italian language: restrictive object-extracted rela-          a balanced dataset, we consulted existing resources and
tive clauses (ORCs). We specifically examine centre-          literature that have carefully controlled for these condi-
embedded ORCs where both the relative head and the            tions.
embedded subject are expressed as lexical noun phrases.          For gender and number, we utilized the Italian experi-
   The assessment involves a yes/no entailment task in        mental stimuli set described by [24], focusing exclusively
which the model is given two paired sentences. The first      on the center-embedded ORCs portion. This dataset,
contains the target structure, and the second is a simple     referred to as Biondo-et-al-2023 , contains 306 ORCs
declarative sentence whose meaning may or may not be          equally divided into three subsets:
logically inferred from the first based on the syntactic
relationship between the elements in the ORC. Specifi-           • The first subset (gen-num-match condition) con-
cally, the second sentence focuses either on the relative          tains ORCs where both NPs match in gender and
head (NP1) or the embedded subject (NP2) and has been              number (i.e., both singular and masculine);
designed according to the following criteria: When the           • The second subset (gen-mismatch condition) in-
focus is on NP1, the entailment is true if the second sen-         troduces a gender mismatch, where NP2 remains
tence presents NP1 as the active subject of the matrix             singular but is feminine;
verb of the main clause or as the passive subject of the         • The third subset (num-mismatch condition) intro-
embedded verb (see examples 1 and 2 in Table 1, respec-            duces a number mismatch, where NP2 is mascu-
tively). The entailment is false if NP1 is shown as the            line but plural.
active subject of the embedded verb or if the verb of the
                                                              For animacy, we incorporated 56 examples drawn from
main clause is negated (see examples 3 and 4, respec-
                                                           a larger set of experimental stimuli described in the paper
tively).
                                                           by Gennari and McDonald, 2008 [25]. These sentences
   When the focus is on NP2, the entailment is true if
                                                           were originally in English and were translated into Ital-
the second sentence presents NP2 as the subject of the
                                                           ian, ensuring that the object relative clause construction
     PAIR    SENTENCE1                                              SENTENCE2                                 NP target   GOLD
     1       Il professore che lo studente chiama apre la porta     Il professore sta aprendo una porta.      NP1         YES
             dell’aula.
     2       Il pittore che il fotografo coinvolge inaugura una     Il pittore è stato coinvolto dal fo-      NP1         YES
             mostra d’avanguardia.                                  tografo.
     3       L’attore che il ballerino ringrazia rompe il micro-    L’attore sta ringraziando il ballerino.   NP1         NO
             fono nuovo.
     4       L’infermiere che il dottore critica aggiorna i turni   L’infermiere non ha aggiornato i turni    NP1         NO
             della settimana.                                       settimanali.
     5       L’allenatore che il nuotatore accusa commette          Il nuotatore sta accusando l’allena-      NP2         YES
             un’infrazione del regolamento.                         tore.
     6       Il cuoco che il cameriere consulta introduce un        Il cameriere è stato consultato dal       NP2         NO
             menù per vegetariani.                                  cuoco.
     7       Il nonno che il bambino insegue calpesta un sasso      Il bambino ha calpestato un sasso.        NP2         NO
             appuntito.
     8       Il pagliaccio che la ragazza deride attira l’atten-    La ragazza sta prendendo in giro il       NP2         YES
             zione di tutti.                                        pagliaccio.
Table 1
Extract of the dataset with the main criteria for yes/no entailment exemplified.


remained syntactically correct and semantically natural              these models have acquired the ability to reason about
in the target language. All of these sentences exhibit               complex constructions they might have already encoun-
an animacy mismatch: in half of the examples, NP1 is                 tered and been tested on, beyond simply recognizing
animate and NP2 is inanimate, while in the other half,               their grammaticality.
the reverse configuration is applied.                                   Table 2 summarizes the types of ORCs included in the
   Additionally, we introduced a fourth condition, also              dataset, along with an example for each condition.
inspired by psycholinguistic research, which focuses on
manipulating the distance between the two NPs. This                  3.1. Human Evaluation
manipulation aims to increase sentence complexity due to
a longer subject-verb agreement dependency in the main               Since the assignment of gold labels to sentence pairs in
clause[4, 28], which might result in agreement attraction            the benchmark was manually derived, though primarily
effects [29, 30]. This condition was obtained by adding              informed by linguistic literature, we conducted a human
one or more prepositional phrases (PP) to either NP1 or              evaluation with untrained native speakers to validate the
NP2, thereby extending the distance between the noun                 examples and ensure they conveyed clear implications.
phrases and increasing the subject-verb agreement de-                   For this validation, we selected 240 sentence pairs,
pendency in the main clause. This fourth condition was               representing approximately 42% of the entire benchmark,
applied to 156 sentences, which were sourced from the                with an equal distribution across all conditions. These
two aforementioned datasets. Specifically, 100 sentences             pairs were annotated by Italian native speakers, recruited
were selected from the Biondo-et-al-2023 dataset, dis-               via the Prolific platform3 . The annotation process was
tributed evenly across the three subsets (match, gender              organized into eight questionnaires, each containing 30
mismatch, and number mismatch), and the entire set                   sentence pairs. Each pair was labeled by five different
from [25] was used.                                                  workers, resulting in a total of 1,050 human judgments.
   Finally, we included a small set of ‘mix-category’                   To maintain accuracy and reliability, each question-
ORCs, with sentences sourced from ‘sister challenge’                 naire included five control items where the first sentence
benchmarks such as CoLA [17], ITaCoLA [19], and                      was a simple declarative. Annotators were given very
ACCOMPL-it [20], specifically selecting only those                   simple instructions, similar to the prompt used for the
marked as grammatical in the original datasets. While                LLM, and were asked to carefully evaluate each pair and
these sentences all contain ORC constructions, the two               determine whether the first sentence implied the second.
NPs were not controlled for specific features. Further-                 The final label for each pair was determined through
more, except for the CoLA sentences2 , these examples fea-           majority voting. This process yielded an accuracy rate
ture right-branching rather than center-embedded struc-              of 94.2% (226 correct; 14 incorrect). Of the 226 correctly
tures. Given the novel formulation of our task (to our               annotated pairs, 207 achieved agreement from at least
knowledge), it will be interesting to determine whether

2                                                                    3
    Sentences included in TRACE-it were translated into Italian.         https://www.prolific.com/
   COND        FEAT                     EXAMPLE                                                               #     SOURCE
               all-match                Il professore che lo studente chiama apre la porta dell’aula.         102
   gen-num     gen-mism                 Il professore che la studentessa chiama apre la porta dell’aula.      102   [24]
               num-mism                 Il professore che gli studenti chiamano apre la porta dell’aula.      102
               mism [an-in]             Lo scienziato che il libro ha infastidito era rinomato per i suoi     28
   animacy                                                                                                          [25]
                                        saggi sull’ecologia.
               mism [in-an]             Il libro che lo scienziato ha studiato era rinomato per i suoi        28
                                        argomenti sull’ecologia.
               all-match_NP1+PP         Il professore di storia e filosofia di Marco che lo studente chiama   50
                                                                                                                    [24]_m
                                        apre la porta dell’aula.
   distance
               gen-mism_NP2+PP          Il primario che la specializzanda di oculistica rassicura lascia il   50
                                        reparto incustodito
               anim-mism_NP1+PP         Lo scienziato dell’agenzia pubblica europea che il libro ha infas-    28
                                                                                                                    [25]_m
                                        tidito era rinomato per i suoi saggi sull’ecologia.
               anim-mism_NP2+PP         Il libro che lo scienziato dell’agenzia pubblica europea ha studi-    28
                                        ato era rinomato per i suoi argomenti sull’ecologia.
                                        Il cane che la macchina ferì aveva un collare giallo.                 17    [17]
   sister-ch   mixed                    Ho bevuto il vino che Tommaso mi ha portato.                          10    [19]
                                        Carlo conosceva bene il compagno di classe che Anna voleva            21    [20]
                                        sempre incontrare.
Table 2
Types of ORCs included in the dataset, categorized into the four main conditions based on the type of manipulation applied
and the number of examples for each. The suffix “_m” in the last column indicates that modifications have been made to the
original stimuli described in the reference source.


four annotators, while the remaining 19 were decided by         4. Evaluation
a majority vote of three out of five annotators.
                                                                4.1. Zero-shot Prompting
3.2. Data format                                                To evaluate knowledge that emerges from the model’s
The benchmark is provided as a tab-separated text file          training rather than through in-context learning, we
with the following information for each entry:                  chose to adopt a zero-shot evaluation paradigm.
                                                                   We formulate a very simple prompt, which is nearly
     • UniqueID: a numerical identifier for the entry;          identical to the instruction presented to humans in the
     • Source: the original reference from which the            annotation task:
       sentence has been taken;
                                                                        “Data questa coppia di frasi, valuta se la
     • ID-mapping: an identifier mapping for cross-
                                                                        prima frase implica la seconda. Rispondi
       referencing according to the condition;
                                                                        sì o no.”
     • Condition: The type of ORC, based on the fea-
       tures (i.e. gender, number, animacy, distance,              Although we experimented with various prompt for-
       mixed) and specific configurations (match, mis-          mulations, we ultimately decided to avoid any prompts
       match) of the two NPs involved;                          that encouraged the model to explicitly analyze the lin-
     • Sentence1: the first sentence containing the ORC;        guistic structure of the sentence. Our aim was to evaluate
     • Sentence2: the second sentence that may or may           the model’s raw ability to infer entailment without any
       not be implied by sentence 1;                            task-specific guidance.
     • NP target: indicates whether Sentence 2 targets
       the head of the relative clause (NP1) or the subject     Metrics Given the perfectly balanced data distribution
       of the embedded clause (NP2) in sentence1.;              across the two classes, the evaluation metrics will be
     • Gold: the gold label assigned to the pair (“sì” if       based on the Accuracy and F1_score.
       sentence 1 implied sentence 2, “no” otherwise).
                                                                4.2. Preliminary Results
                                                                We conducted an initial evaluation of the TRACE-it chal-
                                                                lenge on llama-3-8B Instruct [31], achieving an accu-
                                                                racy of 0.71.
                                                              to create a more comprehensive evaluation framework.


                                                              6. Limitations
                                                              There are several limitations in the current benchmark.
                                                              First, the dataset is small in scale and focuses exclu-
                                                              sively on a single syntactic construction — object relative
                                                              clauses. While this targeted approach enables a focused
                                                              investigation into how language models process specific
                                                              grammatical features, it restricts the generalizability of
Figure 1: Percentage accuracy for the whole dataset (ALL) the results to other complex syntactic phenomena. Ex-
and across subsections.
                                                              panding the dataset to include a broader range of syn-
                                                              tactic structures and increasing its size would provide
                                                              a more comprehensive evaluation of language models’
   Figure 1 reports accuracy results across the distinct sub- syntactic comprehension abilities.
sections of the dataset. This preliminary analysis reveals       Additionally, the binary-choice format required by the
that ORCs sourced from existing acceptability datasets entailment task presents another limitation. By forcing
were the easiest for the model to handle. In terms of models (and humans) to make a yes/no decision, this
ORCs with specific conditions applied to the two NPs, approach simplifies the evaluation and may not fully cap-
the model performed best on sentences where there was ture the complexity of syntactic understanding. Future
a mismatch in animacy, indicating that this condition is work could explore alternative evaluation formats that
easier for the model to process. Conversely, when both allow for a more graded or probabilistic assessment of
NPs matched in animacy, the influence of grammatical model performance.
features such as gender and number became more ap-
parent. Specifically, a mismatch in number appeared to
facilitate comprehension more effectively than either a References
full match or a gender mismatch, a finding that aligns
with human data [24].                                          [1] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
   However, these observations are based on preliminary             cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
analysis and require further validation. Generalization             naldi, D. Scalena, CALAMITA: Challenge the Abili-
capabilities should be verified across different models to          ties of LAnguage Models in ITAlian, in: Proceed-
obtain more robust conclusions.                                     ings of the 10th Italian Conference on Computa-
                                                                    tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
                                                                    ber 4 - December 6, 2024, CEUR Workshop Proceed-
5. Conclusion                                                       ings, CEUR-WS.org, 2024.
                                                               [2] J. A. Hawkins, Processing complexity and filler-
In this report, we have described TRACE-it, a novel                 gap dependencies across grammars, Language 75
benchmark, with a corresponding task, presented for                 (1999) 244–285. URL: https://api.semanticscholar.
the CALAMITA challenge and designed to evaluate the                 org/CorpusID:89607408.
ability of large language models (LLMs) to comprehend          [3] L. Frazier, C. Clifton, Successive cyclicity in the
object relative clauses (ORCs) in Italian. By focusing              grammar and the parser, Language and Cogni-
on this specific type of complex syntactic construction,            tive Processes 4 (1989) 93–126. URL: https://api.
TRACE-it allows for a detailed examination of how mod-              semanticscholar.org/CorpusID:62152168.
els handle key grammatical and semantic features, such         [4] E. Gibson, Linguistic complexity: Locality of syn-
as gender, number, and animacy, which are known to                  tactic dependencies, Cognition 68 (1998) 1–76.
influence human comprehension.                                 [5] L. A. Stowe, Parsing wh-constructions: Evidence
   The results from our preliminary evaluation showed               for on-line gap location, Language and Cogni-
that while models are able to grasp ORC comprehen-                  tive Processes 1 (1986) 227–245. URL: https://api.
sion, challenges remain, and they are consistent with               semanticscholar.org/CorpusID:62596346.
patterns observed in human language processing studies. [6] J. P. King, M. A. Just, Individual differences
Although the benchmark is small in scale and limited                in syntactic processing: The role of working
to a single syntactic structure, it serves as a crucial first       memory, Journal of Memory and Language 30
step towards a deeper understanding of LLMs’ syntactic              (1991) 580–602. URL: https://api.semanticscholar.
capabilities in Italian. Future work should aim to expand           org/CorpusID:144231849.
both the dataset and the range of syntactic phenomena
 [7] A. Staub, Eye movements and processing difficulty            doi:10.1162/tacl_a_00290 .
     in object relative clauses, Cognition 116 (2010)        [18] A. Warstadt, A. Parrish, H. Liu, A. Mohananey,
     71–86.                                                       W. Peng, S.-F. Wang, S. R. Bowman, Blimp: The
 [8] M. De Vincenzi, Syntactic parsing strategies in              benchmark of linguistic minimal pairs for english,
     Italian: The minimal chain principle, volume 12,             Transactions of the Association for Computational
     Springer Science & Business Media, 1991.                     Linguistics 8 (2020) 377–392.
 [9] L. M. S. Corrêa, An alternative assessment of chil-     [19] D. Trotta, R. Guarasci, E. Leonardelli, S. Tonelli,
     dren’s comprehension of relative clauses, Journal            Monolingual and cross-lingual acceptability judg-
     of psycholinguistic research 24 (1995) 183–203.              ments with the Italian CoLA corpus, in: Find-
[10] N. Friedmann, A. Belletti, L. Rizzi, Relativized rel-        ings of the Association for Computational Lin-
     atives: Types of intervention in the acquisition of          guistics: EMNLP 2021, Association for Computa-
     a-bar dependencies, Lingua 119 (2009) 67–88.                 tional Linguistics, Punta Cana, Dominican Republic,
[11] H. Diessel, M. Tomasello, A new look at the acqui-           2021, pp. 2929–2940. URL: https://aclanthology.org/
     sition of relative clauses, Language (2005) 882–906.         2021.findings-emnlp.250. doi:10.18653/v1/2021.
[12] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen,            findings- emnlp.250 .
     M. Baroni, Colorless green recurrent networks           [20] D. Brunato, C. Chesi, F. Dell’Orletta, S. Monte-
     dream hierarchically, in: M. Walker, H. Ji, A. Stent         magni, G. Venturi, R. Zamparelli, Accompl-it @
     (Eds.), Proceedings of the 2018 Conference of the            evalita2020: Overview of the acceptability & com-
     North American Chapter of the Association for                plexity evaluation task for italian, EVALITA Evalua-
     Computational Linguistics: Human Language Tech-              tion of NLP and Speech Tools for Italian - December
     nologies, Volume 1 (Long Papers), Association for            17th, 2020 (2020). URL: https://api.semanticscholar.
     Computational Linguistics, New Orleans, Louisiana,           org/CorpusID:229292651.
     2018, pp. 1195–1205. URL: https://aclanthology.org/     [21] J. Opitz, S. Wein, N. Schneider, Natural language
     N18-1108. doi:10.18653/v1/N18- 1108 .                        processing relies on linguistics, arXiv preprint
[13] R. Marvin, T. Linzen, Targeted syntactic evalu-              arXiv:2405.05966 (2024).
     ation of language models, in: E. Riloff, D. Chi-        [22] A. Warstadt, S. R. Bowman, What artificial neural
     ang, J. Hockenmaier, J. Tsujii (Eds.), Proceed-              networks can tell us about human language acqui-
     ings of the 2018 Conference on Empirical Meth-               sition, in: Algebraic structures in natural language,
     ods in Natural Language Processing, Association              CRC Press, 2022, pp. 17–60.
     for Computational Linguistics, Brussels, Belgium,       [23] Y. Belinkov, J. Glass, Analysis Methods in Neural
     2018, pp. 1192–1202. URL: https://aclanthology.org/          Language Processing: A Survey, Transactions of
     D18-1151. doi:10.18653/v1/D18- 1151 .                        the Association for Computational Linguistics 7
[14] S. A. Chowdhury, R. Zamparelli, Rnn simulations of           (2019) 49–72. URL: https://doi.org/10.1162/tacl_a_
     grammaticality judgments on long-distance depen-             00254. doi:10.1162/tacl_a_00254 .
     dencies, in: Proceedings of the 27th international      [24] N. Biondo, E. Pagliarini, V. Moscati, L. Rizzi,
     conference on computational linguistics, 2018, pp.           A. Belletti, Features matter: the role of num-
     133–144.                                                     ber and gender features during the online pro-
[15] E. G. Wilcox, R. Levy, T. Morita, R. Futrell, What           cessing of subject- and object- relative clauses
     do rnn language models learn about filler–gap                in italian,      Language, Cognition and Neu-
     dependencies?, in: BlackboxNLP@EMNLP, 2018.                  roscience 38 (2023) 802–820. URL: https://doi.
     URL: https://api.semanticscholar.org/CorpusID:               org/10.1080/23273798.2022.2159989. doi:10.1080/
     52156878.                                                    23273798.2022.2159989 .
[16] J. Gauthier, J. Hu, E. Wilcox, P. Qian, R. Levy, Syn-   [25] S. P. Gennari, M. C. MacDonald, Semantic inde-
     taxGym: An online platform for targeted evalua-              terminacy in object relative clauses, Journal of
     tion of language models, in: A. Celikyilmaz, T.-             memory and language 58 (2008) 161–187.
     H. Wen (Eds.), Proceedings of the 58th Annual           [26] M. W. Lowder, P. C. Gordon, Effects of animacy
     Meeting of the Association for Computational Lin-            and noun-phrase relatedness on the processing of
     guistics: System Demonstrations, Association for             complex sentences, Memory & cognition 42 (2014)
     Computational Linguistics, Online, 2020, pp. 70–76.          794–805.
     URL: https://aclanthology.org/2020.acl-demos.10.        [27] W. M. Mak, W. Vonk, H. Schriefers, The in-
     doi:10.18653/v1/2020.acl- demos.10 .                         fluence of animacy on relative clause process-
[17] A. Warstadt, A. Singh, S. R. Bowman, Neural net-             ing,      Journal of Memory and Language 47
     work acceptability judgments, Transactions of the            (2002) 50–68. URL: https://www.sciencedirect.com/
     Association for Computational Linguistics 7 (2019)           science/article/pii/S0749596X01928372. doi:https:
     625–641. URL: https://aclanthology.org/Q19-1040.             //doi.org/10.1006/jmla.2001.2837 .
[28] H. Liu, C. Xu, J. Liang, Dependency distance: A
     new perspective on syntactic patterns in natural
     languages, Physics of life reviews 21 (2017) 171–193.
[29] J. Franck, G. Lassi, U. H. Frauenfelder, L. Rizzi,
     Agreement and movement: A syntactic analysis
     of attraction, Cognition 101 (2006) 173–216.
[30] D. Parker, A. An, Not all phrases are equally attrac-
     tive: Experimental evidence for selective agreement
     attraction effects, Frontiers in psychology 9 (2018)
     1566.
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-
     Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
     A. Fan, et al., The llama 3 herd of models, arXiv
     preprint arXiv:2407.21783 (2024).

</pre>