<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>TRACE-it: Testing Relative clAuses Comprehension through Entailment in ITalian: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dominique Brunato</string-name>
          <email>dominique.brunato@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto di Linguistica Computazionale ”A. Zampolli”, CNR-ILC, ItaliaNLP Lab</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>ITalian) is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to comprehend a specific type of complex syntactic construction in Italian: object relative clauses. In this report, we outline the theoretical framework that informed the creation of the dataset and provide a comprehensive overview of the linguistic materials used. Object Relative Clauses, Italian language, benchmark, syntactic assessment, entailment</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
A</p>
      <p>CALAMITA</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction and Motivation</title>
      <p>
        TRACE-it (Testing Relative clAuses Comprehension
through Entailment in Italian) is a benchmark designed
to assess the ability of Large Language Models (LLMs)
to comprehend complex sentences in Italian. Complex
sentences, in this context, are defined as those
containrates in comprehension questions after reading– has been
extensively studied and explained by formal linguistic
theories and processing models [
        <xref ref-type="bibr" rid="ref4 ref6">7, 4, 8, 6</xref>
        ], including child
language acquisition data [9, 10, 11]. This benchmark
aims to determine whether LLMs encounter similar
dificulties and to explore various factors that were shown to
modulate this complexity for humans, such as altering
derstanding requires the computation of a grammatical
ing a type of unbounded dependency, whose correct un- the nature of the elements involved in the dependency in
terms of grammatical and/or semantic features, as well
relationship between phrases that are pronounced in a po- as varying the distance between the filler and the gap.
sition diferent from the one where they are interpreted.
      </p>
      <sec id="sec-2-1">
        <title>These structures, also known as “filler-gap” constructions in psycholinguistics, pose significant challenges for human sentence processing, particularly pronounced when the “filler” (the pronounced element) is distant from</title>
        <p>Examples of this include object-gap relationships, which
occur in constructions such as relative clauses (1), cleft
sentences (2), or wh-questions (3), like the following1:
1. Il giornalista che il senatore contestò ammise
l’errore. [The reporter who the senator attacked
admitted the error.]</p>
      </sec>
      <sec id="sec-2-2">
        <title>2. E’ il giornalista che il senatore contestò. [It is the</title>
        <p>reporter that the senator attacked.]
reporter did the senator attack? ]</p>
      </sec>
      <sec id="sec-2-3">
        <title>3. Quale giornalista il senatore contestò? [Which</title>
      </sec>
      <sec id="sec-2-4">
        <title>The higher complexity of these constructions compared to their subject counterparts –typically measured in terms of reading times and often accompanied by error</title>
        <p>0000-0003-3256-4794 (D. Brunato)</p>
      </sec>
      <sec id="sec-2-5">
        <title>1Examples are taken from [6].</title>
        <p>In this respect, the proposed benchmark is part of a
growing set of resources specifically designed for
syntactic evaluation of neural language models, which are
typically composed by minimal pairs of grammatical
and non-grammatical sentences addressing a specific
[12, 13, 14, 15, 16], i.a.). To succeed, a model must score
the grammatical sentence higher than its
ungrammatical counterpart, either assigning a binary value or in
terms of model perplexity. Two main resources in this
respect are Corpus of Linguistic Acceptability (CoLA) [17]
and BLiMP (Benchmark of Linguistic Minimal Pairs) [18],
which include minimal pairs for various grammatical
phenomena in English. Adaptations of these resources have
been recently released also in other languages, Italian
included. Notable examples include ITaCoLA [19], which
is directly inspired by CoLA, and the dataset developed
for the AcCompl-It task (Acceptability &amp; Complexity</p>
      </sec>
      <sec id="sec-2-6">
        <title>Evaluation for Italian) held in the context of Evalita 2020 campaign [20].</title>
      </sec>
      <sec id="sec-2-7">
        <title>While similar for purposes, the novelty of TRACEit lies in its approach.</title>
      </sec>
      <sec id="sec-2-8">
        <title>Unlike previous benchmarks</title>
        <p>
          that have focused on testing LLMs’ ability to
distinguish between grammatical and ungrammatical
sentences through minimal pairs or assigning a
complexthe “gap” (the position where it is interpreted) [
          <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
          ]. linguistic phenomenon that difers in the sentence (see
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ity score to such sentences, this benchmark introduces
Attribution 4.0 International (CC BY 4.0).
a more advanced task based on entailment. Instead of embedded verb (example 5). It is false if NP2 is the
passimply assessing grammaticality, the model is tasked sive subject of the embedded verb or is presented as the
with determining whether a given complex sentence logi- subject of the main clause’s verb (examples 6 and 7,
recally entails a simpler yes/no implication. This approach spectively). In the majority of cases, the second sentence
would thus provide a more nuanced evaluation of the closely mirrors the lexical structure of the first, as the
model’s ability to understand deep syntactic structures, dataset is firstly designed to investigate syntactic
entailgoing beyond surface-level grammaticality to probe its ment. However, in some instances, a paraphrase is used
comprehension of meaning. (e.g.. 8).
        </p>
        <p>The ability to grasp complex syntactic relationships, These criteria were almost equally balanced across the
such as those present in filler-gap constructions, is fun- distinct portions of the whole dataset, which are detailed
damental to higher-order language tasks. For instance, in the following section.
summarization, information extraction, and question
answering all depend on the model’s capacity to correctly
interpret sentence structure and meaning. By requiring 3. Data description
the model to process complex syntactic dependencies,
this benchmark aims to provide a further step towards The benchmark consists of 566 sentence pairs, all
strucmore rigorous and meaningful evaluation of syntactic tured to evaluate the comprehension of Object Relative
comprehension, with a specific focus on Italian. More- Clauses (ORCs). While the task’s main objective and the
over, TRACE-it contributes to the growing field of linguis- criteria for determining entailment between the two
sentically informed resources that enhance interpretability tences in each pair remain constant, the dataset is divided
in NLP [21]. These benchmarks are essential for un- into four main sections. Each section corresponds to a
raveling the linguistic competence implicitly encoded distinct type of ORC in the first sentence, diferentiated
in neural network representations, and they can shed by specific conditions that characterize the two lexical
light on the similarities and diferences between how hu- noun phrases (NPs) involved in the relative clause:
mans and LLMs acquire, represent, and process linguistic These conditions are inspired by findings from
psyknowledge [22, 23]. cholinguistic literature, which reveal that the processing
dificulty humans encounter with ORCs - particularly in
online comprehension - can be reduced when there is a
2. Challenge: Description mismatch between the two NPs in certain grammatical
and semantic features [24, 10, 25, 26, 27]. Specifically,
we focus on three key features that were shown to have
this efect: gender, number, and animacy. To ensure
a balanced dataset, we consulted existing resources and
literature that have carefully controlled for these
conditions.</p>
      </sec>
      <sec id="sec-2-9">
        <title>For gender and number, we utilized the Italian experi</title>
        <p>mental stimuli set described by [24], focusing exclusively
on the center-embedded ORCs portion. This dataset,
referred to as Biondo-et-al-2023, contains 306 ORCs
equally divided into three subsets:</p>
      </sec>
      <sec id="sec-2-10">
        <title>The proposed challenge focuses on evaluating LLMs’ understanding of a precise linguistic structure in the</title>
        <p>Italian language: restrictive object-extracted
relative clauses (ORCs). We specifically examine
centreembedded ORCs where both the relative head and the
embedded subject are expressed as lexical noun phrases.</p>
        <p>The assessment involves a yes/no entailment task in
which the model is given two paired sentences. The first
contains the target structure, and the second is a simple
declarative sentence whose meaning may or may not be
logically inferred from the first based on the syntactic
relationship between the elements in the ORC.
Specifically, the second sentence focuses either on the relative
head (NP1) or the embedded subject (NP2) and has been
designed according to the following criteria: When the
focus is on NP1, the entailment is true if the second
sentence presents NP1 as the active subject of the matrix
verb of the main clause or as the passive subject of the
embedded verb (see examples 1 and 2 in Table 1,
respectively). The entailment is false if NP1 is shown as the
active subject of the embedded verb or if the verb of the
main clause is negated (see examples 3 and 4,
respectively).</p>
        <p>When the focus is on NP2, the entailment is true if
the second sentence presents NP2 as the subject of the
• The first subset ( gen-num-match condition)
contains ORCs where both NPs match in gender and
number (i.e., both singular and masculine);
• The second subset (gen-mismatch condition)
introduces a gender mismatch, where NP2 remains
singular but is feminine;
• The third subset (num-mismatch condition)
introduces a number mismatch, where NP2 is
masculine but plural.</p>
      </sec>
      <sec id="sec-2-11">
        <title>For animacy, we incorporated 56 examples drawn from</title>
        <p>a larger set of experimental stimuli described in the paper
by Gennari and McDonald, 2008 [25]. These sentences
were originally in English and were translated into
Italian, ensuring that the object relative clause construction
PAIR SENTENCE1
1 Il professore che lo studente chiama apre la porta
dell’aula.
2 Il pittore che il fotografo coinvolge inaugura una
mostra d’avanguardia.
3 L’attore che il ballerino ringrazia rompe il
microfono nuovo.
4 L’infermiere che il dottore critica aggiorna i turni
della settimana.
5 L’allenatore che il nuotatore accusa commette
un’infrazione del regolamento.</p>
        <sec id="sec-2-11-1">
          <title>6 Il cuoco che il cameriere consulta introduce un</title>
          <p>menù per vegetariani.</p>
        </sec>
        <sec id="sec-2-11-2">
          <title>7 Il nonno che il bambino insegue calpesta un sasso</title>
          <p>appuntito.</p>
        </sec>
        <sec id="sec-2-11-3">
          <title>8 Il pagliaccio che la ragazza deride attira l’attenzione di tutti.</title>
        </sec>
        <sec id="sec-2-11-4">
          <title>SENTENCE2</title>
        </sec>
        <sec id="sec-2-11-5">
          <title>Il professore sta aprendo una porta. NP target NP1 GOLD</title>
        </sec>
        <sec id="sec-2-11-6">
          <title>Il pittore è stato coinvolto dal fotografo.</title>
        </sec>
        <sec id="sec-2-11-7">
          <title>L’attore sta ringraziando il ballerino.</title>
        </sec>
        <sec id="sec-2-11-8">
          <title>L’infermiere non ha aggiornato i turni settimanali.</title>
        </sec>
        <sec id="sec-2-11-9">
          <title>Il nuotatore sta accusando l’allenatore.</title>
        </sec>
        <sec id="sec-2-11-10">
          <title>Il cameriere è stato consultato dal cuoco.</title>
        </sec>
        <sec id="sec-2-11-11">
          <title>Il bambino ha calpestato un sasso.</title>
        </sec>
        <sec id="sec-2-11-12">
          <title>La ragazza sta prendendo in giro il pagliaccio. NP1 NP1</title>
          <p>NP1
NP2
NP2
NP2
NP2</p>
          <p>YES
NO
NO
YES
NO
NO
YES
remained syntactically correct and semantically natural these models have acquired the ability to reason about
in the target language. All of these sentences exhibit complex constructions they might have already
encounan animacy mismatch: in half of the examples, NP1 is tered and been tested on, beyond simply recognizing
animate and NP2 is inanimate, while in the other half, their grammaticality.
the reverse configuration is applied. Table 2 summarizes the types of ORCs included in the</p>
          <p>
            Additionally, we introduced a fourth condition, also dataset, along with an example for each condition.
inspired by psycholinguistic research, which focuses on
manipulating the distance between the two NPs. This 3.1. Human Evaluation
manipulation aims to increase sentence complexity due to
a longer subject-verb agreement dependency in the main Since the assignment of gold labels to sentence pairs in
clause[
            <xref ref-type="bibr" rid="ref4">4, 28</xref>
            ], which might result in agreement attraction the benchmark was manually derived, though primarily
efects [ 29, 30]. This condition was obtained by adding informed by linguistic literature, we conducted a human
one or more prepositional phrases (PP) to either NP1 or evaluation with untrained native speakers to validate the
NP2, thereby extending the distance between the noun examples and ensure they conveyed clear implications.
phrases and increasing the subject-verb agreement de- For this validation, we selected 240 sentence pairs,
pendency in the main clause. This fourth condition was representing approximately 42% of the entire benchmark,
applied to 156 sentences, which were sourced from the with an equal distribution across all conditions. These
two aforementioned datasets. Specifically, 100 sentences pairs were annotated by Italian native speakers, recruited
were selected from the Biondo-et-al-2023 dataset, dis- via the Prolific platform 3. The annotation process was
tributed evenly across the three subsets (match, gender organized into eight questionnaires, each containing 30
mismatch, and number mismatch), and the entire set sentence pairs. Each pair was labeled by five diferent
from [25] was used. workers, resulting in a total of 1,050 human judgments.
          </p>
          <p>Finally, we included a small set of ‘mix-category’ To maintain accuracy and reliability, each
questionORCs, with sentences sourced from ‘sister challenge’ naire included five control items where the first sentence
benchmarks such as CoLA [17], ITaCoLA [19], and was a simple declarative. Annotators were given very
ACCOMPL-it [20], specifically selecting only those simple instructions, similar to the prompt used for the
marked as grammatical in the original datasets. While LLM, and were asked to carefully evaluate each pair and
these sentences all contain ORC constructions, the two determine whether the first sentence implied the second.
NPs were not controlled for specific features. Further- The final label for each pair was determined through
more, except for the CoLA sentences2, these examples fea- majority voting. This process yielded an accuracy rate
ture right-branching rather than center-embedded struc- of 94.2% (226 correct; 14 incorrect). Of the 226 correctly
tures. Given the novel formulation of our task (to our annotated pairs, 207 achieved agreement from at least
knowledge), it will be interesting to determine whether</p>
        </sec>
      </sec>
      <sec id="sec-2-12">
        <title>2Sentences included in TRACE-it were translated into Italian.</title>
      </sec>
      <sec id="sec-2-13">
        <title>3https://www.prolific.com/</title>
        <p>gen-num
animacy
distance
sister-ch
mixed
FEAT
all-match
gen-mism
num-mism
mism [an-in]
mism [in-an]
all-match_NP1+PP
gen-mism_NP2+PP
anim-mism_NP1+PP
anim-mism_NP2+PP</p>
        <sec id="sec-2-13-1">
          <title>EXAMPLE</title>
        </sec>
        <sec id="sec-2-13-2">
          <title>Il professore che lo studente chiama apre la porta dell’aula.</title>
        </sec>
        <sec id="sec-2-13-3">
          <title>Il professore che la studentessa chiama apre la porta dell’aula.</title>
        </sec>
        <sec id="sec-2-13-4">
          <title>Il professore che gli studenti chiamano apre la porta dell’aula.</title>
        </sec>
        <sec id="sec-2-13-5">
          <title>Lo scienziato che il libro ha infastidito era rinomato per i suoi saggi sull’ecologia.</title>
        </sec>
        <sec id="sec-2-13-6">
          <title>Il libro che lo scienziato ha studiato era rinomato per i suoi argomenti sull’ecologia.</title>
        </sec>
        <sec id="sec-2-13-7">
          <title>Il professore di storia e filosofia di Marco che lo studente chiama apre la porta dell’aula.</title>
        </sec>
        <sec id="sec-2-13-8">
          <title>Il primario che la specializzanda di oculistica rassicura lascia il reparto incustodito</title>
        </sec>
        <sec id="sec-2-13-9">
          <title>Lo scienziato dell’agenzia pubblica europea che il libro ha infas</title>
          <p>tidito era rinomato per i suoi saggi sull’ecologia.</p>
        </sec>
        <sec id="sec-2-13-10">
          <title>Il libro che lo scienziato dell’agenzia pubblica europea ha studi</title>
          <p>ato era rinomato per i suoi argomenti sull’ecologia.</p>
        </sec>
        <sec id="sec-2-13-11">
          <title>Il cane che la macchina ferì aveva un collare giallo.</title>
        </sec>
        <sec id="sec-2-13-12">
          <title>Ho bevuto il vino che Tommaso mi ha portato.</title>
        </sec>
        <sec id="sec-2-13-13">
          <title>Carlo conosceva bene il compagno di classe che Anna voleva</title>
          <p>sempre incontrare.
3.2. Data format
The benchmark is provided as a tab-separated text file
with the following information for each entry:
• UniqueID: a numerical identifier for the entry;
• Source: the original reference from which the
sentence has been taken;
• ID-mapping: an identifier mapping for
crossreferencing according to the condition;
• Condition: The type of ORC, based on the
features (i.e. gender, number, animacy, distance,
mixed) and specific configurations (match,
mismatch) of the two NPs involved;
• Sentence1: the first sentence containing the ORC;
• Sentence2: the second sentence that may or may
not be implied by sentence 1;
• NP target: indicates whether Sentence 2 targets
the head of the relative clause (NP1) or the subject
of the embedded clause (NP2) in sentence1.;
• Gold: the gold label assigned to the pair (“sì” if
sentence 1 implied sentence 2, “no” otherwise).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <p>4.1. Zero-shot Prompting</p>
      <sec id="sec-3-1">
        <title>To evaluate knowledge that emerges from the model’s</title>
        <p>training rather than through in-context learning, we
chose to adopt a zero-shot evaluation paradigm.</p>
      </sec>
      <sec id="sec-3-2">
        <title>We formulate a very simple prompt, which is nearly</title>
        <p>identical to the instruction presented to humans in the
annotation task:
“Data questa coppia di frasi, valuta se la
prima frase implica la seconda. Rispondi
sì o no.”</p>
      </sec>
      <sec id="sec-3-3">
        <title>Although we experimented with various prompt for</title>
        <p>mulations, we ultimately decided to avoid any prompts
that encouraged the model to explicitly analyze the
linguistic structure of the sentence. Our aim was to evaluate
the model’s raw ability to infer entailment without any
task-specific guidance.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Metrics Given the perfectly balanced data distribution</title>
        <p>across the two classes, the evaluation metrics will be
based on the Accuracy and F1_score.
4.2. Preliminary Results</p>
      </sec>
      <sec id="sec-3-5">
        <title>We conducted an initial evaluation of the TRACE-it challenge on llama-3-8B Instruct [31], achieving an accuracy of 0.71. to create a more comprehensive evaluation framework.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Limitations</title>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <sec id="sec-5-1">
        <title>In this report, we have described TRACE-it, a novel</title>
        <p>benchmark, with a corresponding task, presented for
the CALAMITA challenge and designed to evaluate the
ability of large language models (LLMs) to comprehend
object relative clauses (ORCs) in Italian. By focusing
on this specific type of complex syntactic construction,</p>
      </sec>
      <sec id="sec-5-2">
        <title>TRACE-it allows for a detailed examination of how models handle key grammatical and semantic features, such as gender, number, and animacy, which are known to influence human comprehension.</title>
      </sec>
      <sec id="sec-5-3">
        <title>The results from our preliminary evaluation showed</title>
        <p>that while models are able to grasp ORC
comprehension, challenges remain, and they are consistent with
patterns observed in human language processing studies.</p>
        <p>Although the benchmark is small in scale and limited
to a single syntactic structure, it serves as a crucial first
step towards a deeper understanding of LLMs’ syntactic
capabilities in Italian. Future work should aim to expand
both the dataset and the range of syntactic phenomena
[7] A. Staub, Eye movements and processing dificulty doi:10.1162/tacl_a_00290.</p>
        <p>in object relative clauses, Cognition 116 (2010) [18] A. Warstadt, A. Parrish, H. Liu, A. Mohananey,
71–86. W. Peng, S.-F. Wang, S. R. Bowman, Blimp: The
[8] M. De Vincenzi, Syntactic parsing strategies in benchmark of linguistic minimal pairs for english,
Italian: The minimal chain principle, volume 12, Transactions of the Association for Computational
Springer Science &amp; Business Media, 1991. Linguistics 8 (2020) 377–392.
[9] L. M. S. Corrêa, An alternative assessment of chil- [19] D. Trotta, R. Guarasci, E. Leonardelli, S. Tonelli,
dren’s comprehension of relative clauses, Journal Monolingual and cross-lingual acceptability
judgof psycholinguistic research 24 (1995) 183–203. ments with the Italian CoLA corpus, in:
Find[10] N. Friedmann, A. Belletti, L. Rizzi, Relativized rel- ings of the Association for Computational
Linatives: Types of intervention in the acquisition of guistics: EMNLP 2021, Association for
Computaa-bar dependencies, Lingua 119 (2009) 67–88. tional Linguistics, Punta Cana, Dominican Republic,
[11] H. Diessel, M. Tomasello, A new look at the acqui- 2021, pp. 2929–2940. URL: https://aclanthology.org/
sition of relative clauses, Language (2005) 882–906. 2021.findings-emnlp.250. doi:10.18653/v1/2021.
[12] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, findings-emnlp.250.</p>
        <p>M. Baroni, Colorless green recurrent networks [20] D. Brunato, C. Chesi, F. Dell’Orletta, S.
Montedream hierarchically, in: M. Walker, H. Ji, A. Stent magni, G. Venturi, R. Zamparelli, Accompl-it @
(Eds.), Proceedings of the 2018 Conference of the evalita2020: Overview of the acceptability &amp;
comNorth American Chapter of the Association for plexity evaluation task for italian, EVALITA
EvaluaComputational Linguistics: Human Language Tech- tion of NLP and Speech Tools for Italian - December
nologies, Volume 1 (Long Papers), Association for 17th, 2020 (2020). URL: https://api.semanticscholar.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Computational Linguistics, New Orleans, Louisiana, org/CorpusID:229292651.</title>
        <p>2018, pp. 1195–1205. URL: https://aclanthology.org/ [21] J. Opitz, S. Wein, N. Schneider, Natural language
N18-1108. doi:10.18653/v1/N18-1108. processing relies on linguistics, arXiv preprint
[13] R. Marvin, T. Linzen, Targeted syntactic evalu- arXiv:2405.05966 (2024).</p>
        <p>ation of language models, in: E. Rilof, D. Chi- [22] A. Warstadt, S. R. Bowman, What artificial neural
ang, J. Hockenmaier, J. Tsujii (Eds.), Proceed- networks can tell us about human language
acquiings of the 2018 Conference on Empirical Meth- sition, in: Algebraic structures in natural language,
ods in Natural Language Processing, Association CRC Press, 2022, pp. 17–60.
for Computational Linguistics, Brussels, Belgium, [23] Y. Belinkov, J. Glass, Analysis Methods in Neural
2018, pp. 1192–1202. URL: https://aclanthology.org/ Language Processing: A Survey, Transactions of
D18-1151. doi:10.18653/v1/D18-1151. the Association for Computational Linguistics 7
[14] S. A. Chowdhury, R. Zamparelli, Rnn simulations of (2019) 49–72. URL: https://doi.org/10.1162/tacl_a_
grammaticality judgments on long-distance depen- 00254. doi:10.1162/tacl_a_00254.
dencies, in: Proceedings of the 27th international [24] N. Biondo, E. Pagliarini, V. Moscati, L. Rizzi,
conference on computational linguistics, 2018, pp. A. Belletti, Features matter: the role of
num133–144. ber and gender features during the online
pro[15] E. G. Wilcox, R. Levy, T. Morita, R. Futrell, What cessing of subject- and object- relative clauses
do rnn language models learn about filler–gap in italian, Language, Cognition and
Neudependencies?, in: BlackboxNLP@EMNLP, 2018. roscience 38 (2023) 802–820. URL: https://doi.
URL: https://api.semanticscholar.org/CorpusID: org/10.1080/23273798.2022.2159989. doi:10.1080/
52156878. 23273798.2022.2159989.
[16] J. Gauthier, J. Hu, E. Wilcox, P. Qian, R. Levy, Syn- [25] S. P. Gennari, M. C. MacDonald, Semantic
indetaxGym: An online platform for targeted evalua- terminacy in object relative clauses, Journal of
tion of language models, in: A. Celikyilmaz, T.- memory and language 58 (2008) 161–187.
H. Wen (Eds.), Proceedings of the 58th Annual [26] M. W. Lowder, P. C. Gordon, Efects of animacy
Meeting of the Association for Computational Lin- and noun-phrase relatedness on the processing of
guistics: System Demonstrations, Association for complex sentences, Memory &amp; cognition 42 (2014)</p>
      </sec>
      <sec id="sec-5-5">
        <title>Computational Linguistics, Online, 2020, pp. 70–76. 794–805.</title>
        <p>URL: https://aclanthology.org/2020.acl-demos.10. [27] W. M. Mak, W. Vonk, H. Schriefers, The
indoi:10.18653/v1/2020.acl-demos.10. lfuence of animacy on relative clause
process[17] A. Warstadt, A. Singh, S. R. Bowman, Neural net- ing, Journal of Memory and Language 47
work acceptability judgments, Transactions of the (2002) 50–68. URL: https://www.sciencedirect.com/
Association for Computational Linguistics 7 (2019) science/article/pii/S0749596X01928372. doi:https:
625–641. URL: https://aclanthology.org/Q19-1040. //doi.org/10.1006/jmla.2001.2837.
[28] H. Liu, C. Xu, J. Liang, Dependency distance: A
new perspective on syntactic patterns in natural
languages, Physics of life reviews 21 (2017) 171–193.
[29] J. Franck, G. Lassi, U. H. Frauenfelder, L. Rizzi,</p>
        <p>Agreement and movement: A syntactic analysis
of attraction, Cognition 101 (2006) 173–216.
[30] D. Parker, A. An, Not all phrases are equally
attractive: Experimental evidence for selective agreement
attraction efects, Frontiers in psychology 9 (2018)
1566.
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A.
Al</p>
      </sec>
      <sec id="sec-5-6">
        <title>Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,</title>
      </sec>
      <sec id="sec-5-7">
        <title>A. Fan, et al., The llama 3 herd of models, arXiv</title>
        <p>preprint arXiv:2407.21783 (2024).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hawkins</surname>
          </string-name>
          ,
          <article-title>Processing complexity and fillergap dependencies across grammars</article-title>
          ,
          <source>Language</source>
          <volume>75</volume>
          (
          <year>1999</year>
          )
          <fpage>244</fpage>
          -
          <lpage>285</lpage>
          . URL: https://api.semanticscholar. org/CorpusID:89607408.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Frazier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clifton</surname>
          </string-name>
          ,
          <article-title>Successive cyclicity in the grammar and the parser</article-title>
          ,
          <source>Language and Cognitive Processes</source>
          <volume>4</volume>
          (
          <year>1989</year>
          )
          <fpage>93</fpage>
          -
          <lpage>126</lpage>
          . URL: https://api. semanticscholar.org/CorpusID:62152168.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <article-title>Linguistic complexity: Locality of syntactic dependencies</article-title>
          ,
          <source>Cognition</source>
          <volume>68</volume>
          (
          <year>1998</year>
          )
          <fpage>1</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Stowe</surname>
          </string-name>
          ,
          <article-title>Parsing wh-constructions: Evidence for on-line gap location</article-title>
          ,
          <source>Language and Cognitive Processes</source>
          <volume>1</volume>
          (
          <year>1986</year>
          )
          <fpage>227</fpage>
          -
          <lpage>245</lpage>
          . URL: https://api. semanticscholar.org/CorpusID:62596346.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <article-title>Individual diferences in syntactic processing: The role of working memory</article-title>
          ,
          <source>Journal of Memory and Language</source>
          <volume>30</volume>
          (
          <year>1991</year>
          )
          <fpage>580</fpage>
          -
          <lpage>602</lpage>
          . URL: https://api.semanticscholar. org/CorpusID:144231849.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>