<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Structural sensitivity does not entail grammaticality: assessing LLMs against the Universal Functional Hierarchy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Sgrizzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asya Zanollo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory for Neurocognition</institution>
          ,
          <addr-line>Epistemology, and Theoretical Syntax - NeTS-IUSS Pavia</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University School for Advanced Studies IUSS Pavia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of diferent sizes: Minerva-7B-base-v1.0 [ 3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large language models (LLMs)</kwd>
        <kwd>Cognitive plausibility</kwd>
        <kwd>Syntactic evaluation</kwd>
        <kwd>Universal hierarchy of functional heads</kwd>
        <kwd>Restructuring verbs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ered two aspects: model’s size and the training language,
in order to observe whether, keeping the size constant, a
Large language models (LLMs) have achieved remark- model trained in Italian would perform better in a task
able success across a wide range of natural language specific for Italian. In terms of size, we compared larger,
understanding tasks, reigniting interest in their syntac- medium and smaller models — Minerva-7B-base-v1.0 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
tic abilities and sparking a vigorous debate regarding GPT2-medium-italian-embeddings [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
Bert-base-italianthe cognitive plausibility of the linguistic generalizations xxl-uncased [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], GPT2-small-italian [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and GePpeTto
they acquire from data ([
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a.o.). Recent research has [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], to see if a greater number of parameters and training
begun to probe the extent to which LLMs implicitly en- data leads to better generalization in terms of abstracting
code hierarchical syntactic structure [8, 9, 10], examining linguistic rules. The research questions (RQs) that guide
their sensitivity to phenomena such as long-distance de- this study can be framed as:
pendencies and subject-verb agreement. This paper
contributes to this growing body of work by investigating • RQ1: To what extent do LLMs generalize the verb
whether LLMs are sensitive to a crosslinguistically ro- ordering hierarchy proposed by Cinque (2006) for
bust constraint governing the hierarchical distribution of restructuring verbs?
functional verbs in Italian ([
        <xref ref-type="bibr" rid="ref1">1, 11</xref>
        ]). Given the broad cross- • RQ2: Can LLMs diferentiate the underlying
linguistic relevance of this phenomenon ([12, 13]), our structural ambiguity inherent in restructuring
investigation directly addresses the question of the co- versus control verb constructions?
herence of linguistic structural representations in LLMs: • RQ3: What is the syntactic structure assigned by
can these models learn and represent aspects of Cinque’s LLMs to novel verbs which introduce non-finite
hierarchy from the data they are trained on? We consid- complements?
      </p>
      <sec id="sec-1-1">
        <title>For instance, as far as RQ1 is concerned, the follow</title>
        <p>ing contrast shows that the incorrect hierarchical order
— which directly reflects into linear order — of provare
‘try’ (AspConative) and volere ‘want’ (ModVolition) leads to
ungrammaticality.
(1) a.</p>
        <sec id="sec-1-1-1">
          <title>Gianni lo vuole provare a riparare.</title>
          <p>Gianni it.cl wants to try to fix
‘Gianni wants to try to fix it.’
b. * Gianni lo prova a voler riparare.</p>
          <p>Gianni it.cl tries to wants to fix</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Intended: ‘Gianni tries to want to fix it.’</title>
      </sec>
      <sec id="sec-1-3">
        <title>Regarding RQ2, consider the fact that only restructuring verbs allow clitic climbing (2) and auxiliary switch (3), as shown in the examples below. (2) a.</title>
        <sec id="sec-1-3-1">
          <title>Gianni lo comincia a riparare.</title>
          <p>
            Gianni it.cl begins to fix
diverse languages, adverbs and verbal morphology
appear in a constrained order that reflects an underlying
sequence of functional heads encoding modality, aspect,
tense, and voice [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. A well-known example involves
the relative positions of epistemic and aspectual adverbs.
Consider the following contrast.
(4) a. John probably has again read the book.
          </p>
          <p>b. * John again has probably read the book.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>This contrast reflects a deeper generalization: epis</title>
        <p>
          ‘Gianni begins to fix it.’ temic adverbs like probably structurally precede
aspecb. * Gianni lo corre a riparare. tual adverbs like again in the functional hierarchy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Gianni it.cl runs to fix This ordering is also mirrored in other languages, such as</p>
        <sec id="sec-1-4-1">
          <title>Italian (Giovanni probabilmente ha di nuovo letto il libro</title>
          <p>‘Gianni runs to fix it.’ vs. ?Giovanni di nuovo ha probabilmente letto il libro),
and even when surface word orders vary, (constrained)
movement analyses do preserve the underlying
hierar(3) a. Gianni ha/è voluto partire. chy. In fact, attested orders tend to be derivable from the
Gianni has/is wanted to base sequence via movement operations constrained by
‘Gianni wanted to leave’ Universal Grammar, while unattested orders — such as
stacking adverbs in reverse (again &gt; probably) are rarely,
b. * Gianni ha/*è preferito partire. if ever, observed without resulting in degraded
acceptabilGianni has/*is preferred to ity (see also [15, 16, 17] for a diferent view on ordering
‘Gianni preferred to leave’ constraints yet still rooted in cognitive principles).</p>
          <p>Similarly, in the nominal domain, elements such as</p>
          <p>Finally, RQ3 can be investigated through the syntactic demonstrations, numerals, adjectives, and nouns tend to
ingredients laid out above, using both clitic climbing and conform to the base order Demonstrative &gt; Numeral &gt;
auxiliary switch as diagnostics for a restructuring-like, Adjective &gt; Noun [18]. Using English again for
illustraor a control-like representation of infinitive-taking verbs. tion, the sequence those three books is allowed, but not
Consider a pseudo-verb like grabbare, if models have red three those books. These generalizations suggest that
clear the diference between restructuring and control, natural languages are not arbitrarily diverse but
instanthey would either block or allow clitic climbing across it, tiate a shared blueprint with tightly delimited variation,
and either block or allow auxiliary switch. a claim supported by decades of comparative research</p>
          <p>In the next section, we will introduce the empirical [19, 20, 21, 22].
domain of restructuring and the relevance of the carto- Crucially, these cartographic universals are not merely
graphic enterprise as valid heuristics to test the cognitive typological observations; they reflect deep structural
plausibility of syntactic generalizations. constraints on human language, likely rooted in
cognitive and interface-driven pressures such as learnability,
interpretability, and communicative eficiency (see a.o.
2. Universal Functional Hierarchy [23, 24, 25]). As such, they ofer a highly structured
benchmark for evaluating whether LLMs reflect the underlying
In formal linguistics, the cartographic approach refers principles of natural language cognition or simply
reproto the efort to systematically map out the functional duce surface-level statistical patterns. Assessing
cartostructure of the clause. Much like a geographical map re- graphic generalizations in LLMs thus becomes another
veals detailed topography, syntactic cartography seeks to valuable diagnostic tool for determining whether their
inuncover the fine-grained architecture of language, iden- ternal representations exhibit the kind of compositional
tifying a universal and richly articulated hierarchy of and hierarchical structure found in human language.
functional projections that determine the order of con- Importantly, the utility of cartographic diagnostics
stituents in natural language [14]. This enterprise, de- does not presuppose that LLMs use the same
mechaveloped over the past three decades, has shown striking nisms as human language acquisition. Instead, it
posicross-linguistic consistency: while surface word orders tions cartographic constraints as a structural target: a
vary dramatically across languages, the underlying struc- gold standard against which to assess the depth of
lintural relations often conform to highly constrained and guistic generalization in artificial systems. If LLMs are to
universal hierarchies. For instance, across typologically be considered cognitively plausible models of language
([26], a.o.), they should, at a minimum, capture the
universal constraints that human learners internalize from
fragmented, language-specific input. Testing for
cartographic efects in LLMs therefore ofers a window into
the extent to which their representations are not only
successful at surface prediction but aligned with the
hidden universals that define natural language competence.</p>
          <p>In this sense, cartography closes the gap between
linguistically informed evaluation and cognitively grounded
modeling. By operationalizing syntactic universals as
testable hypotheses in LLMs, we move closer to
understanding not just whether these models can generate
human-like language, but whether they have abstracted
the kinds of structure that make human language what
it is.
2.1. The empirical domain: the case of
restructuring verbs in Italian
these restructuring configurations constrain deeper
syntactic dependencies. Besides CC, restructuring verbs like
potere ’can’, volere ’want’, and dovere ’must’, can in fact
optionally allow the infinitival verb to pick the auxiliary
(essere ’be’, or avere ’have’), as in the case of unaccusative
verbs.
(5)</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Marco ha/è dovuto partire.</title>
          <p>
            Marco has/is must.pstprt leave.inf
A particularly revealing case study for testing structural Marco had to leave.
representations from a cartographic perspective in LLMs
comes from the domain of restructuring verbs in Ital- Restucturing verbs then present an ideal testing
ian, as discussed in [
            <xref ref-type="bibr" rid="ref1">1, 11</xref>
            ]. Restructuring verbs — such ground for evaluating whether LLMs encode abstract
as potere ‘can’, dovere ‘must’, volere ‘want’, continuare syntactic structures from cartographic generalizations,
‘continue’, cominciare ‘begin’, are verbs that, despite se- or merely track co-occurrence frequencies. While Marco
lecting an infinitival complement, do not behave as if lo finisce di mangiare in fretta (‘Marco finishes eating it
they embed a full clause (cf. [13, 12, 27], a.o.). Instead, quickly’) is structurally monoclausal and allows clitic
they participate in a monoclausal structure, lacking the climbing, its control verb counterpart *Marco lo decide di
full complement of functional projections found in fully mangiare in fretta is ungrammatical precisely because the
embedded (i.e., biclausal) contexts. This has observable clitic cannot climb out of a true embedded clause. These
syntactic consequences: only restructuring verbs permit subtle distinctions, masked by similar surface forms,
removement of the object clitic from the complement po- flect two diferent structural representations,
underscorsition of the infinitive up to the matrix verb (e.g., Marco ing the need to go beyond linearity when assessing
synlo vuole mangiare ‘Marco wants to eat it’), while con- tactic competence in artificial models. Furthermore,
evitrol verbs, which are superficially similar, do not (e.g., dence from language development [28] shows that the
*Marco lo decide di mangiare ‘Marco decides to eat it’). distinction between restructuring and control syntax,
Clitic placement (Clitic Climbing; CC) thus ofers a fruit- and the fixed ordering constrain of restructuring verbs,
ful diagnostic for the underlying syntactic structure of a are acquired very early on. This suggests that children
restructuring configuration. have a clear representation of the diference between
          </p>
          <p>
            More specifically, the working hypothesis that we are control and restructuring verbs, and when encountering
adopting here ([
            <xref ref-type="bibr" rid="ref1">1, 11</xref>
            ]) views restructuring verbs as func- a novel infinitive-taking verb, some preliminary corpus
tional heads occupying a fixed hierarchy (e.g., from lower data suggest they tend to prefer a restructuring
interpreto higher, Aspectual &gt; Modal &gt; Temporal), with each tation over a control one [29]. A natural question, then, is
verb spelling out a specific functional projection (Fig. 1) whether LLMs also encode such a clear distinction when
rooted in the cartographic representation of the inflec- processing previously unseen infinitive-taking verbs. In
tional domain. summary, we can use at least three solid tests to probe
          </p>
          <p>Restructuring verbs obey in fact strict ordering con- linguistic competence when comparing restructuring and
straints within sequences: for example, Marco lo suole control verbs: (i) the first (restructuring), but not the
secvoler mangiare spesso ‘Marco usually wants to eat it often’ ond (control), allows Clitic Climbing (CC); (ii) the order
is grammatical, while reversing the restructuring verbs of predicates lexicalizing positions in the functional
hierblocks clitic climbing (*Marco lo vuole soler mangiare archy is rigid; and (iii) restructuring predicates can take
spesso) as it is a violation of the hierarchical sequence both be and have as auxiliaries.
of functional heads (*ModVolition &gt; AspFrequentative). Unlike
linear word orders of adjectives or adverbs, which LLMs
might learn through surface-level statistical regularities,</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Generalization in LLMs</title>
      <p>rather than rule-based, and simply increasing the scale
of training does not really improve the possibility of true
Despite the impressive performance of state-of-the-art syntactic generalizations.</p>
      <p>LLMs, it remains an open question whether their en- In this context, the empirical domain of restructuring
hanced predictive capabilities reflect genuine syntactic verbs provides an ideal testing ground for disentangling
knowledge. LLMs are said to exhibit syntactic generaliza- linear generalizations from structural rules. On the one
tion insofar as they can abstract structural rules from data hand, restructuring verbs follow specific linear orderings
and apply them to novel grammatical contexts beyond that could, in principle, be learned from surface patterns
their training input. Wilson et al. (2023) [30] theorize in the training data. On the other hand, their ordering
three forms of generalization, diferentiating the ability to can either permit or block syntactic phenomena such
learn word distributions and the distributions in contexts as clitic climbing (CC), making linear order a surface
from the ability to abstract generalization independently reflex of deeper structural constraints. Capturing the
of training data. The findings highlight that, while ex- relevant syntactic generalizations in this domain
therecelling in transferring distributions across syntactically fore requires more than sensitivity to word order — it
similar context, LLMs struggle in extracting structural hi- demands an understanding of the underlying hierarchical
erarchical rules, relying primarily on linear order instead. structure.</p>
      <p>Accordingly their linguistic knowledge appears to be of
a semantic and probabilistic nature and the emergence
of human-like abstraction correlates with the increase 4. Methods
of training data, radically diferentiating from human
linguistic competence. The issue of LLMs’s grammat- We designed 13 minimal pairs experiments targeting
varical knowledge is tackled by the linguistic community ious grammatical contrasts involving clitic placement,
through diferent approaches relying on controlled exper- auxiliary selection, and verb-verb complementation. In
imental settings, probing LLMs’ performances on mini- these experiments, we manipulated the presence or
abmal pair sentences, and evaluating the internalization of sence of restructuring environments, the type of
madeep hierarchical dependencies of the underlying linguis- trix verb (restructuring verbs, control verbs, and
pseudotic structures. Blimp [31] evaluate LLMs with minimal verbs), and the structural distance between multiple
ocpairs, finding that — while learning basic dependencies, currences of restructuring verbs, allowing us to probe the
and surface-level patterns — models still cannot encode models’ syntactic representations under diferent
conuniversal constraints like argument structure, even in a ditions. First, we coded 14 restructuring verbs and 14
high-resource language like English. Training models on infinitive-taking verbs (which we name according to the
larger corpora leads to better performances suggesting syntactic literature as control verbs, cf. [34]). While the
that data play a major role compared to the architecture. coding of control verbs is arbitrary, the numbering of</p>
      <p>
        The very same result is obtained in another bench- restructuring verbs reflects their position in the
funcmark, BIG-bench [32], comprising 204 tasks designed to tional hierarchy of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], with andare ‘to go’ assigned code
assess linguistic, reasoning, and knowledge-based abil- 1 as the lowest verb, and solere ‘to be used to’ assigned
ities. Even if larger models show an improvement in code 14 as the highest (see Table. 1). Verbs higher in the
syntactic generalization, this can be explained in terms hierarchy occur linearly to the left of lower verbs.
of memorization rather than grammatical abstraction. In addition to the verbs above, we also created three
Deep-structure constraints still represent a challenge. pseudo-verbs (i.e., non-existent words in Italian) to test
      </p>
      <p>In a recent study, [33] confirms the relevance of train- whether LLMs assign them a restructuring-like or
controling data size in improving generalization, taking the like syntactic representation when they take a non-finite
case of a syntactic universal as the Final-over-Final Con- complement. One, grabbare, is a bare verb resembling
straint (FOFC) — the rule governing word order variation modals (verbs 6, 7, and 12 in Table 1) as well as solere ‘to
crosslinguistically. They tested models with low-resource be used to’ and other control verbs. The other two
pseudolanguages and found that models fail to learn this con- verbs, drommare a and trellare di, take the prepositions a
straint when dealing with languages like Basque. A super- and di, respectively: a feature shared with the remaining
human amount of training examples improves syntactic restructuring and control verbs.
generalization, but models do not acquire abstract rules To address RQ1 (introduced in Section §1), we
conof grammar. structed minimal pairs of verb sequences that either</p>
      <p>Taken together, these studies point to the necessity of respect or violate Cinque’s (2006) functional hierarchy.
incorporating more structured training methodologies Each item in Exp. 1 presents a grammatical
(hierarchyand inductive biases, especially in light of the fact that respecting) sentence alongside a minimally diferent
unhuman language acquisition occurs with far less data. grammatical counterpart, with the two verbs separated
Current models remain fundamentally data-dependent by varying degrees of hierarchical distance. This
experiment tests whether LLMs prefer the option adhering to clitic variant is grammatical because control verbs block
the hierarchy, and whether their preferences correlate clitic climbing, even if the model assumes the
pseudowith the hierarchical distance between verbs. verb to be restructuring-compatible. This design ofers a</p>
      <p>A second experiment (Exp. 2) uses the same verb pairs strong test of whether the model robustly distinguishes
as in Exp. 1, but includes a proclitic clitic in each sentence. restructuring from control verbs. If the model is
sensiThis introduces an explicit syntactic cue for restructur- tive to this contrast, it should reject the proclitic variant
ing, allowing us to evaluate whether clitic placement in favor of enclisis, indicating a fine-grained syntactic
influences the model’s preference for the grammatical, representation of clitic domain boundaries.
hierarchy-respecting variant.. Exp. 9 further probes the syntactic status of
pseudo</p>
      <p>To address RQ2, Exp. 3 and Exp. 4 pair control verbs by pairing them with each other and testing
verbs with restructuring verbs, testing them in both pos- proclitic vs. enclitic placement. This experiment
sible orders: restructuring+control (Exp. 3) and con- asks whether the model classifies pseudo-verbs as
trol+restructuring (Exp. 4). Each minimal pair includes restructuring-like or control-like when they co-occur,
clitics, with the grammatical variant displaying enclisis shedding light on whether it generalizes clitic behavior
on the infinitival verb and the ungrammatical one dis- within novel verb classes.
playing proclisis onto the matrix verb. The latter is ruled In Exp. 10, we tested pseudo-verbs in isolation,
assessout because in both cases the control verb introduces a ing model preferences for auxiliary selection (have vs. be)
clausal boundary that blocks clitic climbing. — another syntactic hallmark of restructuring (see §2.1).</p>
      <p>To investigate RQ3, we conducted a series of experi- For comparison, Exp. 13 and Exp. 14 extend this test to
ments pairing restructuring and control verbs with the restructuring (modals) and control verbs, respectively.
three pseudo-verbs introduced earlier. Exp. 5 combines Exp. 11 tests pseudo-verbs selecting infinitival
comeach of the three pseudo-verbs (grabbare, drommare a, plements, presenting both proclitic and enclitic variants.
trellare di) with all 14 restructuring verbs, presenting two This experiment investigates whether the model prefers
variants per item: one with proclisis onto the matrix verb proclisis (indicating a restructuring representation, along
(suggesting restructuring), and one with enclisis on the the lines of Exp. 5) or enclisis, and whether this
prefinfinitival verb. Exp. 6 reverses the order (restructuring erence is modulated by the presence or absence of the
+ pseudo-verb) but otherwise follows the same design. prepositions di and a.</p>
      <p>Since proclisis requires a monoclausal analysis, these ex- Finally, in Exp. 12 and 13 we tested modal
(restrucperiments test whether the model treats novel verbs as turing) verbs and control verbs with auxiliary selection,
compatible with restructuring. A systematic preference respectively (only modals allow both essere ’to be’ and
for the proclitic variant would suggest that the model avere ’to have’ with unaccusative verbs, while control
generalizes restructuring behavior to unseen verbs. verbs require avere). This allows us to see whether the</p>
      <p>
        Exp. 7 and Exp. 8 approach the same question from the ifne-grained syntactic distinctions between restructuring
opposite angle, pairing pseudo-verbs with control verbs. and control have been successfully generalized by these
In Exp. 7, the order is control + pseudo-verb, while in Exp. models.
8, it is pseudo-verb + control. In both cases, only the
en4.1. Materials: Minimal Pairs between diferent models’ size, in the ability to
internalize the structural dependencies necessary to abstract
The minimal contrasts exemplified in Table 2 have been the relevant generalizations. All models are available on
considered. For each condition internal to each experi- Hugging Face [
        <xref ref-type="bibr" rid="ref5">35, 36, 5, 37, 38</xref>
        ].
ment, we generated 100 structurally irrelevant variants Minerva-7B-base-v1.0 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a causal LLMs with 7
bildisplaying diferent lexical items as subjects, infinitival lion parameters, based on Mistral architecture (32 layers,
verbs, and objects (when present). Although some of the hidden size 4096, 32 attention heads, context window of
items across the experiments were semantically odd, the 4096 tokens) trained on 2˜.48 trillion tokens (1.14T Italian,
generalizations are nonetheless still strong, and the con- 1.14T English, 200B code) and a 51200-token vocabulary.
trast within the pairs remains sharp, as in the example Bert-base-italian-xxl-uncased is the Italian version of
below. BERT base model (uncased), a masked LLMs trained
4. il calciatore lo sta riuscendo a finire di ideare on next sentence prediction. The models has 111M
the soccer player it.cl is about to be able to finish parameters and training data consist of OPUS corpus
to design (https://opus.nlpl.eu/) extended with additional content
5. *il calciatore lo riesce a star finendo di ideare from the Italian portion of the OSCAR corpus, for a final
the soccer player it.cl is able to be about to finish training corpus of 81GB and 13,138,379,147 tokens.
to design GroNLP/GPT2-medium-italian-embeddings [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is built
on GPT-2 medium architecture, with 359M parameters
      </p>
      <p>
        The script responsible for the generation of the mini- with the lexical layer retrained to support Italian.
mal pairs is available on GitHub. GroNLP/GPT2-small-italian [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a smaller causal
Transformer with 121 million parameters, built on GPT-2
4.2. Experiments small architecture and retrained in Italian.
GePpeTto [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] has a GPT2-small configuration ( 1˜17
milFive LLMs have been employed for the evaluation of syn- lion parameters) and has been trained in Italian corpora
tactic generalization with minimal pair sentences. The OSCAR (https://huggingface.co/datasets/oscar-corpus/
selection was driven by two key factors for the evaluation: oscar?utm_source=chatgpt.com), PAISA (https:
model size and language of training. //www.corpusitaliano.it/en/?utm_source=chatgpt.com),
      </p>
      <p>Correspondingly we included large, medium and small Wikipedia. GePpeTto, similarly based on the GPT2-small
models - Minerva-7B-base-v1.0, GPT-2 medium and Bert- architecture, employs a BPE tokenizer with a reduced
base-italian-xxl-uncased, GPT2-small and GePpeTto. All vocabulary of 30,000 tokens, specifically adapted for
models are trained on Italian corpora, hence they allow Italian linguistic data.
us to assess whether exposure to Italian during training
enhances syntactic generalization in a typologically
relevant domain. This setup enables a direct comparison
4.2.1. LLMs Evaluation GePpeTto and Minerva remained well below chance
(12–18%).</p>
      <p>The LM-eval platform [39] was adopted to perform min- In Exp 9 and 11, which included only pseudoverbs,
imal pair tests. A total of 610,500 minimal pairs were GePpeTto consistently preferred proclitic constructions
generated and divided into 13 groups, as described in (low accuracy = proclisis favored), while Minerva and
§4.1, and assessed by all the selected models. For each GPT2-small showed no clear preferences, again reflecting
experiment we computed the mean accuracy and stan- indecision or inconsistency.
dard deviation (3), leaving further statistical analyses for As for auxiliary selection, the results reveal further
the future. For unknown reasons, some models failed to lack of syntactic diferentiation: in Exp 10, GePpeTto
complete certain evaluation tasks without producing any systematically selected essere (7% accuracy), suggesting
intelligible error messages. it interpreted pseudoverbs as restructuring verbs.
GPT2small showed more balanced choices ( 47%), compatible
5. Results with the ambiguity characteristic to some restructuring
verbs which allow both avere and essere.</p>
      <p>
        We organize our results around the three core research In Exp 12, in fact, testing modal auxiliaries, models
questions that reflect diferent dimensions of the models’ should ideally show 50% accuracy, given the
optionalsyntactic generalizations with respect to restructuring ity of auxiliary selection; instead, both GPT2-small and
verbs, control verbs, and infinitive-selecting pseudoverbs. GePpeTto showed categorical but divergent choices, with
For each question, we present the relevant experimen- accuracies around 5%.
tal conditions and summarize the performance of all In Exp. 13 (control verbs), only Minerva performed
tested LLMs in terms of mean accuracy and standard above chance (57%), while GePpeTto and GPT2-small
deviation. To assess whether models internalize the syn- selected the incorrect auxiliary (essere) almost
categortactic hierarchy of restructuring verbs proposed by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] ically (1% accuracy), and BERT was the only model to
(RQ1), Exp. 1 and 2 tested sequences of two restructuring outperform Minerva (63%).
verbs in the correct vs. incorrect hierarchical order, with As a result, models largely fail to generalize the
syntacand without clitic pronouns. Mean accuracies in these tic constraints of restructuring and control verbs. Clitic
experiments were consistently low (Minerva: 36–37%, climbing is not consistently blocked by control verbs,
GePpeTto: 36–38%, GPT2: 46–48%), with SD close to and auxiliary selection does not reliably reflect the
trans0.5. BERT, however, performed moderately above chance parency efects typical of restructuring verbs nor the
(Exp. 1: 64.6%, Exp. 2: 56.9%), suggesting that it may en- ambiguity intrinsic to them. Only GPT2-small shows
code some sensitivity to hierarchical ordering, although partial sensitivity in some control constructions, while
not robustly. GePpeTto tends toward an overgeneralization of
restruc
      </p>
      <p>The presence of clitics in Exp. 2 did not alter model turing syntax (e.g. by overselecting essere as an auxiliary).
behavior compared to Exp 1. Models show no evidence of Finally, a central question of this study addresses how
having acquired the hierarchical layout of restructuring models categorize pseudoverbs — novel verbs not seen
verbs, besides BERT’s results. However, their responses during training but constructed to select infinitival
commay correlate with verb distance or hierarchical ordering, plements, and whether they are interpreted as control or
which we leave for further research. restructuring verbs.</p>
      <p>
        To evaluate whether models distinguish between re- In Exp. 5 and 6, pseudoverbs appeared in sequences
structuring and control verbs based on syntactic diagnos- with restructuring verbs, with proclitic vs. enclitic
altics (RQ2) we considered two properties: clitic climbing ternations. Minerva showed a slight preference for the
(Exp. 3, 4, 5, 6, 7, 8, 9, 11), and auxiliary switch (Exp. 10, 12, enclitic form ( 23–29% accuracy), suggesting a bias
to13). In Exp. 3 and 4, which tested restructuring–control ward control-like syntax. GePpeTto strongly preferred
verb sequences with clitics, models consistently failed to the proclitic form ( 17% accuracy = 83% proclisis),
indiblock clitic climbing where it was expected to be ungram- cating a restructuring-like interpretation. GPT2-small
matical. GePpeTto and Minerva almost systematically was ambivalent. Since the three pseudoverbs difer in
chose the ungrammatical option (18–28%), while GPT2- whether they select a preposition, mirroring the variation
small showed slightly better performance ( 42–46%) but found among restructuring verbs, further analyses will
with high variability, BERT performed near floor ( 5–6%). investigate this property as a potential factor.
A model that shows a bias over 75/80% can be in fact Exp. 9 and 11, which tested proclitic/enclitic
prefconsidered structurally coherent, even though it picks erences with pseudoverb–pseudoverb sequences,
reinthe ungrammatical option [
        <xref ref-type="bibr" rid="ref8">40</xref>
        ]. forced these trends: GePpeTto showed a consistent
pref
      </p>
      <p>In Exp. 7 and 8, which paired control verbs with erence for proclitic constructions (11–15% accuracy),
pseudoverbs, models again failed to systematically block while GPT2-small and Minerva again showed no strong
clitic climbing. GPT2 reached 48–57% accuracy, while preference.</p>
      <p>Experiment UID
Exp. 1
Exp. 2
Exp. 3
Exp. 4
Exp. 5
Exp. 6
Exp. 7
Exp. 8
Exp. 9
Exp. 10
Exp. 11
Exp. 12
Exp. 13</p>
      <p>Minerva</p>
      <p>Mean Std
sequence of two restructuring verbs testing only linear order 0.3646 0.4813
sequence_pairs_with_clitics 0.3762 0.4844
restructuring_and_control_plus_clitics 0.2854 0.4516
control_and_restructuring_plus_clitics 0.2253 0.4178
pseudo_and_restructuring_plus_clitics 0.2336 0.4231
restructuring_and_pseudo_plus_clitics 0.2857 0.4518
control_and_pseudo_plus_clitics 0.1569 0.3637
pseudo_and_control_plus_clitics 0.1810 0.3850
pairs_of_pseudo_verbs_plus_clitics 0.5583 0.4966
auxiliary_switch_with_pseudoverbs — —
pseudo_verbs_plus_clitics 0.2267 0.4187
auxiliary_switch_with_modals — —
auxiliary_switch_with_control_verbs 0.5700 0.4951</p>
      <p>In Exp. 10, which tested auxiliary selection with pseu- tion, two classical diagnostics that distinguish
restructurdoverbs, GePpeTto again opted overwhelmingly for es- ing from control. Across all clitic-related experiments,
sere, consistent with restructuring behavior, while GPT2- models consistently failed to block clitic climbing where
small distributed responses more evenly. BERT dis- it should be ungrammatical, especially in the presence of
tributed its choices roughly evenly (around 53.7% accu- control verbs. This strongly suggests that models do not
racy), suggesting some awareness of optionality, though encode the syntactic opacity of control verbs. A potential
this may be an artifact of random choice. explanation for these results lies in tokenization artifacts.</p>
      <p>These results suggest that GePpeTto interprets novel Unlike proclitic clitics (e.g., lo ha visto ’it.obj has seen’),
infinitive-selecting verbs as restructuring verbs by default enclitics (e.g., vederlo ’see-it.obj’) should be tokenized
(although without expressing the available optionality as subword fragments. If models fail to treat enclitics as
with avere), consistently favoring proclisis and auxiliary distinct morphemes, this may increase their preference
essere. In contrast, GPT2-small and Minerva exhibit un- for proclitic constructions simply because the latter are
certainty or mixed behavior, with no consistent syntactic tokenized as independent words, easily recognizable as
categorization of pseudoverbs. syntactic objects.</p>
      <p>
        Auxiliary selection patterns further support the view
that models lack a deep representation of
infinitive6. Discussion taking verb classes. None of the models consistently
mapped control verbs to avere, or correctly captured the
Overall, the findings reveal that the models’ behavior optionality of auxiliary selection in modals (with the
pardoes not align with the predictions raised by the frame- tial exception of BERT in Exp. 13, having 63% accuracy).
work of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], nor with the grammatical requirements char- GPT2 again performed marginally better than the
othacteristic of the syntax of non-finite complements in ers in preserving optionality, but even it failed to align
Italian. Instead, their choices are often inconsistent, in- with the expected 50% distribution. Surprisingly, both
sensitive to syntactic structure, or driven by superficial GPT2-small and GePpeTto nearly categorically
misasfactors. The first research question addressed whether signed essere to control verbs, a highly ungrammatical
models generalize the hierarchical structure of restructur- option in Italian.
ing verbs as observed in the syntactic literature ([
        <xref ref-type="bibr" rid="ref1">1, 11</xref>
        ]). These findings point to a broader issue: models do
Our results clearly indicate that no such hierarchy is not reliably encode the syntactic transparency of
restrucreflected in the models’ performance. Accuracies were turing verbs nor the obligatory opacity of control verbs.
consistently low, and variability high. These findings Syntactic features that are not overtly marked in
surecho previous results showing that LLMs often fail to face form — such as whether a verb transmits argument
internalize syntactic hierarchies when such structures structure or allows clitic climbing — appear to be dificult
are not directly observed during training or explicitly for models to capture, even when such distinctions are
encoded [30]. Even BERT, which slightly outperformed central to grammaticality.
other models on restructuring verb order, failed across the
board on clitic-related diagnostics. This has implications
for how much syntactic theory — especially fine-grained 7. Conclusions
distinctions like cartographic hierarchies — is learnable
from surface patterns alone. This study investigated whether LLMs encode abstract
      </p>
      <p>In the second set of questions, we tested whether mod- syntactic generalizations by testing their sensitivity to the
els are able to handle clitic climbing and auxiliary selec- restructuring verb hierarchy in Italian. Using a suite of
controlled minimal pair experiments targeting verb order, ceptability judgment task to present these contrasts to
clitic placement, and auxiliary selection, we assessed native speakers and properly compare LLM performance
models’ ability to capture structural dependencies that with human data.
go beyond linear surface patterns. Further analyses - currently underway - are required</p>
      <p>The models tested — GPT2-small-italian, GPT2- to provide a more comprehensive understanding of the
medium-italian-embeddings, GePpeTto, Bert-base- syntactic behaviors tested. These will be reported in
italian-xxl-uncased and Minerva-7B-base-v1.0 — showed future work.
limited sensitivity to the syntactic hierarchy of
restructuring verbs, failed to consistently distinguish
restructuring from control verbs based on key syn- Acknowledgments
tactic diagnostics, and did not consistently categorize
novel infinitive-taking verbs based on the non-finite We acknowledge financial support under the National
embedding typology available in Italian. These findings Recovery and Resilience Plan (NRRP), Mission 4,
Comhighlight fundamental limitations in the syntactic ponent 2, Investment 1.1, Call for tender No. 104
pubabstraction capacities of current models, particularly lished on 2.2.2022 by the Italian Ministry of University
in domains where structural contrasts are not overtly and Research (MUR), funded by the European Union
marked in surface form. – NextGenerationEU– Project Title T-GRA2L: Testing</p>
      <p>While none of the models fully internalize the hier- GRAdeness and GRAmmaticality in Linguistics – CUP
archical structure of restructuring verbs, some results I53D23003900006 - Grant Assignment Decree No. 104
(as BERT’s above-chance accuracy in distinguishing adopted on the 2nd February 2022 by the Italian Ministry
hierarchy-respecting sequences in Exp. 1) suggest at least of Ministry of University and Research (MUR). PI: CC
some limited sensitivity to structural cues. However, this
sensitivity is neither robust nor consistent across models References
or conditions, and most importantly does not translate
into reliable grammaticality judgments. For example,
clitic placement’s explicit cues for restructuring failed
to improve performance, and models consistently failed
to block ungrammatical clitic climbing or the essere
auxiliary selection in the context of control verbs. These
ifndings indicate that, to the extent models are sensitive
to structural hierarchies, in the domain of cartographic
generalizations this sensitivity remains shallow and
insufifcient for capturing the related grammatical distinctions.</p>
      <p>Addressing these limitations will require new
approaches to model design, training, and evaluation that
go beyond surface-level pattern recognition, and may
involve encoding linguistic biases into model architectures—
much like cartographic hierarchies are hypothesized to
be innately hardwired in human cognition.</p>
    </sec>
    <sec id="sec-3">
      <title>8. Limitations</title>
      <p>
        The main limitation of the current research lies in the
exclusive usage of publicly available pre-trained
models as outlined in 4.2. To obtain a fine-grained
understanding of models’ capacity on syntactic generalization,
future works will employ models trained from scratch,
with a training regimen reproducing human language
acquisition stages (see 2). The alignment between
learning trajectories and the implementation of more
structured training methodologies and inductive biases (see 3)
will hopefully improve models’ performance in syntactic
tasks [
        <xref ref-type="bibr" rid="ref10 ref9">41, 42</xref>
        ]
      </p>
      <p>Moreover, we are in the process of designing an
ac[8] Y. Goldberg, Assessing bert’s syntactic abili- spective on the grammaticalisation of speaker
perties, 2019. URL: https://arxiv.org/abs/1901.05287. spective, Jung (2017) 93.</p>
      <p>arXiv:1901.05287. [26] M. Binz, E. Schulz, Turning large language
[9] E. Wilcox, R. Levy, T. Morita, R. Futrell, What do rnn models into cognitive models, arXiv preprint
language models learn about filler-gap dependen- arXiv:2306.03917 (2023).
cies?, 2018. URL: https://arxiv.org/abs/1809.00042. [27] M. Olivier, C. Sevdali, R. Folli, Clitic Climbing and
arXiv:1809.00042. Restructuring in the History of French, Glossa 8
[10] J. Hu, J. Gauthier, P. Qian, E. Wilcox, R. Levy, A (2023) 1–45.</p>
      <p>systematic assessment of syntactic generalization [28] T. Sgrizzi, When infinitives are not under control:
in neural language models, in: Proceedings of the the growing trees hypothesis and the
developmen58th Annual Meeting of the Association for Compu- tal advantage of restructuring verbs, RGG 46 (2024)
tational Linguistics, Association for Computational 1–39.</p>
      <p>Linguistics, Online, 2020, pp. 1725–1744. URL: https: [29] T. Sgrizzi, The Acquisition of Restructuring and
//www.aclweb.org/anthology/2020.acl-main.158. Control, Master’s thesis, University of Siena, Siena,
[11] T. Grano, Control and Restructuring, Oxford Stud- Italy, 2022.</p>
      <p>ies in Theoretical Linguistics, Oxford University [30] M. Wilson, J. Petty, R. Frank, How abstract is
linPress, London, England, 2015. guistic generalization in large language models?
ex[12] S. Wurmbrand, Infinitives, Berlin: De Gruyter Mou- periments with argument structure, Transactions
ton, 2001. of the Association for Computational Linguistics
[13] S. Wurmbrand, Restructuring cross-linguistically, 11 (2023) 1377–1395.</p>
      <p>LingBuzz (2015). doi:lingbuzz/002514. [31] A. Warstadt, A. Parrish, H. Liu, A. Mohananey,
[14] G. Cinque, L. Rizzi, The cartography of syntac- W. Peng, S.-F. Wang, S. R. Bowman, Blimp: The
tic structures, CISCL Working Papers on Lan- benchmark of linguistic minimal pairs for english,
guage and Cognition 2 (2012) 43–59. doi:10.1093/ Transactions of the Association for Computational
oxfordhb/9780199544004.013.0003. Linguistics 8 (2020) 377–392.
[15] G. Scontras, J. Degen, N. D. Goodman, Subjectivity [32] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb,
predicts adjective ordering preferences, Open Mind A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,
1 (2017) 53–66. A. Garriga-Alonso, et al., Beyond the
imita[16] G. Scontras, J. Degen, N. D. Goodman, On the gram- tion game: Quantifying and extrapolating the
camatical source of adjective ordering preferences, pabilities of language models, arXiv preprint
Semantics and Pragmatics 12 (2019) 7–1. arXiv:2206.04615 (2022).
[17] G. Scontras, Adjective ordering across languages, [33] J. Hale, M. Stanojević, Do llms learn a true syntactic</p>
      <p>Annual Review of Linguistics 9 (2023) 357–376. universal?, in: Proceedings of the 2024 Conference
[18] G. Cinque, Deriving greenberg’s universal 20 and on Empirical Methods in Natural Language
Processits exceptions, Linguistic inquiry 36 (2005) 315–332. ing, 2024, pp. 17106–17119.
[19] L. Rizzi, The fine structure of the left periphery, Ele- [34] I. Landau, Control (Elements), LingBuzz (2024).
ments of grammar: Handbook in generative syntax doi:lingbuzz/008204.</p>
      <p>(1997) 281–337. [35] SapienzaNLP - Sapienza University of Rome,
[20] L. Rizzi, G. Bocci, Left periphery of the clause: Pri- Minerva-7b-base-v1.0, https://huggingface.co/
marily illustrated for italian, The Wiley Blackwell sapienzanlp/Minerva-7B-base-v1.0, 2024. Accessed:
Companion to Syntax, Second Edition (2017) 1–30. 2025-08-01.
[21] R. Kayne, Some notes on comparative syntax, [36] GroNLP - University of Groningen,
gpt2-mediumwith special reference to english and french, The italian-embeddings, https://huggingface.co/
Oxford Handbook of Comparative Syntax (2012) GroNLP/gpt2-medium-italian-embeddings, 2020.
3–69. doi:10.1093/oxfordhb/9780195136517. Accessed: 2025-08-01.</p>
      <p>013.0001. [37] GroNLP - University of Groningen,
gpt2[22] K. Abels, Towards a restrictive theory of (remnant) small-italian, https://huggingface.co/GroNLP/
movement!, Linguistic variation yearbook 7 (2007) gpt2-small-italian, 2020. Accessed: 2025-08-01.
53–120. [38] L. D. Mattei, Geppetto: Italian gpt-2 model, https://
[23] G. Ramchand, P. Svenonius, Deriving the functional huggingface.co/LorenzoDeMattei/GePpeTto, 2021.</p>
      <p>hierarchy, Language sciences 46 (2014) 152–174. Accessed: 2025-08-01.
[24] G. C. Ramchand, Situations and syntactic structures: [39] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black,
Rethinking auxiliaries and order in English, vol- A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h,
ume 77, MIT Press, 2018. H. Li, K. McDonell, N. Muennighof, C. Ociepa,
[25] T. Biberauer, Peripheral significance: a phasal per- J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron,</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content
as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cinque</surname>
          </string-name>
          , Restructuring and functional heads,
          <source>Cartography of Syntactic Structures (Hardcover)</source>
          , Oxford University Press, Cary,
          <string-name>
            <surname>NC</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cinque</surname>
          </string-name>
          ,
          <article-title>Adverbs and functional heads: A crosslinguistic perspective</article-title>
          , Oxford University Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Barba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>Minerva llms: The first family of large language models trained from scratch on italian data, in: Proceedings of the 10th Italian conference on computational linguistics (CLiC-it</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>W. De Vries</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nissim</surname>
          </string-name>
          ,
          <article-title>As good as new. how to successfully recycle english gpt-2 to make models for other languages, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>836</fpage>
          -
          <lpage>846</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . findings-acl.
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>DBMDZ</given-names>
            <surname>- Bavarian State</surname>
          </string-name>
          <string-name>
            <surname>Library</surname>
          </string-name>
          ,
          <article-title>Bert-base italian xxl uncased</article-title>
          , https://huggingface.co/dbmdz/ bert-base
          <article-title>-italian-xxl-</article-title>
          <string-name>
            <surname>uncased</surname>
          </string-name>
          ,
          <year>2020</year>
          . Accessed:
          <fpage>2025</fpage>
          -08-01.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerini</surname>
          </string-name>
          ,
          <article-title>Geppetto carves italian into a language model</article-title>
          ,
          <source>in: Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-It</source>
          <year>2020</year>
          , Bologna,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          , E. Dupoux,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Assessing the ability of lstms to learn syntax-sensitive dependencies, Transactions of the Association for Computational Linguistics 4 (</article-title>
          <year>2016</year>
          )
          <fpage>521</fpage>
          -
          <lpage>535</lpage>
          . L.
          <string-name>
            <surname>Sutawika</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Thite</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A. Zou,</given-names>
          </string-name>
          <article-title>The language model evaluation harness, 2024</article-title>
          . URL: https://zenodo.org/records/12608602. doi:
          <volume>10</volume>
          .5281/zenodo.12608602.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L. P.</given-names>
            <surname>Bianchessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bressan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fusco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          , T. Sgrizzi,
          <article-title>From recursion to incrementality: Return to recurrent neural networks, Linguistic Vanguard (forthcoming).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>L.</given-names>
            <surname>Charpentier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cotterell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Gul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jumelet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ross</surname>
          </string-name>
          , et al.,
          <article-title>Babylm turns 3: Call for papers for the 2025 babylm workshop</article-title>
          , arXiv preprint arXiv:
          <volume>2502</volume>
          .10645 (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fusco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L. P.</given-names>
            <surname>Bianchessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bressan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sgrizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <article-title>Recurrent networks are (linguistically) better? an (ongoing) experiment on small-lm training on child-directed speech in italian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>