1. Introduction

Structural sensitivity does not entail grammaticality: assessing LLMs against the Universal Functional Hierarchy

Tommaso Sgrizzi

0 1

Asya Zanollo

0 1

Cristiano Chesi

0 1 0 Laboratory for Neurocognition , Epistemology, and Theoretical Syntax - NeTS-IUSS Pavia 1 University School for Advanced Studies IUSS Pavia

2025

This paper investigates whether large language models (LLMs) generalize core syntactic properties associated with restructuring verbs in Italian, a domain tied to the universal hierarchy of functional heads proposed by Cinque [1, 2]. Specifically, we examine whether LLMs distinguish between restructuring and control verbs based on canonical syntactic diagnostics: verb ordering, clitic climbing, and auxiliary selection. We also probe how models interpret novel infinitive-selecting pseudoverbs, testing whether they default to restructuring- or control-like behavior. Using controlled minimal pairs, we evaluate five models of diferent sizes: Minerva-7B-base-v1.0 [ 3], GPT2-medium-italian-embeddings [4], Bert-base-italian-xxl-uncased [5], GPT2-small-italian [4], , and GePpeTto [6]. Our findings reveal that none of the models internalize the functional hierarchy, nor do they systematically block clitic climbing for control verbs, or are sensitive to auxiliary selection variability of the restructuring and control classes. These results highlight fundamental limitations in the syntactic generalization abilities of current LLMs, particularly in domains where structural contrasts are not overtly marked in the input.

eol>Large language models (LLMs) Cognitive plausibility Syntactic evaluation Universal hierarchy of functional heads Restructuring verbs

1. Introduction

ered two aspects: model’s size and the training language, in order to observe whether, keeping the size constant, a Large language models (LLMs) have achieved remark- model trained in Italian would perform better in a task able success across a wide range of natural language specific for Italian. In terms of size, we compared larger, understanding tasks, reigniting interest in their syntac- medium and smaller models — Minerva-7B-base-v1.0 [ 3 ], tic abilities and sparking a vigorous debate regarding GPT2-medium-italian-embeddings [ 4 ], Bert-base-italianthe cognitive plausibility of the linguistic generalizations xxl-uncased [ 5 ], GPT2-small-italian [ 4 ] and GePpeTto they acquire from data ([ 7 ], a.o.). Recent research has [ 6 ], to see if a greater number of parameters and training begun to probe the extent to which LLMs implicitly en- data leads to better generalization in terms of abstracting code hierarchical syntactic structure [8, 9, 10], examining linguistic rules. The research questions (RQs) that guide their sensitivity to phenomena such as long-distance de- this study can be framed as: pendencies and subject-verb agreement. This paper contributes to this growing body of work by investigating • RQ1: To what extent do LLMs generalize the verb whether LLMs are sensitive to a crosslinguistically ro- ordering hierarchy proposed by Cinque (2006) for bust constraint governing the hierarchical distribution of restructuring verbs? functional verbs in Italian ([ 1, 11 ]). Given the broad cross- • RQ2: Can LLMs diferentiate the underlying linguistic relevance of this phenomenon ([12, 13]), our structural ambiguity inherent in restructuring investigation directly addresses the question of the co- versus control verb constructions? herence of linguistic structural representations in LLMs: • RQ3: What is the syntactic structure assigned by can these models learn and represent aspects of Cinque’s LLMs to novel verbs which introduce non-finite hierarchy from the data they are trained on? We consid- complements?

For instance, as far as RQ1 is concerned, the follow

ing contrast shows that the incorrect hierarchical order — which directly reflects into linear order — of provare ‘try’ (AspConative) and volere ‘want’ (ModVolition) leads to ungrammaticality. (1) a.

Gianni lo vuole provare a riparare.

Gianni it.cl wants to try to fix ‘Gianni wants to try to fix it.’ b. * Gianni lo prova a voler riparare.

Gianni it.cl tries to wants to fix

Intended: ‘Gianni tries to want to fix it.’ Regarding RQ2, consider the fact that only restructuring verbs allow clitic climbing (2) and auxiliary switch (3), as shown in the examples below. (2) a. Gianni lo comincia a riparare.

Gianni it.cl begins to fix diverse languages, adverbs and verbal morphology appear in a constrained order that reflects an underlying sequence of functional heads encoding modality, aspect, tense, and voice [ 2 ]. A well-known example involves the relative positions of epistemic and aspectual adverbs. Consider the following contrast. (4) a. John probably has again read the book.

b. * John again has probably read the book.

This contrast reflects a deeper generalization: epis

‘Gianni begins to fix it.’ temic adverbs like probably structurally precede aspecb. * Gianni lo corre a riparare. tual adverbs like again in the functional hierarchy [ 2 ].

Gianni it.cl runs to fix This ordering is also mirrored in other languages, such as

Italian (Giovanni probabilmente ha di nuovo letto il libro

‘Gianni runs to fix it.’ vs. ?Giovanni di nuovo ha probabilmente letto il libro), and even when surface word orders vary, (constrained) movement analyses do preserve the underlying hierar(3) a. Gianni ha/è voluto partire. chy. In fact, attested orders tend to be derivable from the Gianni has/is wanted to base sequence via movement operations constrained by ‘Gianni wanted to leave’ Universal Grammar, while unattested orders — such as stacking adverbs in reverse (again > probably) are rarely, b. * Gianni ha/*è preferito partire. if ever, observed without resulting in degraded acceptabilGianni has/*is preferred to ity (see also [15, 16, 17] for a diferent view on ordering ‘Gianni preferred to leave’ constraints yet still rooted in cognitive principles).

Similarly, in the nominal domain, elements such as

Finally, RQ3 can be investigated through the syntactic demonstrations, numerals, adjectives, and nouns tend to ingredients laid out above, using both clitic climbing and conform to the base order Demonstrative > Numeral > auxiliary switch as diagnostics for a restructuring-like, Adjective > Noun [18]. Using English again for illustraor a control-like representation of infinitive-taking verbs. tion, the sequence those three books is allowed, but not Consider a pseudo-verb like grabbare, if models have red three those books. These generalizations suggest that clear the diference between restructuring and control, natural languages are not arbitrarily diverse but instanthey would either block or allow clitic climbing across it, tiate a shared blueprint with tightly delimited variation, and either block or allow auxiliary switch. a claim supported by decades of comparative research

In the next section, we will introduce the empirical [19, 20, 21, 22]. domain of restructuring and the relevance of the carto- Crucially, these cartographic universals are not merely graphic enterprise as valid heuristics to test the cognitive typological observations; they reflect deep structural plausibility of syntactic generalizations. constraints on human language, likely rooted in cognitive and interface-driven pressures such as learnability, interpretability, and communicative eficiency (see a.o. 2. Universal Functional Hierarchy [23, 24, 25]). As such, they ofer a highly structured benchmark for evaluating whether LLMs reflect the underlying In formal linguistics, the cartographic approach refers principles of natural language cognition or simply reproto the efort to systematically map out the functional duce surface-level statistical patterns. Assessing cartostructure of the clause. Much like a geographical map re- graphic generalizations in LLMs thus becomes another veals detailed topography, syntactic cartography seeks to valuable diagnostic tool for determining whether their inuncover the fine-grained architecture of language, iden- ternal representations exhibit the kind of compositional tifying a universal and richly articulated hierarchy of and hierarchical structure found in human language. functional projections that determine the order of con- Importantly, the utility of cartographic diagnostics stituents in natural language [14]. This enterprise, de- does not presuppose that LLMs use the same mechaveloped over the past three decades, has shown striking nisms as human language acquisition. Instead, it posicross-linguistic consistency: while surface word orders tions cartographic constraints as a structural target: a vary dramatically across languages, the underlying struc- gold standard against which to assess the depth of lintural relations often conform to highly constrained and guistic generalization in artificial systems. If LLMs are to universal hierarchies. For instance, across typologically be considered cognitively plausible models of language ([26], a.o.), they should, at a minimum, capture the universal constraints that human learners internalize from fragmented, language-specific input. Testing for cartographic efects in LLMs therefore ofers a window into the extent to which their representations are not only successful at surface prediction but aligned with the hidden universals that define natural language competence.

In this sense, cartography closes the gap between linguistically informed evaluation and cognitively grounded modeling. By operationalizing syntactic universals as testable hypotheses in LLMs, we move closer to understanding not just whether these models can generate human-like language, but whether they have abstracted the kinds of structure that make human language what it is. 2.1. The empirical domain: the case of restructuring verbs in Italian these restructuring configurations constrain deeper syntactic dependencies. Besides CC, restructuring verbs like potere ’can’, volere ’want’, and dovere ’must’, can in fact optionally allow the infinitival verb to pick the auxiliary (essere ’be’, or avere ’have’), as in the case of unaccusative verbs. (5)

Marco ha/è dovuto partire.

Marco has/is must.pstprt leave.inf A particularly revealing case study for testing structural Marco had to leave. representations from a cartographic perspective in LLMs comes from the domain of restructuring verbs in Ital- Restucturing verbs then present an ideal testing ian, as discussed in [ 1, 11 ]. Restructuring verbs — such ground for evaluating whether LLMs encode abstract as potere ‘can’, dovere ‘must’, volere ‘want’, continuare syntactic structures from cartographic generalizations, ‘continue’, cominciare ‘begin’, are verbs that, despite se- or merely track co-occurrence frequencies. While Marco lecting an infinitival complement, do not behave as if lo finisce di mangiare in fretta (‘Marco finishes eating it they embed a full clause (cf. [13, 12, 27], a.o.). Instead, quickly’) is structurally monoclausal and allows clitic they participate in a monoclausal structure, lacking the climbing, its control verb counterpart *Marco lo decide di full complement of functional projections found in fully mangiare in fretta is ungrammatical precisely because the embedded (i.e., biclausal) contexts. This has observable clitic cannot climb out of a true embedded clause. These syntactic consequences: only restructuring verbs permit subtle distinctions, masked by similar surface forms, removement of the object clitic from the complement po- flect two diferent structural representations, underscorsition of the infinitive up to the matrix verb (e.g., Marco ing the need to go beyond linearity when assessing synlo vuole mangiare ‘Marco wants to eat it’), while con- tactic competence in artificial models. Furthermore, evitrol verbs, which are superficially similar, do not (e.g., dence from language development [28] shows that the *Marco lo decide di mangiare ‘Marco decides to eat it’). distinction between restructuring and control syntax, Clitic placement (Clitic Climbing; CC) thus ofers a fruit- and the fixed ordering constrain of restructuring verbs, ful diagnostic for the underlying syntactic structure of a are acquired very early on. This suggests that children restructuring configuration. have a clear representation of the diference between

More specifically, the working hypothesis that we are control and restructuring verbs, and when encountering adopting here ([ 1, 11 ]) views restructuring verbs as func- a novel infinitive-taking verb, some preliminary corpus tional heads occupying a fixed hierarchy (e.g., from lower data suggest they tend to prefer a restructuring interpreto higher, Aspectual > Modal > Temporal), with each tation over a control one [29]. A natural question, then, is verb spelling out a specific functional projection (Fig. 1) whether LLMs also encode such a clear distinction when rooted in the cartographic representation of the inflec- processing previously unseen infinitive-taking verbs. In tional domain. summary, we can use at least three solid tests to probe

Restructuring verbs obey in fact strict ordering con- linguistic competence when comparing restructuring and straints within sequences: for example, Marco lo suole control verbs: (i) the first (restructuring), but not the secvoler mangiare spesso ‘Marco usually wants to eat it often’ ond (control), allows Clitic Climbing (CC); (ii) the order is grammatical, while reversing the restructuring verbs of predicates lexicalizing positions in the functional hierblocks clitic climbing (*Marco lo vuole soler mangiare archy is rigid; and (iii) restructuring predicates can take spesso) as it is a violation of the hierarchical sequence both be and have as auxiliaries. of functional heads (*ModVolition > AspFrequentative). Unlike linear word orders of adjectives or adverbs, which LLMs might learn through surface-level statistical regularities,

3. Generalization in LLMs

rather than rule-based, and simply increasing the scale of training does not really improve the possibility of true Despite the impressive performance of state-of-the-art syntactic generalizations.

LLMs, it remains an open question whether their en- In this context, the empirical domain of restructuring hanced predictive capabilities reflect genuine syntactic verbs provides an ideal testing ground for disentangling knowledge. LLMs are said to exhibit syntactic generaliza- linear generalizations from structural rules. On the one tion insofar as they can abstract structural rules from data hand, restructuring verbs follow specific linear orderings and apply them to novel grammatical contexts beyond that could, in principle, be learned from surface patterns their training input. Wilson et al. (2023) [30] theorize in the training data. On the other hand, their ordering three forms of generalization, diferentiating the ability to can either permit or block syntactic phenomena such learn word distributions and the distributions in contexts as clitic climbing (CC), making linear order a surface from the ability to abstract generalization independently reflex of deeper structural constraints. Capturing the of training data. The findings highlight that, while ex- relevant syntactic generalizations in this domain therecelling in transferring distributions across syntactically fore requires more than sensitivity to word order — it similar context, LLMs struggle in extracting structural hi- demands an understanding of the underlying hierarchical erarchical rules, relying primarily on linear order instead. structure.

Accordingly their linguistic knowledge appears to be of a semantic and probabilistic nature and the emergence of human-like abstraction correlates with the increase 4. Methods of training data, radically diferentiating from human linguistic competence. The issue of LLMs’s grammat- We designed 13 minimal pairs experiments targeting varical knowledge is tackled by the linguistic community ious grammatical contrasts involving clitic placement, through diferent approaches relying on controlled exper- auxiliary selection, and verb-verb complementation. In imental settings, probing LLMs’ performances on mini- these experiments, we manipulated the presence or abmal pair sentences, and evaluating the internalization of sence of restructuring environments, the type of madeep hierarchical dependencies of the underlying linguis- trix verb (restructuring verbs, control verbs, and pseudotic structures. Blimp [31] evaluate LLMs with minimal verbs), and the structural distance between multiple ocpairs, finding that — while learning basic dependencies, currences of restructuring verbs, allowing us to probe the and surface-level patterns — models still cannot encode models’ syntactic representations under diferent conuniversal constraints like argument structure, even in a ditions. First, we coded 14 restructuring verbs and 14 high-resource language like English. Training models on infinitive-taking verbs (which we name according to the larger corpora leads to better performances suggesting syntactic literature as control verbs, cf. [34]). While the that data play a major role compared to the architecture. coding of control verbs is arbitrary, the numbering of

The very same result is obtained in another bench- restructuring verbs reflects their position in the funcmark, BIG-bench [32], comprising 204 tasks designed to tional hierarchy of [ 1 ], with andare ‘to go’ assigned code assess linguistic, reasoning, and knowledge-based abil- 1 as the lowest verb, and solere ‘to be used to’ assigned ities. Even if larger models show an improvement in code 14 as the highest (see Table. 1). Verbs higher in the syntactic generalization, this can be explained in terms hierarchy occur linearly to the left of lower verbs. of memorization rather than grammatical abstraction. In addition to the verbs above, we also created three Deep-structure constraints still represent a challenge. pseudo-verbs (i.e., non-existent words in Italian) to test

In a recent study, [33] confirms the relevance of train- whether LLMs assign them a restructuring-like or controling data size in improving generalization, taking the like syntactic representation when they take a non-finite case of a syntactic universal as the Final-over-Final Con- complement. One, grabbare, is a bare verb resembling straint (FOFC) — the rule governing word order variation modals (verbs 6, 7, and 12 in Table 1) as well as solere ‘to crosslinguistically. They tested models with low-resource be used to’ and other control verbs. The other two pseudolanguages and found that models fail to learn this con- verbs, drommare a and trellare di, take the prepositions a straint when dealing with languages like Basque. A super- and di, respectively: a feature shared with the remaining human amount of training examples improves syntactic restructuring and control verbs. generalization, but models do not acquire abstract rules To address RQ1 (introduced in Section §1), we conof grammar. structed minimal pairs of verb sequences that either

Taken together, these studies point to the necessity of respect or violate Cinque’s (2006) functional hierarchy. incorporating more structured training methodologies Each item in Exp. 1 presents a grammatical (hierarchyand inductive biases, especially in light of the fact that respecting) sentence alongside a minimally diferent unhuman language acquisition occurs with far less data. grammatical counterpart, with the two verbs separated Current models remain fundamentally data-dependent by varying degrees of hierarchical distance. This experiment tests whether LLMs prefer the option adhering to clitic variant is grammatical because control verbs block the hierarchy, and whether their preferences correlate clitic climbing, even if the model assumes the pseudowith the hierarchical distance between verbs. verb to be restructuring-compatible. This design ofers a

A second experiment (Exp. 2) uses the same verb pairs strong test of whether the model robustly distinguishes as in Exp. 1, but includes a proclitic clitic in each sentence. restructuring from control verbs. If the model is sensiThis introduces an explicit syntactic cue for restructur- tive to this contrast, it should reject the proclitic variant ing, allowing us to evaluate whether clitic placement in favor of enclisis, indicating a fine-grained syntactic influences the model’s preference for the grammatical, representation of clitic domain boundaries. hierarchy-respecting variant.. Exp. 9 further probes the syntactic status of pseudo

To address RQ2, Exp. 3 and Exp. 4 pair control verbs by pairing them with each other and testing verbs with restructuring verbs, testing them in both pos- proclitic vs. enclitic placement. This experiment sible orders: restructuring+control (Exp. 3) and con- asks whether the model classifies pseudo-verbs as trol+restructuring (Exp. 4). Each minimal pair includes restructuring-like or control-like when they co-occur, clitics, with the grammatical variant displaying enclisis shedding light on whether it generalizes clitic behavior on the infinitival verb and the ungrammatical one dis- within novel verb classes. playing proclisis onto the matrix verb. The latter is ruled In Exp. 10, we tested pseudo-verbs in isolation, assessout because in both cases the control verb introduces a ing model preferences for auxiliary selection (have vs. be) clausal boundary that blocks clitic climbing. — another syntactic hallmark of restructuring (see §2.1).

To investigate RQ3, we conducted a series of experi- For comparison, Exp. 13 and Exp. 14 extend this test to ments pairing restructuring and control verbs with the restructuring (modals) and control verbs, respectively. three pseudo-verbs introduced earlier. Exp. 5 combines Exp. 11 tests pseudo-verbs selecting infinitival comeach of the three pseudo-verbs (grabbare, drommare a, plements, presenting both proclitic and enclitic variants. trellare di) with all 14 restructuring verbs, presenting two This experiment investigates whether the model prefers variants per item: one with proclisis onto the matrix verb proclisis (indicating a restructuring representation, along (suggesting restructuring), and one with enclisis on the the lines of Exp. 5) or enclisis, and whether this prefinfinitival verb. Exp. 6 reverses the order (restructuring erence is modulated by the presence or absence of the + pseudo-verb) but otherwise follows the same design. prepositions di and a.

Since proclisis requires a monoclausal analysis, these ex- Finally, in Exp. 12 and 13 we tested modal (restrucperiments test whether the model treats novel verbs as turing) verbs and control verbs with auxiliary selection, compatible with restructuring. A systematic preference respectively (only modals allow both essere ’to be’ and for the proclitic variant would suggest that the model avere ’to have’ with unaccusative verbs, while control generalizes restructuring behavior to unseen verbs. verbs require avere). This allows us to see whether the

Exp. 7 and Exp. 8 approach the same question from the ifne-grained syntactic distinctions between restructuring opposite angle, pairing pseudo-verbs with control verbs. and control have been successfully generalized by these In Exp. 7, the order is control + pseudo-verb, while in Exp. models. 8, it is pseudo-verb + control. In both cases, only the en4.1. Materials: Minimal Pairs between diferent models’ size, in the ability to internalize the structural dependencies necessary to abstract The minimal contrasts exemplified in Table 2 have been the relevant generalizations. All models are available on considered. For each condition internal to each experi- Hugging Face [ 35, 36, 5, 37, 38 ]. ment, we generated 100 structurally irrelevant variants Minerva-7B-base-v1.0 [ 3 ] is a causal LLMs with 7 bildisplaying diferent lexical items as subjects, infinitival lion parameters, based on Mistral architecture (32 layers, verbs, and objects (when present). Although some of the hidden size 4096, 32 attention heads, context window of items across the experiments were semantically odd, the 4096 tokens) trained on 2˜.48 trillion tokens (1.14T Italian, generalizations are nonetheless still strong, and the con- 1.14T English, 200B code) and a 51200-token vocabulary. trast within the pairs remains sharp, as in the example Bert-base-italian-xxl-uncased is the Italian version of below. BERT base model (uncased), a masked LLMs trained 4. il calciatore lo sta riuscendo a finire di ideare on next sentence prediction. The models has 111M the soccer player it.cl is about to be able to finish parameters and training data consist of OPUS corpus to design (https://opus.nlpl.eu/) extended with additional content 5. *il calciatore lo riesce a star finendo di ideare from the Italian portion of the OSCAR corpus, for a final the soccer player it.cl is able to be about to finish training corpus of 81GB and 13,138,379,147 tokens. to design GroNLP/GPT2-medium-italian-embeddings [ 4 ] is built on GPT-2 medium architecture, with 359M parameters

The script responsible for the generation of the mini- with the lexical layer retrained to support Italian. mal pairs is available on GitHub. GroNLP/GPT2-small-italian [ 4 ] is a smaller causal Transformer with 121 million parameters, built on GPT-2 4.2. Experiments small architecture and retrained in Italian. GePpeTto [ 6 ] has a GPT2-small configuration ( 1˜17 milFive LLMs have been employed for the evaluation of syn- lion parameters) and has been trained in Italian corpora tactic generalization with minimal pair sentences. The OSCAR (https://huggingface.co/datasets/oscar-corpus/ selection was driven by two key factors for the evaluation: oscar?utm_source=chatgpt.com), PAISA (https: model size and language of training. //www.corpusitaliano.it/en/?utm_source=chatgpt.com),

Correspondingly we included large, medium and small Wikipedia. GePpeTto, similarly based on the GPT2-small models - Minerva-7B-base-v1.0, GPT-2 medium and Bert- architecture, employs a BPE tokenizer with a reduced base-italian-xxl-uncased, GPT2-small and GePpeTto. All vocabulary of 30,000 tokens, specifically adapted for models are trained on Italian corpora, hence they allow Italian linguistic data. us to assess whether exposure to Italian during training enhances syntactic generalization in a typologically relevant domain. This setup enables a direct comparison 4.2.1. LLMs Evaluation GePpeTto and Minerva remained well below chance (12–18%).

The LM-eval platform [39] was adopted to perform min- In Exp 9 and 11, which included only pseudoverbs, imal pair tests. A total of 610,500 minimal pairs were GePpeTto consistently preferred proclitic constructions generated and divided into 13 groups, as described in (low accuracy = proclisis favored), while Minerva and §4.1, and assessed by all the selected models. For each GPT2-small showed no clear preferences, again reflecting experiment we computed the mean accuracy and stan- indecision or inconsistency. dard deviation (3), leaving further statistical analyses for As for auxiliary selection, the results reveal further the future. For unknown reasons, some models failed to lack of syntactic diferentiation: in Exp 10, GePpeTto complete certain evaluation tasks without producing any systematically selected essere (7% accuracy), suggesting intelligible error messages. it interpreted pseudoverbs as restructuring verbs. GPT2small showed more balanced choices ( 47%), compatible 5. Results with the ambiguity characteristic to some restructuring verbs which allow both avere and essere.

We organize our results around the three core research In Exp 12, in fact, testing modal auxiliaries, models questions that reflect diferent dimensions of the models’ should ideally show 50% accuracy, given the optionalsyntactic generalizations with respect to restructuring ity of auxiliary selection; instead, both GPT2-small and verbs, control verbs, and infinitive-selecting pseudoverbs. GePpeTto showed categorical but divergent choices, with For each question, we present the relevant experimen- accuracies around 5%. tal conditions and summarize the performance of all In Exp. 13 (control verbs), only Minerva performed tested LLMs in terms of mean accuracy and standard above chance (57%), while GePpeTto and GPT2-small deviation. To assess whether models internalize the syn- selected the incorrect auxiliary (essere) almost categortactic hierarchy of restructuring verbs proposed by [ 1 ] ically (1% accuracy), and BERT was the only model to (RQ1), Exp. 1 and 2 tested sequences of two restructuring outperform Minerva (63%). verbs in the correct vs. incorrect hierarchical order, with As a result, models largely fail to generalize the syntacand without clitic pronouns. Mean accuracies in these tic constraints of restructuring and control verbs. Clitic experiments were consistently low (Minerva: 36–37%, climbing is not consistently blocked by control verbs, GePpeTto: 36–38%, GPT2: 46–48%), with SD close to and auxiliary selection does not reliably reflect the trans0.5. BERT, however, performed moderately above chance parency efects typical of restructuring verbs nor the (Exp. 1: 64.6%, Exp. 2: 56.9%), suggesting that it may en- ambiguity intrinsic to them. Only GPT2-small shows code some sensitivity to hierarchical ordering, although partial sensitivity in some control constructions, while not robustly. GePpeTto tends toward an overgeneralization of restruc

The presence of clitics in Exp. 2 did not alter model turing syntax (e.g. by overselecting essere as an auxiliary). behavior compared to Exp 1. Models show no evidence of Finally, a central question of this study addresses how having acquired the hierarchical layout of restructuring models categorize pseudoverbs — novel verbs not seen verbs, besides BERT’s results. However, their responses during training but constructed to select infinitival commay correlate with verb distance or hierarchical ordering, plements, and whether they are interpreted as control or which we leave for further research. restructuring verbs.

To evaluate whether models distinguish between re- In Exp. 5 and 6, pseudoverbs appeared in sequences structuring and control verbs based on syntactic diagnos- with restructuring verbs, with proclitic vs. enclitic altics (RQ2) we considered two properties: clitic climbing ternations. Minerva showed a slight preference for the (Exp. 3, 4, 5, 6, 7, 8, 9, 11), and auxiliary switch (Exp. 10, 12, enclitic form ( 23–29% accuracy), suggesting a bias to13). In Exp. 3 and 4, which tested restructuring–control ward control-like syntax. GePpeTto strongly preferred verb sequences with clitics, models consistently failed to the proclitic form ( 17% accuracy = 83% proclisis), indiblock clitic climbing where it was expected to be ungram- cating a restructuring-like interpretation. GPT2-small matical. GePpeTto and Minerva almost systematically was ambivalent. Since the three pseudoverbs difer in chose the ungrammatical option (18–28%), while GPT2- whether they select a preposition, mirroring the variation small showed slightly better performance ( 42–46%) but found among restructuring verbs, further analyses will with high variability, BERT performed near floor ( 5–6%). investigate this property as a potential factor. A model that shows a bias over 75/80% can be in fact Exp. 9 and 11, which tested proclitic/enclitic prefconsidered structurally coherent, even though it picks erences with pseudoverb–pseudoverb sequences, reinthe ungrammatical option [ 40 ]. forced these trends: GePpeTto showed a consistent pref

In Exp. 7 and 8, which paired control verbs with erence for proclitic constructions (11–15% accuracy), pseudoverbs, models again failed to systematically block while GPT2-small and Minerva again showed no strong clitic climbing. GPT2 reached 48–57% accuracy, while preference.

Experiment UID Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 Exp. 6 Exp. 7 Exp. 8 Exp. 9 Exp. 10 Exp. 11 Exp. 12 Exp. 13

Minerva

Mean Std sequence of two restructuring verbs testing only linear order 0.3646 0.4813 sequence_pairs_with_clitics 0.3762 0.4844 restructuring_and_control_plus_clitics 0.2854 0.4516 control_and_restructuring_plus_clitics 0.2253 0.4178 pseudo_and_restructuring_plus_clitics 0.2336 0.4231 restructuring_and_pseudo_plus_clitics 0.2857 0.4518 control_and_pseudo_plus_clitics 0.1569 0.3637 pseudo_and_control_plus_clitics 0.1810 0.3850 pairs_of_pseudo_verbs_plus_clitics 0.5583 0.4966 auxiliary_switch_with_pseudoverbs — — pseudo_verbs_plus_clitics 0.2267 0.4187 auxiliary_switch_with_modals — — auxiliary_switch_with_control_verbs 0.5700 0.4951

In Exp. 10, which tested auxiliary selection with pseu- tion, two classical diagnostics that distinguish restructurdoverbs, GePpeTto again opted overwhelmingly for es- ing from control. Across all clitic-related experiments, sere, consistent with restructuring behavior, while GPT2- models consistently failed to block clitic climbing where small distributed responses more evenly. BERT dis- it should be ungrammatical, especially in the presence of tributed its choices roughly evenly (around 53.7% accu- control verbs. This strongly suggests that models do not racy), suggesting some awareness of optionality, though encode the syntactic opacity of control verbs. A potential this may be an artifact of random choice. explanation for these results lies in tokenization artifacts.

These results suggest that GePpeTto interprets novel Unlike proclitic clitics (e.g., lo ha visto ’it.obj has seen’), infinitive-selecting verbs as restructuring verbs by default enclitics (e.g., vederlo ’see-it.obj’) should be tokenized (although without expressing the available optionality as subword fragments. If models fail to treat enclitics as with avere), consistently favoring proclisis and auxiliary distinct morphemes, this may increase their preference essere. In contrast, GPT2-small and Minerva exhibit un- for proclitic constructions simply because the latter are certainty or mixed behavior, with no consistent syntactic tokenized as independent words, easily recognizable as categorization of pseudoverbs. syntactic objects.

Auxiliary selection patterns further support the view that models lack a deep representation of infinitive6. Discussion taking verb classes. None of the models consistently mapped control verbs to avere, or correctly captured the Overall, the findings reveal that the models’ behavior optionality of auxiliary selection in modals (with the pardoes not align with the predictions raised by the frame- tial exception of BERT in Exp. 13, having 63% accuracy). work of [ 1 ], nor with the grammatical requirements char- GPT2 again performed marginally better than the othacteristic of the syntax of non-finite complements in ers in preserving optionality, but even it failed to align Italian. Instead, their choices are often inconsistent, in- with the expected 50% distribution. Surprisingly, both sensitive to syntactic structure, or driven by superficial GPT2-small and GePpeTto nearly categorically misasfactors. The first research question addressed whether signed essere to control verbs, a highly ungrammatical models generalize the hierarchical structure of restructur- option in Italian. ing verbs as observed in the syntactic literature ([ 1, 11 ]). These findings point to a broader issue: models do Our results clearly indicate that no such hierarchy is not reliably encode the syntactic transparency of restrucreflected in the models’ performance. Accuracies were turing verbs nor the obligatory opacity of control verbs. consistently low, and variability high. These findings Syntactic features that are not overtly marked in surecho previous results showing that LLMs often fail to face form — such as whether a verb transmits argument internalize syntactic hierarchies when such structures structure or allows clitic climbing — appear to be dificult are not directly observed during training or explicitly for models to capture, even when such distinctions are encoded [30]. Even BERT, which slightly outperformed central to grammaticality. other models on restructuring verb order, failed across the board on clitic-related diagnostics. This has implications for how much syntactic theory — especially fine-grained 7. Conclusions distinctions like cartographic hierarchies — is learnable from surface patterns alone. This study investigated whether LLMs encode abstract

In the second set of questions, we tested whether mod- syntactic generalizations by testing their sensitivity to the els are able to handle clitic climbing and auxiliary selec- restructuring verb hierarchy in Italian. Using a suite of controlled minimal pair experiments targeting verb order, ceptability judgment task to present these contrasts to clitic placement, and auxiliary selection, we assessed native speakers and properly compare LLM performance models’ ability to capture structural dependencies that with human data. go beyond linear surface patterns. Further analyses - currently underway - are required

The models tested — GPT2-small-italian, GPT2- to provide a more comprehensive understanding of the medium-italian-embeddings, GePpeTto, Bert-base- syntactic behaviors tested. These will be reported in italian-xxl-uncased and Minerva-7B-base-v1.0 — showed future work. limited sensitivity to the syntactic hierarchy of restructuring verbs, failed to consistently distinguish restructuring from control verbs based on key syn- Acknowledgments tactic diagnostics, and did not consistently categorize novel infinitive-taking verbs based on the non-finite We acknowledge financial support under the National embedding typology available in Italian. These findings Recovery and Resilience Plan (NRRP), Mission 4, Comhighlight fundamental limitations in the syntactic ponent 2, Investment 1.1, Call for tender No. 104 pubabstraction capacities of current models, particularly lished on 2.2.2022 by the Italian Ministry of University in domains where structural contrasts are not overtly and Research (MUR), funded by the European Union marked in surface form. – NextGenerationEU– Project Title T-GRA2L: Testing

While none of the models fully internalize the hier- GRAdeness and GRAmmaticality in Linguistics – CUP archical structure of restructuring verbs, some results I53D23003900006 - Grant Assignment Decree No. 104 (as BERT’s above-chance accuracy in distinguishing adopted on the 2nd February 2022 by the Italian Ministry hierarchy-respecting sequences in Exp. 1) suggest at least of Ministry of University and Research (MUR). PI: CC some limited sensitivity to structural cues. However, this sensitivity is neither robust nor consistent across models References or conditions, and most importantly does not translate into reliable grammaticality judgments. For example, clitic placement’s explicit cues for restructuring failed to improve performance, and models consistently failed to block ungrammatical clitic climbing or the essere auxiliary selection in the context of control verbs. These ifndings indicate that, to the extent models are sensitive to structural hierarchies, in the domain of cartographic generalizations this sensitivity remains shallow and insufifcient for capturing the related grammatical distinctions.

Addressing these limitations will require new approaches to model design, training, and evaluation that go beyond surface-level pattern recognition, and may involve encoding linguistic biases into model architectures— much like cartographic hierarchies are hypothesized to be innately hardwired in human cognition.

8. Limitations

The main limitation of the current research lies in the exclusive usage of publicly available pre-trained models as outlined in 4.2. To obtain a fine-grained understanding of models’ capacity on syntactic generalization, future works will employ models trained from scratch, with a training regimen reproducing human language acquisition stages (see 2). The alignment between learning trajectories and the implementation of more structured training methodologies and inductive biases (see 3) will hopefully improve models’ performance in syntactic tasks [ 41, 42 ]

Moreover, we are in the process of designing an ac[8] Y. Goldberg, Assessing bert’s syntactic abili- spective on the grammaticalisation of speaker perties, 2019. URL: https://arxiv.org/abs/1901.05287. spective, Jung (2017) 93.

arXiv:1901.05287. [26] M. Binz, E. Schulz, Turning large language [9] E. Wilcox, R. Levy, T. Morita, R. Futrell, What do rnn models into cognitive models, arXiv preprint language models learn about filler-gap dependen- arXiv:2306.03917 (2023). cies?, 2018. URL: https://arxiv.org/abs/1809.00042. [27] M. Olivier, C. Sevdali, R. Folli, Clitic Climbing and arXiv:1809.00042. Restructuring in the History of French, Glossa 8 [10] J. Hu, J. Gauthier, P. Qian, E. Wilcox, R. Levy, A (2023) 1–45.

systematic assessment of syntactic generalization [28] T. Sgrizzi, When infinitives are not under control: in neural language models, in: Proceedings of the the growing trees hypothesis and the developmen58th Annual Meeting of the Association for Compu- tal advantage of restructuring verbs, RGG 46 (2024) tational Linguistics, Association for Computational 1–39.

Linguistics, Online, 2020, pp. 1725–1744. URL: https: [29] T. Sgrizzi, The Acquisition of Restructuring and //www.aclweb.org/anthology/2020.acl-main.158. Control, Master’s thesis, University of Siena, Siena, [11] T. Grano, Control and Restructuring, Oxford Stud- Italy, 2022.

ies in Theoretical Linguistics, Oxford University [30] M. Wilson, J. Petty, R. Frank, How abstract is linPress, London, England, 2015. guistic generalization in large language models? ex[12] S. Wurmbrand, Infinitives, Berlin: De Gruyter Mou- periments with argument structure, Transactions ton, 2001. of the Association for Computational Linguistics [13] S. Wurmbrand, Restructuring cross-linguistically, 11 (2023) 1377–1395.

LingBuzz (2015). doi:lingbuzz/002514. [31] A. Warstadt, A. Parrish, H. Liu, A. Mohananey, [14] G. Cinque, L. Rizzi, The cartography of syntac- W. Peng, S.-F. Wang, S. R. Bowman, Blimp: The tic structures, CISCL Working Papers on Lan- benchmark of linguistic minimal pairs for english, guage and Cognition 2 (2012) 43–59. doi:10.1093/ Transactions of the Association for Computational oxfordhb/9780199544004.013.0003. Linguistics 8 (2020) 377–392. [15] G. Scontras, J. Degen, N. D. Goodman, Subjectivity [32] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, predicts adjective ordering preferences, Open Mind A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, 1 (2017) 53–66. A. Garriga-Alonso, et al., Beyond the imita[16] G. Scontras, J. Degen, N. D. Goodman, On the gram- tion game: Quantifying and extrapolating the camatical source of adjective ordering preferences, pabilities of language models, arXiv preprint Semantics and Pragmatics 12 (2019) 7–1. arXiv:2206.04615 (2022). [17] G. Scontras, Adjective ordering across languages, [33] J. Hale, M. Stanojević, Do llms learn a true syntactic

Annual Review of Linguistics 9 (2023) 357–376. universal?, in: Proceedings of the 2024 Conference [18] G. Cinque, Deriving greenberg’s universal 20 and on Empirical Methods in Natural Language Processits exceptions, Linguistic inquiry 36 (2005) 315–332. ing, 2024, pp. 17106–17119. [19] L. Rizzi, The fine structure of the left periphery, Ele- [34] I. Landau, Control (Elements), LingBuzz (2024). ments of grammar: Handbook in generative syntax doi:lingbuzz/008204.

(1997) 281–337. [35] SapienzaNLP - Sapienza University of Rome, [20] L. Rizzi, G. Bocci, Left periphery of the clause: Pri- Minerva-7b-base-v1.0, https://huggingface.co/ marily illustrated for italian, The Wiley Blackwell sapienzanlp/Minerva-7B-base-v1.0, 2024. Accessed: Companion to Syntax, Second Edition (2017) 1–30. 2025-08-01. [21] R. Kayne, Some notes on comparative syntax, [36] GroNLP - University of Groningen, gpt2-mediumwith special reference to english and french, The italian-embeddings, https://huggingface.co/ Oxford Handbook of Comparative Syntax (2012) GroNLP/gpt2-medium-italian-embeddings, 2020. 3–69. doi:10.1093/oxfordhb/9780195136517. Accessed: 2025-08-01.

013.0001. [37] GroNLP - University of Groningen, gpt2[22] K. Abels, Towards a restrictive theory of (remnant) small-italian, https://huggingface.co/GroNLP/ movement!, Linguistic variation yearbook 7 (2007) gpt2-small-italian, 2020. Accessed: 2025-08-01. 53–120. [38] L. D. Mattei, Geppetto: Italian gpt-2 model, https:// [23] G. Ramchand, P. Svenonius, Deriving the functional huggingface.co/LorenzoDeMattei/GePpeTto, 2021.

hierarchy, Language sciences 46 (2014) 152–174. Accessed: 2025-08-01. [24] G. C. Ramchand, Situations and syntactic structures: [39] L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, Rethinking auxiliaries and order in English, vol- A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, ume 77, MIT Press, 2018. H. Li, K. McDonell, N. Muennighof, C. Ociepa, [25] T. Biberauer, Peripheral significance: a phasal per- J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron,

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Cinque , Restructuring and functional heads, Cartography of Syntactic Structures (Hardcover) , Oxford University Press, Cary, NC , 2006 .

[2]

Cinque , Adverbs and functional heads: A crosslinguistic perspective , Oxford University Press, 1999 .

[3]

Orlando ,

Moroni ,

P.-L. H.

Cabot ,

Conia ,

Barba ,

Orlandini , G. Fiameni,

Navigli , Minerva llms: The first family of large language models trained from scratch on italian data, in: Proceedings of the 10th Italian conference on computational linguistics (CLiC-it 2024 ), 2024 , pp. 707 - 719 .

[4] W. De Vries , M. Nissim , As good as new. how to successfully recycle english gpt-2 to make models for other languages, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , 2021 , pp. 836 - 846 . doi: 10 .18653/v1/ 2021 . findings-acl. 74 .

[5]

DBMDZ

- Bavarian State Library , Bert-base italian xxl uncased , https://huggingface.co/dbmdz/ bert-base -italian-xxl- uncased , 2020 . Accessed: 2025 -08-01.

[6]

De Mattei ,

Cafagna ,

Dell'Orletta ,

Nissim ,

Guerini , Geppetto carves italian into a language model , in: Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-It 2020 , Bologna, 2021 .

[7]

Linzen , E. Dupoux,

Goldberg , Assessing the ability of lstms to learn syntax-sensitive dependencies, Transactions of the Association for Computational Linguistics 4 ( 2016 ) 521 - 535 . L. Sutawika , E.

Tang , A.

Thite , B.

Wang , K.

Wang , A. Zou,

The language model evaluation harness, 2024 . URL: https://zenodo.org/records/12608602. doi: 10 .5281/zenodo.12608602.

[40]

Chesi ,

Barbini ,

M. L. P.

Bianchessi ,

Bressan ,

Fusco ,

Neri ,

Rossi , T. Sgrizzi, From recursion to incrementality: Return to recurrent neural networks, Linguistic Vanguard (forthcoming).

[41]

Charpentier ,

Choshen ,

Cotterell ,

M. O.

Gul ,

Hu ,

Jumelet ,

Linzen ,

Liu ,

Mueller ,

Ross , et al., Babylm turns 3: Call for papers for the 2025 babylm workshop , arXiv preprint arXiv: 2502 .10645 ( 2025 ).

[42]

Fusco ,

Barbini ,

M. L. P.

Bianchessi ,

Bressan ,

Neri ,

Rossi ,

Sgrizzi ,

Chesi , Recurrent networks are (linguistically) better? an (ongoing) experiment on small-lm training on child-directed speech in italian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), 2024 , pp. 382 - 389 .