1. Introduction

Glossa: a org/abs/2208.02957. doi:10.48550/ARXIV.2208. journal of general linguistics 6 (2021). URL: https: 02957

10.18653/v1/2021.findings-acl.74

Acquisition in Babies and Machines: Comparing the Learning Trajectories of LMs in Terms of Syntactic Structures (AT TracTSS Test Set)

Sarah Rossi

1 2

Guido Formichi

1 2

Sofia Neri

1 2

Tommaso Sgrizzi

1 2

Asya Zanollo

1 2

Veronica Bressan

0 2

Cristiano Chesi

1 2 0 Department of Linguistics and Comparative Cultural Studies, Ca' Foscari University of Venice , Fondamenta Tofetti 1075, 30123 Venice , Italy 1 IUSS Pavia , P.zza Vittoria 15, 27100, Pavia , Italy 2 NeTS Lab, IUSS Pavia , P.zza Vittoria 15, 27100, Pavia , Italy

2025

57 1 836 846

A cognitively plausible language model should (i) process language incrementally, (ii) be trained on naturalistic input, and (iii) mirror the developmental stages observed in child language acquisition. This study focuses on the third point by exploring the adherence of language models' developmental patterns to the predictions of two empirically grounded theories of syntactic acquisition, the Growing Trees and the Neo-Emergentist approaches. Using an evaluation method based on perplexity, we test whether small and medium Italian-tuned LMs (two small GPT2 LMs, GePpeTto, and Minerva-7B) show sensitivity to syntactic phenomena corresponding to three acquisitional stages documented in child Italian. Our results suggest that smaller open models only partially reflect the stagewise progression observed in children.

eol>Language acquisition LMs syntax cognitive plausibility

1. Introduction

linguistics—can contribute meaningfully to linguistic inquiry, provided that certain conditions on the cognitive State-of-the-art Large Language Models (LLMs) demon- plausibility of the model are met. strate remarkable success on various linguistic bench- A language model (LM) that aspires to linguistic cogmarks ([1], inter alia). However, from a linguistic per- nitive plausibility should meet at least three key criteria. spective, they remain uninteresting from the point of First, it should process linguistic input incrementally, review of their cognitive plausibility. In fact, their archi- flecting the word-by-word, real-time parsing observed tecture and learning dynamics difer fundamentally from in human sentence production and comprehension [4]. those of human learners, raising doubts about their rele- Second, it should be exposed to naturalistic training invance to linguistic inquiry [2, 3]. put, approximating the kind and distribution of linguistic

Nonetheless, following [4], we argue that language data encountered by human learners (PoS argument, [5]). modeling—despite often being overlooked in theoretical Third—and this is the focus of the present study—it should reproduce the developmental trajectory observed in first CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- language acquisition, where syntactic competence foltics, September 24 — 26, 2025, Cagliari, Italy lows structured and empirically documented stages. In * Corresponding author. line with this, we investigate whether LMs exhibit cog† These authors contributed equally. nitive plausibility with respect to syntax by examining $ sarah.rossi@iusspavia.it (S. Rossi); guido.formichi@iusspavia.it whether they reflect insights from linguistic theory on (tGom.Fmoarsmoi.csghri)i;zzsoi@fia.inuesrsip@aivuiass.ipta(vTi.aS.igtr(iSz.ziN);eri); how humans acquire and process syntactic knowledge. asya.zanollo@iusspavia.it (A. Zanollo); veronica.bressan@unive.it We compare two prominent approaches to syntactic (V. Bressan); cristiano.chesi@iusspavia.it (C. Chesi) development: the Growing Trees approach (GT) [6] and https://rossisarah.github.io/ (S. Rossi); the Neo-Emergentist approach (NE) [7]. We argue that exhttps://www.iusspavia.it/en/contacts/guido-formichi (G. Formichi); plicit, theoretically informed, and empirically grounded hhttttppss::////twowmwsg.iruizszsip.gaivtihau.ibt/.iiot//ru(Tb.riScgar/iszozfiai)-;neri (S. Neri); theories of language acquisition can serve as efective https://www.iusspavia.it/it/rubrica/asya-zanollo/ (A. Zanollo); testing grounds for the evaluation of linguistic plausibilhttps://www.unive.it/data/people/28977262 (V. Bressan); ity of LMs. https://github.com/cristianochesi (C. Chesi) We propose an efective method for evaluating the ac0009-0007-2525-2457 (S. Rossi); 0009-0007-1307-194X quisition stages reflected in various (L)LMs by collecting ((TG.. SFgorrimzziic);h0i)0;0090-0090-0010-0339-8574-5468-4035(5A6.(SZ.aNnoelrlio));;00000000--00000033--13307752--17395697 their perplexity estimates for sentences corresponding to (V. Bressan); 0000-0003-1935-1348 (C. Chesi) stages observed in typical Italian first language develop© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License ment [8]. For our set of experiments, we drew from both Attribution 4.0 International (CC BY 4.0).

GT and NE literature to identify 17 core phenomena, each architectures that may embed relevant linguistic inturepresented by a prototypical structural pattern. To en- itions in the form of inductive biases [4]. rich the dataset, we introduced variations to these struc- Within a linguistic and cognitive perspective, the distures—e.g., changes in verbal class—resulting in a total tinction between models and theories is well established of 89 subphenomena. For each subphenomenon, we gen- and relevant to discussions about LLMs. [15] distinerated 100 lexically neutral instances, yielding a compre- guishes models, tools for simulating or predicting linguishensive evaluation battery of 8,900 items. We tested three tic data, from theories which, on the other hand, seek to GPT2 small Italian LMs, ita-baseline-small and NeTS-3M explain underlying cognitive mechanisms. [19] similarly [9, 10], GePpeTto—117M parameters [11]—and a larger emphasizes that valid theories must provide mechanisone—Miverva-7B-base, 7B parameters [12]. Results show tic explanations rather than merely replicate behavior. that Italian language models exhibit a stage-wise syntac- More recently, [20] formalizes this distinction, describing tic learning trajectory that aligns more closely with the models as devices for representing systems or testing GT approach, which proves more predictive than the NE specific hypotheses, whereas theories aim to provide exframework. We conclude that while key asymmetries planatory frameworks to generalize across phenomena. remain, models trained on a minimal amount of input They argue that during early theory development, when consisting solely of child-directed speech (e.g., NeTS-3M) empirical testing is limited, plausibility—shaped by faccan approximate the developmental patterns observed in tors such as computational tractability and theoretical human language acquisition. invariance—serves a critical criterion for advancing from models to theories. In sum, while LLMs demonstrate impressive empirical performance and ofer valuable tools 2. Poverty of Stimulus, LLMs, and for exploring linguistic patterns, their fundamental diflanguage theories ferences from human cognition, limitations in capturing graded acceptability, and reliance on vast datasets, distinA striking diference between LLMs and natural language guish them from genuine linguistic theories. At present, acquisition lies in the quantity of training data needed LLMs function as tools for hypothesis testing rather than to achieve adult competence. A robust cross-linguistic as explanatory accounts of language cognitive foundaobservation in first language acquisition is that children tions. converge on the adult grammar within a remarkably In this context, we draw on two linguistic theories from short developmental window—by approximately age 4 to the current literature to support the view that language 6—regardless of the language they are exposed to [13, 14], learning by (L)LMs can be meaningfully assessed—and and with limited exposure to primary linguistic data, as compared to child language acquisition—using precise emphasized in the Poverty of Stimulus (PoS) argument linguistic criteria. [15].

However, the PoS argument has been recently challenged by scholars who argue that LLMs represent the 3. Theories of Language most empirically grounded models of language currently Acquisition available, and that core features of human linguistic competence (e.g., recursion, logical inference, and hierarchi- As already mentioned, children converge on adult gramcal syntactic structure) may emerge spontaneously in mar within a remarkably short time [13, 14]. While there predictive models trained on unannotated language data can be (moderate) variability in the timing of acquisi[16, 17]. From this perspective, LLMs question the ne- tion in typically developing children, the developmental cessity of domain-specific innate mechanisms posited by patterns are consistent across individuals, in two key Generative Grammar (GG), suggesting that rich linguis- respects. First, all children go through stages in which tic generalizations may arise from data-driven statistical they make systematic, non-random errors—such as overlearning, given domain-general cognitive inductive bi- production of computationally lighter structures with a ases embedded in the artificial neural network architec- smaller number of morphosyntactic elements [21], like ture [18]. uninflected verbs (infinitives [ 22, 23] and imperatives

However, the debate concerns not only whether LLMs [24]; e.g., Mangi-a! ‘Eat!, imperative’ vs. Mangi-a-v-ano exhibit linguistic capacities or the amount of data re- ‘They ate’, past imperfective). quired, but also whether they can inform a theory that Second, children produce and master certain sentence accounts for the cognitive underpinnings of natural lan- types before others, and—crucially—the order of acquiguage. The issue should be addressed from multiple sition appears to be consistent across learners: some perspectives: by developing fine-grained performance children progress more rapidly than others, but all pass metrics, creating relevant tasks and benchmarks, paying through the same developmental stages. This provides attention to the amount of data, and considering model further evidence that the human language faculty con3.2. Growing Trees Approach strains the hypothesis space available to learners. This Empirical studies across multiple languages have study focuses on this second dimension of acquisition: shown that acquisition proceeds in structural bursts or the order in which diferent sentence types are acquired. “explosions”: at a given point, an entire syntactic domain (e.g., the vP+TP layer) becomes accessible, and all 3.1. Comparing Competing Theories structures associated with that domain become available to the child. Crucially, within these domains, there is We examine two prominent theories in the literature no robust evidence for a fixed internal acquisition orconcerning the order in which syntactic structures are der, suggesting that what is developmentally primary is acquired by children. Both seek to answer the same core the availability of the domain itself, not the sequential question (namely, which structures emerge earlier or later mastery of its substructures. These domains are straightin child language), but they difer significantly in their forwardly captured by the detailed cartographic structure empirical methodologies and theoretical assumptions, of the functional spine as it has been drawn by theoretical leading to divergent predictions that remain under active linguists over the past 30 years [28, 25]. investigation. Given the ongoing nature of this debate, While the foundational empirical work focused on we consider both approaches in our analysis, without Hebrew, the GT framework has since been extended to prematurely excluding either. other languages throught both experimental and corpusbased studies, including Italian [29, 30, 31], English [32], and others [27].

The GT approach takes the syntactic tree—a symbolic and highly formalized representation of sentence struc- 3.3. Neo-Emergentist Approach ture—as its central object of study. Syntactic trees capture The Neo-Emergentis approach [7, 33, 34] to language acthe hierarchical relationships among constituents, mak- quisition departs radically from both traditional nativist ing explicit distinctions that are not evident in surface and certain usage-based models. This approach is theoretword order, and are therefore indispensable for modeling ically motivated to a maximally impoverished Universal core properties of natural language. Grammar (UG), in line with Chomskyan “Three Factors”

The GT hypothesis proposes that syntactic develop- [35]. Rather than positing rich, innate linguistic content ment unfolds in a layered fashion, reflecting the gradual (Factor 1), this model shifts explanatory weight onto the availability of diferent regions of the tree. Initially, only interaction between primary linguistic data (PLD; Factor low structural domains, such as the verb phrase (vP) and 2) and general cognitive learning principles (Factor 3), inflectional phrase (IP), are accessible to the child, al- thereby advancing a minimalist conception of UG. lowing for simple subject–verb sentences, for instance. The central claim is that syntactic categories are not Subsequently, portions of the so-called Left Periphery [? innately specified but are emergent, and that acquisition 25], a high functional layer, become available, supporting proceeds along a learning path where coarser-grained the production of wh-questions and preposed adverbs. categories are acquired before finer-grained refinements. Only later does the full functional spine, including higher This involves a successive division algorithm, where the CP-level structures like embedded clauses, relatives, and child initially makes basic contrasts (such as predicate/ar“why”-questions, become active. The GT model builds gument) followed by more fine-grained subdivision (idenupon earlier maturational analyses introduced in the tifying discourse and thematic domain up to cartograph1990s, notably [26], and further developed in subsequent ically defined syntactic distinctions). Data from Catalan, work (see [6, 27]). In a cognitively plausible model, one Spanish, Italian, German, and Dutch [33] suggests that would expect to observe a learning trajectory mirroring basic CP structures (such as wh-questions, V2 word orthat of human acquirers, in which early-acquired struc- der, illocutionary complementisers, and topicalisation) tures (e.g., simple S–V sentences) are mastered before emerge at early developmental stages (defined in terms of later-acquired ones (e.g., embedded clauses). MLUw), challenging models that assume a fixed, innately

Traditional metrics for assessing language develop- specified hierarchy of syntactic categories [ 6, 26, 36]. In ment, such as Mean Length of Utterance in words contrast, finer-grained structures (e.g. recursive topics, (MLUw) alone or average age of acquisition across child multiple left-peripheral elements, V3 orders) seem to apsamples, have limited explanatory power due to the doc- pear only later (around or after MLUw 2.5). Crucially, umented high degree of individual variability in acquisi- building on the Peripheral Speaker-Hearer Hypothesis tion speed [6]. In other words, some children are faster (PSHH), which posits that speaker-hearer perspective is than others, but all of them follow the same develop- formally encoded at the edges of phasal domains [37], NE mental path, in that they all acquire various syntactic model predicts that here-and-now and speaker-hearerstructures in the same order. oriented material functions as key bootstrapping heuristics in acquisition, and therefore they are expected to be ID i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii

SV simple SV unaccusative VS unaccusative

Imperatives

Modals

Root wh-questions Root yes/no questions

Preposed Adverbs

Focus Illocutionary COMPs

Why questions

Topics Embedded that

Embedded if

Subject Relative Object Relative – intervener Object Relative + intervener acquired early. This point is particularly relevant when modeling the developmental trajectory of a language model, whose training, by definition, lacks access to referential stimuli such as here-and-now context (cf. the symbol grounding problem [38]).

3.4. Predictions

the TP layer is often a matter of theoretical interpretation, and currently under scrutiny.

4. Experimental Evidence 4.1. Methods To test LMs against the developmental predictions of

Under a NE view, the timing and trajectory of syntactic both NE and GT frameworks, we defined a problem space acquisition are governed by the complexity of formal designed to capture the full range of potential developfeatures involved, rather than its fixed hierarchical posi- mental trajectories a LM might exhibit. Using a test set tion in the functional spine. More specifically, if the GT (c.f. next subsection) that targets structurally rich conpredicts that Topics (pertaining to Stage 3) are acquired structions attested at various stages of acquisition, we later than wh-questions (pertaining to Stage 2), by virtue expect a coherent model (i) to be sensitive to syntactic of their structural height; from a NE point of view, this variations and similarities across diferent sentence types depends on the featural specification of these elements and to assign probabilities accordingly, and (ii) to align [34]: for example, yes/no questions are expected to be with one of the two developmental hypotheses by assignearly-acquired CP structures due to their low formal com- ing higher perplexity scores to items corresponding to plexity and learnability via generalization from minimal later stages of acquisition. To obtain perplexity measures cues, whereas according to the GT they are expected to and standard errors, we used the lm-evaluation-harness arise in Stage 2. platform [39] and created a custom task consisting of 100

Under the NE view, the macrocategories C, T, and V lexically irrelevant variations of the syntactic patterns are assumed to be available from the onset. In contrast, presented in Table 1 and further detailed in Appendix A. the GT approach posits that only V and T are initially Items were grouped into three stages to reflect the fineravailable (Stage 1), with C-related projections emerging grained distinctions predicted by the GT framework. If at later stages. no diference is found between Stage 1 and Stage 2, then

Despite being grounded in empirical studies, the two the LM behavior is consistent with NE approach. Othapproaches yield diverging predictions about the order of erwise, if a distinction emerges, this is in line with GT acquisition. This divergence stems also from how particu- predictions. lar structures are analyzed. For instance, whether a given construction involves movement to C or remains within repeating the log-probability analysis across multiple training epochs, in order to trace whether the model’s familiarization path mirrors human developmental patterns. The same type of analysis could not be carry out on the other models due to the impossibility to carefully control their training.

4.4. Results 4.2. ATTracTSS: A Novel Dataset The novel test set we created for evaluating the Acqui

sition Trajectories of various LMs in Terms of Syntactic Structures is dubbed ATTracTSS. The dataset consists of grammatical sentences representing 17 prototypical syntactic constructions—here referred to as sentence types (e.g., simple SV sentences, wh-questions, topicalizations, embedded clauses)—and 100 lexically diverse items generated for each sentence type.

We built our dataset based on the phenomena tested by GT and NE. Notably, NE does not provide an explicit list of the specific sentence types it predicts to emerge in a fixed acquisitional order. Therefore, we adapted GT’s classification to the NE framework where possible, deriving stage-based predictions for both hypotheses Table 1. In cases where alignment was not possible, we assigned the label unknown.

Mean perplexity and SD values for each stage in GT and

NE were derived from negative log probability values that the four models assigned to each of the items in the dataset, as reported in Table 2 (GT) and Table 3 (NE). Despite numerical diferences, perplexity tends to increase coherently with the stage progression in all LMs; SD, instead, tends to grow higher in the latest stages of both GT (Stage 3) and NE (Stage 2), suggesting higher variation within them.

Then, a series of linear regressions were run to as4.3. Implementation sess whether negative log probability assignment is significantly predicted across models (i) by the diferent We carry out a perplexity analysis starting from the neg- syntactic structures of the sentence types included in ative log probabilities assigned by the model to each the dataset, and most importantly (ii) by the articulasentence in the dataset. Perplexity levels are expected tion in stages proposed by GT and/or by NE. Random to inversely correlate with learnability. Perplexity mea- intercepts for length (i.e, number of words in each item sures how well a model predicts a given sentence. Lower in the dataset) were included in all regressions. Likeperplexity means the model finds the sentence more pre- lihood ratio tests (ANOVA) between a null model and dictable (less surprising), while higher perplexity means a model using sentence types as fixed efect revealed the model finds it less predictable (more surprising). that these significantly improved model fit in all LMs Given the 100 repetitions of the same syntactic skeleton, (ita-baseline-small: 2(65) = 2622.7, < .0001; we assume that averaging over multiple lexicalizations NeTS-3M: 2(65) = 2953.7, < .0001; GePpeTto: reduces the impact of individual word-level frequency 2(65) = 2925.3, < .0001; Minerva: 2(65) = efects on model perplexity. 3095.7, < .0001). As for GT and NE, instead, simi

At stage level, our hypothesis is that diferent acqui- lar tests outputted a sharp asymmetry in the predictive sition stages would be characterized not only by difer- power of the two accounts. Treating GT’s three-stage ent mean perplexity values, but also by similar standard articulation as fixed factor significantly improved model deviations (SD), indicating consistent model confidence ift (ita-baseline-small: 2(2) = 10.633, < .00491; within each stage. As for sentence types, if perplexity NeTS-3M: 2(2) = 376.68, < .0001; GePpeTto: remains consistently low across lexical variants of a sen- 2(2) = 9.1605, < .0001; Minerva: 2(2) = 35.5, tence type, and the variation is low, we interpret this < .0001), but the same did not apply to NE’s stages as evidence that the model handles the structure with a (p values >.05 for all LMs). Note however that except degree of robustness and consistency, suggesting it has for NeTS-3M, where all pairwise comparisons between learned to generalize over that syntactic pattern. While stages reach significance, contrasts between Stage 2 and this should not be taken to imply that the model has ac- 3 and Stage 1 and 3 strongly vary across LMs (see Apquired the structure in a human-like or abstract sense, pendix B), with Stage 3 being the least stable of the three. such behavior can nonetheless serve as a useful proxy For the detailed longitudinal results of the NETS-3M for comparison with human acquisition data. model, see Appendix C.

Four models were tested: ita-baseline-small—the pretrained GPT2 baseline model for Italian shared by the BabyLM Community in the HuggingFace platform [10], 4.5. Discussion NeTS-3M—a similar small GPT2 model trained on a custom 3M corpus of child-directed speech [40] —, GePpeTto—117M parameters [11]—and a larger model, Miverva-7B-base, 7B parameters [12]. For the NeTS-3M model we also implemented a longitudinal tracking by

The experiments reported in the previous sections were

conducted to address the issue of language development in LMs, i.e., to assess whether the way LMs “learn” their language may be compared to the process of natural language acquisition in children. Specifically, we compared Table 2 to be acquired—roughly overlap with the sentence types Mean perplexity estimation and SD grouped by GT stages. that LMs find less predictable. Nevertheless, closer inspection of mean perplexity values per sentence type Stages GT Perplexity (SD) Models revealed some variation within the stages, especially in 42.1788 (13.28) ita-baseline-small GT 3 and NE 2: some late structures for children, like Overall 5404..09360220 ((1160..2998)) GNeeTPSp-e3TMto welhsy(~-q3u0e),stwiohnisle,rSetcaegivee1vetrraynlosiwtivpeercplaleuxsietys farroemaasslligmnoedd36.5133 (11.14) Minerva higher-than expected perplexity (~52). These observation 37.3312 (10.28) ita-baseline-small suggest that caution is needed when comparing humans Stage 1 3430.7.6822269((182.6.305)) GNeeTPSp-e3TMto wanitdhLhMums, aanndacwquhiisleititohne, gsoemneeraimllpeoarrntainntgatsryemndmaeltirginess 32.3002 (8.62) Minerva remain.

48.4068 (9.89) ita-baseline-small Moreover, and as a general consideration, our results Stage 2 6510.6.5432923((183.3.236)) GNeeTPSp-e3TMto sdhaordwbceonncshismteanrtklys h(eig.gh.,er~2p0erppelerxpilteyxiitfycofmorpGarPeTd-t3o[s4t1a]n)-. 41.3775 (8.41) Minerva This may stem from the absence of licensing contexts in 55.2353 (17.35) ita-baseline-small the test items, or suggest that the models resolve, for inStage 3 65.0507 (17.23) NeTS-3M stance, certain non-local dependencies—especially those 56.5017 (12.70) GePpeTto in the Left Periphery—via strategies that diverge from 48.5069 (13.62) Minerva native-like structural processing. Also, this suggests that our assessment task is far from trivial and highlights Table 3 the need for further exploration of training regimens Mean perplexity estimation and SD grouped by NE stages. to determine whether specific language models exhibit learning trajectories consistent with those observed in Stages NE Perplexity (SD) Models human language acquisition.

42.1788 (13.28) ita-baseline-small Another interesting result concerns the diference in Overall 4540..90632001 ((1106..9289)) GNeeTPSp-e3TMto tpheerfporremdaicntcivee. pWowhieler othfeGsTinagnlde NstEruwctituhreretsyppeecst atolwLaMyss 36.5133 (11.14) Minerva qualify as good predictors for LMs perplexity, ratifying 38.8547 (10.84) ita-baseline-small some sort of syntactic representation abilities, grouping Stage 1 4461.4.8370198((184.8.852)) GNeeTPSp-e3TMto tahlwe apyhserneotumrennsabienttteortrheesuthltrseteh-asntatgheeacrotiacruselartitowno-osftaGgTe 33.4046 (8.45) Minerva subdivision of NE. This pattern holds both across mod54.8328 (15.47) ita-baseline-small els, and in the longitudinal evaluation across training Stage 2 5676..15459265 ((1136..1333)) GNeeTPSp-e3TMto e3pMocthoskeonfsNoeTfSc-h3iMld,-daisrmecatlel-dscsapleeetcrhan.sEfovremnetrhtoruaignhedproon48.0398 (14.31) Minerva posed on independent grounds, then, the linguistic stages grounded in the GT framework may ofer a useful lens for interpreting LM behavior, especially in cognitively the stage-wise developmental predictions of two com- oriented settings. peting theories, GT and NE, against the performance of Finally, two more relevant considerations may be some Italian LMs. We did that by looking at perplex- drawn especially from the epoch-by-epoch analysis, ity associated to a varied set of sentences in a novel which we could performed on NeTS-3M, the only model dataset (ATTracTSS test set) both in a cross-sectional we could strictly control for architecture, training regiperspective, looking at four diferent Italian models (ita- men and training set (a 3M token corpus, including childbaseline-small, NeTS-3M, GePpeTto, Minerva), and in a directed speech only, [40], see Appendix C). longitudinal perspective, focusing on the performance of First, the model shows evidence of learning, graduone of these models (NeTS-3M) across training epochs. ally reducing the perplexity gap between items from GT

As for the cross-sectional study, we observed a gen- Stage 2 and Stage 3, although these stages remain diseral alignment of all our LMs with the linguistic devel- tinguishable. However, and in line with the results of opment observed in children. Perplexity values tended pairwise comparisons between Stages across models, the to grow with the progression of stages in both GT and most pronounced distinction for the model clearly lies NE, suggesting that the syntactic structures that chil- between Stage 1 and Stage 2. dren struggle with the most—and therefore take longer Second, our findings suggest that minimal (3M tokens of NeTS-3M vs. ≥ 10M tokens of the other LMs) but curated input allows a transformer model to approximate early, mid, and late stages of language acquisition, in line with empirically attested developmental pattern of a linguistic theory (the GT approach): this is confirmed not only by general the snapshot of perplexity estimation across stages, where NeTS-3M is the only LM strongly diferentiating Stage 1, 2 and 3, but crucially also along training epochs simulating child linguistic development. respect to native speaker intuitions—where structural refers specifically to syntactic structures, i.e., sentence types.

This ultimately frames the core tension in terms of learning versus acquisition.

6. Limitations Although our aim was to assess acquisition stages also

5. Concluding remarks across diferent training regimens—naturalistic, conversational, or redundant [5] using small-scale corpora In this paper we presented ATTracTSS, a novel dataset to (10–100M tokens), we could only test the NeTS-3M model assess the Acquisition Trajectories in Terms of Syntactic under the redundant regimen, trained on a 3M-token Structures inspired by language acquisition studies and corpus of Italian child-directed speech [40]. This is beby two competing empirically-grounded theories—the low the 10M-token BabyLM small track threshold [10]. Growing Trees approach and the Neo-Emergentist frame- While minimal, this amount approximates the linguistic work. Both theories argue for a stage-wise acquisition input received by a 4-year-old child—who has typically of syntax in children, but crucially difer in the size and acquired structures across all three developmental stages internal composition of these stages. [ 42 ]—though an additional million tokens would have

We conducted out-of-the-box evaluations on three brought the exposure closer to that developmental win“small” language models (124M parameters)—the pre- dow. trained GPT2 baseline model for Italian shared by the Moreover, the model architecture—GPT-2 (see model BabyLM, ita-baseline-small; NeTS-3M, a similar small card [ 43 ])—is not cognitively plausible, as it relies on a GPT2 model trained on a custom 3M corpus; GePpeTto, non-incremental, parallel attention mechanism that does 117M parameters—and a larger model, Miverva-7B-base, not reflect human-like structure-building [2, 5]. 7B parameters. We measured the perplexity that these A further limitation is that the dataset is not fully balLMs assigned to each sentence in the test set and com- anced in terms of the number of phenomena per item; pared them against the three-stage predictions of GT and future iterations will aim to expand the dataset and enthe two-stage articulation of NE. sure more uniform distribution across phenomena (see

In our experimental results, we observe that small- the material in the Appendix A). scale, fully open Italian-tuned models show alignment Finally, although we draw on attested acquisition patwith theories of language acquisition. Among the the- terns from Growing Trees and Neo-Emergentism, we lack oretical approaches tested, the GT-based stage theory adult acceptability data for the same structures. Such data yields more accurate predictions than the NE-based ap- will be essential in future studies to assess whether model proach. This work demonstrates the benefits of a sufi- outputs at later training stages simulate adult linguistic ciently rich grammatical theory in order to account for competence. how language acquisition unravels in children, and how this developmental trajectory can serve as a metric to compare natural, instinct-driven acquisition in humans Acknowledgments with the learning processes of LMs. In children, acquisition proceeds incrementally, through identifiable phases We acknowledge financial support under the National or stages. A robust theory is necessary to explicitly de- Recovery and Resilience Plan (NRRP), Mission 4, Comtermine which linguistic phenomena emerge at which ponent 2, Investment 1.1, Call for tender No. 104 pubstage, in a principled and non-impressionistic way. Such lished on 2.2.2022 by the Italian Ministry of University reflection is crucial for linguists, but it may also have and Research (MUR), funded by the European Union broader practical implications, particularly with respect – NextGenerationEU– Project Title T-GRA2L: Testing to the sustainability and optimization of model training. GRAdeness and GRAmmaticality in Linguistics – CUP Focusing on child language acquisition, and especially I53D23003900006 - Grant Assignment Decree No. 104 on the stages through which it unfolds, ofers an addi- adopted on the 2nd February 2022 by the Italian Mintional, more fine-grained metric for evaluating model istry of Ministry of University and Research (MUR). PI: competence and cognitive plausibility. In this work, we Cristiano Chesi. did not address the notion of cognitive coherence, which we consider too general; instead, we focus on strictly linguistic issues and discuss structural coherence with [20] G. Baggio, A. De Santo, N. A. Nuñez, Plausi- tures in Typical and Atypical Development: A View bility and Early Theory in Linguistics and Cog- From Growing Trees and Syntactic Cartography, nitive Science, Computational Brain & Behav- Master’s thesis, University of Siena, Siena, 2024. ior 7 (2024) 535–547. URL: https://link.springer. [31] T. Sgrizzi, When infinitives are not under control: com/10.1007/s42113-024-00196-7. doi:10.1007/ the Growing Trees Hypothesis and the developmens42113-024-00196-7, number: 4. tal advantage of restructuring verbs, RGG 46 (2024) [21] L. Rizzi, Grammatically-Based Target- 1–39. Number: 4.

Inconsistencies in Child Language, in: K. U. [32] A. A. Robiatu, A Computational Perspective on The Deen, J. Nomura, B. Schulz, B. D. Schwartz (Eds.), Growing Tree Approach: Design and ImplementaThe Proceedings of the Inaugural Conference tion of A Rule-Based System, Master’s thesis, Union Generative Approaches to Language Acqui- versity of Siena, 2025. sition—North America, MIT Working Papers in [33] N. Bosch, T. Biberauer, Emergent Syntactic CateLinguistics, 2006. gories and Increasing Granularity: Evidence from [22] L. Rizzi, Some Notes on Linguistic The- a Multilingual Corpus Study, in: Proceedings of ory and Language Development: The Case the 48th Boston University Conference on Lanof Root Infinitives, Language Acquisition 3 guage Development (BUCLD), Cascadilla Proceed(1993) 371–393. URL: http://www.tandfonline.com/ ings Project, Somerville, MA, 2024, pp. 101–116. doi/abs/10.1207/s15327817la0304_2. doi:10.1207/ [34] N. Bosch, Not all topics are equal: syntactic coms15327817la0304_2, number: 4. plexity and its efect on the acquisition of left[23] L. Haegeman, Root Infinitives, Tense, and Trun- peripheral structures, in: Proceedings of NELS cated Structures in Dutch, Language Acquisition 4 55, 2024. (1995) 205–255. URL: http://www.tandfonline.com/ [35] N. Chomsky, Three Factors in Language Dedoi/abs/10.1207/s15327817la0403_2. doi:10.1207/ sign, Linguistic Inquiry 36 (2005) 1–22. URL: https: s15327817la0403_2, number: 3. //direct.mit.edu/ling/article/36/1/1-22/250. doi:10. [24] M. Salustri, N. Hyams, Looking for the univer- 1162/0024389052993655, number: 1. sal core of the RI stage, in: V. Torrens, L. Es- [36] L. Rizzi, Early null subjects and root null subjects. cobar (Eds.), Language Acquisition and Language in Syntactic theory and first language acquisition: Disorders, volume 41, John Benjamins Publishing Cross-linguistic perspectives„ in: Binding, depenCompany, Amsterdam, 2006, pp. 159–182. URL: dencies, and learnability., Lawrence Erlbaum Assohttps://benjamins.com/catalog/lald.41.09sal. doi:10. ciates Inc., Hillsdale, NJ, 1994.

1075/lald.41.09sal. [37] J. Heim, M. Wiltschko, Rethinking structural [25] L. Rizzi, G. Bocci, Left Periphery of the Clause: Pri- growth: Insights from the acquisition of interacmarily Illustrated for Italian, in: M. Everaert, H. C. tional language, Glossa: a journal of general linguisRiemsdijk (Eds.), The Wiley Blackwell Companion tics 10 (2025). URL: https://www.glossa-journal.org/ to Syntax, Second Edition, 1 ed., Wiley, 2017, pp. article/id/16396/. doi:10.16995/glossa.16396, 1–30. URL: https://onlinelibrary.wiley.com/doi/10. number: 1. 1002/9781118358733.wbsyncom104. doi:10.1002/ [38] J. R. Searle, Minds, brains, and programs, Be9781118358733.wbsyncom104. havioral and Brain Sciences 3 (1980) 417–424. [26] A. Radford, Syntactic theory and the acquisition of URL: https://www.cambridge.org/core/product/ English syntax: The nature of early child grammars identifier/S0140525X00005756/type/journal_ of English. Blackwell: Oxford., Blackwell, Oxford, article. doi:10.1017/S0140525X00005756, 1990. number: 3. [27] A. Belletti, N. Friedmann, L. Rizzi, Growing trees [39] L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi, in child grammars: Cartography as an analytic tool S. Biderman, J. Tow, B. Fattori, C. Lovering, et al., for syntactic development, in: S. Wolfe (Ed.), The Eleutherai/lm-evaluation-harness: v0.4.9.1, 2025.

Oxford Handbook of Syntactic Cartography, ???? doi:10.5281/ZENODO.16737642. [28] G. Cinque, L. Rizzi, The cartography of syntac- [40] A. Fusco, M. Barbini, M. L. Piccini Bianchessi, tic structures, in: B. Heine, H. Narrog (Eds.), The V. Bressan, S. Neri, S. Rossi, T. Sgrizzi, C. Chesi, oxford handbook of linguistic analysis, Oxford Uni- Recurrent Networks are (Linguistically) Better? versity Press, Oxford, 2010, pp. 65–78. An Experiment on Small-LM Training on Child[29] S. Rossi, Italian/Romance imperatives as radically Directed Speech in Italian, in: Proceedings of the reduced structures: a corpus CHILDES study, RGG 10th Italian Conference on Computational Linguis45 (2023) 1–39. Number: 5. tics (CLiC-it 2024), CEUR, Aachen, 2024. [30] E. Casadei, A New Sentence Repetition Task Tool [41] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, to Investigate The Acquisition of Syntactic Struc- J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,

A. Online Resources Additional resources, including the full ATTracTSS

dataset and supporting materials, are available at: • ATTracTSS GitHub repository

To stay up to date with future developments from our lab, visit: • NeTS Lab - Computational Projects • NeTS Lab - General website

B. Pairwise Comparisons This appendix reports the results of pairwise statistical comparisons between the estimated probabilities associated with each GT stage. See Table 4. C. NeTS-3M Model Results Across Epochs

This appendix reports the results of the NeTS-3M Model across epochs, see Table 5. We show performance over 10 training epochs, with predictions evaluated by linguistic phenomenon, GT approach, and NE approach. Table 5 reports perplexity (derived from -log(probability), where higher values indicate greater model uncertainty) and 2 values (where higher values reflect stronger model predictions). 0.818 -0.048 -0.866 Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

Average Perplexity: -log(prob) Epoch 1 2 3 4 5 6 7 8 9 10

Total p(GT)

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

D. M.

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language Models are FewShot Learners , 2020 . URL: http://arxiv.org/abs/ 2005 .

14165. doi: 10 .48550/arXiv. 2005 . 14165 , issue: arXiv: 2005 .14165 arXiv: 2005 .14165 [cs].

[42]

Friedmann ,

Reznick , Stages rather than ages in the acquisition of movement structures: Data from sentence repetition and 27696 spontaneous clauses , Glossa: a journal of general linguistics 39 ( 2021 ). URL: https://www.

glossa-journal .org/article/id/5716/. doi: 10 .16995/ glossa.5716, number : 1 .

[43]

Ö . Gül, babylm-baseline - 10m-gpt2, https://huggingface.co/BabyLM-community/ babylm-baseline - 10m -gpt2 , 2025 . Model card last updated ca. 1 month before August 2025 .