<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Glossa: a org/abs/2208.02957. doi:10.48550/ARXIV.2208.
journal of general linguistics 6 (2021). URL: https: 02957</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2021.findings-acl.74</article-id>
      <title-group>
        <article-title>Acquisition in Babies and Machines: Comparing the Learning Trajectories of LMs in Terms of Syntactic Structures (AT TracTSS Test Set)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarah Rossi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Formichi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Neri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Sgrizzi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asya Zanollo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronica Bressan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Linguistics and Comparative Cultural Studies, Ca' Foscari University of Venice</institution>
          ,
          <addr-line>Fondamenta Tofetti 1075, 30123 Venice</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IUSS Pavia</institution>
          ,
          <addr-line>P.zza Vittoria 15, 27100, Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NeTS Lab, IUSS Pavia</institution>
          ,
          <addr-line>P.zza Vittoria 15, 27100, Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>57</volume>
      <issue>1</issue>
      <fpage>836</fpage>
      <lpage>846</lpage>
      <abstract>
        <p>A cognitively plausible language model should (i) process language incrementally, (ii) be trained on naturalistic input, and (iii) mirror the developmental stages observed in child language acquisition. This study focuses on the third point by exploring the adherence of language models' developmental patterns to the predictions of two empirically grounded theories of syntactic acquisition, the Growing Trees and the Neo-Emergentist approaches. Using an evaluation method based on perplexity, we test whether small and medium Italian-tuned LMs (two small GPT2 LMs, GePpeTto, and Minerva-7B) show sensitivity to syntactic phenomena corresponding to three acquisitional stages documented in child Italian. Our results suggest that smaller open models only partially reflect the stagewise progression observed in children.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Language acquisition</kwd>
        <kwd>LMs</kwd>
        <kwd>syntax</kwd>
        <kwd>cognitive plausibility</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>linguistics—can contribute meaningfully to linguistic
inquiry, provided that certain conditions on the cognitive
State-of-the-art Large Language Models (LLMs) demon- plausibility of the model are met.
strate remarkable success on various linguistic bench- A language model (LM) that aspires to linguistic
cogmarks ([1], inter alia). However, from a linguistic per- nitive plausibility should meet at least three key criteria.
spective, they remain uninteresting from the point of First, it should process linguistic input incrementally,
review of their cognitive plausibility. In fact, their archi- flecting the word-by-word, real-time parsing observed
tecture and learning dynamics difer fundamentally from in human sentence production and comprehension [4].
those of human learners, raising doubts about their rele- Second, it should be exposed to naturalistic training
invance to linguistic inquiry [2, 3]. put, approximating the kind and distribution of linguistic</p>
      <p>Nonetheless, following [4], we argue that language data encountered by human learners (PoS argument, [5]).
modeling—despite often being overlooked in theoretical Third—and this is the focus of the present study—it should
reproduce the developmental trajectory observed in first
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- language acquisition, where syntactic competence
foltics, September 24 — 26, 2025, Cagliari, Italy lows structured and empirically documented stages. In
* Corresponding author. line with this, we investigate whether LMs exhibit
cog† These authors contributed equally. nitive plausibility with respect to syntax by examining
$ sarah.rossi@iusspavia.it (S. Rossi); guido.formichi@iusspavia.it whether they reflect insights from linguistic theory on
(tGom.Fmoarsmoi.csghri)i;zzsoi@fia.inuesrsip@aivuiass.ipta(vTi.aS.igtr(iSz.ziN);eri); how humans acquire and process syntactic knowledge.
asya.zanollo@iusspavia.it (A. Zanollo); veronica.bressan@unive.it We compare two prominent approaches to syntactic
(V. Bressan); cristiano.chesi@iusspavia.it (C. Chesi) development: the Growing Trees approach (GT) [6] and
 https://rossisarah.github.io/ (S. Rossi); the Neo-Emergentist approach (NE) [7]. We argue that
exhttps://www.iusspavia.it/en/contacts/guido-formichi (G. Formichi); plicit, theoretically informed, and empirically grounded
hhttttppss::////twowmwsg.iruizszsip.gaivtihau.ibt/.iiot//ru(Tb.riScgar/iszozfiai)-;neri (S. Neri); theories of language acquisition can serve as efective
https://www.iusspavia.it/it/rubrica/asya-zanollo/ (A. Zanollo); testing grounds for the evaluation of linguistic
plausibilhttps://www.unive.it/data/people/28977262 (V. Bressan); ity of LMs.
https://github.com/cristianochesi (C. Chesi) We propose an efective method for evaluating the
ac0009-0007-2525-2457 (S. Rossi); 0009-0007-1307-194X quisition stages reflected in various (L)LMs by collecting
((TG.. SFgorrimzziic);h0i)0;0090-0090-0010-0339-8574-5468-4035(5A6.(SZ.aNnoelrlio));;00000000--00000033--13307752--17395697 their perplexity estimates for sentences corresponding to
(V. Bressan); 0000-0003-1935-1348 (C. Chesi) stages observed in typical Italian first language
develop© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License ment [8]. For our set of experiments, we drew from both
Attribution 4.0 International (CC BY 4.0).</p>
      <p>GT and NE literature to identify 17 core phenomena, each architectures that may embed relevant linguistic
inturepresented by a prototypical structural pattern. To en- itions in the form of inductive biases [4].
rich the dataset, we introduced variations to these struc- Within a linguistic and cognitive perspective, the
distures—e.g., changes in verbal class—resulting in a total tinction between models and theories is well established
of 89 subphenomena. For each subphenomenon, we gen- and relevant to discussions about LLMs. [15]
distinerated 100 lexically neutral instances, yielding a compre- guishes models, tools for simulating or predicting
linguishensive evaluation battery of 8,900 items. We tested three tic data, from theories which, on the other hand, seek to
GPT2 small Italian LMs, ita-baseline-small and NeTS-3M explain underlying cognitive mechanisms. [19] similarly
[9, 10], GePpeTto—117M parameters [11]—and a larger emphasizes that valid theories must provide
mechanisone—Miverva-7B-base, 7B parameters [12]. Results show tic explanations rather than merely replicate behavior.
that Italian language models exhibit a stage-wise syntac- More recently, [20] formalizes this distinction, describing
tic learning trajectory that aligns more closely with the models as devices for representing systems or testing
GT approach, which proves more predictive than the NE specific hypotheses, whereas theories aim to provide
exframework. We conclude that while key asymmetries planatory frameworks to generalize across phenomena.
remain, models trained on a minimal amount of input They argue that during early theory development, when
consisting solely of child-directed speech (e.g., NeTS-3M) empirical testing is limited, plausibility—shaped by
faccan approximate the developmental patterns observed in tors such as computational tractability and theoretical
human language acquisition. invariance—serves a critical criterion for advancing from
models to theories. In sum, while LLMs demonstrate
impressive empirical performance and ofer valuable tools
2. Poverty of Stimulus, LLMs, and for exploring linguistic patterns, their fundamental
diflanguage theories ferences from human cognition, limitations in capturing
graded acceptability, and reliance on vast datasets,
distinA striking diference between LLMs and natural language guish them from genuine linguistic theories. At present,
acquisition lies in the quantity of training data needed LLMs function as tools for hypothesis testing rather than
to achieve adult competence. A robust cross-linguistic as explanatory accounts of language cognitive
foundaobservation in first language acquisition is that children tions.
converge on the adult grammar within a remarkably In this context, we draw on two linguistic theories from
short developmental window—by approximately age 4 to the current literature to support the view that language
6—regardless of the language they are exposed to [13, 14], learning by (L)LMs can be meaningfully assessed—and
and with limited exposure to primary linguistic data, as compared to child language acquisition—using precise
emphasized in the Poverty of Stimulus (PoS) argument linguistic criteria.
[15].</p>
      <p>However, the PoS argument has been recently
challenged by scholars who argue that LLMs represent the 3. Theories of Language
most empirically grounded models of language currently Acquisition
available, and that core features of human linguistic
competence (e.g., recursion, logical inference, and hierarchi- As already mentioned, children converge on adult
gramcal syntactic structure) may emerge spontaneously in mar within a remarkably short time [13, 14]. While there
predictive models trained on unannotated language data can be (moderate) variability in the timing of
acquisi[16, 17]. From this perspective, LLMs question the ne- tion in typically developing children, the developmental
cessity of domain-specific innate mechanisms posited by patterns are consistent across individuals, in two key
Generative Grammar (GG), suggesting that rich linguis- respects. First, all children go through stages in which
tic generalizations may arise from data-driven statistical they make systematic, non-random errors—such as
overlearning, given domain-general cognitive inductive bi- production of computationally lighter structures with a
ases embedded in the artificial neural network architec- smaller number of morphosyntactic elements [21], like
ture [18]. uninflected verbs (infinitives [ 22, 23] and imperatives</p>
      <p>However, the debate concerns not only whether LLMs [24]; e.g., Mangi-a! ‘Eat!, imperative’ vs. Mangi-a-v-ano
exhibit linguistic capacities or the amount of data re- ‘They ate’, past imperfective).
quired, but also whether they can inform a theory that Second, children produce and master certain sentence
accounts for the cognitive underpinnings of natural lan- types before others, and—crucially—the order of
acquiguage. The issue should be addressed from multiple sition appears to be consistent across learners: some
perspectives: by developing fine-grained performance children progress more rapidly than others, but all pass
metrics, creating relevant tasks and benchmarks, paying through the same developmental stages. This provides
attention to the amount of data, and considering model further evidence that the human language faculty
con3.2. Growing Trees Approach
strains the hypothesis space available to learners. This Empirical studies across multiple languages have
study focuses on this second dimension of acquisition: shown that acquisition proceeds in structural bursts or
the order in which diferent sentence types are acquired. “explosions”: at a given point, an entire syntactic
domain (e.g., the vP+TP layer) becomes accessible, and all
3.1. Comparing Competing Theories structures associated with that domain become available
to the child. Crucially, within these domains, there is
We examine two prominent theories in the literature no robust evidence for a fixed internal acquisition
orconcerning the order in which syntactic structures are der, suggesting that what is developmentally primary is
acquired by children. Both seek to answer the same core the availability of the domain itself, not the sequential
question (namely, which structures emerge earlier or later mastery of its substructures. These domains are
straightin child language), but they difer significantly in their forwardly captured by the detailed cartographic structure
empirical methodologies and theoretical assumptions, of the functional spine as it has been drawn by theoretical
leading to divergent predictions that remain under active linguists over the past 30 years [28, 25].
investigation. Given the ongoing nature of this debate, While the foundational empirical work focused on
we consider both approaches in our analysis, without Hebrew, the GT framework has since been extended to
prematurely excluding either. other languages throught both experimental and
corpusbased studies, including Italian [29, 30, 31], English [32],
and others [27].</p>
      <p>The GT approach takes the syntactic tree—a symbolic
and highly formalized representation of sentence struc- 3.3. Neo-Emergentist Approach
ture—as its central object of study. Syntactic trees capture The Neo-Emergentis approach [7, 33, 34] to language
acthe hierarchical relationships among constituents, mak- quisition departs radically from both traditional nativist
ing explicit distinctions that are not evident in surface and certain usage-based models. This approach is
theoretword order, and are therefore indispensable for modeling ically motivated to a maximally impoverished Universal
core properties of natural language. Grammar (UG), in line with Chomskyan “Three Factors”</p>
      <p>The GT hypothesis proposes that syntactic develop- [35]. Rather than positing rich, innate linguistic content
ment unfolds in a layered fashion, reflecting the gradual (Factor 1), this model shifts explanatory weight onto the
availability of diferent regions of the tree. Initially, only interaction between primary linguistic data (PLD; Factor
low structural domains, such as the verb phrase (vP) and 2) and general cognitive learning principles (Factor 3),
inflectional phrase (IP), are accessible to the child, al- thereby advancing a minimalist conception of UG.
lowing for simple subject–verb sentences, for instance. The central claim is that syntactic categories are not
Subsequently, portions of the so-called Left Periphery [? innately specified but are emergent, and that acquisition
25], a high functional layer, become available, supporting proceeds along a learning path where coarser-grained
the production of wh-questions and preposed adverbs. categories are acquired before finer-grained refinements.
Only later does the full functional spine, including higher This involves a successive division algorithm, where the
CP-level structures like embedded clauses, relatives, and child initially makes basic contrasts (such as
predicate/ar“why”-questions, become active. The GT model builds gument) followed by more fine-grained subdivision
(idenupon earlier maturational analyses introduced in the tifying discourse and thematic domain up to
cartograph1990s, notably [26], and further developed in subsequent ically defined syntactic distinctions). Data from Catalan,
work (see [6, 27]). In a cognitively plausible model, one Spanish, Italian, German, and Dutch [33] suggests that
would expect to observe a learning trajectory mirroring basic CP structures (such as wh-questions, V2 word
orthat of human acquirers, in which early-acquired struc- der, illocutionary complementisers, and topicalisation)
tures (e.g., simple S–V sentences) are mastered before emerge at early developmental stages (defined in terms of
later-acquired ones (e.g., embedded clauses). MLUw), challenging models that assume a fixed, innately</p>
      <p>Traditional metrics for assessing language develop- specified hierarchy of syntactic categories [ 6, 26, 36]. In
ment, such as Mean Length of Utterance in words contrast, finer-grained structures (e.g. recursive topics,
(MLUw) alone or average age of acquisition across child multiple left-peripheral elements, V3 orders) seem to
apsamples, have limited explanatory power due to the doc- pear only later (around or after MLUw 2.5). Crucially,
umented high degree of individual variability in acquisi- building on the Peripheral Speaker-Hearer Hypothesis
tion speed [6]. In other words, some children are faster (PSHH), which posits that speaker-hearer perspective is
than others, but all of them follow the same develop- formally encoded at the edges of phasal domains [37], NE
mental path, in that they all acquire various syntactic model predicts that here-and-now and
speaker-hearerstructures in the same order. oriented material functions as key bootstrapping
heuristics in acquisition, and therefore they are expected to be
ID
i
ii
iii
iv
v
vi
vii
viii
ix
x
xi
xii
xiii
xiv
xv
xvi
xvii</p>
      <p>SV simple
SV unaccusative
VS unaccusative</p>
      <p>Imperatives</p>
      <p>Modals</p>
      <p>Root wh-questions
Root yes/no questions</p>
      <p>Preposed Adverbs</p>
      <p>Focus
Illocutionary COMPs</p>
      <p>Why questions</p>
      <p>Topics
Embedded that</p>
      <p>Embedded if</p>
      <p>Subject Relative
Object Relative – intervener
Object Relative + intervener
acquired early. This point is particularly relevant when
modeling the developmental trajectory of a language
model, whose training, by definition, lacks access to
referential stimuli such as here-and-now context (cf. the
symbol grounding problem [38]).</p>
      <sec id="sec-1-1">
        <title>3.4. Predictions</title>
        <p>the TP layer is often a matter of theoretical interpretation,
and currently under scrutiny.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Evidence</title>
      <sec id="sec-2-1">
        <title>4.1. Methods</title>
        <sec id="sec-2-1-1">
          <title>To test LMs against the developmental predictions of</title>
          <p>Under a NE view, the timing and trajectory of syntactic both NE and GT frameworks, we defined a problem space
acquisition are governed by the complexity of formal designed to capture the full range of potential
developfeatures involved, rather than its fixed hierarchical posi- mental trajectories a LM might exhibit. Using a test set
tion in the functional spine. More specifically, if the GT (c.f. next subsection) that targets structurally rich
conpredicts that Topics (pertaining to Stage 3) are acquired structions attested at various stages of acquisition, we
later than wh-questions (pertaining to Stage 2), by virtue expect a coherent model (i) to be sensitive to syntactic
of their structural height; from a NE point of view, this variations and similarities across diferent sentence types
depends on the featural specification of these elements and to assign probabilities accordingly, and (ii) to align
[34]: for example, yes/no questions are expected to be with one of the two developmental hypotheses by
assignearly-acquired CP structures due to their low formal com- ing higher perplexity scores to items corresponding to
plexity and learnability via generalization from minimal later stages of acquisition. To obtain perplexity measures
cues, whereas according to the GT they are expected to and standard errors, we used the lm-evaluation-harness
arise in Stage 2. platform [39] and created a custom task consisting of 100</p>
          <p>Under the NE view, the macrocategories C, T, and V lexically irrelevant variations of the syntactic patterns
are assumed to be available from the onset. In contrast, presented in Table 1 and further detailed in Appendix A.
the GT approach posits that only V and T are initially Items were grouped into three stages to reflect the
fineravailable (Stage 1), with C-related projections emerging grained distinctions predicted by the GT framework. If
at later stages. no diference is found between Stage 1 and Stage 2, then</p>
          <p>Despite being grounded in empirical studies, the two the LM behavior is consistent with NE approach.
Othapproaches yield diverging predictions about the order of erwise, if a distinction emerges, this is in line with GT
acquisition. This divergence stems also from how particu- predictions.
lar structures are analyzed. For instance, whether a given
construction involves movement to C or remains within
repeating the log-probability analysis across multiple
training epochs, in order to trace whether the model’s
familiarization path mirrors human developmental
patterns. The same type of analysis could not be carry out
on the other models due to the impossibility to carefully
control their training.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.4. Results</title>
      </sec>
      <sec id="sec-2-3">
        <title>4.2. ATTracTSS: A Novel Dataset</title>
        <sec id="sec-2-3-1">
          <title>The novel test set we created for evaluating the Acqui</title>
          <p>sition Trajectories of various LMs in Terms of Syntactic
Structures is dubbed ATTracTSS. The dataset consists of
grammatical sentences representing 17 prototypical
syntactic constructions—here referred to as sentence types
(e.g., simple SV sentences, wh-questions, topicalizations,
embedded clauses)—and 100 lexically diverse items
generated for each sentence type.</p>
          <p>We built our dataset based on the phenomena tested
by GT and NE. Notably, NE does not provide an explicit
list of the specific sentence types it predicts to emerge in
a fixed acquisitional order. Therefore, we adapted GT’s
classification to the NE framework where possible,
deriving stage-based predictions for both hypotheses Table 1.
In cases where alignment was not possible, we assigned
the label unknown.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Mean perplexity and SD values for each stage in GT and</title>
          <p>NE were derived from negative log probability values
that the four models assigned to each of the items in the
dataset, as reported in Table 2 (GT) and Table 3 (NE).
Despite numerical diferences, perplexity tends to increase
coherently with the stage progression in all LMs; SD,
instead, tends to grow higher in the latest stages of both GT
(Stage 3) and NE (Stage 2), suggesting higher variation
within them.</p>
          <p>Then, a series of linear regressions were run to
as4.3. Implementation sess whether negative log probability assignment is
significantly predicted across models (i) by the diferent
We carry out a perplexity analysis starting from the neg- syntactic structures of the sentence types included in
ative log probabilities assigned by the model to each the dataset, and most importantly (ii) by the
articulasentence in the dataset. Perplexity levels are expected tion in stages proposed by GT and/or by NE. Random
to inversely correlate with learnability. Perplexity mea- intercepts for length (i.e, number of words in each item
sures how well a model predicts a given sentence. Lower in the dataset) were included in all regressions.
Likeperplexity means the model finds the sentence more pre- lihood ratio tests (ANOVA) between a null model and
dictable (less surprising), while higher perplexity means a model using sentence types as fixed efect revealed
the model finds it less predictable (more surprising). that these significantly improved model fit in all LMs
Given the 100 repetitions of the same syntactic skeleton, (ita-baseline-small:  2(65) = 2622.7,  &lt; .0001;
we assume that averaging over multiple lexicalizations NeTS-3M:  2(65) = 2953.7,  &lt; .0001; GePpeTto:
reduces the impact of individual word-level frequency  2(65) = 2925.3,  &lt; .0001; Minerva:  2(65) =
efects on model perplexity. 3095.7,  &lt; .0001). As for GT and NE, instead,
simi</p>
          <p>At stage level, our hypothesis is that diferent acqui- lar tests outputted a sharp asymmetry in the predictive
sition stages would be characterized not only by difer- power of the two accounts. Treating GT’s three-stage
ent mean perplexity values, but also by similar standard articulation as fixed factor significantly improved model
deviations (SD), indicating consistent model confidence ift (ita-baseline-small:  2(2) = 10.633,  &lt; .00491;
within each stage. As for sentence types, if perplexity NeTS-3M:  2(2) = 376.68,  &lt; .0001; GePpeTto:
remains consistently low across lexical variants of a sen-  2(2) = 9.1605,  &lt; .0001; Minerva:  2(2) = 35.5,
tence type, and the variation is low, we interpret this  &lt; .0001), but the same did not apply to NE’s stages
as evidence that the model handles the structure with a (p values &gt;.05 for all LMs). Note however that except
degree of robustness and consistency, suggesting it has for NeTS-3M, where all pairwise comparisons between
learned to generalize over that syntactic pattern. While stages reach significance, contrasts between Stage 2 and
this should not be taken to imply that the model has ac- 3 and Stage 1 and 3 strongly vary across LMs (see
Apquired the structure in a human-like or abstract sense, pendix B), with Stage 3 being the least stable of the three.
such behavior can nonetheless serve as a useful proxy For the detailed longitudinal results of the NETS-3M
for comparison with human acquisition data. model, see Appendix C.</p>
          <p>Four models were tested: ita-baseline-small—the
pretrained GPT2 baseline model for Italian shared by the
BabyLM Community in the HuggingFace platform [10], 4.5. Discussion
NeTS-3M—a similar small GPT2 model trained on a
custom 3M corpus of child-directed speech [40] —,
GePpeTto—117M parameters [11]—and a larger model,
Miverva-7B-base, 7B parameters [12]. For the NeTS-3M
model we also implemented a longitudinal tracking by</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>The experiments reported in the previous sections were</title>
          <p>conducted to address the issue of language development
in LMs, i.e., to assess whether the way LMs “learn” their
language may be compared to the process of natural
language acquisition in children. Specifically, we compared
Table 2 to be acquired—roughly overlap with the sentence types
Mean perplexity estimation and SD grouped by GT stages. that LMs find less predictable. Nevertheless, closer
inspection of mean perplexity values per sentence type
Stages GT Perplexity (SD) Models revealed some variation within the stages, especially in
42.1788 (13.28) ita-baseline-small GT 3 and NE 2: some late structures for children, like
Overall 5404..09360220 ((1160..2998)) GNeeTPSp-e3TMto welhsy(~-q3u0e),stwiohnisle,rSetcaegivee1vetrraynlosiwtivpeercplaleuxsietys
farroemaasslligmnoedd36.5133 (11.14) Minerva higher-than expected perplexity (~52). These observation
37.3312 (10.28) ita-baseline-small suggest that caution is needed when comparing humans
Stage 1 3430.7.6822269((182.6.305)) GNeeTPSp-e3TMto wanitdhLhMums, aanndacwquhiisleititohne, gsoemneeraimllpeoarrntainntgatsryemndmaeltirginess
32.3002 (8.62) Minerva remain.</p>
          <p>48.4068 (9.89) ita-baseline-small Moreover, and as a general consideration, our results
Stage 2 6510.6.5432923((183.3.236)) GNeeTPSp-e3TMto sdhaordwbceonncshismteanrtklys h(eig.gh.,er~2p0erppelerxpilteyxiitfycofmorpGarPeTd-t3o[s4t1a]n)-.
41.3775 (8.41) Minerva This may stem from the absence of licensing contexts in
55.2353 (17.35) ita-baseline-small the test items, or suggest that the models resolve, for
inStage 3 65.0507 (17.23) NeTS-3M stance, certain non-local dependencies—especially those
56.5017 (12.70) GePpeTto in the Left Periphery—via strategies that diverge from
48.5069 (13.62) Minerva native-like structural processing. Also, this suggests that
our assessment task is far from trivial and highlights
Table 3 the need for further exploration of training regimens
Mean perplexity estimation and SD grouped by NE stages. to determine whether specific language models exhibit
learning trajectories consistent with those observed in
Stages NE Perplexity (SD) Models human language acquisition.</p>
          <p>42.1788 (13.28) ita-baseline-small Another interesting result concerns the diference in
Overall 4540..90632001 ((1106..9289)) GNeeTPSp-e3TMto tpheerfporremdaicntcivee. pWowhieler othfeGsTinagnlde NstEruwctituhreretsyppeecst atolwLaMyss
36.5133 (11.14) Minerva qualify as good predictors for LMs perplexity, ratifying
38.8547 (10.84) ita-baseline-small some sort of syntactic representation abilities, grouping
Stage 1 4461.4.8370198((184.8.852)) GNeeTPSp-e3TMto tahlwe apyhserneotumrennsabienttteortrheesuthltrseteh-asntatgheeacrotiacruselartitowno-osftaGgTe
33.4046 (8.45) Minerva subdivision of NE. This pattern holds both across
mod54.8328 (15.47) ita-baseline-small els, and in the longitudinal evaluation across training
Stage 2 5676..15459265 ((1136..1333)) GNeeTPSp-e3TMto
e3pMocthoskeonfsNoeTfSc-h3iMld,-daisrmecatlel-dscsapleeetcrhan.sEfovremnetrhtoruaignhedproon48.0398 (14.31) Minerva posed on independent grounds, then, the linguistic stages
grounded in the GT framework may ofer a useful lens
for interpreting LM behavior, especially in cognitively
the stage-wise developmental predictions of two com- oriented settings.
peting theories, GT and NE, against the performance of Finally, two more relevant considerations may be
some Italian LMs. We did that by looking at perplex- drawn especially from the epoch-by-epoch analysis,
ity associated to a varied set of sentences in a novel which we could performed on NeTS-3M, the only model
dataset (ATTracTSS test set) both in a cross-sectional we could strictly control for architecture, training
regiperspective, looking at four diferent Italian models (ita- men and training set (a 3M token corpus, including
childbaseline-small, NeTS-3M, GePpeTto, Minerva), and in a directed speech only, [40], see Appendix C).
longitudinal perspective, focusing on the performance of First, the model shows evidence of learning,
graduone of these models (NeTS-3M) across training epochs. ally reducing the perplexity gap between items from GT</p>
          <p>As for the cross-sectional study, we observed a gen- Stage 2 and Stage 3, although these stages remain
diseral alignment of all our LMs with the linguistic devel- tinguishable. However, and in line with the results of
opment observed in children. Perplexity values tended pairwise comparisons between Stages across models, the
to grow with the progression of stages in both GT and most pronounced distinction for the model clearly lies
NE, suggesting that the syntactic structures that chil- between Stage 1 and Stage 2.
dren struggle with the most—and therefore take longer Second, our findings suggest that minimal (3M tokens
of NeTS-3M vs. ≥ 10M tokens of the other LMs) but
curated input allows a transformer model to approximate
early, mid, and late stages of language acquisition, in
line with empirically attested developmental pattern of
a linguistic theory (the GT approach): this is confirmed
not only by general the snapshot of perplexity estimation
across stages, where NeTS-3M is the only LM strongly
diferentiating Stage 1, 2 and 3, but crucially also along
training epochs simulating child linguistic development.
respect to native speaker intuitions—where structural
refers specifically to syntactic structures, i.e., sentence
types.</p>
          <p>This ultimately frames the core tension in terms of
learning versus acquisition.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Limitations</title>
      <sec id="sec-3-1">
        <title>Although our aim was to assess acquisition stages also</title>
        <p>
          5. Concluding remarks across diferent training regimens—naturalistic,
conversational, or redundant [5] using small-scale corpora
In this paper we presented ATTracTSS, a novel dataset to (10–100M tokens), we could only test the NeTS-3M model
assess the Acquisition Trajectories in Terms of Syntactic under the redundant regimen, trained on a 3M-token
Structures inspired by language acquisition studies and corpus of Italian child-directed speech [40]. This is
beby two competing empirically-grounded theories—the low the 10M-token BabyLM small track threshold [10].
Growing Trees approach and the Neo-Emergentist frame- While minimal, this amount approximates the linguistic
work. Both theories argue for a stage-wise acquisition input received by a 4-year-old child—who has typically
of syntax in children, but crucially difer in the size and acquired structures across all three developmental stages
internal composition of these stages. [
          <xref ref-type="bibr" rid="ref3">42</xref>
          ]—though an additional million tokens would have
        </p>
        <p>
          We conducted out-of-the-box evaluations on three brought the exposure closer to that developmental
win“small” language models (124M parameters)—the pre- dow.
trained GPT2 baseline model for Italian shared by the Moreover, the model architecture—GPT-2 (see model
BabyLM, ita-baseline-small; NeTS-3M, a similar small card [
          <xref ref-type="bibr" rid="ref5">43</xref>
          ])—is not cognitively plausible, as it relies on a
GPT2 model trained on a custom 3M corpus; GePpeTto, non-incremental, parallel attention mechanism that does
117M parameters—and a larger model, Miverva-7B-base, not reflect human-like structure-building [2, 5].
7B parameters. We measured the perplexity that these A further limitation is that the dataset is not fully
balLMs assigned to each sentence in the test set and com- anced in terms of the number of phenomena per item;
pared them against the three-stage predictions of GT and future iterations will aim to expand the dataset and
enthe two-stage articulation of NE. sure more uniform distribution across phenomena (see
        </p>
        <p>In our experimental results, we observe that small- the material in the Appendix A).
scale, fully open Italian-tuned models show alignment Finally, although we draw on attested acquisition
patwith theories of language acquisition. Among the the- terns from Growing Trees and Neo-Emergentism, we lack
oretical approaches tested, the GT-based stage theory adult acceptability data for the same structures. Such data
yields more accurate predictions than the NE-based ap- will be essential in future studies to assess whether model
proach. This work demonstrates the benefits of a sufi- outputs at later training stages simulate adult linguistic
ciently rich grammatical theory in order to account for competence.
how language acquisition unravels in children, and how
this developmental trajectory can serve as a metric to
compare natural, instinct-driven acquisition in humans Acknowledgments
with the learning processes of LMs. In children,
acquisition proceeds incrementally, through identifiable phases We acknowledge financial support under the National
or stages. A robust theory is necessary to explicitly de- Recovery and Resilience Plan (NRRP), Mission 4,
Comtermine which linguistic phenomena emerge at which ponent 2, Investment 1.1, Call for tender No. 104
pubstage, in a principled and non-impressionistic way. Such lished on 2.2.2022 by the Italian Ministry of University
reflection is crucial for linguists, but it may also have and Research (MUR), funded by the European Union
broader practical implications, particularly with respect – NextGenerationEU– Project Title T-GRA2L: Testing
to the sustainability and optimization of model training. GRAdeness and GRAmmaticality in Linguistics – CUP
Focusing on child language acquisition, and especially I53D23003900006 - Grant Assignment Decree No. 104
on the stages through which it unfolds, ofers an addi- adopted on the 2nd February 2022 by the Italian
Mintional, more fine-grained metric for evaluating model istry of Ministry of University and Research (MUR). PI:
competence and cognitive plausibility. In this work, we Cristiano Chesi.
did not address the notion of cognitive coherence, which
we consider too general; instead, we focus on strictly
linguistic issues and discuss structural coherence with
[20] G. Baggio, A. De Santo, N. A. Nuñez, Plausi- tures in Typical and Atypical Development: A View
bility and Early Theory in Linguistics and Cog- From Growing Trees and Syntactic Cartography,
nitive Science, Computational Brain &amp; Behav- Master’s thesis, University of Siena, Siena, 2024.
ior 7 (2024) 535–547. URL: https://link.springer. [31] T. Sgrizzi, When infinitives are not under control:
com/10.1007/s42113-024-00196-7. doi:10.1007/ the Growing Trees Hypothesis and the
developmens42113-024-00196-7, number: 4. tal advantage of restructuring verbs, RGG 46 (2024)
[21] L. Rizzi, Grammatically-Based Target- 1–39. Number: 4.</p>
        <p>Inconsistencies in Child Language, in: K. U. [32] A. A. Robiatu, A Computational Perspective on The
Deen, J. Nomura, B. Schulz, B. D. Schwartz (Eds.), Growing Tree Approach: Design and
ImplementaThe Proceedings of the Inaugural Conference tion of A Rule-Based System, Master’s thesis,
Union Generative Approaches to Language Acqui- versity of Siena, 2025.
sition—North America, MIT Working Papers in [33] N. Bosch, T. Biberauer, Emergent Syntactic
CateLinguistics, 2006. gories and Increasing Granularity: Evidence from
[22] L. Rizzi, Some Notes on Linguistic The- a Multilingual Corpus Study, in: Proceedings of
ory and Language Development: The Case the 48th Boston University Conference on
Lanof Root Infinitives, Language Acquisition 3 guage Development (BUCLD), Cascadilla
Proceed(1993) 371–393. URL: http://www.tandfonline.com/ ings Project, Somerville, MA, 2024, pp. 101–116.
doi/abs/10.1207/s15327817la0304_2. doi:10.1207/ [34] N. Bosch, Not all topics are equal: syntactic
coms15327817la0304_2, number: 4. plexity and its efect on the acquisition of
left[23] L. Haegeman, Root Infinitives, Tense, and Trun- peripheral structures, in: Proceedings of NELS
cated Structures in Dutch, Language Acquisition 4 55, 2024.
(1995) 205–255. URL: http://www.tandfonline.com/ [35] N. Chomsky, Three Factors in Language
Dedoi/abs/10.1207/s15327817la0403_2. doi:10.1207/ sign, Linguistic Inquiry 36 (2005) 1–22. URL: https:
s15327817la0403_2, number: 3. //direct.mit.edu/ling/article/36/1/1-22/250. doi:10.
[24] M. Salustri, N. Hyams, Looking for the univer- 1162/0024389052993655, number: 1.
sal core of the RI stage, in: V. Torrens, L. Es- [36] L. Rizzi, Early null subjects and root null subjects.
cobar (Eds.), Language Acquisition and Language in Syntactic theory and first language acquisition:
Disorders, volume 41, John Benjamins Publishing Cross-linguistic perspectives„ in: Binding,
depenCompany, Amsterdam, 2006, pp. 159–182. URL: dencies, and learnability., Lawrence Erlbaum
Assohttps://benjamins.com/catalog/lald.41.09sal. doi:10. ciates Inc., Hillsdale, NJ, 1994.</p>
        <p>1075/lald.41.09sal. [37] J. Heim, M. Wiltschko, Rethinking structural
[25] L. Rizzi, G. Bocci, Left Periphery of the Clause: Pri- growth: Insights from the acquisition of
interacmarily Illustrated for Italian, in: M. Everaert, H. C. tional language, Glossa: a journal of general
linguisRiemsdijk (Eds.), The Wiley Blackwell Companion tics 10 (2025). URL: https://www.glossa-journal.org/
to Syntax, Second Edition, 1 ed., Wiley, 2017, pp. article/id/16396/. doi:10.16995/glossa.16396,
1–30. URL: https://onlinelibrary.wiley.com/doi/10. number: 1.
1002/9781118358733.wbsyncom104. doi:10.1002/ [38] J. R. Searle, Minds, brains, and programs,
Be9781118358733.wbsyncom104. havioral and Brain Sciences 3 (1980) 417–424.
[26] A. Radford, Syntactic theory and the acquisition of URL: https://www.cambridge.org/core/product/
English syntax: The nature of early child grammars identifier/S0140525X00005756/type/journal_
of English. Blackwell: Oxford., Blackwell, Oxford, article. doi:10.1017/S0140525X00005756,
1990. number: 3.
[27] A. Belletti, N. Friedmann, L. Rizzi, Growing trees [39] L. Sutawika, H. Schoelkopf, L. Gao, B. Abbasi,
in child grammars: Cartography as an analytic tool S. Biderman, J. Tow, B. Fattori, C. Lovering, et al.,
for syntactic development, in: S. Wolfe (Ed.), The Eleutherai/lm-evaluation-harness: v0.4.9.1, 2025.</p>
        <p>Oxford Handbook of Syntactic Cartography, ???? doi:10.5281/ZENODO.16737642.
[28] G. Cinque, L. Rizzi, The cartography of syntac- [40] A. Fusco, M. Barbini, M. L. Piccini Bianchessi,
tic structures, in: B. Heine, H. Narrog (Eds.), The V. Bressan, S. Neri, S. Rossi, T. Sgrizzi, C. Chesi,
oxford handbook of linguistic analysis, Oxford Uni- Recurrent Networks are (Linguistically) Better?
versity Press, Oxford, 2010, pp. 65–78. An Experiment on Small-LM Training on
Child[29] S. Rossi, Italian/Romance imperatives as radically Directed Speech in Italian, in: Proceedings of the
reduced structures: a corpus CHILDES study, RGG 10th Italian Conference on Computational
Linguis45 (2023) 1–39. Number: 5. tics (CLiC-it 2024), CEUR, Aachen, 2024.
[30] E. Casadei, A New Sentence Repetition Task Tool [41] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
to Investigate The Acquisition of Syntactic Struc- J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Online Resources</title>
      <sec id="sec-4-1">
        <title>Additional resources, including the full ATTracTSS</title>
        <p>dataset and supporting materials, are available at:
• ATTracTSS GitHub repository</p>
        <p>To stay up to date with future developments from our
lab, visit:
• NeTS Lab - Computational Projects
• NeTS Lab - General website</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>B. Pairwise Comparisons</title>
      <sec id="sec-5-1">
        <title>This appendix reports the results of pairwise statistical comparisons between the estimated probabilities associated with each GT stage. See Table 4.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>C. NeTS-3M Model Results Across</title>
    </sec>
    <sec id="sec-7">
      <title>Epochs</title>
      <p>This appendix reports the results of the NeTS-3M Model
across epochs, see Table 5. We show performance over 10
training epochs, with predictions evaluated by linguistic
phenomenon, GT approach, and NE approach. Table 5
reports perplexity (derived from -log(probability), where
higher values indicate greater model uncertainty) and
 2 values (where higher values reflect stronger model
predictions).
0.818
-0.048
-0.866
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content
as needed and take(s) full responsibility for the publication’s content.</p>
      <p>Average Perplexity: -log(prob)
Epoch
1
2
3
4
5
6
7
8
9
10</p>
      <p>Total
p(GT)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <source>Language Models are FewShot Learners</source>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          14165. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          , issue: arXiv:
          <year>2005</year>
          .14165 arXiv:
          <year>2005</year>
          .14165 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>N.</given-names>
            <surname>Friedmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reznick</surname>
          </string-name>
          ,
          <article-title>Stages rather than ages in the acquisition of movement structures: Data from sentence repetition and 27696 spontaneous clauses</article-title>
          ,
          <source>Glossa: a journal of general linguistics 39</source>
          (
          <year>2021</year>
          ). URL: https://www.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>glossa-journal</article-title>
          .org/article/id/5716/. doi:
          <volume>10</volume>
          .16995/ glossa.5716,
          <issue>number</issue>
          :
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ö</surname>
          </string-name>
          . Gül, babylm-baseline
          <string-name>
            <surname>-</surname>
          </string-name>
          10m-gpt2, https://huggingface.co/BabyLM-community/ babylm-baseline
          <string-name>
            <surname>-</surname>
          </string-name>
          10m
          <source>-gpt2</source>
          ,
          <year>2025</year>
          .
          <article-title>Model card last updated ca. 1 month before August 2025</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>