<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Achille Fusco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matilde Barbini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Letizia Piccini Bianchessi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronica Bressan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sofia Neri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarah Rossi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Sgrizzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <email>cristiano.chesi@iusspavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
          ,
          <addr-line>Dec 04 - 06, 2024, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NeTS Lab, IUSS Pavia</institution>
          ,
          <addr-line>P.zza Vittoria 15 27100 Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Here we discuss strategies and results of a small-sized training program based on Italian childdirected speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models' architecture. Preliminary findings lead us to conclude that: (i) different tokenization strategies produce mildly significant improvements overall, although segmentation aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state perform slightly worse in training loss (compared to standard one- and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the performance of these models remains significantly lower compared to that of 7-year-old nativespeaker children in the relevant linguistic contrasts we considered [4].</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LSTM</kwd>
        <kwd>Transformers</kwd>
        <kwd>Small Language Models (SLM)</kwd>
        <kwd>tokenization</kwd>
        <kwd>cell state control</kwd>
        <kwd>LM evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        According to the mainstream LLM development
pipeline, Transformer-based architectures [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
outperform sequential training models, like LSTM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in
various NLP tasks. When small-sized training data are
available, optimization becomes necessary [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but
common optimization techniques neglect the
linguistically relevant fact that these models (i) conflate
semantic/world knowledge with morpho-syntactic
competence, (ii) require unreasonable training data
compared to that needed by children during language
acquisition, (iii) the higher their performance, the lower
their return in cognitive/linguistic terms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this
paper we address these three issues, starting from the
observation that while world knowledge uses all
training data available, and the more the better,
structural (morpho-syntactic and compositional
semantic) knowledge might require a much smaller
dataset (from 10 to 100 million words, according to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
We explore this intuition further and, based on prolific
literature from the ‘80s showing that typical child errors
are structurally sensitive and never random [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we
model networks’ architecture to bias learning towards
plausible structural configurations, possibly preventing
these “small” language models (SLM) from producing
wrong linguistic generalizations. We started from a mild
revision of the LM training and evaluation pipeline for
Italian including alternative approaches to tokenization
based on pseudo-morphological decomposition (§2.2);
we then approached a more structurally-driven update
0000-0002-5389-8884 (A. Fusco); 0009-0007-7986-2365 (M.
      </p>
      <p>
        Barbini); 0009-0005-8116-3358 (M. L. Piccini Bianchessi);
0000-0003-3072-7967 (V. Bressan); 0009-0003-5456-0556 (S. Neri);
0009-0007-2525-2457 (S. Rossi); 0000-0003-1375-1359 (T. Sgrizzi);
0000-0003-1935-1348 (C. Chesi);
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
of the cell state in LSTM networks, which we will call
eMG-RNN variants (§2.3); we finally adopted a precise
testing benchmark for specific linguistic contrasts in
Italian following BLiMP design [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (§2.4). We will first
set the stage in section (§2) and discuss one alternative
tokenization strategy (MorPiece). A simple modification
to the gating system in LSTM is proposed that mimics
certain linguistic constraints. Then, we will describe the
relevant experiments we have run (§3) and draw some
conclusions based on the observed results (§4). A general
discussion with a description of the next steps will
conclude this paper (§5).
2. Revisiting LM training pipeline
LM training pipeline is relatively rigid: after corpus
cleaning (i), the data are prepared/optimized for
tokenization (ii), then the tokenized input is batched for
training autoregressive models (iii), mostly feeding
transformer-based architectures (iv). Once the models
are trained, the evaluation step requires their assessment
using some standard tasks (v). In the next sub-sections,
we will identify various criticalities in this pipeline,
eventually proposing strategies to mitigate these
problems and, in the end, training linguistically more
informative SLM.
2.1. Corpus creation and cleaning
      </p>
      <p>
        The primary data we collected for Italian replicates
plausible linguistic input that children may be exposed
to during acquisition, in line with [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It consists of about
3M tokens divided into child-directed speech (CHILDES
Italian section), child movie subtitles (from
OpenSubtitles), child songs (from Zecchino D’Oro
repository), telephone conversations (VoLIP corpus,
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), and fairy tales (all from copyright expired
sources). Simple cleaning consisted of removing
children’s productions from CHILDES files as well as
any other metalinguistic annotation (speakers’
identification, headers, time stamps, tags, links, etc.).
Dimension and rough lexical richness of each section are
reported in Table 1 (Type-Token Ratio, TTR) before and
after the cleaning procedure.
Popular vLLMs use either Byte-Pair Encoding (BPE)
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or (fast)WordPiece (fWP) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] algorithms for
tokenization. The simplicity and computational
efficiency of these approaches contrast with the limited
morphological analysis they provide. In rich inflectional
languages (e.g., Italian) and agglutinative languages
(e.g., Finnish), this might induce linguistically unsound
generalizations. Here, we explore a more
morphologically informed strategy, inspired by the
Tolerance Principle (TP) and Sufficiency Principle (SP)
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], aiming to break words into potentially relevant
morphemes without relying on morpheme tables [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
The experiments we conduct compare the impact of
different strategies when integrated into various
network architectures. We refer to MorPiece (MoP) as a
TP/SP-based strategy, which can be algorithmically
described as follows: each token is traversed from left to
right to create a “root trie,” and from right to left to
create an “inflectional trie” [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Each time a node N of
the trie is traversed (corresponding to the current
character path in the word), the frequency counter
associated with this node (Nc) is updated (+1). Nodes
corresponding to token endings (characters before white
spaces or punctuation) are flagged. Once both tries are
created, the optimization procedure explores each
descendant, and for every daughter node Dk its
frequency k is compared to HN, the approximation of the
harmonic number for N used both in TP and SP [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
where c is the frequency of the mother node Nc:
HN = c/ln(c)
(F1)
If k &gt; HN and c ≠ k, a productive boundary break is
postulated (based on the inference that since there are
different continuations and some of them are
productive, i.e. sufficiently frequent according to SP,
those might be real independent morphemes). We can
check if this break respects HD for the relevant nodes Dj
and Ni in the “inflectional trie”. This means there exists
a path where the frequency i of the daughter node Ni (in
the “inflectional trie” the dependency between D and N
is reversed) is lower than j/ln(j), where j is the frequency
of the mother node Dj. If this is the case, the continuation
is not considered “an exception”, in the sense of TP [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
suggesting that the continuation is, in fact, a productive
independent morpheme. A “++” root node is then
activated, the node Dk linked to it, and so on recursively,
following the FastWordPiece tokenization strategy [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
During recognition, the LinMaxMatch identification
approach is adopted, as in FastWordPiece. Figure 1
illustrates the relevant morpheme breaks (indicated as
“||”) obtained by applying this morpheme-breaking
procedure in the root and infl tries fragments.
c
e
r
l
214762
10240
5391
1813 c
1307 a
e
a
r
e
10121
466619
      </p>
      <p>c
e
r
a 4
o 124</p>
      <p>h
i
o
o
i
ò
à
a
l
r</p>
      <p>Various parametric controls have been considered to
tune this procedure: (i) a branching factor (bf) parameter
that excludes nodes with an excessively high number (&gt;
bf) of continuations (the rationale being that when too
many continuations are present, they are unlikely to
correspond to inflections; this often happens near the
root of each trie); (ii) a cutoff parameter indicating the
lower frequency boundary for a mother node (this is
necessary to ensure a minimum number of observations;
for example, if cutoff = 8, we exclude from the “root” trie
any branching daughter with a frequency &lt; 5). As in
BPE, minimum frequency control for tokens is also
implemented to exclude infrequent dictionary entries.</p>
      <p>Consider the word “cerca” (“to search for”)
represented in the “root” trie. In the last “c-a” the
relation between Hfc and “a” frequency indicates that a
break might exist between the nodes “c”
(frequency=1813) and “a” (frequency=1307), since Hfc =
1813/ln(1813) and 1307 &gt; Hfc. This hypothesis is
confirmed by the failure of the Hfc check at the relevant
“infl” “a-c” segment (“a” frequency=10121, “c”
frequency=466619): 10121 &lt; 466619/ln(466619). If Hfc had
been greater than “a” frequency, then no segmentation
advantage would have been observable.</p>
      <p>The proposed algorithm has a linear time complexity
of O(2n), as each trie must be explored deterministically
exactly once to evaluate the HN/D frequency relation.
The best linguistic results (relatively linguistically
coherent segmentations) for our Italian corpus were
obtained with cutoff=100 and bf=10. We found that it
was unnecessary to filter the proposed inflectional
breaks using the infl trie double check (TP) since the
LinMaxMatch strategy already efficiently filtered out
initially overestimated breaks. However, as an
anonymous reviewer correctly pointed out, this strategy
does not guarantee total inclusion of every token of our
training corpus (in contrast to BPE, for instance). We
acknowledge this limitation, but we emphasize that our
goal was to produce a smaller, potentially more efficient
lexicon. In our experiments, while BPE generated a
lexicon of 96028 tokens (67169 when the minimum
lexical frequency was set to 2), MoP produced a lexicon
of just 55049 tokens (cutoff=100, bf=10).
2.3. Revisiting LSTM architecture</p>
      <p>
        Despite many variants of the standard LSTM
architectures, notably Gated Recurrent Units [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] or
LSTM augmented with peephole connections [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and
the discouraging equivalence results for these variations
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], we observe a recent revival of RNN-based model
architectures [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We believe, in fact, that the core
intuition behind the LSTM architecture may be
linguistically relevant and worth exploring further,
although generally more performant models (for
instance in terms of GLUE benchmark, [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]) are usually
preferred [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The linguistic intuition is that the
“longterm memory” (cell state C in Figure 2) in LSTM
networks could effectively model various types of
nonlocal dependencies using a single mechanism.
Linguistically speaking, filler-gap dependencies (1) and
co-referential dependencies (2) are both “non-local
dependencies” but they are subject to non-identical
locality conditions:
(1) a. cosa i credi che abbia riposto _ i?
what (you) believe that (he) shelved?
what do you believe he shelved?
b. *cosa i credi che abbia riposto il libro [AdvP senza
leggere _ i]]?
b'. cosa i credi che abbia riposto _ i [AdvP senza
leggere _ i]]?
what do you believe he shelved (*the book)
without reading?
(2) a. [il panino]i, chi credi che loi abbia mangiato?
the sandwich, who (you) believe it has eaten?
b. *[il panino]i, chi credi che _i abbia mangiato?
the sandwich, who (you) believe has eaten?
the sandwich, who do you believe have eaten
*(it)?
      </p>
      <p>
        While both dependencies require
C(onstituent)command generalizations to be captured [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], the
adjunct island in (1), [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], but not clitic left-dislocation in
(2), [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], can, for instance, be licensed with a(n extra) gap
(1).b'. Aware of these differences, we decided to simply
alter the gating system to allow the LSTM to create
distinct pathways: one to “merge” new tokens, the other
to decide if a long-distance dependency is necessary, and
subsequently to “move” the relevant items [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. The
processing implementation of these operations is
inspired by expectation-based Minimalist Grammars
formalism, eMG [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], and it is then named eMG-RNN.
      </p>
      <p>Following this implementation, merge applies
incrementally, token by token, and move means “retain
in memory”. In more detail, the cell of an eMG-RNN
network performs the forward processing described in
the computational graph in Figure 2: (i) the input at time
t (xt) is linearly transformed to a lower dimension vector
(E, loosely used for “embedding”), then concatenated (C)
with the previous hidden state/output, if any (ht-1). Two
pathways, both transformed using a sigmoid function
(σ), lead, on the one hand, to the move gate, on the other,
to the merge gate. In the first case, the result of the
sigmoid transformation is multiplied (⊙, the Hadamard
product) with the input (this either erases or allows
some component of the original vector to be added (+)
to the previous (if any) context/cell state (ct-1) as in LSTM
forget gate). The merge gate, on the other direction, will
privilege the new token if the result of the sigmoid
combination of the incoming token and the previous
hidden state is low, otherwise (1 - this activation, as in
GRUs update gate) will favor items in the context/cell
state (transformed through a tanh function to simulate
memory decay).</p>
      <p>ct-1
xt
ht-1
|
E
C
move</p>
      <p>σ
merge
σ
+
⊙</p>
      <p>i
j
1⊙
tanh</p>
      <p>ct
⊙
+
ht
2.4. A linguistically informed evaluation</p>
      <p>
        The last step in the pipeline requires a linguistically
advanced set of oppositions to verify that the structural
generalizations can be captured coherently. We adopted
the lm-eval package [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and we included a specific task
based on English BLiMP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Most of the contrasts are
derived from the COnVERSA test [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They consist of
minimal pairs ordered following an increasing
complexity metric that considers the number of
operations necessary to establish a dependency and the
locality of such dependency. The examples below
illustrate this point by comparing a local agreement
dependency with, (3).b, or without, (3).a, a (linear)
intervener and a more complex dependency that
requires to process an object relative clause (4):
(3) a. Il piatto è pieno. Vs. Il piatto è piena.
      </p>
      <p>the dish.S.M is full.S.M … full.S.F
b. Il muro della casa è rosso
the wall.S.M of the house is red.S.M
Vs. Il muro della casa è rossa.</p>
      <p>the wall.S.M of the house is red.S.F
(4) Ci sono due maestri. Uno insegna ed è ascoltato
dagli studenti, l'altro si riposa. Quale maestro
insegna? There are two teachers. One teaches and
he’s listened to by the students, the other rests.
Which one teaches?</p>
      <p>Quello che gli studenti ascoltano.</p>
      <p>The one who the students listen to
Vs. Quello che ascolta gli studenti.</p>
      <p>The one who listens to the students</p>
      <p>
        Four kinds of dependency (agreement, thematic role
assignment, pronominal forms usage, questions
formation and answering) are considered for a set of 32
distinct syntactic configurations (a total of 344 minimal
pairs to be judged, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Materials and Methods</title>
      <p>
        We trained our models on the IUSS High-Performance
Cluster with 2 GPU nodes, each with 4 A100 NVIDIA
devices and 1T RAM. Each network has been trained
with the full corpus using various batched strategies. (i)
Naturalistic, line-by-line, single exposure to each
sentence in the corpus (each epoch corresponds to an
exposure of about 3M tokens); (ii) Conversational, two
sequential lines are used for the input, that is, [line 1,
line 2], [line 2, line 3], etc. are batched; this guarantees
that a minimal conversational context for each sentence
is provided. In this case, each epoch corresponds to an
exposure of 6M tokens; (iii) fixed sequence length,
considering the average sentence length of 54 words per
sentence, a window of 60 tokens is used, that is, [tok_1,
tok_2 … tok_60], [tok_2, tok_3 … tok_61] … are batched;
with this regimen, each epoch corresponds to an
exposure of 180M tokens. Roughly speaking, the bare
amount of data processed by a 7 y.o. child ranges from 7
to 70M tokens, [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], then training the networks with a
naturalistic or conversational regimen for 3-10 epochs
would result in a comparable exposure. We trained the
networks using torch.optim.lr_scheduler (step_size=5,
gamma=0.1) and Adam optimizer (lr=0.001) with 16-bit
automatic mixed-precision to speed up the (parallel)
training for a maximum of 100 epochs. The networks
have been implemented in PyTorch (v2.3.1), wrapped in
Transformers structures (4.42.4) to maximize
compatibility in the lm-eval (v.0.4.3) environment.
CUDA drivers v.12.4 were used. The most relevant
configurations tested are discussed in the next session.
3.1. Configurations tested
      </p>
      <p>
        Three different tokenization strategies (BPE,
FastWordPiece, and MorPiece) are compared using the
best-performing LSTM network [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] , which consists of
650 units for the embedding layer and 650 nodes for each
of the two hidden layers. Five different network
architectures are compared, with the GroNLP
GPT-2small pretrained model [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] constituting our “top LLM
performer”. This model was re-adapted to Italian from
the GPT-2 English trained model, which was originally
trained on approximately 10 billion token corpus,
namely various orders of magnitude bigger than our
corpus. We then trained on our corpus a comparable
bidirectional transformer (BERT), two LSTM networks,
respectively with 1 and 2 LSTM layers, and a one-layer
eMG-RNN network (Table 2), as described in §2.3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <p>Parameters Structure</p>
      <p>12 Attention heads +
121M 768 hidden units</p>
      <p>12 Attention heads +
113M 768 hidden units</p>
      <p>650 Embedding + 2
65M LSTM layers (650)</p>
      <p>650 Embedding + 1
36M</p>
      <p>LSTM layers (650)
73M 650 Embedding + 1</p>
      <p>eMG-RNN layer (650)</p>
      <p>
        Comparing BERT and LSTM architectures, LSTMx1
qualifies as the most performant configuration (both in
training and in minimal pair judgments). Considering
training, the only batching regimen performing
sufficiently well is the fixed sequence length (loss=0.8877
with LSTMx1 vs. conversational loss=4.0240 or
naturalistic regimen loss=4.5884). All networks reached
a learning plateau around 10-12 epochs. Comparing the
performances on COnVERSA, we realized that the
results does not improve after 3 epochs of fixed sequence
length (60 tokens) training regimen (this result is
compatible with the overfitting hypothesis, [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]).
Focusing on tokenizer training results with LSTMx1, we
observed that BPE and FastWordPiece have comparable
performance. MorPiece performs slightly worse, even
though the tokenization seems linguistically more
coherent (e.g., “farlo” – “to do it” is tokenized both by
BPE and fWP as a single token, while it is split in two in
MorPiece: “far” “+lo”) and the training faster (Table 3).
This, however, only marginally impacts on minimal
pairs contrast judgments, performing slightly better,
overall, just in certain agreement cases.
      </p>
    </sec>
    <sec id="sec-4">
      <title>5. Discussion</title>
      <p>
        Overall, LSTM networks significantly outperform
Bidirectional Transformers in this minimal pairs test on
Italian. This finding is consistent with results previously
discussed in the literature and suggests a clear
advantage of recurrent, sequential model architectures
(e.g., LSTM) over Bidirectional Transformers in terms of
linguistic generalizations [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] and partially justify the
renewed interest for RNN networks that we have been
observed in the last couple of years [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. As far as
the tokenization procedure is concerned, it is somewhat
premature to draw definitive conclusions from our
experiments, as MorPiece has not yet been fully
optimized or tested. Specifically, the optimal cut-off
threshold and minimum branching factor have not been
systematically evaluated. Nevertheless, a more
morphologically coherent segmentation is expected to
enhance sensitivity in certain minimal contrasts.
      </p>
      <p>Similarly, the eMG-RNN architecture could be
further explored and optimized, particularly considering
specific contrasts, which may help determine whether
our linguistic modeling is on the right track. Evidence to
the contrary is attested by the judgments of sentences
with missing thematic roles, which are often incorrectly
preferred by most models, including our eMG-RNN.</p>
      <p>
        In the end, our results suggest that Loss/Accuracy
performance registered in training is not a significant
predictor of the performance on the COnVERSA test, or
more generally, of the linguistic coherence of the LM
trained. Likewise, the models’ dimension is not a clear
predictor either: Transformers trained on the same small
dataset perform randomly (in all dimensions their
performance is round 50%) while eMG-RNN, which has
a number of parameters similar to LSTM-2, outperforms
both LSTM-2 and LSTM-1 (half size of eMG-RNN). The
training size remains a striking difference compared to
the input received by children: this difference of one
order of magnitude suggests that the bias considered in
eMG-RNN are not yet satisfactory and that our
Language Acquisition Device is still more efficient; in
this sense, the Poverty of Stimulus Hypothesis remains
unrefuted [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] by these results. Next steps will consider
extending to 10M tokens the training corpus (to match
the English counterpart [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and further exploring the
effects of optimized tokenization procedures or other
minimal modifications, and optimizations [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], of
recurrent neural networks.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This project is partially supported by the T-GRA2L:
Testing GRAdeness and GRAmmaticality in Linguistics,
PRIN 2022 Next Generation EU funded Project
(202223PL4N). National coordinator: CC</p>
    </sec>
    <sec id="sec-6">
      <title>A. Online Resources</title>
      <p>Resources (corpus information, tokenizer, network
architectures and lm_eval tasks) are available at
https://github.com/cristianochesi/babylm-2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          et al., Eds.,
          <source>Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. Singapore: Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          . [Online]. Available: https://aclanthology.org/
          <year>2023</year>
          .conllbabylm.0
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Katzir</surname>
          </string-name>
          , “
          <article-title>Why large language models are poor theories of human linguistic cognition. A reply to Piantadosi (</article-title>
          <year>2023</year>
          ),”
          <year>2023</year>
          . [Online].
          <source>Available: lingbuzz/007190</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Piantadosi</surname>
          </string-name>
          , “
          <article-title>Modern language models refute Chomsky's approach to language,” Lingbuzz Preprint, lingbuzz</article-title>
          , vol.
          <volume>7180</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghersi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Musella</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Musola</surname>
          </string-name>
          ,
          <article-title>COnVERSA: Test di Comprensione delle Opposizioni morfo-sintattiche VERbali attraverso la ScritturA</article-title>
          .
          <source>Firenze: Hogrefe</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          et al., “Attention Is All You Need,” arXiv:
          <fpage>1706</fpage>
          .03762 [cs],
          <source>Dec</source>
          .
          <year>2017</year>
          , Accessed: Mar.
          <volume>26</volume>
          ,
          <year>2022</year>
          . [Online]. Available: http://arxiv.org/abs/1706.03762
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>“Long shortterm memory,” Neural computation</article-title>
          , vol.
          <volume>9</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L. G. G.</given-names>
            <surname>Charpentier</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Samuel</surname>
          </string-name>
          , “
          <article-title>Not all layers are equally as important: Every Layer Counts BERT</article-title>
          ,”
          <source>in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning</source>
          ,
          <source>Singapore: Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>210</fpage>
          -
          <lpage>224</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .conll-babylm.
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          , “
          <article-title>FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,”</article-title>
          <source>Jun. 23</source>
          ,
          <year>2022</year>
          , arXiv: arXiv:
          <fpage>2205</fpage>
          .14135. Accessed: Jun.
          <volume>12</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2205.14135
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Steuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mosbach</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          , “
          <article-title>Large GPTlike Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence</article-title>
          and Psycholinguistic Measures,”
          <source>in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning</source>
          ,
          <source>Singapore: Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>129</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .conll-babylm.
          <volume>12</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-S.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , “
          <article-title>When Do You Need Billions of Words of Pretraining Data?</article-title>
          ,
          <source>” Nov. 10</source>
          ,
          <year>2020</year>
          , arXiv: arXiv:
          <year>2011</year>
          .04946. Accessed: Jan.
          <volume>10</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>2011</year>
          .04946
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Crain</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Nakayama</surname>
          </string-name>
          , “Structure Dependence in Grammar Formation,
          <source>” Language</source>
          , vol.
          <volume>63</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>522</fpage>
          ,
          <string-name>
            <surname>Sep</surname>
          </string-name>
          .
          <year>1987</year>
          , doi: 10.2307/415004.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          et al.,
          <article-title>“BLiMP: The Benchmark of Linguistic Minimal Pairs for English,” Transactions of the Association for Computational Linguistics</article-title>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>377</fpage>
          -
          <lpage>392</lpage>
          , Dec.
          <year>2020</year>
          , doi: 10.1162/tacl_a_
          <fpage>00321</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Alfano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Rosa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Iacobini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Savy</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voghera</surname>
          </string-name>
          , “
          <article-title>VOLIP: a corpus of spoken Italian and a virtuous example of reuse of linguistic resources</article-title>
          ,”
          <source>in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Calzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Loftsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , and S. Piperidis, Eds., Reykjavik, Iceland: European Language Resources Association (ELRA),
          <source>May</source>
          <year>2014</year>
          , pp.
          <fpage>3897</fpage>
          -
          <lpage>3901</lpage>
          . [Online]. Available: http://www.lrecconf.org/proceedings/lrec2014/pdf/906_Paper.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          et al.,
          <string-name>
            <surname>“Language Models are Few-Shot</surname>
            <given-names>Learners</given-names>
          </string-name>
          ,” arXiv:
          <year>2005</year>
          .14165 [cs],
          <source>Jul</source>
          .
          <year>2020</year>
          , Accessed: Apr.
          <volume>21</volume>
          ,
          <year>2021</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>2005</year>
          .14165
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gage</surname>
          </string-name>
          , “
          <article-title>A new algorithm for data compression,” C Users Journal</article-title>
          , vol.
          <volume>12</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,” arXiv preprint arXiv:
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C. D. Yang</surname>
          </string-name>
          ,
          <article-title>The price of linguistic productivity: how children learn to break the rules of language</article-title>
          . Cambridge, MA: MIT Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jabbar</surname>
          </string-name>
          , “
          <article-title>MorphPiece : A Linguistic Tokenizer for Large Language Models</article-title>
          ,” Feb.
          <volume>03</volume>
          ,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2307</fpage>
          .07262. Accessed: Jun.
          <volume>23</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2307.07262
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fredkin</surname>
          </string-name>
          , “Trie memory,
          <source>” Commun. ACM</source>
          , vol.
          <volume>3</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>499</lpage>
          , Sep.
          <year>1960</year>
          , doi: 10.1145/367390.367400.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salcianu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dopson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , “Fast WordPiece Tokenization,” Oct.
          <volume>05</volume>
          ,
          <year>2021</year>
          , arXiv: arXiv:
          <year>2012</year>
          .15524. Accessed: Jun.
          <volume>13</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>2012</year>
          .15524
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          et al.,
          <article-title>“Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          ,” Sep.
          <volume>02</volume>
          ,
          <year>2014</year>
          , arXiv: arXiv:
          <fpage>1406</fpage>
          .1078. Accessed: Jun.
          <volume>12</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/1406.1078
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Gers</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          , “
          <article-title>Recurrent nets that time and count</article-title>
          ,”
          <source>in Proceedings of the IEEE-INNSENNS International Joint Conference on Neural Networks. IJCNN 2000</source>
          .
          <article-title>Neural Computing: New Challenges and Perspectives for the New Millennium</article-title>
          , Como, Italy: IEEE,
          <year>2000</year>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>194</lpage>
          vol.
          <volume>3</volume>
          . doi:
          <volume>10</volume>
          .1109/IJCNN.
          <year>2000</year>
          .
          <volume>861302</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Greff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koutník</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Steunebrink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          , “LSTM:
          <string-name>
            <given-names>A Search</given-names>
            <surname>Space</surname>
          </string-name>
          <string-name>
            <surname>Odyssey</surname>
          </string-name>
          ,”
          <source>IEEE Trans. Neural Netw. Learning Syst.</source>
          , vol.
          <volume>28</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>2222</fpage>
          -
          <lpage>2232</lpage>
          , Oct.
          <year>2017</year>
          , doi: 10.1109/TNNLS.
          <year>2016</year>
          .
          <volume>2582924</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajimirsadegh</surname>
          </string-name>
          , “Were RNNs All We Needed?,” Oct.
          <volume>04</volume>
          ,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2410</fpage>
          .01201. Accessed: Oct.
          <volume>18</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2410.01201
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , “
          <article-title>GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding,”</article-title>
          <source>Feb. 22</source>
          ,
          <year>2019</year>
          , arXiv: arXiv:
          <year>1804</year>
          .07461. Accessed: Jul.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/
          <year>1804</year>
          .07461
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Dao</surname>
          </string-name>
          , “Mamba:
          <article-title>Linear-Time Sequence Modeling with Selective State Spaces</article-title>
          ,” May 31,
          <year>2024</year>
          , arXiv: arXiv:
          <fpage>2312</fpage>
          .00752. Accessed: Oct.
          <volume>20</volume>
          ,
          <year>2024</year>
          . [Online]. Available: http://arxiv.org/abs/2312.00752
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Reinhart</surname>
          </string-name>
          , “
          <article-title>The syntactic domain of anaphora</article-title>
          ,” Massachusetts Institute of Technology, Cambridge (MA),
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Ross</surname>
          </string-name>
          , “
          <article-title>Constraints on variables in syntax</article-title>
          .,
          <string-name>
            <surname>”</surname>
            <given-names>MIT</given-names>
          </string-name>
          , Cambridge (MA),
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cecchetto</surname>
          </string-name>
          , “
          <article-title>A Comparative Analysis of Left and Right Dislocation in Romance,” Studia Linguistica</article-title>
          , vol.
          <volume>53</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>67</lpage>
          , Apr.
          <year>1999</year>
          , doi: 10.1111/
          <fpage>1467</fpage>
          -
          <lpage>9582</lpage>
          .
          <fpage>00039</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          et al.,
          <source>Merge and the Strong Minimalist Thesis</source>
          , 1st ed. Cambridge University Press,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1017/9781009343244.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          , “
          <article-title>Expectation-based Minimalist Grammars</article-title>
          ,” arXiv:
          <fpage>2109</fpage>
          .13871 [cs],
          <source>Sep</source>
          .
          <year>2021</year>
          , Accessed: Nov.
          <volume>02</volume>
          ,
          <year>2021</year>
          . [Online]. Available: http://arxiv.org/abs/2109.13871
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chesi</surname>
          </string-name>
          et al., “Different Ways to Forget: Linguistic Gates in Recurrent Neural Networks,”
          <source>in Proceedings of the BabyLM Challenge at the 28th Conference on Computational Natural Language Learning</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          et al.,
          <article-title>“A framework for few-shot language model evaluation</article-title>
          .” Zenodo, Dec.
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.10256836.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hart</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Risley</surname>
          </string-name>
          , “
          <article-title>American parenting of language-learning children: Persisting differences in family-child interactions observed in natural home environments</article-title>
          .,” Developmental Psychology, vol.
          <volume>28</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1096</fpage>
          -
          <lpage>1105</lpage>
          , Nov.
          <year>1992</year>
          , doi: 10.1037/
          <fpage>0012</fpage>
          -
          <lpage>1649</lpage>
          .
          <year>28</year>
          .6.1096.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gulordava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          , “
          <article-title>Colorless Green Recurrent Networks Dream Hierarchically,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , New Orleans, Louisiana: Association for Computational Linguistics, Jun.
          <year>2018</year>
          , pp.
          <fpage>1195</fpage>
          -
          <lpage>1205</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -1108.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>W. de Vries and M. Nissim</surname>
          </string-name>
          , “
          <article-title>As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages,” in Findings of the Association for Computational Linguistics: ACLIJCNLP</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>836</fpage>
          -
          <lpage>846</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>F.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>You</surname>
          </string-name>
          , “To Repeat or Not To Repeat:
          <article-title>Insights from Scaling LLM under Token-Crisis,”</article-title>
          <year>2023</year>
          , arXiv. doi:
          <volume>10</volume>
          .48550/ARXIV.2305.13230.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wilcox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Futrell</surname>
          </string-name>
          , and
          <string-name>
            <surname>R. Levy</surname>
          </string-name>
          , “Using Computational Models to Test Syntactic Learnability,” Linguistic Inquiry, pp.
          <fpage>1</fpage>
          -
          <lpage>44</lpage>
          , Apr.
          <year>2023</year>
          , doi: 10.1162/ling_a_
          <fpage>00491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Crain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Berwick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Bolhuis</surname>
          </string-name>
          , “
          <article-title>The growth of language: Universal Grammar, experience, and principles of computation,” Neuroscience &amp; Biobehavioral Reviews</article-title>
          , vol.
          <volume>81</volume>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>119</lpage>
          , Oct.
          <year>2017</year>
          , doi: 10.1016/j.neubiorev.
          <year>2016</year>
          .
          <volume>12</volume>
          .023.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>