=Paper= {{Paper |id=Vol-3878/45_main_long |storemode=property |title=Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-LM Training on Child-Directed Speech in Italian |pdfUrl=https://ceur-ws.org/Vol-3878/45_main_long.pdf |volume=Vol-3878 |authors=Achille Fusco,Matilde Barbini,Maria Letizia Piccini Bianchessi,Veronica Bressan,Sofia Neri,Sarah Rossi,Tommaso Sgrizzi,Cristiano Chesi |dblpUrl=https://dblp.org/rec/conf/clic-it/FuscoBBBNRSC24 }} ==Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-LM Training on Child-Directed Speech in Italian== https://ceur-ws.org/Vol-3878/45_main_long.pdf

Recurrent Networks are (Linguistically) Better? An
Experiment on Small-LM Training on Child-Directed
Speech in Italian
Achille Fusco1 ,†, Matilde Barbini1 ,†, Maria Letizia Piccini Bianchessi1 ,†, Veronica
Bressan1 ,†, Sofia Neri1 ,†, Sarah Rossi1 ,†, Tommaso Sgrizzi1 ,†, Cristiano Chesi1,∗ ,†
1
NeTS Lab, IUSS Pavia, P.zza Vittoria 15 27100 Pavia, Italy

Abstract
Here we discuss strategies and results of a small-sized training program based on Italian child-
directed speech (less than 3M tokens) for various network architectures. The rationale behind these
experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on
different models' architecture. Preliminary findings lead us to conclude that: (i) different
tokenization strategies produce mildly significant improvements overall, although segmentation
aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM
networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state
perform slightly worse in training loss (compared to standard one- and two-layered LSTM models)
but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in
autoregressive training procedures are linguistically irrelevant and, more generally, misleading
since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the
performance of these models remains significantly lower compared to that of 7-year-old native-
speaker children in the relevant linguistic contrasts we considered [4].

Keywords
LSTM, Transformers, Small Language Models (SLM), tokenization, cell state control, LM evaluation
1

1. Introduction training data available, and the more the better,
structural (morpho-syntactic and compositional
According to the mainstream LLM development semantic) knowledge might require a much smaller
pipeline, Transformer-based architectures [5] dataset (from 10 to 100 million words, according to [10]).
outperform sequential training models, like LSTM [6], in We explore this intuition further and, based on prolific
various NLP tasks. When small-sized training data are literature from the ‘80s showing that typical child errors
available, optimization becomes necessary [7], [8], but are structurally sensitive and never random [11], we
common optimization techniques neglect the model networks’ architecture to bias learning towards
linguistically relevant fact that these models (i) conflate plausible structural configurations, possibly preventing
semantic/world knowledge with morpho-syntactic these “small” language models (SLM) from producing
competence, (ii) require unreasonable training data wrong linguistic generalizations. We started from a mild
compared to that needed by children during language revision of the LM training and evaluation pipeline for
acquisition, (iii) the higher their performance, the lower Italian including alternative approaches to tokenization
their return in cognitive/linguistic terms [9]. In this based on pseudo-morphological decomposition (§2.2);
paper we address these three issues, starting from the we then approached a more structurally-driven update
observation that while world knowledge uses all

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, 0000-0002-5389-8884 (A. Fusco); 0009-0007-7986-2365 (M.
Dec 04 — 06, 2024, Pisa, Italy Barbini); 0009-0005-8116-3358 (M. L. Piccini Bianchessi);
∗
Corresponding author. 0000-0003-3072-7967 (V. Bressan); 0009-0003-5456-0556 (S. Neri);
†
These authors contributed equally. 0009-0007-2525-2457 (S. Rossi); 0000-0003-1375-1359 (T. Sgrizzi);
cristiano.chesi@iusspavia.it (C. Chesi) 0000-0003-1935-1348 (C. Chesi);

© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
of the cell state in LSTM networks, which we will call 2.2. Tokenization: MorPiece (MoP)
eMG-RNN variants (§2.3); we finally adopted a precise
testing benchmark for specific linguistic contrasts in Popular vLLMs use either Byte-Pair Encoding (BPE)
Italian following BLiMP design [12] (§2.4). We will first [14], [15] or (fast)WordPiece (fWP) [16] algorithms for
set the stage in section (§2) and discuss one alternative tokenization. The simplicity and computational
tokenization strategy (MorPiece). A simple modification efficiency of these approaches contrast with the limited
to the gating system in LSTM is proposed that mimics morphological analysis they provide. In rich inflectional
certain linguistic constraints. Then, we will describe the languages (e.g., Italian) and agglutinative languages
relevant experiments we have run (§3) and draw some (e.g., Finnish), this might induce linguistically unsound
conclusions based on the observed results (§4). A general generalizations. Here, we explore a more
discussion with a description of the next steps will morphologically informed strategy, inspired by the
conclude this paper (§5). Tolerance Principle (TP) and Sufficiency Principle (SP)
[17], aiming to break words into potentially relevant
morphemes without relying on morpheme tables [18].
2. Revisiting LM training pipeline The experiments we conduct compare the impact of
LM training pipeline is relatively rigid: after corpus different strategies when integrated into various
cleaning (i), the data are prepared/optimized for network architectures. We refer to MorPiece (MoP) as a
tokenization (ii), then the tokenized input is batched for TP/SP-based strategy, which can be algorithmically
training autoregressive models (iii), mostly feeding described as follows: each token is traversed from left to
transformer-based architectures (iv). Once the models right to create a “root trie,” and from right to left to
are trained, the evaluation step requires their assessment create an “inflectional trie” [19]. Each time a node N of
using some standard tasks (v). In the next sub-sections, the trie is traversed (corresponding to the current
we will identify various criticalities in this pipeline, character path in the word), the frequency counter
eventually proposing strategies to mitigate these associated with this node (Nc) is updated (+1). Nodes
problems and, in the end, training linguistically more corresponding to token endings (characters before white
informative SLM. spaces or punctuation) are flagged. Once both tries are
created, the optimization procedure explores each
2.1. Corpus creation and cleaning descendant, and for every daughter node Dk its
frequency k is compared to HN, the approximation of the
The primary data we collected for Italian replicates
harmonic number for N used both in TP and SP [17],
plausible linguistic input that children may be exposed
where c is the frequency of the mother node Nc:
to during acquisition, in line with [1]. It consists of about
3M tokens divided into child-directed speech (CHILDES HN = c/ln(c) (F1)
Italian section), child movie subtitles (from
OpenSubtitles), child songs (from Zecchino D’Oro If k > HN and c ≠ k, a productive boundary break is
repository), telephone conversations (VoLIP corpus, postulated (based on the inference that since there are
[13]), and fairy tales (all from copyright expired different continuations and some of them are
sources). Simple cleaning consisted of removing productive, i.e. sufficiently frequent according to SP,
children’s productions from CHILDES files as well as those might be real independent morphemes). We can
any other metalinguistic annotation (speakers’ check if this break respects HD for the relevant nodes Dj
identification, headers, time stamps, tags, links, etc.). and Ni in the “inflectional trie”. This means there exists
Dimension and rough lexical richness of each section are a path where the frequency i of the daughter node Ni (in
reported in Table 1 (Type-Token Ratio, TTR) before and the “inflectional trie” the dependency between D and N
after the cleaning procedure. is reversed) is lower than j/ln(j), where j is the frequency
of the mother node Dj. If this is the case, the continuation
Table 1 is not considered “an exception”, in the sense of TP [17],
Corpus profiling before (bc) and after (ac) cleaning.
suggesting that the continuation is, in fact, a productive
Section tokens bc tokens ac TTR independent morpheme. A “++” root node is then
Childes 405892 346155 0.03 activated, the node Dk linked to it, and so on recursively,
Subtitles 959026 700729 0.05 following the FastWordPiece tokenization strategy [20].
Conversations 80826 58039 0.11 During recognition, the LinMaxMatch identification
Songs 240309 222572 0.08 approach is adopted, as in FastWordPiece. Figure 1
Fairy tales 1103543 1287826 0.05 illustrates the relevant morpheme breaks (indicated as
Total 2973879 2431038 0.03 “||”) obtained by applying this morpheme-breaking
procedure in the root and infl tries fragments.
Various parametric controls have been considered to training corpus (in contrast to BPE, for instance). We
tune this procedure: (i) a branching factor (bf) parameter acknowledge this limitation, but we emphasize that our
that excludes nodes with an excessively high number (> goal was to produce a smaller, potentially more efficient
bf) of continuations (the rationale being that when too lexicon. In our experiments, while BPE generated a
many continuations are present, they are unlikely to lexicon of 96028 tokens (67169 when the minimum
correspond to inflections; this often happens near the lexical frequency was set to 2), MoP produced a lexicon
root of each trie); (ii) a cutoff parameter indicating the of just 55049 tokens (cutoff=100, bf=10).
lower frequency boundary for a mother node (this is
necessary to ensure a minimum number of observations; 2.3. Revisiting LSTM architecture
for example, if cutoff = 8, we exclude from the “root” trie
Despite many variants of the standard LSTM
any branching daughter with a frequency < 5). As in
architectures, notably Gated Recurrent Units [21] or
BPE, minimum frequency control for tokens is also
LSTM augmented with peephole connections [22], and
implemented to exclude infrequent dictionary entries.
the discouraging equivalence results for these variations
c “root” trie “infl” trie [23], we observe a recent revival of RNN-based model
214762
a 458013 architectures [24]. We believe, in fact, that the core
e 466619
intuition behind the LSTM architecture may be
10240 e
l ✘ linguistically relevant and worth exploring further,
r
5391
10121 82939
l although generally more performant models (for
a 4 c instance in terms of GLUE benchmark, [25]) are usually
1813 c r
preferred [26]. The linguistic intuition is that the “long-
124 r term memory” (cell state C in Figure 2) in LSTM
1307
a o h r
networks could effectively model various types of non-
r i e local dependencies using a single mechanism.
Linguistically speaking, filler-gap dependencies (1) and
e l o r co-referential dependencies (2) are both “non-local
dependencies” but they are subject to non-identical
a e i o ò à locality conditions:

(1) a. cosa i credi che abbia riposto _ i?
Figure 1: Visualization of a fragment of the “root” and
what (you) believe that (he) shelved?
the “infl(ectional)” trie created by MorPiece on our
what do you believe he shelved?
corpus (cutoff=100, bf=10).
b. *cosa i credi che abbia riposto il libro [AdvP senza
Consider the word “cerca” (“to search for”) leggere _ i]]?
represented in the “root” trie. In the last “c-a” the b'. cosa i credi che abbia riposto _ i [AdvP senza
relation between Hfc and “a” frequency indicates that a leggere _ i]]?
break might exist between the nodes “c” what do you believe he shelved (*the book)
(frequency=1813) and “a” (frequency=1307), since Hfc = without reading?
1813/ln(1813) and 1307 > Hfc. This hypothesis is
(2) a. [il panino]i, chi credi che loi abbia mangiato?
confirmed by the failure of the Hfc check at the relevant
the sandwich, who (you) believe it has eaten?
“infl” “a-c” segment (“a” frequency=10121, “c”
b. *[il panino]i, chi credi che _i abbia mangiato?
frequency=466619): 10121 < 466619/ln(466619). If Hfc had
the sandwich, who (you) believe has eaten?
been greater than “a” frequency, then no segmentation
the sandwich, who do you believe have eaten
advantage would have been observable.
*(it)?
The proposed algorithm has a linear time complexity
of O(2n), as each trie must be explored deterministically While both dependencies require C(onstituent)-
exactly once to evaluate the HN/D frequency relation. command generalizations to be captured [27], the
The best linguistic results (relatively linguistically adjunct island in (1), [28], but not clitic left-dislocation in
coherent segmentations) for our Italian corpus were (2), [29], can, for instance, be licensed with a(n extra) gap
obtained with cutoff=100 and bf=10. We found that it (1).b'. Aware of these differences, we decided to simply
was unnecessary to filter the proposed inflectional alter the gating system to allow the LSTM to create
breaks using the infl trie double check (TP) since the distinct pathways: one to “merge” new tokens, the other
LinMaxMatch strategy already efficiently filtered out to decide if a long-distance dependency is necessary, and
initially overestimated breaks. However, as an subsequently to “move” the relevant items [30]. The
anonymous reviewer correctly pointed out, this strategy processing implementation of these operations is
does not guarantee total inclusion of every token of our
inspired by expectation-based Minimalist Grammars based on English BLiMP [12]. Most of the contrasts are
formalism, eMG [31], and it is then named eMG-RNN. derived from the COnVERSA test [4]. They consist of
Following this implementation, merge applies minimal pairs ordered following an increasing
incrementally, token by token, and move means “retain complexity metric that considers the number of
in memory”. In more detail, the cell of an eMG-RNN operations necessary to establish a dependency and the
network performs the forward processing described in locality of such dependency. The examples below
the computational graph in Figure 2: (i) the input at time illustrate this point by comparing a local agreement
t (xt) is linearly transformed to a lower dimension vector dependency with, (3).b, or without, (3).a, a (linear)
(E, loosely used for “embedding”), then concatenated (C) intervener and a more complex dependency that
with the previous hidden state/output, if any (ht-1). Two requires to process an object relative clause (4):
pathways, both transformed using a sigmoid function
(σ), lead, on the one hand, to the move gate, on the other, (3) a. Il piatto è pieno. Vs. Il piatto è piena.
to the merge gate. In the first case, the result of the the dish.S.M is full.S.M … full.S.F
sigmoid transformation is multiplied (⊙, the Hadamard b. Il muro della casa è rosso
product) with the input (this either erases or allows the wall.S.M of the house is red.S.M
some component of the original vector to be added (+) Vs. Il muro della casa è rossa.
to the previous (if any) context/cell state (ct-1) as in LSTM the wall.S.M of the house is red.S.F
forget gate). The merge gate, on the other direction, will (4) Ci sono due maestri. Uno insegna ed è ascoltato
privilege the new token if the result of the sigmoid dagli studenti, l'altro si riposa. Quale maestro
combination of the incoming token and the previous insegna? There are two teachers. One teaches and
hidden state is low, otherwise (1 - this activation, as in he’s listened to by the students, the other rests.
GRUs update gate) will favor items in the context/cell Which one teaches?
state (transformed through a tanh function to simulate Quello che gli studenti ascoltano.
memory decay). The one who the students listen to
Vs. Quello che ascolta gli studenti.
ct-1 | + tanh ct The one who listens to the students
move
Four kinds of dependency (agreement, thematic role
xt E assignment, pronominal forms usage, questions
⊙
formation and answering) are considered for a set of 32
σ distinct syntactic configurations (a total of 344 minimal
i
ht-1 pairs to be judged, [4]).
C

σ
j
⊙
3. Materials and Methods
We trained our models on the IUSS High-Performance
Cluster with 2 GPU nodes, each with 4 A100 NVIDIA
1-
devices and 1T RAM. Each network has been trained
with the full corpus using various batched strategies. (i)
Naturalistic, line-by-line, single exposure to each
sentence in the corpus (each epoch corresponds to an
⊙ + ht exposure of about 3M tokens); (ii) Conversational, two
sequential lines are used for the input, that is, [line 1,
merge
line 2], [line 2, line 3], etc. are batched; this guarantees
that a minimal conversational context for each sentence
Figure 2: eMG-RNN cell computational graph. is provided. In this case, each epoch corresponds to an
exposure of 6M tokens; (iii) fixed sequence length,
This architecture is the most performant compared
considering the average sentence length of 54 words per
to various alternatives tested for the BabyLM 2024
sentence, a window of 60 tokens is used, that is, [tok_1,
challenge [32].
tok_2 … tok_60], [tok_2, tok_3 … tok_61] … are batched;
with this regimen, each epoch corresponds to an
2.4. A linguistically informed evaluation
exposure of 180M tokens. Roughly speaking, the bare
The last step in the pipeline requires a linguistically amount of data processed by a 7 y.o. child ranges from 7
advanced set of oppositions to verify that the structural to 70M tokens, [34], then training the networks with a
generalizations can be captured coherently. We adopted naturalistic or conversational regimen for 3-10 epochs
the lm-eval package [33] and we included a specific task would result in a comparable exposure. We trained the
networks using torch.optim.lr_scheduler (step_size=5, compatible with the overfitting hypothesis, [37]).
gamma=0.1) and Adam optimizer (lr=0.001) with 16-bit Focusing on tokenizer training results with LSTMx1, we
automatic mixed-precision to speed up the (parallel) observed that BPE and FastWordPiece have comparable
training for a maximum of 100 epochs. The networks performance. MorPiece performs slightly worse, even
have been implemented in PyTorch (v2.3.1), wrapped in though the tokenization seems linguistically more
Transformers structures (4.42.4) to maximize coherent (e.g., “farlo” – “to do it” is tokenized both by
compatibility in the lm-eval (v.0.4.3) environment. BPE and fWP as a single token, while it is split in two in
CUDA drivers v.12.4 were used. The most relevant MorPiece: “far” “+lo”) and the training faster (Table 3).
configurations tested are discussed in the next session. This, however, only marginally impacts on minimal
pairs contrast judgments, performing slightly better,
3.1. Configurations tested overall, just in certain agreement cases.
Three different tokenization strategies (BPE, Table 3
FastWordPiece, and MorPiece) are compared using the Impact of the tokenization strategy on LSTM training
best-performing LSTM network [35] , which consists of
650 units for the embedding layer and 650 nodes for each Strategy Vocab size Training Loss
of the two hidden layers. Five different network time x
architectures are compared, with the GroNLP GPT-2- epoch
small pretrained model [36] constituting our “top LLM Corpus types 72931 ~1h 1.1520
performer”. This model was re-adapted to Italian from BPE 96028 ~4h 0.8877
the GPT-2 English trained model, which was originally fWP 97162 ~4h 0.9491
trained on approximately 10 billion token corpus, MoP 55049 ~3h 1.1151
namely various orders of magnitude bigger than our
corpus. We then trained on our corpus a comparable We then adopted the BPE tokenizer for architectural
bidirectional transformer (BERT), two LSTM networks, comparisons. Network training performances are
respectively with 1 and 2 LSTM layers, and a one-layer summarized in Table 4 and graphically represented in
eMG-RNN network (Table 2), as described in §2.3. Figure 3 for linguistic dimensions comparison.
Table 2 Table 4
Network architectures Network architectures and their performance on
Model Parameters Structure training (Loss/Accuracy) and COnVERSA test
GroNLP GPT-2 12 Attention heads + Model Loss/Accuracy COnVERA
121M
small 768 hidden units GroNLP GPT-2s 0.73 (±0.02)
12 Attention heads + BERT 4.5488/0.65471 0.43(±0.02)
BERT 113M
768 hidden units LSTMx2 0.7849/0.8283 0.48(±0.03)
650 Embedding + 2 LSTMx1 0.8784/0.8103 0.52(±0.03)
LSTMx2 65M
LSTM layers (650) eMG-RNN 0.9491/0.7815 0.61(±0.01)
650 Embedding + 1
LSTMx1 36M
LSTM layers (650) Total
Wh questions + OR Agr DP
650 Embedding + 1 Reﬂexives in Psych V
Subj position in wh-
1 Agr Subj-AP

eMG-RNN 73M questions 0.9 Agr Subj-AP + attract

eMG-RNN layer (650) Clitic pronouns dat 0.8 Agr Subj-V (unacc)

Clitic pronouns acc 0.7 Agr Subj-V (unerg)

0.6
Clitic pronouns Agr V-Subj (unacc)
0.5

4. Results
Wh questions + OR Agr Subj-V (unerg) +
unamb 0.4 attract

0.3
Wh questions + SR Agr Subj-V (trans)
0.2

Comparing BERT and LSTM architectures, LSTMx1 Why questions 0.1 Agr Subj-V trans
(unerg)
0

qualifies as the most performant configuration (both in Wh arguments
Aux selection
(ditrans)

training and in minimal pair judgments). Considering Person rotation in
decl
Aux selection (trans)

training, the only batching regimen performing Person rotation in
questions
Theta roles (subst)

sufficiently well is the fixed sequence length (loss=0.8877 Reﬂexives cl (unacc
V)
Agr Past Part (unacc)

with LSTMx1 vs. conversational loss=4.0240 or
Aux selection
Reﬂexives cl
(unacc)
Aux selection
Aux selection (unerg)

naturalistic regimen loss=4.5884). All networks reached
(passives)
Theta roles
Polar questions
(omission)
Wh adjuncts
Agr Subj V Psych Psych
V V (piacere)

a learning plateau around 10-12 epochs. Comparing the (cumulative) Agr Past Part (with (preoccupare)
clitic)

performances on COnVERSA, we realized that the
LSTMx1 Emg_BPE 7 y.o. child
results does not improve after 3 epochs of fixed sequence
length (60 tokens) training regimen (this result is Figure 3: Performance of the 2 best RNN networks
variants on COnVERSA compared to the 7 y.o. children.
5. Discussion References
Overall, LSTM networks significantly outperform [1] A. Warstadt et al., Eds., Proceedings of the BabyLM
Bidirectional Transformers in this minimal pairs test on Challenge at the 27th Conference on Computational
Italian. This finding is consistent with results previously Natural Language Learning. Singapore: Association
discussed in the literature and suggests a clear for Computational Linguistics, 2023. [Online].
advantage of recurrent, sequential model architectures Available: https://aclanthology.org/2023.conll-
babylm.0
(e.g., LSTM) over Bidirectional Transformers in terms of
[2] R. Katzir, “Why large language models are poor
linguistic generalizations [38] and partially justify the theories of human linguistic cognition. A reply to
renewed interest for RNN networks that we have been Piantadosi (2023),” 2023. [Online]. Available:
observed in the last couple of years [24], [26]. As far as lingbuzz/007190
the tokenization procedure is concerned, it is somewhat [3] S. Piantadosi, “Modern language models refute
premature to draw definitive conclusions from our Chomsky’s approach to language,” Lingbuzz
experiments, as MorPiece has not yet been fully Preprint, lingbuzz, vol. 7180, 2023.
optimized or tested. Specifically, the optimal cut-off [4] C. Chesi, G. Ghersi, V. Musella, and D. Musola,
threshold and minimum branching factor have not been COnVERSA: Test di Comprensione delle Opposizioni
systematically evaluated. Nevertheless, a more morfo-sintattiche VERbali attraverso la ScritturA.
Firenze: Hogrefe, 2024.
morphologically coherent segmentation is expected to
[5] A. Vaswani et al., “Attention Is All You Need,”
enhance sensitivity in certain minimal contrasts. arXiv:1706.03762 [cs], Dec. 2017, Accessed: Mar. 26,
Similarly, the eMG-RNN architecture could be 2022. [Online]. Available:
further explored and optimized, particularly considering http://arxiv.org/abs/1706.03762
specific contrasts, which may help determine whether [6] S. Hochreiter and J. Schmidhuber, “Long short-
our linguistic modeling is on the right track. Evidence to term memory,” Neural computation, vol. 9, no. 8,
the contrary is attested by the judgments of sentences pp. 1735–1780, 1997.
with missing thematic roles, which are often incorrectly [7] L. G. G. Charpentier and D. Samuel, “Not all layers
preferred by most models, including our eMG-RNN. are equally as important: Every Layer Counts
In the end, our results suggest that Loss/Accuracy BERT,” in Proceedings of the BabyLM Challenge at
the 27th Conference on Computational Natural
performance registered in training is not a significant
Language Learning, Singapore: Association for
predictor of the performance on the COnVERSA test, or Computational Linguistics, 2023, pp. 210–224. doi:
more generally, of the linguistic coherence of the LM 10.18653/v1/2023.conll-babylm.20.
trained. Likewise, the models’ dimension is not a clear [8] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré,
predictor either: Transformers trained on the same small “FlashAttention: Fast and Memory-Efficient Exact
dataset perform randomly (in all dimensions their Attention with IO-Awareness,” Jun. 23, 2022, arXiv:
performance is round 50%) while eMG-RNN, which has arXiv:2205.14135. Accessed: Jun. 12, 2024. [Online].
a number of parameters similar to LSTM-2, outperforms Available: http://arxiv.org/abs/2205.14135
both LSTM-2 and LSTM-1 (half size of eMG-RNN). The [9] J. Steuer, M. Mosbach, and D. Klakow, “Large GPT-
training size remains a striking difference compared to like Models are Bad Babies: A Closer Look at the
Relationship between Linguistic Competence and
the input received by children: this difference of one
Psycholinguistic Measures,” in Proceedings of the
order of magnitude suggests that the bias considered in BabyLM Challenge at the 27th Conference on
eMG-RNN are not yet satisfactory and that our Computational Natural Language Learning,
Language Acquisition Device is still more efficient; in Singapore: Association for Computational
this sense, the Poverty of Stimulus Hypothesis remains Linguistics, 2023, pp. 114–129. doi:
unrefuted [39] by these results. Next steps will consider 10.18653/v1/2023.conll-babylm.12.
extending to 10M tokens the training corpus (to match [10] Y. Zhang, A. Warstadt, H.-S. Li, and S. R. Bowman,
the English counterpart [1]) and further exploring the “When Do You Need Billions of Words of
effects of optimized tokenization procedures or other Pretraining Data?,” Nov. 10, 2020, arXiv:
minimal modifications, and optimizations [24], of arXiv:2011.04946. Accessed: Jan. 10, 2024. [Online].
Available: http://arxiv.org/abs/2011.04946
recurrent neural networks.
[11] S. Crain and M. Nakayama, “Structure Dependence
in Grammar Formation,” Language, vol. 63, no. 3, p.
Acknowledgments 522, Sep. 1987, doi: 10.2307/415004.
[12] A. Warstadt et al., “BLiMP: The Benchmark of
This project is partially supported by the T-GRA2L:
Linguistic Minimal Pairs for English,” Transactions
Testing GRAdeness and GRAmmaticality in Linguistics, of the Association for Computational Linguistics, vol.
PRIN 2022 Next Generation EU funded Project 8, pp. 377–392, Dec. 2020, doi:
(202223PL4N). National coordinator: CC 10.1162/tacl_a_00321.
[13] I. Alfano, F. Cutugno, A. De Rosa, C. Iacobini, R. and Analysis Platform for Natural Language
Savy, and M. Voghera, “VOLIP: a corpus of spoken Understanding,” Feb. 22, 2019, arXiv:
Italian and a virtuous example of reuse of linguistic arXiv:1804.07461. Accessed: Jul. 20, 2024. [Online].
resources,” in Proceedings of the Ninth International Available: http://arxiv.org/abs/1804.07461
Conference on Language Resources and Evaluation [26] A. Gu and T. Dao, “Mamba: Linear-Time Sequence
(LREC’14), N. Calzolari, K. Choukri, T. Declerck, H. Modeling with Selective State Spaces,” May 31,
Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. 2024, arXiv: arXiv:2312.00752. Accessed: Oct. 20,
Odijk, and S. Piperidis, Eds., Reykjavik, Iceland: 2024. [Online]. Available:
European Language Resources Association http://arxiv.org/abs/2312.00752
(ELRA), May 2014, pp. 3897–3901. [Online]. [27] T. Reinhart, “The syntactic domain of anaphora,”
Available: http://www.lrec- Massachusetts Institute of Technology, Cambridge
conf.org/proceedings/lrec2014/pdf/906_Paper.pdf (MA), 1976.
[14] T. B. Brown et al., “Language Models are Few-Shot [28] J. R. Ross, “Constraints on variables in syntax.,”
Learners,” arXiv:2005.14165 [cs], Jul. 2020, MIT, Cambridge (MA), 1967.
Accessed: Apr. 21, 2021. [Online]. Available: [29] C. Cecchetto, “A Comparative Analysis of Left and
http://arxiv.org/abs/2005.14165 Right Dislocation in Romance,” Studia Linguistica,
[15] P. Gage, “A new algorithm for data compression,” vol. 53, no. 1, pp. 40–67, Apr. 1999, doi:
C Users Journal, vol. 12, no. 2, pp. 23–38, 1994. 10.1111/1467-9582.00039.
[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, [30] N. Chomsky et al., Merge and the Strong Minimalist
“Bert: Pre-training of deep bidirectional Thesis, 1st ed. Cambridge University Press, 2023.
transformers for language understanding,” arXiv doi: 10.1017/9781009343244.
preprint arXiv:1810.04805, 2018. [31] C. Chesi, “Expectation-based Minimalist
[17] C. D. Yang, The price of linguistic productivity: how Grammars,” arXiv:2109.13871 [cs], Sep. 2021,
children learn to break the rules of language. Accessed: Nov. 02, 2021. [Online]. Available:
Cambridge, MA: MIT Press, 2016. http://arxiv.org/abs/2109.13871
[18] H. Jabbar, “MorphPiece : A Linguistic Tokenizer for [32] C. Chesi et al., “Different Ways to Forget:
Large Language Models,” Feb. 03, 2024, arXiv: Linguistic Gates in Recurrent Neural Networks,” in
arXiv:2307.07262. Accessed: Jun. 23, 2024. [Online]. Proceedings of the BabyLM Challenge at the 28th
Available: http://arxiv.org/abs/2307.07262 Conference on Computational Natural Language
[19] E. Fredkin, “Trie memory,” Commun. ACM, vol. 3, Learning, 2024.
no. 9, pp. 490–499, Sep. 1960, doi: [33] L. Gao et al., “A framework for few-shot language
10.1145/367390.367400. model evaluation.” Zenodo, Dec. 2023. doi:
[20] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. 10.5281/zenodo.10256836.
Zhou, “Fast WordPiece Tokenization,” Oct. 05, [34] B. Hart and T. R. Risley, “American parenting of
2021, arXiv: arXiv:2012.15524. Accessed: Jun. 13, language-learning children: Persisting differences
2024. [Online]. Available: in family-child interactions observed in natural
http://arxiv.org/abs/2012.15524 home environments.,” Developmental Psychology,
[21] K. Cho et al., “Learning Phrase Representations vol. 28, no. 6, pp. 1096–1105, Nov. 1992, doi:
using RNN Encoder-Decoder for Statistical 10.1037/0012-1649.28.6.1096.
Machine Translation,” Sep. 02, 2014, arXiv: [35] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen,
arXiv:1406.1078. Accessed: Jun. 12, 2024. [Online]. and M. Baroni, “Colorless Green Recurrent
Available: http://arxiv.org/abs/1406.1078 Networks Dream Hierarchically,” in Proceedings of
[22] F. A. Gers and J. Schmidhuber, “Recurrent nets that the 2018 Conference of the North American Chapter
time and count,” in Proceedings of the IEEE-INNS- of the Association for Computational Linguistics:
ENNS International Joint Conference on Neural Human Language Technologies, Volume 1 (Long
Networks. IJCNN 2000. Neural Computing: New Papers), New Orleans, Louisiana: Association for
Challenges and Perspectives for the New Millennium, Computational Linguistics, Jun. 2018, pp. 1195–
Como, Italy: IEEE, 2000, pp. 189–194 vol.3. doi: 1205. doi: 10.18653/v1/N18-1108.
10.1109/IJCNN.2000.861302. [36] W. de Vries and M. Nissim, “As Good as New. How
[23] K. Greff, R. K. Srivastava, J. Koutník, B. R. to Successfully Recycle English GPT-2 to Make
Steunebrink, and J. Schmidhuber, “LSTM: A Search Models for Other Languages,” in Findings of the
Space Odyssey,” IEEE Trans. Neural Netw. Learning Association for Computational Linguistics: ACL-
Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017, doi: IJCNLP 2021, 2021, pp. 836–846. doi:
10.1109/TNNLS.2016.2582924. 10.18653/v1/2021.findings-acl.74.
[24] L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. [37] F. Xue, Y. Fu, W. Zhou, Z. Zheng, and Y. You, “To
Hajimirsadegh, “Were RNNs All We Needed?,” Oct. Repeat or Not To Repeat: Insights from Scaling
04, 2024, arXiv: arXiv:2410.01201. Accessed: Oct. LLM under Token-Crisis,” 2023, arXiv. doi:
18, 2024. [Online]. Available: 10.48550/ARXIV.2305.13230.
http://arxiv.org/abs/2410.01201 [38] E. Wilcox, R. Futrell, and R. Levy, “Using
[25] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and Computational Models to Test Syntactic
S. R. Bowman, “GLUE: A Multi-Task Benchmark
Learnability,” Linguistic Inquiry, pp. 1–44, Apr.
2023, doi: 10.1162/ling_a_00491.
[39] C. Yang, S. Crain, R. C. Berwick, N. Chomsky, and
J. J. Bolhuis, “The growth of language: Universal
Grammar, experience, and principles of
computation,” Neuroscience & Biobehavioral
Reviews, vol. 81, pp. 103–119, Oct. 2017, doi:
10.1016/j.neubiorev.2016.12.023.

A. Online Resources
Resources (corpus information, tokenizer, network
architectures and lm_eval tasks) are available at
https://github.com/cristianochesi/babylm-2024.