Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian Achille Fusco1 ,†, Matilde Barbini1 ,†, Maria Letizia Piccini Bianchessi1 ,†, Veronica Bressan1 ,†, Sofia Neri1 ,†, Sarah Rossi1 ,†, Tommaso Sgrizzi1 ,†, Cristiano Chesi1,∗ ,† 1 NeTS Lab, IUSS Pavia, P.zza Vittoria 15 27100 Pavia, Italy Abstract Here we discuss strategies and results of a small-sized training program based on Italian child- directed speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models' architecture. Preliminary findings lead us to conclude that: (i) different tokenization strategies produce mildly significant improvements overall, although segmentation aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state perform slightly worse in training loss (compared to standard one- and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the performance of these models remains significantly lower compared to that of 7-year-old native- speaker children in the relevant linguistic contrasts we considered [4]. Keywords LSTM, Transformers, Small Language Models (SLM), tokenization, cell state control, LM evaluation 1 1. Introduction training data available, and the more the better, structural (morpho-syntactic and compositional According to the mainstream LLM development semantic) knowledge might require a much smaller pipeline, Transformer-based architectures [5] dataset (from 10 to 100 million words, according to [10]). outperform sequential training models, like LSTM [6], in We explore this intuition further and, based on prolific various NLP tasks. When small-sized training data are literature from the ‘80s showing that typical child errors available, optimization becomes necessary [7], [8], but are structurally sensitive and never random [11], we common optimization techniques neglect the model networks’ architecture to bias learning towards linguistically relevant fact that these models (i) conflate plausible structural configurations, possibly preventing semantic/world knowledge with morpho-syntactic these “small” language models (SLM) from producing competence, (ii) require unreasonable training data wrong linguistic generalizations. We started from a mild compared to that needed by children during language revision of the LM training and evaluation pipeline for acquisition, (iii) the higher their performance, the lower Italian including alternative approaches to tokenization their return in cognitive/linguistic terms [9]. In this based on pseudo-morphological decomposition (§2.2); paper we address these three issues, starting from the we then approached a more structurally-driven update observation that while world knowledge uses all CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, 0000-0002-5389-8884 (A. Fusco); 0009-0007-7986-2365 (M. Dec 04 — 06, 2024, Pisa, Italy Barbini); 0009-0005-8116-3358 (M. L. Piccini Bianchessi); ∗ Corresponding author. 0000-0003-3072-7967 (V. Bressan); 0009-0003-5456-0556 (S. Neri); † These authors contributed equally. 0009-0007-2525-2457 (S. Rossi); 0000-0003-1375-1359 (T. Sgrizzi); cristiano.chesi@iusspavia.it (C. Chesi) 0000-0003-1935-1348 (C. Chesi); © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings of the cell state in LSTM networks, which we will call 2.2. Tokenization: MorPiece (MoP) eMG-RNN variants (§2.3); we finally adopted a precise testing benchmark for specific linguistic contrasts in Popular vLLMs use either Byte-Pair Encoding (BPE) Italian following BLiMP design [12] (§2.4). We will first [14], [15] or (fast)WordPiece (fWP) [16] algorithms for set the stage in section (§2) and discuss one alternative tokenization. The simplicity and computational tokenization strategy (MorPiece). A simple modification efficiency of these approaches contrast with the limited to the gating system in LSTM is proposed that mimics morphological analysis they provide. In rich inflectional certain linguistic constraints. Then, we will describe the languages (e.g., Italian) and agglutinative languages relevant experiments we have run (§3) and draw some (e.g., Finnish), this might induce linguistically unsound conclusions based on the observed results (§4). A general generalizations. Here, we explore a more discussion with a description of the next steps will morphologically informed strategy, inspired by the conclude this paper (§5). Tolerance Principle (TP) and Sufficiency Principle (SP) [17], aiming to break words into potentially relevant morphemes without relying on morpheme tables [18]. 2. Revisiting LM training pipeline The experiments we conduct compare the impact of LM training pipeline is relatively rigid: after corpus different strategies when integrated into various cleaning (i), the data are prepared/optimized for network architectures. We refer to MorPiece (MoP) as a tokenization (ii), then the tokenized input is batched for TP/SP-based strategy, which can be algorithmically training autoregressive models (iii), mostly feeding described as follows: each token is traversed from left to transformer-based architectures (iv). Once the models right to create a “root trie,” and from right to left to are trained, the evaluation step requires their assessment create an “inflectional trie” [19]. Each time a node N of using some standard tasks (v). In the next sub-sections, the trie is traversed (corresponding to the current we will identify various criticalities in this pipeline, character path in the word), the frequency counter eventually proposing strategies to mitigate these associated with this node (Nc) is updated (+1). Nodes problems and, in the end, training linguistically more corresponding to token endings (characters before white informative SLM. spaces or punctuation) are flagged. Once both tries are created, the optimization procedure explores each 2.1. Corpus creation and cleaning descendant, and for every daughter node Dk its frequency k is compared to HN, the approximation of the The primary data we collected for Italian replicates harmonic number for N used both in TP and SP [17], plausible linguistic input that children may be exposed where c is the frequency of the mother node Nc: to during acquisition, in line with [1]. It consists of about 3M tokens divided into child-directed speech (CHILDES HN = c/ln(c) (F1) Italian section), child movie subtitles (from OpenSubtitles), child songs (from Zecchino D’Oro If k > HN and c ≠ k, a productive boundary break is repository), telephone conversations (VoLIP corpus, postulated (based on the inference that since there are [13]), and fairy tales (all from copyright expired different continuations and some of them are sources). Simple cleaning consisted of removing productive, i.e. sufficiently frequent according to SP, children’s productions from CHILDES files as well as those might be real independent morphemes). We can any other metalinguistic annotation (speakers’ check if this break respects HD for the relevant nodes Dj identification, headers, time stamps, tags, links, etc.). and Ni in the “inflectional trie”. This means there exists Dimension and rough lexical richness of each section are a path where the frequency i of the daughter node Ni (in reported in Table 1 (Type-Token Ratio, TTR) before and the “inflectional trie” the dependency between D and N after the cleaning procedure. is reversed) is lower than j/ln(j), where j is the frequency of the mother node Dj. If this is the case, the continuation Table 1 is not considered “an exception”, in the sense of TP [17], Corpus profiling before (bc) and after (ac) cleaning. suggesting that the continuation is, in fact, a productive Section tokens bc tokens ac TTR independent morpheme. A “++” root node is then Childes 405892 346155 0.03 activated, the node Dk linked to it, and so on recursively, Subtitles 959026 700729 0.05 following the FastWordPiece tokenization strategy [20]. Conversations 80826 58039 0.11 During recognition, the LinMaxMatch identification Songs 240309 222572 0.08 approach is adopted, as in FastWordPiece. Figure 1 Fairy tales 1103543 1287826 0.05 illustrates the relevant morpheme breaks (indicated as Total 2973879 2431038 0.03 “||”) obtained by applying this morpheme-breaking procedure in the root and infl tries fragments. Various parametric controls have been considered to training corpus (in contrast to BPE, for instance). We tune this procedure: (i) a branching factor (bf) parameter acknowledge this limitation, but we emphasize that our that excludes nodes with an excessively high number (> goal was to produce a smaller, potentially more efficient bf) of continuations (the rationale being that when too lexicon. In our experiments, while BPE generated a many continuations are present, they are unlikely to lexicon of 96028 tokens (67169 when the minimum correspond to inflections; this often happens near the lexical frequency was set to 2), MoP produced a lexicon root of each trie); (ii) a cutoff parameter indicating the of just 55049 tokens (cutoff=100, bf=10). lower frequency boundary for a mother node (this is necessary to ensure a minimum number of observations; 2.3. Revisiting LSTM architecture for example, if cutoff = 8, we exclude from the “root” trie Despite many variants of the standard LSTM any branching daughter with a frequency < 5). As in architectures, notably Gated Recurrent Units [21] or BPE, minimum frequency control for tokens is also LSTM augmented with peephole connections [22], and implemented to exclude infrequent dictionary entries. the discouraging equivalence results for these variations c “root” trie “infl” trie [23], we observe a recent revival of RNN-based model 214762 a 458013 architectures [24]. We believe, in fact, that the core e 466619 intuition behind the LSTM architecture may be 10240 e l ✘ linguistically relevant and worth exploring further, r 5391 10121 82939 l although generally more performant models (for a 4 c instance in terms of GLUE benchmark, [25]) are usually 1813 c r preferred [26]. The linguistic intuition is that the “long- 124 r term memory” (cell state C in Figure 2) in LSTM 1307 a o h r networks could effectively model various types of non- r i e local dependencies using a single mechanism. Linguistically speaking, filler-gap dependencies (1) and e l o r co-referential dependencies (2) are both “non-local dependencies” but they are subject to non-identical a e i o ò à locality conditions: (1) a. cosa i credi che abbia riposto _ i? Figure 1: Visualization of a fragment of the “root” and what (you) believe that (he) shelved? the “infl(ectional)” trie created by MorPiece on our what do you believe he shelved? corpus (cutoff=100, bf=10). b. *cosa i credi che abbia riposto il libro [AdvP senza Consider the word “cerca” (“to search for”) leggere _ i]]? represented in the “root” trie. In the last “c-a” the b'. cosa i credi che abbia riposto _ i [AdvP senza relation between Hfc and “a” frequency indicates that a leggere _ i]]? break might exist between the nodes “c” what do you believe he shelved (*the book) (frequency=1813) and “a” (frequency=1307), since Hfc = without reading? 1813/ln(1813) and 1307 > Hfc. This hypothesis is (2) a. [il panino]i, chi credi che loi abbia mangiato? confirmed by the failure of the Hfc check at the relevant the sandwich, who (you) believe it has eaten? “infl” “a-c” segment (“a” frequency=10121, “c” b. *[il panino]i, chi credi che _i abbia mangiato? frequency=466619): 10121 < 466619/ln(466619). If Hfc had the sandwich, who (you) believe has eaten? been greater than “a” frequency, then no segmentation the sandwich, who do you believe have eaten advantage would have been observable. *(it)? The proposed algorithm has a linear time complexity of O(2n), as each trie must be explored deterministically While both dependencies require C(onstituent)- exactly once to evaluate the HN/D frequency relation. command generalizations to be captured [27], the The best linguistic results (relatively linguistically adjunct island in (1), [28], but not clitic left-dislocation in coherent segmentations) for our Italian corpus were (2), [29], can, for instance, be licensed with a(n extra) gap obtained with cutoff=100 and bf=10. We found that it (1).b'. Aware of these differences, we decided to simply was unnecessary to filter the proposed inflectional alter the gating system to allow the LSTM to create breaks using the infl trie double check (TP) since the distinct pathways: one to “merge” new tokens, the other LinMaxMatch strategy already efficiently filtered out to decide if a long-distance dependency is necessary, and initially overestimated breaks. However, as an subsequently to “move” the relevant items [30]. The anonymous reviewer correctly pointed out, this strategy processing implementation of these operations is does not guarantee total inclusion of every token of our inspired by expectation-based Minimalist Grammars based on English BLiMP [12]. Most of the contrasts are formalism, eMG [31], and it is then named eMG-RNN. derived from the COnVERSA test [4]. They consist of Following this implementation, merge applies minimal pairs ordered following an increasing incrementally, token by token, and move means “retain complexity metric that considers the number of in memory”. In more detail, the cell of an eMG-RNN operations necessary to establish a dependency and the network performs the forward processing described in locality of such dependency. The examples below the computational graph in Figure 2: (i) the input at time illustrate this point by comparing a local agreement t (xt) is linearly transformed to a lower dimension vector dependency with, (3).b, or without, (3).a, a (linear) (E, loosely used for “embedding”), then concatenated (C) intervener and a more complex dependency that with the previous hidden state/output, if any (ht-1). Two requires to process an object relative clause (4): pathways, both transformed using a sigmoid function (σ), lead, on the one hand, to the move gate, on the other, (3) a. Il piatto è pieno. Vs. Il piatto è piena. to the merge gate. In the first case, the result of the the dish.S.M is full.S.M … full.S.F sigmoid transformation is multiplied (⊙, the Hadamard b. Il muro della casa è rosso product) with the input (this either erases or allows the wall.S.M of the house is red.S.M some component of the original vector to be added (+) Vs. Il muro della casa è rossa. to the previous (if any) context/cell state (ct-1) as in LSTM the wall.S.M of the house is red.S.F forget gate). The merge gate, on the other direction, will (4) Ci sono due maestri. Uno insegna ed è ascoltato privilege the new token if the result of the sigmoid dagli studenti, l'altro si riposa. Quale maestro combination of the incoming token and the previous insegna? There are two teachers. One teaches and hidden state is low, otherwise (1 - this activation, as in he’s listened to by the students, the other rests. GRUs update gate) will favor items in the context/cell Which one teaches? state (transformed through a tanh function to simulate Quello che gli studenti ascoltano. memory decay). The one who the students listen to Vs. Quello che ascolta gli studenti. ct-1 | + tanh ct The one who listens to the students move Four kinds of dependency (agreement, thematic role xt E assignment, pronominal forms usage, questions ⊙ formation and answering) are considered for a set of 32 σ distinct syntactic configurations (a total of 344 minimal i ht-1 pairs to be judged, [4]). C σ j ⊙ 3. Materials and Methods We trained our models on the IUSS High-Performance Cluster with 2 GPU nodes, each with 4 A100 NVIDIA 1- devices and 1T RAM. Each network has been trained with the full corpus using various batched strategies. (i) Naturalistic, line-by-line, single exposure to each sentence in the corpus (each epoch corresponds to an ⊙ + ht exposure of about 3M tokens); (ii) Conversational, two sequential lines are used for the input, that is, [line 1, merge line 2], [line 2, line 3], etc. are batched; this guarantees that a minimal conversational context for each sentence Figure 2: eMG-RNN cell computational graph. is provided. In this case, each epoch corresponds to an exposure of 6M tokens; (iii) fixed sequence length, This architecture is the most performant compared considering the average sentence length of 54 words per to various alternatives tested for the BabyLM 2024 sentence, a window of 60 tokens is used, that is, [tok_1, challenge [32]. tok_2 … tok_60], [tok_2, tok_3 … tok_61] … are batched; with this regimen, each epoch corresponds to an 2.4. A linguistically informed evaluation exposure of 180M tokens. Roughly speaking, the bare The last step in the pipeline requires a linguistically amount of data processed by a 7 y.o. child ranges from 7 advanced set of oppositions to verify that the structural to 70M tokens, [34], then training the networks with a generalizations can be captured coherently. We adopted naturalistic or conversational regimen for 3-10 epochs the lm-eval package [33] and we included a specific task would result in a comparable exposure. We trained the networks using torch.optim.lr_scheduler (step_size=5, compatible with the overfitting hypothesis, [37]). gamma=0.1) and Adam optimizer (lr=0.001) with 16-bit Focusing on tokenizer training results with LSTMx1, we automatic mixed-precision to speed up the (parallel) observed that BPE and FastWordPiece have comparable training for a maximum of 100 epochs. The networks performance. MorPiece performs slightly worse, even have been implemented in PyTorch (v2.3.1), wrapped in though the tokenization seems linguistically more Transformers structures (4.42.4) to maximize coherent (e.g., “farlo” – “to do it” is tokenized both by compatibility in the lm-eval (v.0.4.3) environment. BPE and fWP as a single token, while it is split in two in CUDA drivers v.12.4 were used. The most relevant MorPiece: “far” “+lo”) and the training faster (Table 3). configurations tested are discussed in the next session. This, however, only marginally impacts on minimal pairs contrast judgments, performing slightly better, 3.1. Configurations tested overall, just in certain agreement cases. Three different tokenization strategies (BPE, Table 3 FastWordPiece, and MorPiece) are compared using the Impact of the tokenization strategy on LSTM training best-performing LSTM network [35] , which consists of 650 units for the embedding layer and 650 nodes for each Strategy Vocab size Training Loss of the two hidden layers. Five different network time x architectures are compared, with the GroNLP GPT-2- epoch small pretrained model [36] constituting our “top LLM Corpus types 72931 ~1h 1.1520 performer”. This model was re-adapted to Italian from BPE 96028 ~4h 0.8877 the GPT-2 English trained model, which was originally fWP 97162 ~4h 0.9491 trained on approximately 10 billion token corpus, MoP 55049 ~3h 1.1151 namely various orders of magnitude bigger than our corpus. We then trained on our corpus a comparable We then adopted the BPE tokenizer for architectural bidirectional transformer (BERT), two LSTM networks, comparisons. Network training performances are respectively with 1 and 2 LSTM layers, and a one-layer summarized in Table 4 and graphically represented in eMG-RNN network (Table 2), as described in §2.3. Figure 3 for linguistic dimensions comparison. Table 2 Table 4 Network architectures Network architectures and their performance on Model Parameters Structure training (Loss/Accuracy) and COnVERSA test GroNLP GPT-2 12 Attention heads + Model Loss/Accuracy COnVERA 121M small 768 hidden units GroNLP GPT-2s 0.73 (±0.02) 12 Attention heads + BERT 4.5488/0.65471 0.43(±0.02) BERT 113M 768 hidden units LSTMx2 0.7849/0.8283 0.48(±0.03) 650 Embedding + 2 LSTMx1 0.8784/0.8103 0.52(±0.03) LSTMx2 65M LSTM layers (650) eMG-RNN 0.9491/0.7815 0.61(±0.01) 650 Embedding + 1 LSTMx1 36M LSTM layers (650) Total Wh questions + OR Agr DP 650 Embedding + 1 Reflexives in Psych V Subj position in wh- 1 Agr Subj-AP eMG-RNN 73M questions 0.9 Agr Subj-AP + attract eMG-RNN layer (650) Clitic pronouns dat 0.8 Agr Subj-V (unacc) Clitic pronouns acc 0.7 Agr Subj-V (unerg) 0.6 Clitic pronouns Agr V-Subj (unacc) 0.5 4. Results Wh questions + OR Agr Subj-V (unerg) + unamb 0.4 attract 0.3 Wh questions + SR Agr Subj-V (trans) 0.2 Comparing BERT and LSTM architectures, LSTMx1 Why questions 0.1 Agr Subj-V trans (unerg) 0 qualifies as the most performant configuration (both in Wh arguments Aux selection (ditrans) training and in minimal pair judgments). Considering Person rotation in decl Aux selection (trans) training, the only batching regimen performing Person rotation in questions Theta roles (subst) sufficiently well is the fixed sequence length (loss=0.8877 Reflexives cl (unacc V) Agr Past Part (unacc) with LSTMx1 vs. conversational loss=4.0240 or Aux selection Reflexives cl (unacc) Aux selection Aux selection (unerg) naturalistic regimen loss=4.5884). All networks reached (passives) Theta roles Polar questions (omission) Wh adjuncts Agr Subj V Psych Psych V V (piacere) a learning plateau around 10-12 epochs. Comparing the (cumulative) Agr Past Part (with (preoccupare) clitic) performances on COnVERSA, we realized that the LSTMx1 Emg_BPE 7 y.o. child results does not improve after 3 epochs of fixed sequence length (60 tokens) training regimen (this result is Figure 3: Performance of the 2 best RNN networks variants on COnVERSA compared to the 7 y.o. children. 5. Discussion References Overall, LSTM networks significantly outperform [1] A. Warstadt et al., Eds., Proceedings of the BabyLM Bidirectional Transformers in this minimal pairs test on Challenge at the 27th Conference on Computational Italian. This finding is consistent with results previously Natural Language Learning. Singapore: Association discussed in the literature and suggests a clear for Computational Linguistics, 2023. [Online]. advantage of recurrent, sequential model architectures Available: https://aclanthology.org/2023.conll- babylm.0 (e.g., LSTM) over Bidirectional Transformers in terms of [2] R. Katzir, “Why large language models are poor linguistic generalizations [38] and partially justify the theories of human linguistic cognition. A reply to renewed interest for RNN networks that we have been Piantadosi (2023),” 2023. [Online]. Available: observed in the last couple of years [24], [26]. As far as lingbuzz/007190 the tokenization procedure is concerned, it is somewhat [3] S. Piantadosi, “Modern language models refute premature to draw definitive conclusions from our Chomsky’s approach to language,” Lingbuzz experiments, as MorPiece has not yet been fully Preprint, lingbuzz, vol. 7180, 2023. optimized or tested. Specifically, the optimal cut-off [4] C. Chesi, G. Ghersi, V. Musella, and D. Musola, threshold and minimum branching factor have not been COnVERSA: Test di Comprensione delle Opposizioni systematically evaluated. Nevertheless, a more morfo-sintattiche VERbali attraverso la ScritturA. Firenze: Hogrefe, 2024. morphologically coherent segmentation is expected to [5] A. Vaswani et al., “Attention Is All You Need,” enhance sensitivity in certain minimal contrasts. arXiv:1706.03762 [cs], Dec. 2017, Accessed: Mar. 26, Similarly, the eMG-RNN architecture could be 2022. [Online]. Available: further explored and optimized, particularly considering http://arxiv.org/abs/1706.03762 specific contrasts, which may help determine whether [6] S. Hochreiter and J. Schmidhuber, “Long short- our linguistic modeling is on the right track. Evidence to term memory,” Neural computation, vol. 9, no. 8, the contrary is attested by the judgments of sentences pp. 1735–1780, 1997. with missing thematic roles, which are often incorrectly [7] L. G. G. Charpentier and D. Samuel, “Not all layers preferred by most models, including our eMG-RNN. are equally as important: Every Layer Counts In the end, our results suggest that Loss/Accuracy BERT,” in Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural performance registered in training is not a significant Language Learning, Singapore: Association for predictor of the performance on the COnVERSA test, or Computational Linguistics, 2023, pp. 210–224. doi: more generally, of the linguistic coherence of the LM 10.18653/v1/2023.conll-babylm.20. trained. Likewise, the models’ dimension is not a clear [8] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, predictor either: Transformers trained on the same small “FlashAttention: Fast and Memory-Efficient Exact dataset perform randomly (in all dimensions their Attention with IO-Awareness,” Jun. 23, 2022, arXiv: performance is round 50%) while eMG-RNN, which has arXiv:2205.14135. Accessed: Jun. 12, 2024. [Online]. a number of parameters similar to LSTM-2, outperforms Available: http://arxiv.org/abs/2205.14135 both LSTM-2 and LSTM-1 (half size of eMG-RNN). The [9] J. Steuer, M. Mosbach, and D. Klakow, “Large GPT- training size remains a striking difference compared to like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and the input received by children: this difference of one Psycholinguistic Measures,” in Proceedings of the order of magnitude suggests that the bias considered in BabyLM Challenge at the 27th Conference on eMG-RNN are not yet satisfactory and that our Computational Natural Language Learning, Language Acquisition Device is still more efficient; in Singapore: Association for Computational this sense, the Poverty of Stimulus Hypothesis remains Linguistics, 2023, pp. 114–129. doi: unrefuted [39] by these results. Next steps will consider 10.18653/v1/2023.conll-babylm.12. extending to 10M tokens the training corpus (to match [10] Y. Zhang, A. Warstadt, H.-S. Li, and S. R. Bowman, the English counterpart [1]) and further exploring the “When Do You Need Billions of Words of effects of optimized tokenization procedures or other Pretraining Data?,” Nov. 10, 2020, arXiv: minimal modifications, and optimizations [24], of arXiv:2011.04946. Accessed: Jan. 10, 2024. [Online]. Available: http://arxiv.org/abs/2011.04946 recurrent neural networks. [11] S. Crain and M. Nakayama, “Structure Dependence in Grammar Formation,” Language, vol. 63, no. 3, p. Acknowledgments 522, Sep. 1987, doi: 10.2307/415004. [12] A. Warstadt et al., “BLiMP: The Benchmark of This project is partially supported by the T-GRA2L: Linguistic Minimal Pairs for English,” Transactions Testing GRAdeness and GRAmmaticality in Linguistics, of the Association for Computational Linguistics, vol. PRIN 2022 Next Generation EU funded Project 8, pp. 377–392, Dec. 2020, doi: (202223PL4N). National coordinator: CC 10.1162/tacl_a_00321. [13] I. Alfano, F. Cutugno, A. De Rosa, C. Iacobini, R. and Analysis Platform for Natural Language Savy, and M. Voghera, “VOLIP: a corpus of spoken Understanding,” Feb. 22, 2019, arXiv: Italian and a virtuous example of reuse of linguistic arXiv:1804.07461. Accessed: Jul. 20, 2024. [Online]. resources,” in Proceedings of the Ninth International Available: http://arxiv.org/abs/1804.07461 Conference on Language Resources and Evaluation [26] A. Gu and T. Dao, “Mamba: Linear-Time Sequence (LREC’14), N. Calzolari, K. Choukri, T. Declerck, H. Modeling with Selective State Spaces,” May 31, Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. 2024, arXiv: arXiv:2312.00752. Accessed: Oct. 20, Odijk, and S. Piperidis, Eds., Reykjavik, Iceland: 2024. [Online]. Available: European Language Resources Association http://arxiv.org/abs/2312.00752 (ELRA), May 2014, pp. 3897–3901. [Online]. [27] T. Reinhart, “The syntactic domain of anaphora,” Available: http://www.lrec- Massachusetts Institute of Technology, Cambridge conf.org/proceedings/lrec2014/pdf/906_Paper.pdf (MA), 1976. [14] T. B. Brown et al., “Language Models are Few-Shot [28] J. R. Ross, “Constraints on variables in syntax.,” Learners,” arXiv:2005.14165 [cs], Jul. 2020, MIT, Cambridge (MA), 1967. Accessed: Apr. 21, 2021. [Online]. Available: [29] C. Cecchetto, “A Comparative Analysis of Left and http://arxiv.org/abs/2005.14165 Right Dislocation in Romance,” Studia Linguistica, [15] P. Gage, “A new algorithm for data compression,” vol. 53, no. 1, pp. 40–67, Apr. 1999, doi: C Users Journal, vol. 12, no. 2, pp. 23–38, 1994. 10.1111/1467-9582.00039. [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, [30] N. Chomsky et al., Merge and the Strong Minimalist “Bert: Pre-training of deep bidirectional Thesis, 1st ed. Cambridge University Press, 2023. transformers for language understanding,” arXiv doi: 10.1017/9781009343244. preprint arXiv:1810.04805, 2018. [31] C. Chesi, “Expectation-based Minimalist [17] C. D. Yang, The price of linguistic productivity: how Grammars,” arXiv:2109.13871 [cs], Sep. 2021, children learn to break the rules of language. Accessed: Nov. 02, 2021. [Online]. Available: Cambridge, MA: MIT Press, 2016. http://arxiv.org/abs/2109.13871 [18] H. Jabbar, “MorphPiece : A Linguistic Tokenizer for [32] C. Chesi et al., “Different Ways to Forget: Large Language Models,” Feb. 03, 2024, arXiv: Linguistic Gates in Recurrent Neural Networks,” in arXiv:2307.07262. Accessed: Jun. 23, 2024. [Online]. Proceedings of the BabyLM Challenge at the 28th Available: http://arxiv.org/abs/2307.07262 Conference on Computational Natural Language [19] E. Fredkin, “Trie memory,” Commun. ACM, vol. 3, Learning, 2024. no. 9, pp. 490–499, Sep. 1960, doi: [33] L. Gao et al., “A framework for few-shot language 10.1145/367390.367400. model evaluation.” Zenodo, Dec. 2023. doi: [20] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. 10.5281/zenodo.10256836. Zhou, “Fast WordPiece Tokenization,” Oct. 05, [34] B. Hart and T. R. Risley, “American parenting of 2021, arXiv: arXiv:2012.15524. Accessed: Jun. 13, language-learning children: Persisting differences 2024. [Online]. Available: in family-child interactions observed in natural http://arxiv.org/abs/2012.15524 home environments.,” Developmental Psychology, [21] K. Cho et al., “Learning Phrase Representations vol. 28, no. 6, pp. 1096–1105, Nov. 1992, doi: using RNN Encoder-Decoder for Statistical 10.1037/0012-1649.28.6.1096. Machine Translation,” Sep. 02, 2014, arXiv: [35] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, arXiv:1406.1078. Accessed: Jun. 12, 2024. [Online]. and M. Baroni, “Colorless Green Recurrent Available: http://arxiv.org/abs/1406.1078 Networks Dream Hierarchically,” in Proceedings of [22] F. A. Gers and J. Schmidhuber, “Recurrent nets that the 2018 Conference of the North American Chapter time and count,” in Proceedings of the IEEE-INNS- of the Association for Computational Linguistics: ENNS International Joint Conference on Neural Human Language Technologies, Volume 1 (Long Networks. IJCNN 2000. Neural Computing: New Papers), New Orleans, Louisiana: Association for Challenges and Perspectives for the New Millennium, Computational Linguistics, Jun. 2018, pp. 1195– Como, Italy: IEEE, 2000, pp. 189–194 vol.3. doi: 1205. doi: 10.18653/v1/N18-1108. 10.1109/IJCNN.2000.861302. [36] W. de Vries and M. Nissim, “As Good as New. How [23] K. Greff, R. K. Srivastava, J. Koutník, B. R. to Successfully Recycle English GPT-2 to Make Steunebrink, and J. Schmidhuber, “LSTM: A Search Models for Other Languages,” in Findings of the Space Odyssey,” IEEE Trans. Neural Netw. Learning Association for Computational Linguistics: ACL- Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017, doi: IJCNLP 2021, 2021, pp. 836–846. doi: 10.1109/TNNLS.2016.2582924. 10.18653/v1/2021.findings-acl.74. [24] L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. [37] F. Xue, Y. Fu, W. Zhou, Z. Zheng, and Y. You, “To Hajimirsadegh, “Were RNNs All We Needed?,” Oct. Repeat or Not To Repeat: Insights from Scaling 04, 2024, arXiv: arXiv:2410.01201. Accessed: Oct. LLM under Token-Crisis,” 2023, arXiv. doi: 18, 2024. [Online]. Available: 10.48550/ARXIV.2305.13230. http://arxiv.org/abs/2410.01201 [38] E. Wilcox, R. Futrell, and R. Levy, “Using [25] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and Computational Models to Test Syntactic S. R. Bowman, “GLUE: A Multi-Task Benchmark Learnability,” Linguistic Inquiry, pp. 1–44, Apr. 2023, doi: 10.1162/ling_a_00491. [39] C. Yang, S. Crain, R. C. Berwick, N. Chomsky, and J. J. Bolhuis, “The growth of language: Universal Grammar, experience, and principles of computation,” Neuroscience & Biobehavioral Reviews, vol. 81, pp. 103–119, Oct. 2017, doi: 10.1016/j.neubiorev.2016.12.023. A. Online Resources Resources (corpus information, tokenizer, network architectures and lm_eval tasks) are available at https://github.com/cristianochesi/babylm-2024.