Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian

Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian AchilleFusco NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

MatildeBarbini NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

MariaLetizia NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

PicciniBianchessi NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

VeronicaBressan NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

SofiaNeri NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

SarahRossi NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

TommasoSgrizzi NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

CristianoChesi cristiano.chesi@iusspavia.it NeTS Lab IUSS Pavia

P.zza Vittoria 15 27100 Pavia Italy

Recurrent Networks are (Linguistically) Better? An Experiment on Small-LM Training on Child-Directed Speech in Italian 1613-0073 A72235413C076F278FAC7B4CA2DAE1A9 GROBID - A machine learning software for extracting information from scholarly documents LSTM Transformers Small Language Models (SLM) tokenization cell state control LM evaluation

Here we discuss strategies and results of a small-sized training program based on Italian childdirected speech (less than 3M tokens) for various network architectures. The rationale behind these experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on different models' architecture. Preliminary findings lead us to conclude that: (i) different tokenization strategies produce mildly significant improvements overall, although segmentation aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state perform slightly worse in training loss (compared to standard one-and two-layered LSTM models) but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in autoregressive training procedures are linguistically irrelevant and, more generally, misleading since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the performance of these models remains significantly lower compared to that of 7-year-old nativespeaker children in the relevant linguistic contrasts we considered [4].

Introduction

According to the mainstream LLM development pipeline, Transformer-based architectures [5] outperform sequential training models, like LSTM [6], in various NLP tasks. When small-sized training data are available, optimization becomes necessary [7], [8], but common optimization techniques neglect the linguistically relevant fact that these models (i) conflate semantic/world knowledge with morpho-syntactic competence, (ii) require unreasonable training data compared to that needed by children during language acquisition, (iii) the higher their performance, the lower their return in cognitive/linguistic terms [9]. In this paper we address these three issues, starting from the observation that while world knowledge uses all training data available, and the more the better, structural (morpho-syntactic and compositional semantic) knowledge might require a much smaller dataset (from 10 to 100 million words, according to [10]). We explore this intuition further and, based on prolific literature from the '80s showing that typical child errors are structurally sensitive and never random [11], we model networks' architecture to bias learning towards plausible structural configurations, possibly preventing these "small" language models (SLM) from producing wrong linguistic generalizations. We started from a mild revision of the LM training and evaluation pipeline for Italian including alternative approaches to tokenization based on pseudo-morphological decomposition ( §2.2); we then approached a more structurally-driven update of the cell state in LSTM networks, which we will call eMG-RNN variants ( §2.3); we finally adopted a precise testing benchmark for specific linguistic contrasts in Italian following BLiMP design [12] ( §2.4). We will first set the stage in section ( §2) and discuss one alternative tokenization strategy (MorPiece). A simple modification to the gating system in LSTM is proposed that mimics certain linguistic constraints. Then, we will describe the relevant experiments we have run ( §3) and draw some conclusions based on the observed results ( §4). A general discussion with a description of the next steps will conclude this paper ( §5).

Revisiting LM training pipeline

LM training pipeline is relatively rigid: after corpus cleaning (i), the data are prepared/optimized for tokenization (ii), then the tokenized input is batched for training autoregressive models (iii), mostly feeding transformer-based architectures (iv). Once the models are trained, the evaluation step requires their assessment using some standard tasks (v). In the next sub-sections, we will identify various criticalities in this pipeline, eventually proposing strategies to mitigate these problems and, in the end, training linguistically more informative SLM.

Corpus creation and cleaning

The primary data we collected for Italian replicates plausible linguistic input that children may be exposed to during acquisition, in line with [1]. It consists of about 3M tokens divided into child-directed speech (CHILDES Italian section), child movie subtitles (from OpenSubtitles), child songs (from Zecchino D'Oro repository), telephone conversations (VoLIP corpus, [13]), and fairy tales (all from copyright expired sources). Simple cleaning consisted of removing children's productions from CHILDES files as well as any other metalinguistic annotation (speakers' identification, headers, time stamps, tags, links, etc.). Dimension and rough lexical richness of each section are reported in Table 1 (Type-Token Ratio, TTR) before and after the cleaning procedure.

Table 1

Corpus profiling before (bc) and after (ac) cleaning.

Tokenization: MorPiece (MoP)

Popular vLLMs use either Byte-Pair Encoding (BPE) [14], [15] or (fast)WordPiece (fWP) [16] algorithms for tokenization. The simplicity and computational efficiency of these approaches contrast with the limited morphological analysis they provide. In rich inflectional languages (e.g., Italian) and agglutinative languages (e.g., Finnish), this might induce linguistically unsound generalizations. Here, we explore a more morphologically informed strategy, inspired by the Tolerance Principle (TP) and Sufficiency Principle (SP) [17], aiming to break words into potentially relevant morphemes without relying on morpheme tables [18]. The experiments we conduct compare the impact of different strategies when integrated into various network architectures. We refer to MorPiece (MoP) as a TP/SP-based strategy, which can be algorithmically described as follows: each token is traversed from left to right to create a "root trie," and from right to left to create an "inflectional trie" [19]. Each time a node N of the trie is traversed (corresponding to the current character path in the word), the frequency counter associated with this node (Nc) is updated (+1). Nodes corresponding to token endings (characters before white spaces or punctuation) are flagged. Once both tries are created, the optimization procedure explores each descendant, and for every daughter node Dk its frequency k is compared to HN, the approximation of the harmonic number for N used both in TP and SP [17], where c is the frequency of the mother node Nc:

HN = c/ln(c) (F1)

If k > HN and c ≠ k, a productive boundary break is postulated (based on the inference that since there are different continuations and some of them are productive, i.e. sufficiently frequent according to SP, those might be real independent morphemes). We can check if this break respects HD for the relevant nodes Dj and Ni in the "inflectional trie". This means there exists a path where the frequency i of the daughter node Ni (in the "inflectional trie" the dependency between D and N is reversed) is lower than j/ln(j), where j is the frequency of the mother node Dj. If this is the case, the continuation is not considered "an exception", in the sense of TP [17], suggesting that the continuation is, in fact, a productive independent morpheme. A "++" root node is then activated, the node Dk linked to it, and so on recursively, following the FastWordPiece tokenization strategy [20].

During recognition, the LinMaxMatch identification approach is adopted, as in FastWordPiece. Figure 1 illustrates the relevant morpheme breaks (indicated as "||") obtained by applying this morpheme-breaking procedure in the root and infl tries fragments. Various parametric controls have been considered to tune this procedure: (i) a branching factor (bf) parameter that excludes nodes with an excessively high number (> bf) of continuations (the rationale being that when too many continuations are present, they are unlikely to correspond to inflections; this often happens near the root of each trie); (ii) a cutoff parameter indicating the lower frequency boundary for a mother node (this is necessary to ensure a minimum number of observations; for example, if cutoff = 8, we exclude from the "root" trie any branching daughter with a frequency < 5). As in BPE, minimum frequency control for tokens is also implemented to exclude infrequent dictionary entries. Consider the word "cerca" ("to search for") represented in the "root" trie. In the last "c-a" the relation between Hfc and "a" frequency indicates that a break might exist between the nodes "c" (frequency=1813) and "a" (frequency=1307), since Hfc = 1813/ln(1813) and 1307 > Hfc. This hypothesis is confirmed by the failure of the Hfc check at the relevant "infl" "a-c" segment ("a" frequency=10121, "c" frequency=466619): 10121 < 466619/ln(466619). If Hfc had been greater than "a" frequency, then no segmentation advantage would have been observable.

The proposed algorithm has a linear time complexity of O(2n), as each trie must be explored deterministically exactly once to evaluate the HN/D frequency relation. The best linguistic results (relatively linguistically coherent segmentations) for our Italian corpus were obtained with cutoff=100 and bf=10. We found that it was unnecessary to filter the proposed inflectional breaks using the infl trie double check (TP) since the LinMaxMatch strategy already efficiently filtered out initially overestimated breaks. However, as an anonymous reviewer correctly pointed out, this strategy does not guarantee total inclusion of every token of our training corpus (in contrast to BPE, for instance). We acknowledge this limitation, but we emphasize that our goal was to produce a smaller, potentially more efficient lexicon. In our experiments, while BPE generated a lexicon of 96028 tokens (67169 when the minimum lexical frequency was set to 2), MoP produced a lexicon of just 55049 tokens (cutoff=100, bf=10).

Revisiting LSTM architecture

Despite many variants of the standard LSTM architectures, notably Gated Recurrent Units [21] or LSTM augmented with peephole connections [22], and the discouraging equivalence results for these variations [23], we observe a recent revival of RNN-based model architectures [24]. We believe, in fact, that the core intuition behind the LSTM architecture may be linguistically relevant and worth exploring further, although generally more performant models (for instance in terms of GLUE benchmark, [25]) are usually preferred [26]. The linguistic intuition is that the "longterm memory" (cell state C in Figure 2) in LSTM networks could effectively model various types of nonlocal dependencies using a single mechanism. Linguistically speaking, filler-gap dependencies (1) and co-referential dependencies (2) are both "non-local dependencies" but they are subject to non-identical locality conditions: (2) a. [il panino]i, chi credi che loi abbia mangiato? the sandwich, who (you) believe it has eaten? b. *[il panino]i, chi credi che _i abbia mangiato? the sandwich, who (you) believe has eaten? the sandwich, who do you believe have eaten *(it)?

While both dependencies require C(onstituent)command generalizations to be captured [27], the adjunct island in (1), [28], but not clitic left-dislocation in (2), [29], can, for instance, be licensed with a(n extra) gap (1).b'. Aware of these differences, we decided to simply alter the gating system to allow the LSTM to create distinct pathways: one to "merge" new tokens, the other to decide if a long-distance dependency is necessary, and subsequently to "move" the relevant items [30]. The processing implementation of these operations is inspired by expectation-based Minimalist Grammars formalism, eMG [31], and it is then named eMG-RNN. Following this implementation, merge applies incrementally, token by token, and move means "retain in memory". In more detail, the cell of an eMG-RNN network performs the forward processing described in the computational graph in Figure 2: (i) the input at time t (xt) is linearly transformed to a lower dimension vector (E, loosely used for "embedding"), then concatenated (C) with the previous hidden state/output, if any (ht-1). Two pathways, both transformed using a sigmoid function (σ), lead, on the one hand, to the move gate, on the other, to the merge gate. In the first case, the result of the sigmoid transformation is multiplied (⊙, the Hadamard product) with the input (this either erases or allows some component of the original vector to be added (+) to the previous (if any) context/cell state (ct-1) as in LSTM forget gate). The merge gate, on the other direction, will privilege the new token if the result of the sigmoid combination of the incoming token and the previous hidden state is low, otherwise (1 -this activation, as in GRUs update gate) will favor items in the context/cell state (transformed through a tanh function to simulate memory decay). This architecture is the most performant compared to various alternatives tested for the BabyLM 2024 challenge [32].

A linguistically informed evaluation

The last step in the pipeline requires a linguistically advanced set of oppositions to verify that the structural generalizations can be captured coherently. We adopted the lm-eval package [33] and we included a specific task based on English BLiMP [12]. Most of the contrasts are derived from the COnVERSA test [4]. They consist of minimal pairs ordered following an increasing complexity metric that considers the number of operations necessary to establish a dependency and the locality of such dependency. The examples below illustrate this point by comparing a local agreement dependency with, (3).b, or without, (3).a, a (linear) intervener and a more complex dependency that requires to process an object relative clause (4): The one who the students listen to Vs. Quello che ascolta gli studenti.

The one who listens to the students Four kinds of dependency (agreement, thematic role assignment, pronominal forms usage, questions formation and answering) are considered for a set of 32 distinct syntactic configurations (a total of 344 minimal pairs to be judged, [4]).

Materials and Methods

We trained our models on the IUSS High-Performance Cluster with 2 GPU nodes, each with 4 A100 NVIDIA devices and 1T RAM. Each network has been trained with the full corpus using various batched strategies. CUDA drivers v.12.4 were used. The most relevant configurations tested are discussed in the next session.

Configurations tested

Three different tokenization strategies (BPE, FastWordPiece, and MorPiece) are compared using the best-performing LSTM network [35] , which consists of 650 units for the embedding layer and 650 nodes for each of the two hidden layers. Five different network architectures are compared, with the GroNLP GPT-2small pretrained model [36] constituting our "top LLM performer". This model was re-adapted to Italian from the GPT-2 English trained model, which was originally trained on approximately 10 billion token corpus, namely various orders of magnitude bigger than our corpus. We then trained on our corpus a comparable bidirectional transformer (BERT), two LSTM networks, respectively with 1 and 2 LSTM layers, and a one-layer eMG-RNN network (Table 2), as described in §2.3.

Table 2

Network architectures

Results

Comparing BERT and LSTM architectures, LSTMx1 qualifies as the most performant configuration (both in training and in minimal pair judgments). Considering training, the only batching regimen performing sufficiently well is the fixed sequence length (loss=0.8877 with LSTMx1 vs. conversational loss=4.0240 or naturalistic regimen loss=4.5884). All networks reached a learning plateau around 10-12 epochs. Comparing the performances on COnVERSA, we realized that the results does not improve after 3 epochs of fixed sequence length (60 tokens) training regimen (this result is compatible with the overfitting hypothesis, [37]). Focusing on tokenizer training results with LSTMx1, we observed that BPE and FastWordPiece have comparable performance. MorPiece performs slightly worse, even though the tokenization seems linguistically more coherent (e.g., "farlo" -"to do it" is tokenized both by BPE and fWP as a single token, while it is split in two in MorPiece: "far" "+lo") and the training faster (Table 3). This, however, only marginally impacts on minimal pairs contrast judgments, performing slightly better, overall, just in certain agreement cases.

Table 3 Impact of the tokenization strategy on LSTM training

We then adopted the BPE tokenizer for architectural comparisons. Network training performances are summarized in Table 4 and graphically represented in Figure 3 for linguistic dimensions comparison.

Discussion

Overall, LSTM networks significantly outperform Bidirectional Transformers in this minimal pairs test on Italian. This finding is consistent with results previously discussed in the literature and suggests a clear advantage of recurrent, sequential model architectures (e.g., LSTM) over Bidirectional Transformers in terms of linguistic generalizations [38] and partially justify the renewed interest for RNN networks that we have been observed in the last couple of years [24], [26]. As far as the tokenization procedure is concerned, it is somewhat premature to draw definitive conclusions from our experiments, as MorPiece has not yet been fully optimized or tested. Specifically, the optimal cut-off threshold and minimum branching factor have not been systematically evaluated. Nevertheless, a more morphologically coherent segmentation is expected to enhance sensitivity in certain minimal contrasts.

Similarly, the eMG-RNN architecture could be further explored and optimized, particularly considering specific contrasts, which may help determine whether our linguistic modeling is on the right track. Evidence to the contrary is attested by the judgments of sentences with missing thematic roles, which are often incorrectly preferred by most models, including our eMG-RNN.

In the end, our results suggest that Loss/Accuracy performance registered in training is not a significant predictor of the performance on the COnVERSA test, or more generally, of the linguistic coherence of the LM trained. Likewise, the models' dimension is not a clear predictor either: Transformers trained on the same small dataset perform randomly (in all dimensions their performance is round 50%) while eMG-RNN, which has a number of parameters similar to LSTM-2, outperforms both LSTM-2 and LSTM-1 (half size of eMG-RNN). The training size remains a striking difference compared to the input received by children: this difference of one order of magnitude suggests that the bias considered in eMG-RNN are not yet satisfactory and that our Language Acquisition Device is still more efficient; in this sense, the Poverty of Stimulus Hypothesis remains unrefuted [39] by these results. Next steps will consider extending to 10M tokens the training corpus (to match the English counterpart [1]) and further exploring the effects of optimized tokenization procedures or other minimal modifications, and optimizations [24], of recurrent neural networks.

Figure 1 :1Figure 1: Visualization of a fragment of the "root" and the "infl(ectional)" trie created by MorPiece on our corpus (cutoff=100, bf=10).

( 1 )1a. cosa i credi che abbia riposto _ i? what (you) believe that (he) shelved? what do you believe he shelved? b. *cosa i credi che abbia riposto il libro [AdvP senza leggere _ i]]? b'. cosa i credi che abbia riposto _ i [AdvP senza leggere _ i]]? what do you believe he shelved (*the book) without reading?

Figure 2 :2Figure 2: eMG-RNN cell computational graph.

( 3 )3a. Il piatto è pieno. Vs. Il piatto è piena. the dish.S.M is full.S.M … full.S.F b. Il muro della casa è rosso the wall.S.M of the house is red.S.M Vs. Il muro della casa è rossa. the wall.S.M of the house is red.S.F (4) Ci sono due maestri. Uno insegna ed è ascoltato dagli studenti, l'altro si riposa. Quale maestro insegna? There are two teachers. One teaches and he's listened to by the students, the other rests. Which one teaches? Quello che gli studenti ascoltano.

(i) Naturalistic, line-by-line, single exposure to each sentence in the corpus (each epoch corresponds to an exposure of about 3M tokens); (ii) Conversational, two sequential lines are used for the input, that is, [line 1, line 2], [line 2, line 3], etc. are batched; this guarantees that a minimal conversational context for each sentence is provided. In this case, each epoch corresponds to an exposure of 6M tokens; (iii) fixed sequence length, considering the average sentence length of 54 words per sentence, a window of 60 tokens is used, that is, [tok_1, tok_2 … tok_60], [tok_2, tok_3 … tok_61] … are batched; with this regimen, each epoch corresponds to an exposure of 180M tokens. Roughly speaking, the bare amount of data processed by a 7 y.o. child ranges from 7 to 70M tokens, [34], then training the networks with a naturalistic or conversational regimen for 3-10 epochs would result in a comparable exposure. We trained the .optim.lr_scheduler (step_size=5, gamma=0.1) and Adam optimizer (lr=0.001) with 16-bit automatic mixed-precision to speed up the (parallel) training for a maximum of 100 epochs. The networks have been implemented in PyTorch (v2.3.1), wrapped in Transformers structures (4.42.4) to maximize compatibility in the lm-eval (v.0.4.3) environment.

Table 44Network architectures and their performance on training (Loss/Accuracy) and COnVERSA testModelLoss/Accuracy COnVERAGroNLP GPT-2s0.73 (±0.02)BERT4.5488/0.65471 0.43(±0.02)LSTMx20.7849/0.8283 0.48(±0.03)LSTMx10.8784/0.8103 0.52(±0.03)eMG-RNN0.9491/0.7815 0.61(±0.01)TotalReflexives in Psych V Wh questions + OR1Agr DPAgr Subj-APSubj position in wh-questions0.9Agr Subj-AP + attractClitic pronouns dat0.8Agr Subj-V (unacc)Clitic pronouns acc0.7Agr Subj-V (unerg)Clitic pronouns0.6Agr V-Subj (unacc)0.5Wh questions + ORAgr Subj-V (unerg) +unamb0.4attract0.3Wh questions + SRAgr Subj-V (trans)0.2Why questions0.1Agr Subj-V trans (unerg)0Wh argumentsAux selection (ditrans)Person rotation in declAux selection (trans)Person rotation in questionsTheta roles (subst)Reflexives cl (unacc V)Agr Past Part (unacc)Reflexives clAux selection (unacc)Aux selection (passives)Aux selection (unerg)Agr Subj V Wh adjuncts Polar questionsTheta roles Psych V (piacere) Psych V (omission)(cumulative)Agr Past Part (with(preoccupare)clitic)LSTMx1Emg_BPE7 y.o. childFigure 3: Performance of the 2 best RNN networksvariants on COnVERSA compared to the 7 y.o. children.

Acknowledgments

This project is partially supported by the T-GRA2L: Testing GRAdeness and GRAmmaticality in Linguistics, PRIN 2022 Next Generation EU funded Project (202223PL4N). National coordinator: CC

(C. Chesi) 0000-0002-5389-8884 (A. Fusco); 0009-0007-7986-2365 (M. Barbini); 0009-0005-8116-3358 (M. L. Piccini Bianchessi); 0000-0003-3072-7967 (V. Bressan); 0009-0003-5456-0556 (S. Neri); 0009-0007-2525-2457 (S. Rossi); 0000-0003-1375-1359 (T. Sgrizzi); 0000-0003-1935-1348 (C. Chesi);

AWarstadt Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Singapore

2023 : Association for Computational Linguistics Why large language models are poor theories of human linguistic cognition RKatzir lingbuzz/007190 2023. 2023 A reply to Piantadosi Modern language models refute Chomsky's approach to language SPiantadosi Lingbuzz Preprint, lingbuzz 7180 2023 COnVERSA: Test di Comprensione delle Opposizioni morfo-sintattiche VERbali attraverso la ScritturA CChesi GGhersi VMusella DMusola 2024 Hogrefe Firenze Attention Is All You Need AVaswani arXiv:1706.03762 Dec. 2017. Mar. 26, 2022 Long shortterm memory SHochreiter JSchmidhuber Neural computation 9 8 1997 Not all layers are equally as important: Every Layer Counts BERT LG GCharpentier DSamuel 10.18653/v1/2023.conll-babylm.20 Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Singapore

2023 : Association for Computational Linguistics FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness TDao DYFu SErmon ARudra CRé arXiv:2205.14135 Jun. 23, 2022. Jun. 12, 2024 Large GPTlike Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures JSteuer MMosbach DKlakow 10.18653/v1/2023.conll-babylm.12 Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Singapore

2023 : Association for Computational Linguistics When Do You Need Billions of Words of Pretraining Data? YZhang AWarstadt H.-SLi SRBowman arXiv:2011.04946 Nov. 10, 2020. Jan. 10, 2024 Structure Dependence in Grammar Formation SCrain MNakayama 10.2307/415004 Language 63 3 522 Sep. 1987 BLiMP: The Benchmark of Linguistic Minimal Pairs for English AWarstadt 10.1162/tacl_a_00321 Transactions of the Association for Computational Linguistics 8 Dec. 2020 VOLIP: a corpus of spoken Italian and a virtuous example of reuse of linguistic resources IAlfano FCutugno ADeRosa CIacobini RSavy MVoghera Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14 NCalzolari KChoukri TDeclerck HLoftsson BMaegaard JMariani AMoreno JOdijk SPiperidis the Ninth International Conference on Language Resources and Evaluation (LREC'14

Reykjavik, Iceland

ELRA May 2014 Language Models are Few-Shot Learners TBBrown arXiv:2005.14165 Jul. 2020. Apr. 21, 2021 A new algorithm for data compression PGage C Users Journal 12 2 1994 Bert: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 2018 arXiv preprint The price of linguistic productivity: how children learn to break the rules of language CDYang 2016 MIT Press Cambridge, MA MorphPiece : A Linguistic Tokenizer for Large Language Models HJabbar arXiv:2307.07262 Feb. 03, 2024. Jun. 23, 2024 Trie memory EFredkin 10.1145/367390.367400 Commun. ACM 3 9 Sep. 1960 Fast WordPiece Tokenization XSong ASalcianu YSong DDopson DZhou arXiv:2012.15524 Oct. 05, 2021. Jun. 13, 2024 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation KCho arXiv:1406.1078 Sep. 02, 2014. Jun. 12, 2024 Recurrent nets that time and count FAGers JSchmidhuber 10.1109/IJCNN.2000.861302 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium

Como, Italy

IEEE 2000 3 LSTM: A Search Space Odyssey KGreff RKSrivastava JKoutník BRSteunebrink JSchmidhuber 10.1109/TNNLS.2016.2582924 IEEE Trans. Neural Netw. Learning Syst 28 10 Oct. 2017 Were RNNs All We Needed? LFeng FTung MOAhmed YBengio HHajimirsadegh arXiv:2410.01201 Oct. 04, 2024. Oct. 18, 2024 GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding AWang ASingh JMichael FHill OLevy SRBowman arXiv:1804.07461 Feb. 22, 2019. Jul. 20, 2024 Mamba: Linear-Time Sequence Modeling with Selective State Spaces AGu TDao arXiv:2312.00752 May 31, 2024. Oct. 20, 2024 The syntactic domain of anaphora TReinhart 1976 Massachusetts Institute of Technology Cambridge (MA Constraints on variables in syntax JRRoss 1967 Cambridge (MA MIT A Comparative Analysis of Left and Right Dislocation in Romance CCecchetto 10.1111/1467-9582.00039 Studia Linguistica 53 1 Apr. 1999 Merge and the Strong Minimalist Thesis NChomsky 10.1017/9781009343244 2023 Cambridge University Press 1st ed Expectation-based Minimalist CChesi arXiv:2109.13871 Grammars Sep. 2021. Nov. 02, 2021 Different Ways to Forget: Linguistic Gates in Recurrent Neural Networks CChesi Proceedings of the BabyLM Challenge at the 28th Conference on Computational Natural Language Learning the BabyLM Challenge at the 28th Conference on Computational Natural Language Learning 2024 A framework for few-shot language model evaluation LGao 10.5281/zenodo.10256836 Zenodo Dec. 2023 American parenting of language-learning children: Persisting differences in family-child interactions observed in natural home environments BHart TRRisley 10.1037/0012-1649.28.6.1096 Developmental Psychology 28 6 Nov. 1992 Colorless Green Recurrent Networks Dream Hierarchically KGulordava PBojanowski EGrave TLinzen MBaroni 10.18653/v1/N18-1108 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

Jun. 2018 1 Association for Computational Linguistics As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages WVries MNissim 10.18653/v1/2021.findings-acl.74 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021 To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis FXue YFu WZhou ZZheng YYou 10.48550/ARXIV.2305.13230 2023 Using Computational Models to Test Syntactic Learnability EWilcox RFutrell RLevy 10.1162/ling_a_00491 Linguistic Inquiry Apr. 2023 The growth of language: Universal Grammar, experience, and principles of computation CYang SCrain RCBerwick NChomsky JJBolhuis 10.1016/j.neubiorev.2016.12.023 Neuroscience & Biobehavioral Reviews 81 Oct. 2017 A Online Resources Resources (corpus information, tokenizer, network architectures and lm_eval tasks)