=Paper= {{Paper |id=Vol-3878/45_main_long |storemode=property |title=Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-LM Training on Child-Directed Speech in Italian |pdfUrl=https://ceur-ws.org/Vol-3878/45_main_long.pdf |volume=Vol-3878 |authors=Achille Fusco,Matilde Barbini,Maria Letizia Piccini Bianchessi,Veronica Bressan,Sofia Neri,Sarah Rossi,Tommaso Sgrizzi,Cristiano Chesi |dblpUrl=https://dblp.org/rec/conf/clic-it/FuscoBBBNRSC24 }} ==Recurrent Networks Are (Linguistically) Better? An (Ongoing) Experiment on Small-LM Training on Child-Directed Speech in Italian== https://ceur-ws.org/Vol-3878/45_main_long.pdf
                                Recurrent Networks are (Linguistically) Better? An
                                Experiment on Small-LM Training on Child-Directed
                                Speech in Italian
                                Achille Fusco1 ,†, Matilde Barbini1 ,†, Maria Letizia Piccini Bianchessi1 ,†, Veronica
                                Bressan1 ,†, Sofia Neri1 ,†, Sarah Rossi1 ,†, Tommaso Sgrizzi1 ,†, Cristiano Chesi1,∗ ,†
                                1
                                    NeTS Lab, IUSS Pavia, P.zza Vittoria 15 27100 Pavia, Italy

                                                    Abstract
                                                    Here we discuss strategies and results of a small-sized training program based on Italian child-
                                                    directed speech (less than 3M tokens) for various network architectures. The rationale behind these
                                                    experiments [1] lies in the attempt to understand the effect of this naturalistic training diet on
                                                    different models' architecture. Preliminary findings lead us to conclude that: (i) different
                                                    tokenization strategies produce mildly significant improvements overall, although segmentation
                                                    aligns more closely with linguistic intuitions in some cases, but not in others; (ii) modified LSTM
                                                    networks (eMG-RNN variant) with a single layer and a structurally more controlled cell state
                                                    perform slightly worse in training loss (compared to standard one- and two-layered LSTM models)
                                                    but better on linguistically critical contrasts. This suggests that standard loss/accuracy metrics in
                                                    autoregressive training procedures are linguistically irrelevant and, more generally, misleading
                                                    since the best-trained models produce poorer linguistic predictions ([2], pace [3]). Overall, the
                                                    performance of these models remains significantly lower compared to that of 7-year-old native-
                                                    speaker children in the relevant linguistic contrasts we considered [4].

                                                    Keywords
                                                    LSTM, Transformers, Small Language Models (SLM), tokenization, cell state control, LM evaluation
                                1




                                1. Introduction                                                        training data available, and the more the better,
                                                                                                       structural (morpho-syntactic and compositional
                                    According to the mainstream LLM development                        semantic) knowledge might require a much smaller
                                pipeline,     Transformer-based        architectures    [5]            dataset (from 10 to 100 million words, according to [10]).
                                outperform sequential training models, like LSTM [6], in               We explore this intuition further and, based on prolific
                                various NLP tasks. When small-sized training data are                  literature from the ‘80s showing that typical child errors
                                available, optimization becomes necessary [7], [8], but                are structurally sensitive and never random [11], we
                                common optimization techniques neglect the                             model networks’ architecture to bias learning towards
                                linguistically relevant fact that these models (i) conflate            plausible structural configurations, possibly preventing
                                semantic/world knowledge with morpho-syntactic                         these “small” language models (SLM) from producing
                                competence, (ii) require unreasonable training data                    wrong linguistic generalizations. We started from a mild
                                compared to that needed by children during language                    revision of the LM training and evaluation pipeline for
                                acquisition, (iii) the higher their performance, the lower             Italian including alternative approaches to tokenization
                                their return in cognitive/linguistic terms [9]. In this                based on pseudo-morphological decomposition (§2.2);
                                paper we address these three issues, starting from the                 we then approached a more structurally-driven update
                                observation that while world knowledge uses all


                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,        0000-0002-5389-8884 (A. Fusco); 0009-0007-7986-2365 (M.
                                Dec 04 — 06, 2024, Pisa, Italy                                            Barbini); 0009-0005-8116-3358 (M. L. Piccini Bianchessi);
                                ∗
                                  Corresponding author.                                                   0000-0003-3072-7967 (V. Bressan); 0009-0003-5456-0556 (S. Neri);
                                †
                                  These authors contributed equally.                                      0009-0007-2525-2457 (S. Rossi); 0000-0003-1375-1359 (T. Sgrizzi);
                                   cristiano.chesi@iusspavia.it (C. Chesi)                                0000-0003-1935-1348 (C. Chesi);

                                                                                                                        © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                        Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
of the cell state in LSTM networks, which we will call          2.2. Tokenization: MorPiece (MoP)
eMG-RNN variants (§2.3); we finally adopted a precise
testing benchmark for specific linguistic contrasts in          Popular vLLMs use either Byte-Pair Encoding (BPE)
Italian following BLiMP design [12] (§2.4). We will first       [14], [15] or (fast)WordPiece (fWP) [16] algorithms for
set the stage in section (§2) and discuss one alternative       tokenization. The simplicity and computational
tokenization strategy (MorPiece). A simple modification         efficiency of these approaches contrast with the limited
to the gating system in LSTM is proposed that mimics            morphological analysis they provide. In rich inflectional
certain linguistic constraints. Then, we will describe the      languages (e.g., Italian) and agglutinative languages
relevant experiments we have run (§3) and draw some             (e.g., Finnish), this might induce linguistically unsound
conclusions based on the observed results (§4). A general       generalizations. Here, we explore a more
discussion with a description of the next steps will            morphologically informed strategy, inspired by the
conclude this paper (§5).                                       Tolerance Principle (TP) and Sufficiency Principle (SP)
                                                                [17], aiming to break words into potentially relevant
                                                                morphemes without relying on morpheme tables [18].
2. Revisiting LM training pipeline                              The experiments we conduct compare the impact of
LM training pipeline is relatively rigid: after corpus          different strategies when integrated into various
cleaning (i), the data are prepared/optimized for               network architectures. We refer to MorPiece (MoP) as a
tokenization (ii), then the tokenized input is batched for      TP/SP-based strategy, which can be algorithmically
training autoregressive models (iii), mostly feeding            described as follows: each token is traversed from left to
transformer-based architectures (iv). Once the models           right to create a “root trie,” and from right to left to
are trained, the evaluation step requires their assessment      create an “inflectional trie” [19]. Each time a node N of
using some standard tasks (v). In the next sub-sections,        the trie is traversed (corresponding to the current
we will identify various criticalities in this pipeline,        character path in the word), the frequency counter
eventually proposing strategies to mitigate these               associated with this node (Nc) is updated (+1). Nodes
problems and, in the end, training linguistically more          corresponding to token endings (characters before white
informative SLM.                                                spaces or punctuation) are flagged. Once both tries are
                                                                created, the optimization procedure explores each
2.1. Corpus creation and cleaning                               descendant, and for every daughter node Dk its
                                                                frequency k is compared to HN, the approximation of the
     The primary data we collected for Italian replicates
                                                                harmonic number for N used both in TP and SP [17],
plausible linguistic input that children may be exposed
                                                                where c is the frequency of the mother node Nc:
to during acquisition, in line with [1]. It consists of about
3M tokens divided into child-directed speech (CHILDES                                  HN = c/ln(c)                  (F1)
Italian section), child movie subtitles (from
OpenSubtitles), child songs (from Zecchino D’Oro                If k > HN and c ≠ k, a productive boundary break is
repository), telephone conversations (VoLIP corpus,             postulated (based on the inference that since there are
[13]), and fairy tales (all from copyright expired              different continuations and some of them are
sources). Simple cleaning consisted of removing                 productive, i.e. sufficiently frequent according to SP,
children’s productions from CHILDES files as well as            those might be real independent morphemes). We can
any other metalinguistic annotation (speakers’                  check if this break respects HD for the relevant nodes Dj
identification, headers, time stamps, tags, links, etc.).       and Ni in the “inflectional trie”. This means there exists
Dimension and rough lexical richness of each section are        a path where the frequency i of the daughter node Ni (in
reported in Table 1 (Type-Token Ratio, TTR) before and          the “inflectional trie” the dependency between D and N
after the cleaning procedure.                                   is reversed) is lower than j/ln(j), where j is the frequency
                                                                of the mother node Dj. If this is the case, the continuation
Table 1                                                         is not considered “an exception”, in the sense of TP [17],
Corpus profiling before (bc) and after (ac) cleaning.
                                                                suggesting that the continuation is, in fact, a productive
     Section          tokens bc     tokens ac      TTR          independent morpheme. A “++” root node is then
    Childes            405892        346155        0.03         activated, the node Dk linked to it, and so on recursively,
    Subtitles          959026        700729        0.05         following the FastWordPiece tokenization strategy [20].
  Conversations         80826         58039        0.11         During recognition, the LinMaxMatch identification
      Songs            240309        222572        0.08         approach is adopted, as in FastWordPiece. Figure 1
   Fairy tales         1103543       1287826       0.05         illustrates the relevant morpheme breaks (indicated as
      Total            2973879       2431038       0.03         “||”) obtained by applying this morpheme-breaking
                                                                procedure in the root and infl tries fragments.
    Various parametric controls have been considered to                          training corpus (in contrast to BPE, for instance). We
tune this procedure: (i) a branching factor (bf) parameter                       acknowledge this limitation, but we emphasize that our
that excludes nodes with an excessively high number (>                           goal was to produce a smaller, potentially more efficient
bf) of continuations (the rationale being that when too                          lexicon. In our experiments, while BPE generated a
many continuations are present, they are unlikely to                             lexicon of 96028 tokens (67169 when the minimum
correspond to inflections; this often happens near the                           lexical frequency was set to 2), MoP produced a lexicon
root of each trie); (ii) a cutoff parameter indicating the                       of just 55049 tokens (cutoff=100, bf=10).
lower frequency boundary for a mother node (this is
necessary to ensure a minimum number of observations;                            2.3. Revisiting LSTM architecture
for example, if cutoff = 8, we exclude from the “root” trie
                                                                                     Despite many variants of the standard LSTM
any branching daughter with a frequency < 5). As in
                                                                                 architectures, notably Gated Recurrent Units [21] or
BPE, minimum frequency control for tokens is also
                                                                                 LSTM augmented with peephole connections [22], and
implemented to exclude infrequent dictionary entries.
                                                                                 the discouraging equivalence results for these variations
                       c        “root” trie                   “infl” trie        [23], we observe a recent revival of RNN-based model
          214762
                                                          a          458013      architectures [24]. We believe, in fact, that the core
                       e                         466619
                                                                                 intuition behind the LSTM architecture may be
            10240                                                            e
                                                          l          ✘           linguistically relevant and worth exploring further,
                       r
             5391
                                             10121               82939
                                                                             l   although generally more performant models (for
                                 a   4               c                           instance in terms of GLUE benchmark, [25]) are usually
            1813       c                                                 r
                                                                                 preferred [26]. The linguistic intuition is that the “long-
                               124                        r                      term memory” (cell state C in Figure 2) in LSTM
     1307
              a            o             h                                   r
                                                                                 networks could effectively model various types of non-
               r                 i               e                               local dependencies using a single mechanism.
                                                                                 Linguistically speaking, filler-gap dependencies (1) and
       e               l         o               r                               co-referential dependencies (2) are both “non-local
                                                                                 dependencies” but they are subject to non-identical
      a            e       i     o           ò       à                           locality conditions:

                                                                                   (1) a. cosa i credi          che abbia riposto _ i?
Figure 1: Visualization of a fragment of the “root” and
                                                                                         what (you) believe that (he) shelved?
the “infl(ectional)” trie created by MorPiece on our
                                                                                         what do you believe he shelved?
corpus (cutoff=100, bf=10).
                                                                                       b. *cosa i credi che abbia riposto il libro [AdvP senza
     Consider the word “cerca” (“to search for”)                                         leggere _ i]]?
represented in the “root” trie. In the last “c-a” the                                  b'. cosa i credi che abbia riposto _ i [AdvP senza
relation between Hfc and “a” frequency indicates that a                                  leggere _ i]]?
break might exist between the nodes “c”                                                  what do you believe he shelved (*the book)
(frequency=1813) and “a” (frequency=1307), since Hfc =                                   without reading?
1813/ln(1813) and 1307 > Hfc. This hypothesis is
                                                                                   (2) a. [il panino]i, chi credi che loi abbia mangiato?
confirmed by the failure of the Hfc check at the relevant
                                                                                         the sandwich, who (you) believe it has eaten?
“infl” “a-c” segment (“a” frequency=10121, “c”
                                                                                       b. *[il panino]i, chi credi che _i abbia mangiato?
frequency=466619): 10121 < 466619/ln(466619). If Hfc had
                                                                                         the sandwich, who (you) believe has eaten?
been greater than “a” frequency, then no segmentation
                                                                                         the sandwich, who do you believe have eaten
advantage would have been observable.
                                                                                         *(it)?
     The proposed algorithm has a linear time complexity
of O(2n), as each trie must be explored deterministically                             While both dependencies require C(onstituent)-
exactly once to evaluate the HN/D frequency relation.                            command generalizations to be captured [27], the
The best linguistic results (relatively linguistically                           adjunct island in (1), [28], but not clitic left-dislocation in
coherent segmentations) for our Italian corpus were                              (2), [29], can, for instance, be licensed with a(n extra) gap
obtained with cutoff=100 and bf=10. We found that it                             (1).b'. Aware of these differences, we decided to simply
was unnecessary to filter the proposed inflectional                              alter the gating system to allow the LSTM to create
breaks using the infl trie double check (TP) since the                           distinct pathways: one to “merge” new tokens, the other
LinMaxMatch strategy already efficiently filtered out                            to decide if a long-distance dependency is necessary, and
initially overestimated breaks. However, as an                                   subsequently to “move” the relevant items [30]. The
anonymous reviewer correctly pointed out, this strategy                          processing implementation of these operations is
does not guarantee total inclusion of every token of our
inspired by expectation-based Minimalist Grammars               based on English BLiMP [12]. Most of the contrasts are
formalism, eMG [31], and it is then named eMG-RNN.              derived from the COnVERSA test [4]. They consist of
     Following this implementation, merge applies               minimal pairs ordered following an increasing
incrementally, token by token, and move means “retain           complexity metric that considers the number of
in memory”. In more detail, the cell of an eMG-RNN              operations necessary to establish a dependency and the
network performs the forward processing described in            locality of such dependency. The examples below
the computational graph in Figure 2: (i) the input at time      illustrate this point by comparing a local agreement
t (xt) is linearly transformed to a lower dimension vector      dependency with, (3).b, or without, (3).a, a (linear)
(E, loosely used for “embedding”), then concatenated (C)        intervener and a more complex dependency that
with the previous hidden state/output, if any (ht-1). Two       requires to process an object relative clause (4):
pathways, both transformed using a sigmoid function
(σ), lead, on the one hand, to the move gate, on the other,       (3) a. Il piatto è pieno. Vs. Il piatto è piena.
to the merge gate. In the first case, the result of the                 the dish.S.M is full.S.M … full.S.F
sigmoid transformation is multiplied (⊙, the Hadamard                 b. Il muro della casa è rosso
product) with the input (this either erases or allows                   the wall.S.M of the house is red.S.M
some component of the original vector to be added (+)                   Vs. Il muro della casa è rossa.
to the previous (if any) context/cell state (ct-1) as in LSTM                the wall.S.M of the house is red.S.F
forget gate). The merge gate, on the other direction, will        (4) Ci sono due maestri. Uno insegna ed è ascoltato
privilege the new token if the result of the sigmoid                  dagli studenti, l'altro si riposa. Quale maestro
combination of the incoming token and the previous                    insegna? There are two teachers. One teaches and
hidden state is low, otherwise (1 - this activation, as in            he’s listened to by the students, the other rests.
GRUs update gate) will favor items in the context/cell                Which one teaches?
state (transformed through a tanh function to simulate                      Quello che gli studenti ascoltano.
memory decay).                                                              The one who the students listen to
                                                                      Vs. Quello che ascolta gli studenti.
    ct-1    |                           +       tanh      ct                The one who listens to the students
                       move
                                                                    Four kinds of dependency (agreement, thematic role
    xt      E                                                   assignment, pronominal forms usage, questions
                                       ⊙
                                                                formation and answering) are considered for a set of 32
                               σ                                distinct syntactic configurations (a total of 344 minimal
                                            i
    ht-1                                                        pairs to be judged, [4]).
            C

                         σ
                                   j
                                                ⊙
                                                                3. Materials and Methods
                                                                We trained our models on the IUSS High-Performance
                                                                Cluster with 2 GPU nodes, each with 4 A100 NVIDIA
                                       1-
                                                                devices and 1T RAM. Each network has been trained
                                                                with the full corpus using various batched strategies. (i)
                                                                Naturalistic, line-by-line, single exposure to each
                                                                sentence in the corpus (each epoch corresponds to an
                                       ⊙         +        ht    exposure of about 3M tokens); (ii) Conversational, two
                                                                sequential lines are used for the input, that is, [line 1,
                       merge
                                                                line 2], [line 2, line 3], etc. are batched; this guarantees
                                                                that a minimal conversational context for each sentence
Figure 2: eMG-RNN cell computational graph.                     is provided. In this case, each epoch corresponds to an
                                                                exposure of 6M tokens; (iii) fixed sequence length,
    This architecture is the most performant compared
                                                                considering the average sentence length of 54 words per
to various alternatives tested for the BabyLM 2024
                                                                sentence, a window of 60 tokens is used, that is, [tok_1,
challenge [32].
                                                                tok_2 … tok_60], [tok_2, tok_3 … tok_61] … are batched;
                                                                with this regimen, each epoch corresponds to an
2.4. A linguistically informed evaluation
                                                                exposure of 180M tokens. Roughly speaking, the bare
    The last step in the pipeline requires a linguistically     amount of data processed by a 7 y.o. child ranges from 7
advanced set of oppositions to verify that the structural       to 70M tokens, [34], then training the networks with a
generalizations can be captured coherently. We adopted          naturalistic or conversational regimen for 3-10 epochs
the lm-eval package [33] and we included a specific task        would result in a comparable exposure. We trained the
networks using torch.optim.lr_scheduler (step_size=5,         compatible with the overfitting hypothesis, [37]).
gamma=0.1) and Adam optimizer (lr=0.001) with 16-bit          Focusing on tokenizer training results with LSTMx1, we
automatic mixed-precision to speed up the (parallel)          observed that BPE and FastWordPiece have comparable
training for a maximum of 100 epochs. The networks            performance. MorPiece performs slightly worse, even
have been implemented in PyTorch (v2.3.1), wrapped in         though the tokenization seems linguistically more
Transformers structures (4.42.4) to maximize                  coherent (e.g., “farlo” – “to do it” is tokenized both by
compatibility in the lm-eval (v.0.4.3) environment.           BPE and fWP as a single token, while it is split in two in
CUDA drivers v.12.4 were used. The most relevant              MorPiece: “far” “+lo”) and the training faster (Table 3).
configurations tested are discussed in the next session.      This, however, only marginally impacts on minimal
                                                              pairs contrast judgments, performing slightly better,
3.1. Configurations tested                                    overall, just in certain agreement cases.
    Three different tokenization strategies (BPE,             Table 3
FastWordPiece, and MorPiece) are compared using the           Impact of the tokenization strategy on LSTM training
best-performing LSTM network [35] , which consists of
650 units for the embedding layer and 650 nodes for each           Strategy                                    Vocab size                                Training                                        Loss
of the two hidden layers. Five different network                                                                                                          time x
architectures are compared, with the GroNLP GPT-2-                                                                                                        epoch
small pretrained model [36] constituting our “top LLM           Corpus types                                           72931                                ~1h                                       1.1520
performer”. This model was re-adapted to Italian from               BPE                                                96028                                ~4h                                       0.8877
the GPT-2 English trained model, which was originally              fWP                                                 97162                                ~4h                                       0.9491
trained on approximately 10 billion token corpus,                  MoP                                                 55049                                ~3h                                       1.1151
namely various orders of magnitude bigger than our
corpus. We then trained on our corpus a comparable            We then adopted the BPE tokenizer for architectural
bidirectional transformer (BERT), two LSTM networks,          comparisons. Network training performances are
respectively with 1 and 2 LSTM layers, and a one-layer        summarized in Table 4 and graphically represented in
eMG-RNN network (Table 2), as described in §2.3.              Figure 3 for linguistic dimensions comparison.
Table 2                                                       Table 4
Network architectures                                         Network architectures and their performance on
     Model    Parameters Structure                            training (Loss/Accuracy) and COnVERSA test
 GroNLP GPT-2            12 Attention heads +                          Model                                                          Loss/Accuracy                                       COnVERA
                121M
     small               768 hidden units                          GroNLP GPT-2s                                                                                                          0.73 (±0.02)
                         12 Attention heads +                          BERT                                                           4.5488/0.65471                                      0.43(±0.02)
     BERT       113M
                         768 hidden units                             LSTMx2                                                          0.7849/0.8283                                       0.48(±0.03)
                         650 Embedding + 2                            LSTMx1                                                          0.8784/0.8103                                       0.52(±0.03)
    LSTMx2       65M
                         LSTM layers (650)                           eMG-RNN                                                          0.9491/0.7815                                       0.61(±0.01)
                         650 Embedding + 1
    LSTMx1       36M
                         LSTM layers (650)                                                                                                  Total
                                                                                                                  Wh questions + OR                   Agr DP
                         650 Embedding + 1                                                            Reflexives in Psych V
                                                                                               Subj position in wh-
                                                                                                                                        1                      Agr Subj-AP

   eMG-RNN       73M                                                                                questions                         0.9                               Agr Subj-AP + attract


                         eMG-RNN layer (650)                                          Clitic pronouns dat                             0.8                                       Agr Subj-V (unacc)

                                                                             Clitic pronouns acc                                      0.7                                              Agr Subj-V (unerg)

                                                                                                                                      0.6
                                                                           Clitic pronouns                                                                                                      Agr V-Subj (unacc)
                                                                                                                                      0.5


4. Results
                                                                   Wh questions + OR                                                                                                                Agr Subj-V (unerg) +
                                                                        unamb                                                         0.4                                                                 attract

                                                                                                                                      0.3
                                                                 Wh questions + SR                                                                                                                     Agr Subj-V (trans)
                                                                                                                                      0.2

    Comparing BERT and LSTM architectures, LSTMx1                  Why questions                                                      0.1                                                                Agr Subj-V trans
                                                                                                                                                                                                             (unerg)
                                                                                                                                        0

qualifies as the most performant configuration (both in            Wh arguments
                                                                                                                                                                                                         Aux selection
                                                                                                                                                                                                           (ditrans)


training and in minimal pair judgments). Considering             Person rotation in
                                                                       decl
                                                                                                                                                                                                       Aux selection (trans)


training, the only batching regimen performing                      Person rotation in
                                                                        questions
                                                                                                                                                                                                    Theta roles (subst)


sufficiently well is the fixed sequence length (loss=0.8877           Reflexives cl (unacc
                                                                              V)
                                                                                                                                                                                                Agr Past Part (unacc)



with LSTMx1 vs. conversational loss=4.0240 or
                                                                                                                                                                                       Aux selection
                                                                                      Reflexives cl
                                                                                                                                                                                         (unacc)
                                                                                            Aux selection
                                                                                                                                                                                 Aux selection (unerg)

naturalistic regimen loss=4.5884). All networks reached
                                                                                             (passives)
                                                                                                                                                                         Theta roles
                                                                                                  Polar questions
                                                                                                                                                                         (omission)
                                                                                                              Wh adjuncts
                                                                                                                       Agr Subj V                         Psych Psych
                                                                                                                                                                V     V (piacere)

a learning plateau around 10-12 epochs. Comparing the                                                                 (cumulative) Agr Past Part (with (preoccupare)
                                                                                                                                         clitic)

performances on COnVERSA, we realized that the
                                                                                                                LSTMx1                Emg_BPE                  7 y.o. child
results does not improve after 3 epochs of fixed sequence
length (60 tokens) training regimen (this result is           Figure 3: Performance of the 2 best RNN networks
                                                              variants on COnVERSA compared to the 7 y.o. children.
5. Discussion                                                 References
     Overall, LSTM networks significantly outperform          [1] A. Warstadt et al., Eds., Proceedings of the BabyLM
Bidirectional Transformers in this minimal pairs test on           Challenge at the 27th Conference on Computational
Italian. This finding is consistent with results previously        Natural Language Learning. Singapore: Association
discussed in the literature and suggests a clear                   for Computational Linguistics, 2023. [Online].
advantage of recurrent, sequential model architectures             Available:       https://aclanthology.org/2023.conll-
                                                                   babylm.0
(e.g., LSTM) over Bidirectional Transformers in terms of
                                                              [2] R. Katzir, “Why large language models are poor
linguistic generalizations [38] and partially justify the          theories of human linguistic cognition. A reply to
renewed interest for RNN networks that we have been                Piantadosi (2023),” 2023. [Online]. Available:
observed in the last couple of years [24], [26]. As far as         lingbuzz/007190
the tokenization procedure is concerned, it is somewhat       [3] S. Piantadosi, “Modern language models refute
premature to draw definitive conclusions from our                  Chomsky’s approach to language,” Lingbuzz
experiments, as MorPiece has not yet been fully                    Preprint, lingbuzz, vol. 7180, 2023.
optimized or tested. Specifically, the optimal cut-off        [4] C. Chesi, G. Ghersi, V. Musella, and D. Musola,
threshold and minimum branching factor have not been               COnVERSA: Test di Comprensione delle Opposizioni
systematically evaluated. Nevertheless, a more                     morfo-sintattiche VERbali attraverso la ScritturA.
                                                                   Firenze: Hogrefe, 2024.
morphologically coherent segmentation is expected to
                                                              [5] A. Vaswani et al., “Attention Is All You Need,”
enhance sensitivity in certain minimal contrasts.                  arXiv:1706.03762 [cs], Dec. 2017, Accessed: Mar. 26,
     Similarly, the eMG-RNN architecture could be                  2022.               [Online].              Available:
further explored and optimized, particularly considering           http://arxiv.org/abs/1706.03762
specific contrasts, which may help determine whether          [6] S. Hochreiter and J. Schmidhuber, “Long short-
our linguistic modeling is on the right track. Evidence to         term memory,” Neural computation, vol. 9, no. 8,
the contrary is attested by the judgments of sentences             pp. 1735–1780, 1997.
with missing thematic roles, which are often incorrectly      [7] L. G. G. Charpentier and D. Samuel, “Not all layers
preferred by most models, including our eMG-RNN.                   are equally as important: Every Layer Counts
     In the end, our results suggest that Loss/Accuracy            BERT,” in Proceedings of the BabyLM Challenge at
                                                                   the 27th Conference on Computational Natural
performance registered in training is not a significant
                                                                   Language Learning, Singapore: Association for
predictor of the performance on the COnVERSA test, or              Computational Linguistics, 2023, pp. 210–224. doi:
more generally, of the linguistic coherence of the LM              10.18653/v1/2023.conll-babylm.20.
trained. Likewise, the models’ dimension is not a clear       [8] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré,
predictor either: Transformers trained on the same small           “FlashAttention: Fast and Memory-Efficient Exact
dataset perform randomly (in all dimensions their                  Attention with IO-Awareness,” Jun. 23, 2022, arXiv:
performance is round 50%) while eMG-RNN, which has                 arXiv:2205.14135. Accessed: Jun. 12, 2024. [Online].
a number of parameters similar to LSTM-2, outperforms              Available: http://arxiv.org/abs/2205.14135
both LSTM-2 and LSTM-1 (half size of eMG-RNN). The            [9] J. Steuer, M. Mosbach, and D. Klakow, “Large GPT-
training size remains a striking difference compared to            like Models are Bad Babies: A Closer Look at the
                                                                   Relationship between Linguistic Competence and
the input received by children: this difference of one
                                                                   Psycholinguistic Measures,” in Proceedings of the
order of magnitude suggests that the bias considered in            BabyLM Challenge at the 27th Conference on
eMG-RNN are not yet satisfactory and that our                      Computational Natural Language Learning,
Language Acquisition Device is still more efficient; in            Singapore: Association for Computational
this sense, the Poverty of Stimulus Hypothesis remains             Linguistics,      2023,      pp.     114–129.     doi:
unrefuted [39] by these results. Next steps will consider          10.18653/v1/2023.conll-babylm.12.
extending to 10M tokens the training corpus (to match         [10] Y. Zhang, A. Warstadt, H.-S. Li, and S. R. Bowman,
the English counterpart [1]) and further exploring the             “When Do You Need Billions of Words of
effects of optimized tokenization procedures or other              Pretraining Data?,” Nov. 10, 2020, arXiv:
minimal modifications, and optimizations [24], of                  arXiv:2011.04946. Accessed: Jan. 10, 2024. [Online].
                                                                   Available: http://arxiv.org/abs/2011.04946
recurrent neural networks.
                                                              [11] S. Crain and M. Nakayama, “Structure Dependence
                                                                   in Grammar Formation,” Language, vol. 63, no. 3, p.
Acknowledgments                                                    522, Sep. 1987, doi: 10.2307/415004.
                                                              [12] A. Warstadt et al., “BLiMP: The Benchmark of
    This project is partially supported by the T-GRA2L:
                                                                   Linguistic Minimal Pairs for English,” Transactions
Testing GRAdeness and GRAmmaticality in Linguistics,               of the Association for Computational Linguistics, vol.
PRIN 2022 Next Generation EU funded Project                        8,     pp.      377–392,       Dec.     2020,     doi:
(202223PL4N). National coordinator: CC                             10.1162/tacl_a_00321.
[13] I. Alfano, F. Cutugno, A. De Rosa, C. Iacobini, R.            and Analysis Platform for Natural Language
     Savy, and M. Voghera, “VOLIP: a corpus of spoken              Understanding,”       Feb.    22,   2019,    arXiv:
     Italian and a virtuous example of reuse of linguistic         arXiv:1804.07461. Accessed: Jul. 20, 2024. [Online].
     resources,” in Proceedings of the Ninth International         Available: http://arxiv.org/abs/1804.07461
     Conference on Language Resources and Evaluation          [26] A. Gu and T. Dao, “Mamba: Linear-Time Sequence
     (LREC’14), N. Calzolari, K. Choukri, T. Declerck, H.          Modeling with Selective State Spaces,” May 31,
     Loftsson, B. Maegaard, J. Mariani, A. Moreno, J.              2024, arXiv: arXiv:2312.00752. Accessed: Oct. 20,
     Odijk, and S. Piperidis, Eds., Reykjavik, Iceland:            2024.               [Online].             Available:
     European Language Resources Association                       http://arxiv.org/abs/2312.00752
     (ELRA), May 2014, pp. 3897–3901. [Online].               [27] T. Reinhart, “The syntactic domain of anaphora,”
     Available:                           http://www.lrec-         Massachusetts Institute of Technology, Cambridge
     conf.org/proceedings/lrec2014/pdf/906_Paper.pdf               (MA), 1976.
[14] T. B. Brown et al., “Language Models are Few-Shot        [28] J. R. Ross, “Constraints on variables in syntax.,”
     Learners,” arXiv:2005.14165 [cs], Jul. 2020,                  MIT, Cambridge (MA), 1967.
     Accessed: Apr. 21, 2021. [Online]. Available:            [29] C. Cecchetto, “A Comparative Analysis of Left and
     http://arxiv.org/abs/2005.14165                               Right Dislocation in Romance,” Studia Linguistica,
[15] P. Gage, “A new algorithm for data compression,”              vol. 53, no. 1, pp. 40–67, Apr. 1999, doi:
     C Users Journal, vol. 12, no. 2, pp. 23–38, 1994.             10.1111/1467-9582.00039.
[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,        [30] N. Chomsky et al., Merge and the Strong Minimalist
     “Bert: Pre-training of deep bidirectional                     Thesis, 1st ed. Cambridge University Press, 2023.
     transformers for language understanding,” arXiv               doi: 10.1017/9781009343244.
     preprint arXiv:1810.04805, 2018.                         [31] C.     Chesi,     “Expectation-based     Minimalist
[17] C. D. Yang, The price of linguistic productivity: how         Grammars,” arXiv:2109.13871 [cs], Sep. 2021,
     children learn to break the rules of language.                Accessed: Nov. 02, 2021. [Online]. Available:
     Cambridge, MA: MIT Press, 2016.                               http://arxiv.org/abs/2109.13871
[18] H. Jabbar, “MorphPiece : A Linguistic Tokenizer for      [32] C. Chesi et al., “Different Ways to Forget:
     Large Language Models,” Feb. 03, 2024, arXiv:                 Linguistic Gates in Recurrent Neural Networks,” in
     arXiv:2307.07262. Accessed: Jun. 23, 2024. [Online].          Proceedings of the BabyLM Challenge at the 28th
     Available: http://arxiv.org/abs/2307.07262                    Conference on Computational Natural Language
[19] E. Fredkin, “Trie memory,” Commun. ACM, vol. 3,               Learning, 2024.
     no. 9, pp. 490–499, Sep. 1960, doi:                      [33] L. Gao et al., “A framework for few-shot language
     10.1145/367390.367400.                                        model evaluation.” Zenodo, Dec. 2023. doi:
[20] X. Song, A. Salcianu, Y. Song, D. Dopson, and D.              10.5281/zenodo.10256836.
     Zhou, “Fast WordPiece Tokenization,” Oct. 05,            [34] B. Hart and T. R. Risley, “American parenting of
     2021, arXiv: arXiv:2012.15524. Accessed: Jun. 13,             language-learning children: Persisting differences
     2024.                [Online].              Available:        in family-child interactions observed in natural
     http://arxiv.org/abs/2012.15524                               home environments.,” Developmental Psychology,
[21] K. Cho et al., “Learning Phrase Representations               vol. 28, no. 6, pp. 1096–1105, Nov. 1992, doi:
     using RNN Encoder-Decoder for Statistical                     10.1037/0012-1649.28.6.1096.
     Machine Translation,” Sep. 02, 2014, arXiv:              [35] K. Gulordava, P. Bojanowski, E. Grave, T. Linzen,
     arXiv:1406.1078. Accessed: Jun. 12, 2024. [Online].           and M. Baroni, “Colorless Green Recurrent
     Available: http://arxiv.org/abs/1406.1078                     Networks Dream Hierarchically,” in Proceedings of
[22] F. A. Gers and J. Schmidhuber, “Recurrent nets that           the 2018 Conference of the North American Chapter
     time and count,” in Proceedings of the IEEE-INNS-             of the Association for Computational Linguistics:
     ENNS International Joint Conference on Neural                 Human Language Technologies, Volume 1 (Long
     Networks. IJCNN 2000. Neural Computing: New                   Papers), New Orleans, Louisiana: Association for
     Challenges and Perspectives for the New Millennium,           Computational Linguistics, Jun. 2018, pp. 1195–
     Como, Italy: IEEE, 2000, pp. 189–194 vol.3. doi:              1205. doi: 10.18653/v1/N18-1108.
     10.1109/IJCNN.2000.861302.                               [36] W. de Vries and M. Nissim, “As Good as New. How
[23] K. Greff, R. K. Srivastava, J. Koutník, B. R.                 to Successfully Recycle English GPT-2 to Make
     Steunebrink, and J. Schmidhuber, “LSTM: A Search              Models for Other Languages,” in Findings of the
     Space Odyssey,” IEEE Trans. Neural Netw. Learning             Association for Computational Linguistics: ACL-
     Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017, doi:        IJCNLP 2021, 2021, pp. 836–846. doi:
     10.1109/TNNLS.2016.2582924.                                   10.18653/v1/2021.findings-acl.74.
[24] L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H.         [37] F. Xue, Y. Fu, W. Zhou, Z. Zheng, and Y. You, “To
     Hajimirsadegh, “Were RNNs All We Needed?,” Oct.               Repeat or Not To Repeat: Insights from Scaling
     04, 2024, arXiv: arXiv:2410.01201. Accessed: Oct.             LLM under Token-Crisis,” 2023, arXiv. doi:
     18,          2024.        [Online].         Available:        10.48550/ARXIV.2305.13230.
     http://arxiv.org/abs/2410.01201                          [38] E. Wilcox, R. Futrell, and R. Levy, “Using
[25] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and          Computational Models to Test Syntactic
     S. R. Bowman, “GLUE: A Multi-Task Benchmark
     Learnability,” Linguistic Inquiry, pp. 1–44, Apr.
     2023, doi: 10.1162/ling_a_00491.
[39] C. Yang, S. Crain, R. C. Berwick, N. Chomsky, and
     J. J. Bolhuis, “The growth of language: Universal
     Grammar, experience, and principles of
     computation,” Neuroscience & Biobehavioral
     Reviews, vol. 81, pp. 103–119, Oct. 2017, doi:
     10.1016/j.neubiorev.2016.12.023.


A. Online Resources
Resources (corpus information, tokenizer, network
architectures and lm_eval tasks) are available at
https://github.com/cristianochesi/babylm-2024.