<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recurrent Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadine El-Naggar</string-name>
          <email>nadine.el-naggar@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pranava Madhyastha</string-name>
          <email>pranava.madhyastha@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tillman Weyde</string-name>
          <email>t.e.weyde@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>City, University of London</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>28</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Considerable work, both theoretical and empirical, has shown that Recurrent Neural Network (RNN) architectures are capable of learning formal languages under specific conditions. In this study, we investigate the ability of linear and ReLU RNNs to learn Dyck-1 languages in whole sequence classification tasks. We observe that counting bracket sequences is learned but performance on full Dyck-1 recognition is poor. Models for both tasks do not generalise well to longer sequences. We determine correct weights for the given tasks with suitable architectures, but the standard setup for classification surprisingly departs from the correct values. We propose a regression setup with clipping that we find to stabilise correct weights, but it makes learning from random weight initialisation even less efective. Our observations suggest that Dyck-1 languages seem unlikely to be learned by ReLU RNNs for most practical applications.</p>
      </abstract>
      <kwd-group>
        <kwd>Dyck-1 languages</kwd>
        <kwd>Formal language learning</kwd>
        <kwd>Generalisation</kwd>
        <kwd>Classification</kwd>
        <kwd>Systematicity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>4. We provide evidence that using a regression setup causes models to retain correctly
initialised weights, but does not improve their ability to learn the correct weights.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <p>
        In our experiments, we evaluate the ability of RNNs to learn Dyck-1 languages in two tasks.
Task 1 (bracket counting) detects if there is a surplus of opening brackets in a string, which is
not full Dyck-1 recognition. Task 2 (Dyck-1 recognition) detects invalid Dyck-1 sequences, i.e.
one that has a surplus of opening brackets overall or that has a surplus of closing brackets at
any previous point, versus valid ones. Both tasks require internally counting up for opening and
down for closing brackets. For Task 1, calculating the diference and comparing to a threshold
is suficient, which can be achieved with a linear network. For Task 2 we need also to flag
negative counter values at any time point for full Dyck-1 recognition. We know that this is
possible with a small RNN with ReLU activation from the theoretical results of Siegelmann et al.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Leshno et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Our experimental setup is similar to those of Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Suzgun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], but there
are some important diferences. Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] use     sequences and predict the next token.
Suzgun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] only use valid Dyck-1 sequences and determine at every point the valid next
tokens, like in Gers and Schmidhuber [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], efectively classifying incomplete versus complete
Dyck-1 sequences. While our experiments follow Suzgun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our datasets contains also
invalid Dyck-1 sequences, i.e. ones with a surplus of closing brackets, which adds complexity
to the task. We also use shorter sequences than Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Suzgun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In addition to standard random weight initialisation, we evaluate the efect of training from
correct weights. In order to determine the necessary size of the NN and for experimenting with
correct weight initialisation, we define two NNs with weights that correctly perform Tasks 1
and 2 as shown in Figure 1. These models operate correctly on sequences of arbitrary length,
within the limits of the numeric representation range.</p>
      <p>
        We train both models using sequences of lengths 2, 4, and 8 tokens and test with sequences
of 10, 20, and 50 tokens. The datasets contain all possible sequences for lengths 2-10, and a
sample of 150 sequences for lengths 20 and 50. All datasets are class balanced. There is a
label at the end of each sequence, (Task 1: ‘1’ - incomplete, ‘0’ - balanced or surplus of closing
brackets; Task 2: ‘1’ - invalid or incomplete, ‘0’ - valid Dyck-1 sequence). This is diferent to
Gers and Schmidhuber [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Suzgun et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], who use labels at every time
step and a target encoding in terms of legal next tokens. We use two diferent setups, standard
classification (sigmoid output activation with cross-entropy loss), and regression setup (clipped
output [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] with mean squared error (MSE) loss). We use two initialisation schemes: random
weights and correct weights according to Figure 1. We train for 100 epochs using the Adam
optimiser by Kingma and Ba [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and a learning rate of 0.005.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>In Table 1, we observe that Task 1 is learned on the training set, but generalises less for longer
sequences. Longer sequences in the training data do lead to improved generalisation, but still
not perfect in most cases. Interestingly, starting from correct weights with the classification
setup does not lead to better generalisation. This is intriguing, as the training apparently
unlearns the correct weight values. In order to avoid this problem, we developed the regression
setup, which does not change the correct weights (last row in Table 1), but it does not learn well
from random initialisation. In Task 2, the learning from random weights never leads to perfect
training or generalisation. Even with correct initialisation, generalisation is mostly poor.</p>
      <p>The models used are custom designed for their respective tasks, and it is possible for these
models to be used to solve these tasks with the correct weights. However, in practice, they do
not converge to the correct weights. The systematic behaviour to solve these tasks does not
emerge in our models during training, irrespective of the setup used, and when the standard
classification setup is used, the correctly trained models even unlearn the correct weights. When
the regression setup is used, the correctly initialised models do not unlearn the correct weights.
However, we observe that the use of the regression setup does not improve the ability of the
models to learn systematically and generalise more efectively to longer sequences. This leads
us to believe that both of these setups are not suited for this type of task.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        We have a few interesting observations in our experiments. Learning Dyck-1 sequences from
random weights is a dificult task and the results do not generalise to long sequences in any
setup not initialised with correct weights. Using longer (and thus more) training sequences
leads to some improvements, but generalisation to longer sequences is still limited. Initialising
the network with correct weights could help, but even that is not efective in a classification
task. The tested approach of using a regression setup (clipping and MSE) can avoid unlearning
of correct weights, but it hinders learning from random weight initialisation, and is thus not
efective either. Given our results it seems unlikely RNNs would learn Dyck-1 languages in a
practical scenario. Using LSTMs could improve the situation, but probably only to a limited
extent, given the results by Weiss et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Gers and Schmidhuber [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Overall, further
studies and designs are needed to address reliable learning of Dyck-1 languages with NNs.
      </p>
      <p>
        The observation is that RNNs fail to learn symbolic patterns and the systematic behaviour
required to count brackets and recognise Dyck-1 sequences. This is consistent with studies that
focus on the abilities of NNs to learn systematic behaviour, such as Fodor and Pylyshyn [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
Marcus et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and Lake and Baroni [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These studies show that NNs struggle with tasks
that are simple for humans to learn from a small number of examples. The dificulty to learn
symbolic patterns in a systematic manner is not exclusive to larger models, and is evidently
exhibited by our very small models. Achieving systematicity in smaller models can potentially
serve as a stepping stone towards achieving systematicity for larger models. In the future we
aim to develop a deeper understanding of the learning dynamics of our models, and develop
methods that can improve systematic learning.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , E. Yahav,
          <article-title>On the practical computational power of finite precision rnns for language recognition</article-title>
          , in: I. Gurevych, Y. Miyao (Eds.),
          <article-title>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2018</year>
          , Melbourne, Australia,
          <source>July 15-20</source>
          ,
          <year>2018</year>
          , Volume
          <volume>2</volume>
          :
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>740</fpage>
          -
          <lpage>745</lpage>
          . URL: https://aclanthology.org/P18-2117/.
          <source>doi:1 0 . 1 8</source>
          <volume>6 5 3</volume>
          / v 1 / P 1
          <fpage>8</fpage>
          -
          <lpage>2</lpage>
          1
          <fpage>1</fpage>
          7 .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shieber</surname>
          </string-name>
          ,
          <article-title>LSTM networks can perform dynamic counting</article-title>
          , CoRR abs/
          <year>1906</year>
          .03648 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1906</year>
          .03648.
          <article-title>a r X i v : 1 9 0 6 . 0 3 6 4 8</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Siegelmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Horne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Giles</surname>
          </string-name>
          ,
          <article-title>Computational capabilities of recurrent narx neural networks</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>Cybernetics</year>
          )
          <volume>27</volume>
          (
          <year>1997</year>
          )
          <fpage>208</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Leshno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schocken</surname>
          </string-name>
          ,
          <article-title>Multilayer feedforward networks with a nonpolynomial activation function can approximate any function</article-title>
          ,
          <source>Neural networks 6</source>
          (
          <year>1993</year>
          )
          <fpage>861</fpage>
          -
          <lpage>867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Gers</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Schmidhuber,</surname>
          </string-name>
          <article-title>LSTM recurrent networks learn simple context-free and contextsensitive languages</article-title>
          ,
          <source>IEEE Trans. Neural Networks</source>
          <volume>12</volume>
          (
          <year>2001</year>
          )
          <fpage>1333</fpage>
          -
          <lpage>1340</lpage>
          . URL: https://doi. org/10.1109/72.963769.
          <source>doi:1 0 . 1 1 0 9 / 7 2 . 9 6</source>
          <volume>3 7 6 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          , in: Y. Bengio, Y. LeCun (Eds.),
          <source>3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          , Conference Track Proceedings,
          <year>2015</year>
          . URL: http://arxiv.org/abs/ 1412.6980.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Fodor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. W.</given-names>
            <surname>Pylyshyn</surname>
          </string-name>
          ,
          <article-title>Connectionism and cognitive architecture: A critical analysis</article-title>
          ,
          <source>Cognition</source>
          <volume>28</volume>
          (
          <year>1988</year>
          )
          <fpage>3</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. F.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Vishton</surname>
          </string-name>
          ,
          <article-title>Rule learning by seven-month-old infants</article-title>
          ,
          <source>Science</source>
          <volume>283</volume>
          (
          <year>1999</year>
          )
          <fpage>77</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <article-title>Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks</article-title>
          , in: J. G. Dy,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Krause (Eds.),
          <source>Proceedings of the 35th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2018</year>
          , Stockholmsmässan, Stockholm, Sweden,
          <source>July 10-15</source>
          ,
          <year>2018</year>
          , volume
          <volume>80</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2879</fpage>
          -
          <lpage>2888</lpage>
          . URL: http://proceedings.mlr.press/v80/lake18a.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>