1. Introduction

Recurrent Neural Networks

Nadine El-Naggar

nadine.el-naggar@city.ac.uk 0

Pranava Madhyastha

pranava.madhyastha@city.ac.uk 0

Tillman Weyde

t.e.weyde@city.ac.uk 0 0 City, University of London , United Kingdom

2022

28 30

Considerable work, both theoretical and empirical, has shown that Recurrent Neural Network (RNN) architectures are capable of learning formal languages under specific conditions. In this study, we investigate the ability of linear and ReLU RNNs to learn Dyck-1 languages in whole sequence classification tasks. We observe that counting bracket sequences is learned but performance on full Dyck-1 recognition is poor. Models for both tasks do not generalise well to longer sequences. We determine correct weights for the given tasks with suitable architectures, but the standard setup for classification surprisingly departs from the correct values. We propose a regression setup with clipping that we find to stabilise correct weights, but it makes learning from random weight initialisation even less efective. Our observations suggest that Dyck-1 languages seem unlikely to be learned by ReLU RNNs for most practical applications.

Dyck-1 languages Formal language learning Generalisation Classification Systematicity

1. Introduction

4. We provide evidence that using a regression setup causes models to retain correctly initialised weights, but does not improve their ability to learn the correct weights.

2. Experimental Setup

In our experiments, we evaluate the ability of RNNs to learn Dyck-1 languages in two tasks. Task 1 (bracket counting) detects if there is a surplus of opening brackets in a string, which is not full Dyck-1 recognition. Task 2 (Dyck-1 recognition) detects invalid Dyck-1 sequences, i.e. one that has a surplus of opening brackets overall or that has a surplus of closing brackets at any previous point, versus valid ones. Both tasks require internally counting up for opening and down for closing brackets. For Task 1, calculating the diference and comparing to a threshold is suficient, which can be achieved with a linear network. For Task 2 we need also to flag negative counter values at any time point for full Dyck-1 recognition. We know that this is possible with a small RNN with ReLU activation from the theoretical results of Siegelmann et al. [ 3 ], Leshno et al. [ 4 ] and Weiss et al. [ 1 ].

Our experimental setup is similar to those of Weiss et al. [ 1 ] and Suzgun et al. [ 2 ], but there are some important diferences. Weiss et al. [ 1 ] use sequences and predict the next token. Suzgun et al. [ 2 ] only use valid Dyck-1 sequences and determine at every point the valid next tokens, like in Gers and Schmidhuber [ 5 ], efectively classifying incomplete versus complete Dyck-1 sequences. While our experiments follow Suzgun et al. [ 2 ], our datasets contains also invalid Dyck-1 sequences, i.e. ones with a surplus of closing brackets, which adds complexity to the task. We also use shorter sequences than Weiss et al. [ 1 ] and Suzgun et al. [ 2 ].

In addition to standard random weight initialisation, we evaluate the efect of training from correct weights. In order to determine the necessary size of the NN and for experimenting with correct weight initialisation, we define two NNs with weights that correctly perform Tasks 1 and 2 as shown in Figure 1. These models operate correctly on sequences of arbitrary length, within the limits of the numeric representation range.

We train both models using sequences of lengths 2, 4, and 8 tokens and test with sequences of 10, 20, and 50 tokens. The datasets contain all possible sequences for lengths 2-10, and a sample of 150 sequences for lengths 20 and 50. All datasets are class balanced. There is a label at the end of each sequence, (Task 1: ‘1’ - incomplete, ‘0’ - balanced or surplus of closing brackets; Task 2: ‘1’ - invalid or incomplete, ‘0’ - valid Dyck-1 sequence). This is diferent to Gers and Schmidhuber [ 5 ], Weiss et al. [ 1 ] and Suzgun et al. [ 2 ], who use labels at every time step and a target encoding in terms of legal next tokens. We use two diferent setups, standard classification (sigmoid output activation with cross-entropy loss), and regression setup (clipped output [ 0, 1 ] with mean squared error (MSE) loss). We use two initialisation schemes: random weights and correct weights according to Figure 1. We train for 100 epochs using the Adam optimiser by Kingma and Ba [ 6 ] and a learning rate of 0.005.

3. Results

In Table 1, we observe that Task 1 is learned on the training set, but generalises less for longer sequences. Longer sequences in the training data do lead to improved generalisation, but still not perfect in most cases. Interestingly, starting from correct weights with the classification setup does not lead to better generalisation. This is intriguing, as the training apparently unlearns the correct weight values. In order to avoid this problem, we developed the regression setup, which does not change the correct weights (last row in Table 1), but it does not learn well from random initialisation. In Task 2, the learning from random weights never leads to perfect training or generalisation. Even with correct initialisation, generalisation is mostly poor.

The models used are custom designed for their respective tasks, and it is possible for these models to be used to solve these tasks with the correct weights. However, in practice, they do not converge to the correct weights. The systematic behaviour to solve these tasks does not emerge in our models during training, irrespective of the setup used, and when the standard classification setup is used, the correctly trained models even unlearn the correct weights. When the regression setup is used, the correctly initialised models do not unlearn the correct weights. However, we observe that the use of the regression setup does not improve the ability of the models to learn systematically and generalise more efectively to longer sequences. This leads us to believe that both of these setups are not suited for this type of task.

4. Conclusion

We have a few interesting observations in our experiments. Learning Dyck-1 sequences from random weights is a dificult task and the results do not generalise to long sequences in any setup not initialised with correct weights. Using longer (and thus more) training sequences leads to some improvements, but generalisation to longer sequences is still limited. Initialising the network with correct weights could help, but even that is not efective in a classification task. The tested approach of using a regression setup (clipping and MSE) can avoid unlearning of correct weights, but it hinders learning from random weight initialisation, and is thus not efective either. Given our results it seems unlikely RNNs would learn Dyck-1 languages in a practical scenario. Using LSTMs could improve the situation, but probably only to a limited extent, given the results by Weiss et al. [ 1 ] and Gers and Schmidhuber [ 5 ]. Overall, further studies and designs are needed to address reliable learning of Dyck-1 languages with NNs.

The observation is that RNNs fail to learn symbolic patterns and the systematic behaviour required to count brackets and recognise Dyck-1 sequences. This is consistent with studies that focus on the abilities of NNs to learn systematic behaviour, such as Fodor and Pylyshyn [ 7 ], Marcus et al. [ 8 ], and Lake and Baroni [ 9 ]. These studies show that NNs struggle with tasks that are simple for humans to learn from a small number of examples. The dificulty to learn symbolic patterns in a systematic manner is not exclusive to larger models, and is evidently exhibited by our very small models. Achieving systematicity in smaller models can potentially serve as a stepping stone towards achieving systematicity for larger models. In the future we aim to develop a deeper understanding of the learning dynamics of our models, and develop methods that can improve systematic learning.

[1]

Weiss ,

Goldberg , E. Yahav, On the practical computational power of finite precision rnns for language recognition , in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , ACL 2018 , Melbourne, Australia, July 15-20 , 2018 , Volume 2 :

Short

Papers , Association for Computational Linguistics , 2018 , pp. 740 - 745 . URL: https://aclanthology.org/P18-2117/. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 8 - 2 1 1 7 .

[2]

Suzgun ,

Gehrmann ,

Belinkov ,

S. M.

Shieber , LSTM networks can perform dynamic counting , CoRR abs/ 1906 .03648 ( 2019 ). URL: http://arxiv.org/abs/ 1906 .03648. a r X i v : 1 9 0 6 . 0 3 6 4 8 .

[3]

H. T.

Siegelmann ,

B. G.

Horne ,

C. L.

Giles , Computational capabilities of recurrent narx neural networks , IEEE Transactions on Systems, Man, and Cybernetics , Part

( Cybernetics ) 27 ( 1997 ) 208 - 215 .

[4]

Leshno ,

V. Y.

Lin ,

Pinkus ,

Schocken , Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , Neural networks 6 ( 1993 ) 861 - 867 .

[5]

F. A.

Gers , J. Schmidhuber, LSTM recurrent networks learn simple context-free and contextsensitive languages , IEEE Trans. Neural Networks 12 ( 2001 ) 1333 - 1340 . URL: https://doi. org/10.1109/72.963769. doi:1 0 . 1 1 0 9 / 7 2 . 9 6 3 7 6 9 .

[6]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 , Conference Track Proceedings, 2015 . URL: http://arxiv.org/abs/ 1412.6980.

[7]

J. A.

Fodor ,

Z. W.

Pylyshyn , Connectionism and cognitive architecture: A critical analysis , Cognition 28 ( 1988 ) 3 - 71 .

[8]

G. F.

Marcus ,

Vijayan ,

S. B.

Rao ,

P. M.

Vishton , Rule learning by seven-month-old infants , Science 283 ( 1999 ) 77 - 80 .

[9]

B. M.

Lake ,

Baroni , Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , in: J. G. Dy, A . Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning , ICML 2018 , Stockholmsmässan, Stockholm, Sweden, July 10-15 , 2018 , volume 80 of Proceedings of Machine Learning Research, PMLR , 2018 , pp. 2879 - 2888 . URL: http://proceedings.mlr.press/v80/lake18a.html.