<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explicit Memorization for Recurrent Neural Networks with Autoencoders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio C</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Università di Pisa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italia antonio.carta@di.unipi.it</string-name>
        </contrib>
      </contrib-group>
      <fpage>95</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>Recurrent neural networks are difficult to train due to the vanishing and exploding gradient problem. Most of the solutions in the literature revolve around the design of new models able to mitigate this issue. However, they ignore the training algorithm, relying on gradient descent and end-to-end training. In this extended abstract, we propose a conceptual separation of recurrent models into two components: a feature extractor and a memory. We introduce the Linear Memory Network, a recurrent model based on this conceptual framework. The separation of the two components allows us to concentrate on the development of better memory models and training algorithms. We exploit this model to design several algorithms to train the LMN and its hierarchical extension and initialize the memory based on the optimal solution of the linear autoencoder for sequences. After the initialization, the autoencoder is implicitly used as an explicit memory by encoding and decoding the entire sequence of hidden states in its memory. The experimental results show that using these algorithms, designed for memorization, we can improve the results of recurrent models on a variety of tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Recurrent Neural Networks Autencoders modular neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Recurrent neural networks (RNN)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] solve sequential problems by iteratively
updating an internal state at each timestep. Recently, RNNs obtained
state-ofthe-art results in several sequential domains, such as speech recognition [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
machine translation [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Despite their success, RNNs are extremely difficult to
train due to the vanishing gradient problem[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which makes it extremely hard
to learn long-term dependencies.
      </p>
      <p>The work outlined in this extended abstract focuses on the development of a
novel memorization mechanism for recurrent neural networks. The objectives of
this work are:
– the design of novel RNN models with explicit memorization;
– the design of novel training algorithms for the RNN memory;
– the study of the tradeoffs between a pure memorization approach for sequential</p>
      <p>problems and end-to-end training.</p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>Currently, recurrent architectures are trained end-to-end by stochastic gradient
descent. Most of the work in the literature try to improve current recurrent
models by proposing architectural changes. Instead, our proposal is based on
two principles: the separation between memory and functional components, and
the development of specialized training algorithms for recurrent networks, a field
largely ignored by the current literature.
2</p>
    </sec>
    <sec id="sec-2">
      <title>State of the Art</title>
      <p>
        Several solutions have been proposed to mitigate the vanishing and exploding
gradient problem. Gated architectures, like LSTM[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and GRU[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], try to solve this
problem modifying the model architecture to reduce the vanishing gradient (but
without eliminating the problem). Orthogonal models[
        <xref ref-type="bibr" rid="ref1 ref14 ref17">14, 17, 1</xref>
        ] solve the problem
by exploiting orthogonal linear transformations and linear activation functions,
which guarantee the constant gradient propagation. However, orthogonal models
still perform worse than gated architectures on several tasks.
      </p>
      <p>
        Attention models[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] solve the problem by performing a weighted sum of the
entire sequence of previous hidden states. This approach solves the vanishing
gradient problem but it is much more computationally expensive and does not
scale to long sequences. Memory-Augmented Neural Networks (MANN)[
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] are a
class of models made of a controller and an external memory, where the controller
can read and write the memory through an interface. These models try to solve
the memorization problem on a model-definition level. While they obtained
highly promising results on some synthetic tasks, they are extremely hard to
train. Furthermore, since the model is trained end-to-end there are no guarantees
that the resulting model will use the external memory in any meaningful way.
      </p>
      <p>
        Linear autoencoders for sequences (LAES)[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are a fundamental component
of our proposed model. The optimal solution of a LAES can be easily found
with a closed-form expression[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Furthermore, the literature provides some
equivalence results which bridge the gap between feedforward networks, which
can see the entire sequence at once, and recurrent neural networks[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem Statement</title>
      <p>We separate the problem of processing sequential data into two subproblems:
functional problem the problem of extracting informative features from the
current input, given the current state of the memory;
memorization problem the problem of encoding the features extracted by the
functional component into a memory.</p>
      <p>The memorization module is a recurrent component and therefore can suffer
from the vanishing gradient problem. We focus our work on the development
of new models and training algorithms for this component. Dividing sequential
problems into two separate subproblems allows building separated solutions
yth
ht
xt
ytm
htm
htm
ht
xt
yt
mt1
mt1 1
T0 = 1
mt2
mt2 1
T1 = 2
Functional</p>
      <p>
        Memory
optimized for the peculiarities of the task, including novel architectures and
training algorithms. Furthermore, each component can be easier since it must
solve only part of the problem. As an example, the Linear Memory Network [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
one of our proposed models, is made of a feedforward network and a linear
recurrent network, two components which are less complex than a monolithic
RNN.
      </p>
      <p>Another interesting property of the separation is that it allows us to
concentrate on the memorization task. What are the limits of memorization? How can
we learn when and what to forget? These questions become easier to address if
we focus only on the memorization subtask.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        To investigate the separation of the memory from the model architecture we
introduced the Linear Memory Network (LMN)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a novel recurrent model where
the functional component is implemented using a feedforward network and the
memorization component is implemented using a linear recurrence. The model
update is computed as follows:
ht = (Wxhxt + Wmhmt 1)
mt = Whmht + Wmmmt 1;
where ht is the hidden state and mt is the memory state. Despite the linearity
of the recurrence, the model is equivalent to an RNN. To memorize the sequence
of hidden activations h1; : : : ; hT we can train the memorization component to
encode the sequence. Since the memorization component is linear, it can be
trained using the LAES algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to obtain the optimal autoencoder of the
hidden state sequence. Using this initialization technique, the memory component
is able to encode the hidden state sequence, and therefore the memory can be
used to represent the entire sequence of extracted features. After the initialization,
the model is finetuned with end-to-end training. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] we extended the RNN
pretraining algorithm in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], based on the LAES algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], to pretrain the
LMN. In our experiments, we have found that the memory of the LMN, especially
when initialized with the LAES, is superior to gated architectures when it comes
to learning long-term dependencies. Table 1 shows the frame-level accuracy on
the sequence modeling problem on four different MIDI datasets.
      </p>
      <p>
        In a follow-up work, we extended the LMN by separating the memory
components into k separate modules, each one taking the sequence of hidden activations
with a different sampling rate. The resulting model, dubbed Multi-Scale LMN, is
a hierarchical model inspired by the Clockwork LMN [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Hierarchical models
are especially useful to process long sequences that contain long-term
dependendencies. The architecture of the Multi-Scale LMN can shorten the length of the
dependencies between the input elements by subsampling the original sequence.
The model is trained incrementally by adding a new memory module after a fixed
number of epochs, each one initialized with the LMN pretraining algorithm. The
experimental results improve on the state-of-the-art on the sequence generation
and the common-suffix TIMIT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] compared to equivalent Clockwork RNNs and
LSTMs. Figure 1 shows a schematic view of the architecture of the LMN and
the MultiScale LMN.
      </p>
      <p>
        LMNs are related to another class of recurrent models: orthogonal recurrent
networks. Imposing the orthogonality on the LMN and truncating the gradient, an
approach inspired by the original LSTM training algorithm[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the network gains
a constant propagation of the gradient, and therefore the vanishing and exploding
gradient problems are provably eliminated. However, compared to orthogonal
models the LMN has two distinct advantages: first, despite the linearity of the
memory, the entire network is still nonlinear since the functional component is
a (possibly multi-layer) feedforward network. Orthogonal models instead can
use a limited class of activation function to guarantee the constant gradient
propagation. Furthermore, the pretraining with the LAES is much more effective
than a random orthogonal initialization, since the resulting model is an optimal
autoencoder. These advantages can be seen also in the experimental results,
where the LMN achieves better results than any other orthogonal model in
the literature on sequential MNIST and permuted MNIST. Other experimental
results on TIMIT show a similar trend.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>The results of our research show that recurrent architectures can benefit from
better training algorithms and initializations focused on solutions for the
memorization subtask. The proposed conceptual separation allows separating the
memory component to design novel models and training algorithms. An example
of this approach is the Linear Memory Network, where the linearity of the
memory is exploited to develop a pretraining algorithm that initializes the memory
with the optimal autoencoder. The connection with the work on orthogonal
models is fundamental to guarantee good properties necessary to learn long-term
dependencies. We remark how these results are a consequence of the separation
of the model into two components.</p>
      <p>In general, the experimental results show consistent improvements on several
challenging datasets. The improvements are especially evident on datasets that
require the memorization of long sequences, like complex sequences of notes in
MIDI datasets or sequential pixel MNIST. These datasets are especially difficult
for RNNs, which suffers from the vanishing gradient problem. They are also
difficult for LSTMs, which tend to forget their input after a long sequence, due
to the exponential effect of the forget gate. However, it must be noted that
on different datasets, like several natural language processing benchmarks, the
ability to forget past information seems a key component of every successful
model. Therefore, we believe it is important to study new approaches that are
able to combine the advantages of pure memorization models with architectures
that are able to selectively forget.</p>
      <p>This line of research offers several directions for future work. The linearity of
the memory is useful to investigate the application of hessian-free optimization
methods for RNNs.</p>
      <p>Another possible line of research is the application of our approach to other
fields. We are currently evaluating the domain of continual learning for sequential
data as a possible avenue for future research. Continual learning models must
be able to continually learn from new data without forgetting the old samples.
We believe that this is a setting that could benefit from a separate memory, able
to recognize the different samples and account for the differences between each
subtask.</p>
      <p>In conclusion, we believe a stronger focus on the memorization properties of
recurrent models and training algorithms can bring large benefits to the field.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arjovsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Unitary Evolution Recurrent Neural Networks (nov</article-title>
          <year>2015</year>
          ), http://arxiv.org/abs/1511.06464
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bacciu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sperduti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Linear Memory Networks</article-title>
          . In: ICANN (
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1811</year>
          .03356
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          .
          <source>CoRR abs/1409.0</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          (
          <year>2014</year>
          ). https://doi.org/10.1146/annurev.neuro.
          <volume>26</volume>
          .041002.131047, http://arxiv.org/abs/ 1409.0473
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Boulanger-Lewandowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription. ICML (Cd) (</article-title>
          <year>2012</year>
          ), http://arxiv.org/abs/1206.6392
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling</article-title>
          .
          <source>CoRR abs/1412.3</source>
          ,
          <issue>1</issue>
          -
          <fpage>9</fpage>
          (
          <year>2014</year>
          ). https://doi.org/10.1109/IJCNN.
          <year>2015</year>
          .
          <volume>7280624</volume>
          , http://arxiv.org/abs/1412.3555
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Elman</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science 14</source>
          (
          <issue>2</issue>
          ),
          <fpage>179</fpage>
          -
          <lpage>211</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.r.</given-names>
          </string-name>
          , Hinton, G.:
          <article-title>SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS Alex Graves, Abdel-rahman Mohamed</article-title>
          and Geoffrey Hinton Department of Computer Science, University of Toronto.
          <source>IEEE International Conference (3)</source>
          ,
          <fpage>6645</fpage>
          -
          <lpage>6649</lpage>
          (
          <year>2013</year>
          ). https://doi.org/10.1093/ndt/gfr624
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wayne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Neural Turing Machines</article-title>
          .
          <source>CoRR abs/1410.5</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2014</year>
          ). https://doi.org/10.3389/neuro.12.006.
          <year>2007</year>
          , http://arxiv.org/abs/ 1410.5401
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wayne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grabska-Barwińska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colmenarejo</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramalho</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agapiou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          , Hermann,
          <string-name>
            <given-names>K.M.</given-names>
            ,
            <surname>Zwols</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Summerfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Blunsom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Hybrid computing using a neural network with dynamic external memory</article-title>
          .
          <source>Nature</source>
          <volume>538</volume>
          (
          <issue>7626</issue>
          ),
          <fpage>471</fpage>
          -
          <lpage>476</lpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1038/nature20101, http://dx.doi.org/10.1038/nature20101
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The vanishing gradient problem during learning recurrent neural nets and problem solutions</article-title>
          .
          <source>International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems</source>
          <volume>6</volume>
          (
          <issue>02</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>116</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hochreiter</surname>
            , Sepp; Schmidhuber,
            <given-names>J.: Long</given-names>
          </string-name>
          <string-name>
            <surname>Short-Term Memory</surname>
          </string-name>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          (
          <year>1997</year>
          ). https://doi.org/10.1144/GSL.MEM.
          <year>1999</year>
          .
          <volume>018</volume>
          .01.02
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krikun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thorat</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viégas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hughes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation</article-title>
          . In: TACL (nov
          <year>2017</year>
          ), http://arxiv.org/abs/1611.04558
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Koutník</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greff</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ch</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>A Clockwork RNN</article-title>
          .
          <source>arXiv preprint arXiv:1402.3511</source>
          (
          <year>2014</year>
          ), http://proceedings.mlr. press/v32/koutnik14.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mhammedi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellicar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailey</surname>
          </string-name>
          , J.:
          <article-title>Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections</article-title>
          . In: ICML. pp.
          <fpage>2401</fpage>
          -
          <lpage>2409</lpage>
          (dec
          <year>2017</year>
          ), http://arxiv.org/abs/1612.00188
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sperduti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Linear autoencoder networks for structured data</article-title>
          .
          <source>In: International Workshop on Neural-Symbolic Learning and Reasoning</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sperduti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Equivalence results between feedforward and recurrent neural networks for sequences</article-title>
          .
          <source>IJCAI International Joint Conference on Artificial Intelligence 2015- Janua(Ijcai)</source>
          ,
          <fpage>3827</fpage>
          -
          <lpage>3833</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Vorontsov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trabelsi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kadoury</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>On orthogonality and learning recurrent networks with long term dependencies</article-title>
          .
          <source>In: ICML</source>
          . pp.
          <fpage>3570</fpage>
          -
          <lpage>3578</lpage>
          (jan
          <year>2017</year>
          ), http://arxiv.org/abs/1702.00071
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>