<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kyra Ahrens</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Kerzel</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jae Hee Lee</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cornelius Weber</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Wermter</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a ifne-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We demonstrate that within a few epochs, the subtasks required to reason over relative directions, such as recognizing and locating objects in a scene and estimating their intrinsic orientations, are learned in the order in which relative directions are intuitively processed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Reasoning to solve complex spatial tasks like grounding
directional relations in an intrinsic frame of reference
can be decomposed into a set of subtasks that are
hierarchically organized. Consider two objects 1 and 2 in
an image, where each of the objects has a clear front
side and orientation. Learning to answer whether the
triple (1, , 2) holds for a given directional relation 
in a frame of reference that is intrinsic to 2 spans the
following stages (see Fig. 1 for an example):
1. Both the target object and the reference object
have to be recognized in the image (existence
prediction). In other words, an agent must
initially be capable of answering questions such as
“Is 1 in the image?” or “Is 2 in the image?”.
2. Next, the object’s pose that defines the relative
relation has to be discerned, enabling an agent to
successfully respond to questions such as “What
is the cardinal direction of 2?” (orientation
prediction).</p>
      <sec id="sec-1-1">
        <title>3. Predicting the directional relation using the intrinsic frame of reference is learned by combin</title>
        <p>ing the two preceding competencies, allowing an
agent to answer a question similar to “What is the
relation between 1 and 2 from the perspective
of 2?” (relation prediction). Likewise,
predicting which target object is in a specific relation to
some reference object (link prediction) can be
answered, e.g., “Taking 2’s perspective, which
object is in relation  to it?”.</p>
      </sec>
      <sec id="sec-1-2">
        <title>4. Based on all previous stages, an agent can deter</title>
        <p>mine whether a specific directional relationship
exists between the two objects (triple
classification), thus successfully providing an answer to a
question like “From 2’s perspective, is 1 left of
2?”.</p>
      </sec>
      <sec id="sec-1-3">
        <title>In previous work [1], we showed that enabling a VQA</title>
        <p>tial bias that confounds the analysis of reasoning about
relative directions.</p>
        <p>
          In the present work, we introduce GRiD-A-3D, a novel
and simplified diagnostic VQA dataset, which allows for
a more eficient and targeted analysis of the
corresponding reasoning process by removing possible biases from
using real-world objects. Subsequently, we report the
performance of the two established end-to-end VQA models
MAC [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and FiLM [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] on this dataset. With our
experiments, we show that, when trained on GRiD-A-3D,
both models depict a similar qualitative learning behavior
compared with their replica trained on the more
complex non-abstract GRiD-3D [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] dataset. At the same time,
training converges up to three times faster, thus allowing
more eficient neural experiments.
        </p>
        <p>We summarize the contributions made in this paper
as follows:
• We complement our GRiD-3D benchmark suite1
with a novel GRiD-A-3D (Grounding Relative
Directions with Abstract objects in 3D) dataset
that enables a faster and less biased evaluation
of spatial reasoning behavior in VQA compared
with the original GRiD-3D dataset.
• We verify our previous research findings with the
new dataset, thus underpinning our hypothesis
that multi-task learning enables neural models to
learn to ground relative directions in VQA.
• Furthermore, we add evidence to our hypothesis
that during multi-task learning, spatial
reasoning abilities of a neural model develop along the
intuitive order of corresponding subtasks, thus
forming an implicit curriculum.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Aiming to provide a suitable setup to assess the reasoning
capabilities of neural models on vision-language tasks,
diagnostic datasets have been introduced [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. One
of the major advantages of such datasets is that they
provide structured and tightly controlled scenes to
prevent models from circumventing reasoning by exploiting
conditional biases that commonly arise with real-world
architecture to reason about relative directions is viable, images. A particular advantage of diagnostic datasets
provided that all of the learning stages listed above are based on synthetic images is that their generation
proencapsulated in corresponding subtasks as summarized cess is scalable, customizable, and therefore allows for a
in Fig. 2. Beyond that, the following two observations more fine-grained performance analysis.
were made: First, the subtasks that are found earlier in The vast majority of diagnostic VQA datasets is limited
the chronology of learning stages are also learned earlier to spatial reasoning tasks based on the absolute frame of
by the models, and second, this behavior is consistent reference, i.e., object positions are relative to the viewer
for diferent neural end-to-end models. However, these
ifndings are based on experiments involving images with
3D models of real objects, that may introduce a poten- 1https://github.com/knowledgetechnologyuhh/grid-3d
to more quickly learn how to ground relative directions,
which may be of particular value for few-shot,
transfer, and curriculum learning scenarios. Accordingly, we
extend the GRiD-3D benchmark suite towards another
diagnostic VQA dataset with abstract objects.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. GRiD-A-3D Abstract VQA</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>
        With the introduction of the GRiD-3D dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we
could show that neural VQA models are capable of
grounding relative directions by implicitly deriving a
curriculum of subtasks. In order to further generalize the
previous findings, we extend our GRiD-3D suite towards
a diagnostic dataset based on abstract objects whose
cardinal direction is indicated by colored arrows.
      </p>
      <p>
        Overview and statistics With our new GRiD-A-3D
dataset, we address the following six tasks: Existence
of the image. Yet taking into account more realistic sce- Prediction, Orientation Prediction, Link Prediction, Relation
narios such as multi-agent dialogue in a situated envi- Prediction, Counting, and Triple Classification. All 8 000
renronment, understanding relative directions is a prerequi- dered images are split without overlap into 6 400 for
trainsite for meaningful communication. As a consequence, ing, 800 for validation, and 800 for testing. The 432 948
early models to learn symbolic reasoning with relative corresponding input questions follow largely the same
directions have been proposed [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. However, they 80:10:10 ratio, yielding 346 984, 43 393, and 42 571
quesinherently assume the availability of scene annotations tions for each set, respectively. The GRiD-A-3D dataset
in terms of object labels and spatial relations instead of has an order of magnitude comparable with the GRiD-3D
requiring a model to infer such information implicitly. dataset, both in terms of image and question counts.
      </p>
      <p>
        An early synthetic dataset providing a test bed for
grounding relative directions is Rel3D [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Since Rel3D Image generation For
is restricted to two objects per scene and one single task, each image, we generate a
i.e., binary prediction of (object1, relation, object2) triples, scene by randomly placing
GRiD-3D [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was introduced, which combines the advan- three to five distinct objects
tage of a rich number of tasks and questions as found in onto a plane and render the
traditional synthetic VQA datasets with the challenge of corresponding image with
grounding relative directions. 480x320 pixel resolution via
      </p>
      <p>
        GRiD-3D is the first-of-its-kind to target multi-task Blender.2 We choose a
conlearning of relative directions in a controlled setting. sistent lighting setup across
With this dataset, it was shown that, before learning all images, add shadows to Figure 4: The six abstract
how to answer the question whether a triple (object1, each object, and restrict the objects used in
relation, object2) holds, neural end-to-end VQA models image generation to a fixed the GRiD-A-3D
rely on an implicit curriculum of related subtasks such camera angle, thus obtaining dataset.
as object detection, orientation estimation, and relation one image per scene.
prediction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Objects in GRiD-3D cover a variety of cate- Our object set comprises gray-coloured polygonal
gories, ranging from humanoids and animals to furniture prisms approximating a cylinder shape, each marked
and vehicles. Naturally, such objects difer in terms of with an arrow in one of the six diferent colours: three
proportions, complexity, and, most importantly, symme- primary colours (red, blue, and green) and three additive
try, which can be a crucial determinant of how easily a secondary colours (yellow, cyan, and magenta). The tip
neural network can infer their orientation (and perform of each arrow depicts the object’s front side, allowing
associated tasks). for distinct relative directions between objects in the
im
      </p>
      <p>In this work, we aim to provide a variation of the origi- age. An overview of all six objects can be found in Fig. 4.
nal dataset that ensures the elimination of such potential
distortions (see Fig. 3 for examples), enabling a model
Note that the overall object count in the original GRiD- image and question features are fed to special neural
3D dataset is 28, whereas GRiD-A-3D is restricted to six units called residual blocks (FiLM) or MAC cells (MAC).
diferent objects. A chain of such units provides the core of the reasoning
process.</p>
      <p>
        We use existing PyTorch3 implementations of FiLM
and MAC with their default hyperparameters for the
published CLEVR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] dataset evaluations, except for the
number of MAC cells that we reduce to four to prevent
overfitting. All experiments are run for 100 epochs and
repeated three times with diferent seeds to reduce the
impact of the random initialization of the models on the
results. Fig. 6 shows the mean and the standard deviation
of the evaluations.
      </p>
      <p>We interpret our results in the following way: Existence
and Orientation Prediction are learned earlier than other
tasks. We explain this observation with the fact that these
(a) FiLM (b) MAC tasks only require a model to focus on one single object.</p>
      <p>For the most straightforward task of Existence Prediction,
uFsigedurfeor5o:uNreeuxpraelriemnedn-ttos.-eTnhde VgeQnAermicoudneiltss F( hiLeMre caonldorMedAiCn we observe similar behavior for the two datasets: Both
orange and blue, respectively) control how the question and converge to an accuracy of almost 100% at nearly the
image features are being processed. same time. For the Orientation Prediction task, we observe
convergence to an accuracy of over 80% for both datasets.</p>
      <p>
        Noticeably, the learning happens faster for the abstract
GRiD-A-3D dataset. The shorter learning time can be
Question generation In addition to rendering the im- attributed to the more unequivocal identification of front
ages from our sampled scenes, we obtain scene graphs and back sides of the abstract objects due to the lack
equipped with ground truth information such as absolute of symmetry related noise as shown in Fig. 3. The fact
position, orientation, and relative directions of objects, that the accuracy on Orientation Prediction is capped at
that we use to generate questions related to the six tasks about 85% can be explained by objects placed close to the
contained in GRiD-A-3D. Our question generation builds border between two cardinal directions, as such cases are
upon the framework provided with CLEVR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], whose dificult for the models to learn and classify.
question templates, synonym, and metadata files we tai- A similar learning behavior can be observed for the
lor to our dataset. Likewise, our question generation more complex tasks of Relation Prediction, Triple
Classifipipeline is expressed as a template-based functional pro- cation and Link Prediction, where both models converge
gram executed on each scene graph. faster when trained on GRiD-A-3D and also reach slightly
      </p>
      <p>We follow the depth-first search strategy to determine higher accuracy. Similarly to the results on the
Orientaand instantiate question-answer pairs that comply with tion Prediction task, the main reason for these
observathe scene information and can therefore be considered tions may lie in the facilitated learning conditions due to
valid. We set additional constraints to make sure that an- the lack of front-back symmetries or strong occlusions
swers are uniformly distributed for each task. To ensure with the abstract objects. This efect is most pronounced
a wide variety of natural language questions, we sample for Link Prediction, i.e., predicting which target object
from a rich set of diferently phrased question templates is in a given relation to some reference object. We
atfor each reasoning task and randomly omit utterances or tribute this observation to the smaller set of objects in
replace words with suitable synonyms. the GRiD-A-3D dataset.</p>
      <p>Finally, we observe a mixed result for the Counting task:
4. Evaluations While learning of both VQA models converges faster for
the GRiD-A-3D dataset, higher accuracy is reached for
the GRiD-3D dataset. We hypothesize that this higher
accuracy stems from the more diverse-looking objects
in the GRiD-3D dataset, facilitating the models to
distinguish and thus count multiple objects in close proximity.</p>
      <p>
        In summary, our results suggest the following two
facts: First, the abstract GRiD-A-3D dataset leads to faster
For our experiments, we train MAC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and FiLM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], two
state-of-the-art neural end-to-end VQA architectures, on
our new GRiD-A-3D dataset (cf. Fig. 5). Both architectures
take raw RGB images and plain text question-answer
pairs as input for training. Image features are extracted
by a pretrained ResNet101 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for both models, while
questions are encoded by a GRU [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (FiLM) or a
bidirectional LSTM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] (MAC), respectively. Subsequently, 3https://pytorch.org/
1.0
0.8
y
rca0.6
u
c
cA0.4
0.2
10
20
50
10
      </p>
      <p>20
FiLM on GRiD-3D
MAC on GRiD-3D
FiLM on GRiD-A-3D
MAC on GRiD-A-3D
30 40</p>
      <p>Epoch
Counting</p>
      <p>FiLM on GRiD-3D
MAC on GRiD-3D
FiLM on GRiD-A-3D
MAC on GRiD-A-3D
30 40
learning and can thus enable more computationally ef- curriculum.
ifcient experimentation while achieving comparable
results to the original GRiD-3D dataset. Second, the results
support our assumption of a chronology of subtasks, as
Existence Prediction and Orientation Prediction are learned
before the models can reason about relative directions.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The authors gratefully acknowledge support from the German Research Foundation DFG for the projects CML TRR169, LeCAREbot and IDEAS.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <sec id="sec-6-1">
        <title>This work is an extension to previous work on grounding</title>
        <p>relative directions with end-to-end neural VQA
architectures. We provide a comprehensive, simplified
GRiDA-3D dataset with abstract objects that shows similar
behavior to the original GRiD-3D dataset when learned
by the two established VQA models FiLM and MAC. With
our experiments, we show that the learning of tasks that
focus on a single object like object recognition and
orientation prediction happens prior to learning to ground
relative directions and object counting.</p>
        <p>The abstract nature of the dataset eliminates
approximate front-back object symmetries that can have a
negative impact on object orientation prediction and all
reasoning tasks about directional relations that build upon it.
Furthermore, the simplification of the object set allows
for conducting experiments with a more comprehensive
dataset. In future work, this will allow us to conduct fast
pilot studies on curriculum and transfer learning based
on the intuitive dependency of the diferent spatial
reasoning tasks on one another and the observed implicit</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kerzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahrens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wermter</surname>
          </string-name>
          ,
          <article-title>What is Right for Me is Not Yet Right for You: A Dataset for Grounding Relative Directions via Multi-Task Learning</article-title>
          , in: Thirty-First
          <source>International Joint Conference on Artificial Intelligence</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Compositional attention networks for machine reasoning</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Strub</surname>
          </string-name>
          , H. de Vries,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dumoulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Courville,
          <article-title>FiLM: Visual Reasoning with a General Conditioning Layer</article-title>
          ,
          <source>in: Thirty-Second AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          , L. van der Maaten, L. FeiFei, C. Lawrence Zitnick,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning</article-title>
          , in: IEEE Conference on
          <source>Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Moratz</surname>
          </string-name>
          , T. Tenbrink,
          <article-title>Spatial Reference in Linguistic Human-Robot Interaction: Iterative, Empirically Supported Development of a Model of Projective Relations</article-title>
          ,
          <source>Spatial Cognition &amp; Computation</source>
          <volume>6</volume>
          (
          <year>2006</year>
          ). doi:
          <volume>10</volume>
          .1207/s15427633scc0601_
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Renz</surname>
          </string-name>
          , D. Wolter,
          <source>StarVars: Efective Reasoning About Relative Directions</source>
          , in: TwentyThird
          <source>International Joint Conference on Artificial Intelligence</source>
          , AAAI Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Renz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <article-title>Qualitative Representation and Reasoning over Direction Relations across Diferent Frames of Reference</article-title>
          ,
          <source>in: Sixteenth International Conference on Principles of Knowledge Representation and Reasoning</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Deng,</surname>
          </string-name>
          <article-title>Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          ,
          <source>in: NIPS 2014 Workshop on Deep Learning</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . doi:
          <volume>10</volume>
          .1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>