<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIT Manipal at ImageCLEF 2019 Visual Question Answering in Medical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhishek Thanki</string-name>
          <email>abhishek.harish@learner.manipal.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishnamoorthi Makkithaya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Manipal Institute of Technology</institution>
          ,
          <addr-line>Manipal - Udupi District, Karnataka - 576104</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of MIT, Manipal in the ImageCLEF 2019 VQA-Med task. The goal of the task was to build a system that takes as input a medical image and a clinically relevant question, and generates a clinically relevant answer to the question by using the medical image. We explored a di erent approach compared to most VQA systems and focused on the answer generation part. We used a encoder-decoder architecture based on deep learning where a pretrained CNN on ImageNet was used to extract visual features from input image, a combination of pre-trained word embedding on pub-med articles along with a 2-layer LSTM was used to extract textual features from the question. Both visual and textual features were integrated using a simple element-wise multiplication technique. The integrated features were then passed into a LSTM decoder which then generated a natural language answer. We submitted a total of 8 runs for this task and the best model achieved a BLEU score of 0.462.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Question Answering Encoder-Decoder BLEU</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CNN</p>
      <p>Word2Vec</p>
      <p>LSTM
Visual Question Answering (VQA) is task which consists of building a AI system
which takes as input a image and a question in natural language, and the system
is expected to produce a correct answer to the question by using both the visual
and the textual information. This problem intersects the two important elds
of computer science, Computer Vision (CV) and Natural Language Processing
(NLP). The answers can be as simple as a single word, a simple yes/no, true/false
or consists of multiple words.</p>
      <p>
        VQA task has so far made great progress in the general domain due to the
increasing advancements in the eld of computer vision and natural language
processing. But this problem is relatively new in the medical domain.
ImageCLEF conducts many tasks related to multimedia retrieval in many domains
such as medicine, security, lifelogging, and nature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Visual Question
Answering in Medical domain is one such task and this is the second year, the VQA-Med
task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has been introduced after last years success. Given a medical image and a
clinically relevant question in natural language about the image, the task was to
build a system that would produce a clinically relevant natural language answer
to the question by using the image.
      </p>
      <p>
        In this paper, we discuss our approach to build such a system which was
inspired by VQA research in general domain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and sequence generation task in
the natural language processing eld [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We built a encoder-decoder architecture
using the recent advancements in the eld of deep learning. The model consists of
a pre-trained Convolutional Neural Network (CNN) on ImageNet, a pre-trained
word2vec model trained on pud-med articles [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to extract word embeddings, and
two Long Short Term Memory (LSTM) models. Image features were extracted
using a pre-trained CNN on ImageNet. We tested with two architectures, VGG19
and DenseNet201 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Question features were extracted by using pre-trained
word2vec model and a 2-layer LSTM network. Both visual and textual features
were integrated by using element-wise multiplication and resulting features were
fed to a LSTM sequence generating network to produce the output answers.
      </p>
      <p>This paper is organized in the following manner: Section 2 provides a
information regarding the dataset provided for this challenge. Section 3 presents
related work done which inspired us our model architecture. Section 4 describes
our method of using a encoder-decoder architecture. Section 5 describes our
experiments and corresponding results our model achieved. Finally, we conclude
the paper in Section 6 by discussing the task, our method, and future
improvements.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Description</title>
      <p>In VQA-Med 2019 challenge, three datasets were provided:</p>
      <p>The training set consisted of 3,200 medical images with 12,792
questionanswer pairs.</p>
      <p>The validation set consisted of 500 medical images with 2,000
questionanswer pairs.</p>
      <p>The test set consisted of 500 medical images with 500 question-answer pairs.</p>
      <p>Furthermore, the data in the dataset can be divided into four main categories
as follows:
1. Modality: This category includes questions based on images of structural
or functional parts of the body. For example: ultrasound, CT, etc.
2. Plane: Questions in this category consists about the plane of the medical
image. This is important because di erent projections allow for depicting
di erent tissues. For example: axial, sagittal, etc.
3. Organ System: Questions in this category consists on the di erent organs
in the human body. For example: breast, skull and contents, etc.
4. Abnormality: Questions in this category consists of detecting any
abnormality present in the input image and identifying the type of abnormality.</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        VQA in general domain is not a new problem and the data available is much more
compared to VQA in medical domain. Due to these reasons a lot of work has
been done in the general domain. Our work in this paper takes inspiration from
various resources. First, we are inspired by the simplicity in the baseline model
from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which still achieved good accuracy. Second, since the words used in the
medical domain are quite di erent compared to general English this means that
a word2vec trained on English language does not produce vectors that can best
encode the questions and hence we used a word2vec model trained on pub-med
articles for encoding the question tokens. Third, while a lot of models developed
on VQA general domain use multi-class classi cation to generate answers, we
chose a di erent approach of using sequence generation to generate the answers
since it made more intuitive sense to us.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>Our system consists of two main components: encoder and decoder. The
encoder part consists of 3 sub-components: transfer learning to extract features
from images, word2vec + 2-layer LSTM to extract features from questions, and
element-wise multiplication to fuse the visual and textual features. The decoder
part consists of a sequence generating LSTM network which generates the output
answers to the input question and image. Fig. 1 shows the high-level architecture
of our system.
4.1</p>
      <p>
        Encoder
The encoder part of the model consists of:
{ Pre-trained CNN on ImageNet: Deep CNNs trained on large-scale datasets
such as ImageNet have demonstrated to be excellent at the task of transfer
learning and this is why we chose transfer learning using a pre-trained CNN
to extract visual features. For this purpose we experimented with
VGG19 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and DenseNet-201 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] architectures. For VGG-19, we extracted the
output features from its last hidden layer while in case of DenseNet201,
we extracted the output features from conv5 block32 concat layer output.
These extracted features were then passed through a dense layer which was
trainable to get the nal output visual features in various dimensions such
as 128, 256, and 512.
{ Pre-trained word2vec model + 2-layer LSTM network: To extract features
from the input question, we pre-processed by building a custom function
which cleans the input sentence and outputs a list of tokens. These tokens
were then converted to vector form by using a pre-trained word2vec model
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ][
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These vectors were then passed through a 2-layer LSTM network
to produce the output textual features. The reason why we chose LSTM
network was due to the networks ability to model complex relationships
within the same sentence and also because it is not a ected by the vanishing
gradient problem [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ][
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
{ Feature fusion: Here, we used a simple element-wise multiplication technique
to combine the visual and textual features.
The decoder part of the model consists of a LSTM network. This network takes
as input the output features from the decoder part as well as the state of the
second LSTM. The sequence generation step is started by providing as input
a special token &lt;SOS&gt;. Subsequent output tokens produced by the model are
fed back into the model to produce the next token. This process is continued
until a certain number of tokens are produced or a special token called &lt;EOS&gt; is
predicted.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Result</title>
      <p>We submitted eight runs to ImageLCEF 2019 VQA-Med:
1. VGG19-N128: This run used a VGG-19 for transfer learning and the number
of neurons set in the encoder LSTM networks, the 2 dense layers, and the
decoder LSTM network was 128. This network was trained for 100 epochs.
2. VGG19-N256: This was same as run number one except that the number of
neurons were 256 and it was trained for 200 epochs.
3. VGG19-N256-Dropout: This run was same as run number two except that a
dropout of 0.2 was used in the dense layers and it was trained for 150 epochs.
4. DenseNet201-N256: This run used a DenseNet-201 for transfer learning and
the number of neurons set in the encoder LSTM networks, the 2 dense layers,
and the decoder LSTM network was 256. This network was trained for 150
epochs.
5. DenseNet201-N256-D400: This run was similar to run ve except that it used
the embedding dimension used was 400 instead of 200 which was used in all
the previous experiments.
6. DenseNet201-N256: This run was similar to run ve except that the network
was trained for 200 epochs.
7. DenseNet201-N128: This run was similar to run ve except that the number
of neurons were 256.
8. VGG19-N128: This run was identical to the rst run.</p>
      <p>The VGG19-N128 model achieves the best BLEU score while
DenseNet201N256-Dropout achieves the best strict accuracy. Table 1 shows the result achieved
by all models on the test set.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper describes our participation in the ImageCLEF 2019 VQA-Med
challenge. We used a pre-trained CNN on ImageNet dataset to extract textual
features, a word2vec + 2-layer LSTM network to extract textual features, and a
sequence generating LSTM network to generate the output answer tokens. Our
approach was di erent and instead focused on using sequence generation to
generate the answers while using a simple element-wise multiplication technique
to integrate the visual and textual features. While we would have liked to try
out attention based techniques to integrate visual and textual features but we
weren't able to do so due to the timing limitation. This is something which we
will explore in the future to improve the model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Peteri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimuk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarasau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ben</surname>
            <given-names>Abacha</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Dang-Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.T.</given-names>
            ,
            <surname>Piras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.T.</given-names>
            ,
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.M.</given-names>
            ,
            <surname>de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.G.S.</given-names>
            ,
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Kavallieratou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>del Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.R.</given-names>
            , Rodr guez, C.C.,
            <surname>Vasillopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Karampidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature</article-title>
          . In:
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Lugano,
          <source>Switzerland (September 9-12</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          .
          <source>In: CLEF 2019 Working Notes. CEUR Workshop Proceedings (CEURWS.org)</source>
          ,
          <article-title>CEUR-WS</article-title>
          .org &lt;http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /&gt;, Lugano,
          <source>Switzerland (September 9-12</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Antol</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            , J., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : VQA:
          <article-title>Visual Question Answering</article-title>
          . In: International Conference on Computer Vision (ICCV) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Van Merrienboer,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brokos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Androutsopoulos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Deep relevance ranking using enhanced document-query interactions</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <year>1849</year>
          {
          <year>1860</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Brussels, Belgium (Oct-Nov
          <year>2018</year>
          ), https://www.aclweb.org/anthology/D18-1211
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Der Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          :
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>4700</volume>
          {
          <issue>4708</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yih</surname>
          </string-name>
          , W.t.,
          <string-name>
            <surname>Zweig</surname>
          </string-name>
          , G.:
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>746</volume>
          {
          <issue>751</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frasconi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Gradient ow in recurrent nets: the di culty of learning long-term dependencies (</article-title>
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>