<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tlemcen University at ImageCLEF 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Visual Question Answering Task</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rabia Bounaama</string-name>
          <email>rabea.bounaama@univ-tlemcen.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed El Amine Abderrahim</string-name>
          <email>mohammedelamine.abderrahim@univ-tlemcen.dz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Biomedical Engineering Laboratory, Tlemcen University</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratory of Arabic Natural Language Processing, Tlemcen University</institution>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our methodology of techno team participation at ImageCLEF Medical Visual Question Answering 2019 task. VQA-Med task is a challenge which combines computer vision with Natural Language Processing (NLP) in order to build a system that manages responses based on set of medical images and questions that suit them. We used a jointly learning for text and image method in order to solve the task, we tested a publicly available VQA network. We apply neural network and visual semantic embeddings method on this task. Our approach based on CNNs and RNN model achieve 0.486 of BLEU score.</p>
      </abstract>
      <kwd-group>
        <kwd>CNNs</kwd>
        <kwd>neural networks</kwd>
        <kwd>VQA-Med task</kwd>
        <kwd>RNN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        There are many more complex questions that can be asked in medical Radiology, which
is very rich of images and textual reports, is a prime area where VQA could assist
radiologists in reporting findings for a complicated patient or benefit trainees who have
questions about the size of a mass or presence of a fracture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. VQA system is
expected to reason over both visual and textual information to infer the correct answer
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. So medical VQA systems define as a computer vision and Artificial Intelligence
(AI) problem that aims to answer questions asked by health care professionals about
medical images [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].Artificial neural network models have been studied for many years
in the hope of achieving human-like performance in several fields such as speech and
image understanding [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. VQA could be used to improve human-computer interaction
as a natural way to query visual content. It has garnered a large amount of interest from
the deep learning, computer vision, and NLP communities [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        ImageCLEF provide medical image collections, annotated toward several
evaluation challenges including VQA, image captioning, and tuberculosis [
        <xref ref-type="bibr" rid="ref5">5,10</xref>
        ]. We
participate in the task of VQA in the medical domain.
      </p>
      <p>Participating systems are tasked with answering the question based on the visual
image content. The evaluation of the VQA-Med task participant systems is con ducted
by using two metrics: BLEU and accuracy.</p>
      <p>The following of this paper is organized as follows. In section 2 we present some
related works. In section 3 we describe our approach and more specifically we present
the dataset and discuss in detail the models and techniques used in our submitted run.
The conclusion and future work perspectives are presented in section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Convolutional Neural Networks (CNNs) make a promising model for the ImageNET
classification task to medical modality such as in the work of [
        <xref ref-type="bibr" rid="ref1 ref6">1,6</xref>
        ], where the authors
of Novasearch team [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] evaluate the CNNs classifier with medical images in order to
build a Medical Image Retrieval System (MIRS) to classify each subfigure, from a
collection of figures from compound images found in biomedical literature.
      </p>
      <p>
        Another work of subfigure classification task at ImageCLEF 2016 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used modern
Deep CNNs in order to predict the modality of a medical image with two main groups
: Diagnostic Images and Generic Biomedical Illustration. To extract information from
medical images and build their textual features, they used Bag-of-Words (BoW) and
Bag of Visual Words (BoVW) approaches.
      </p>
      <p>
        In the work of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] they used Multi-modal Factorized Bilinear(MFB) pooling as well
as Multimodal Factorized High-order(MFH) pooling to solve the task in order to build
a system that is able to reason over medical images and questions and generate the
corresponding answers at ImageCLEF Med-VQA 2018 Task.
      </p>
      <p>
        The main idea proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is about automatically generate questions and
images selected from the literature based on ImageCLEF data where they apply Stacked
Attention Network (SAN) which was proposed to allow multi-step reasoning for
answer prediction, and Multimodal Compact Bilinear pooling (MCB) with two attention
layers based on CNNs.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduce VQA-RAD, a manually constructed VQA dataset in
radiology where clinicians asked naturally occurring questions about radiology images
and provided reference answers in order to encourage the community to design VQA
tools with the goals of improving patient care where they use a balanced images sample
from MedPix. The annotation of the dataset was generated by volunteer clinical trainees
and validated by expert radiologists, they train their data using deep learning and
bootstrapping approaches. They provide the data in JSON, XML, and Excel format. The
final VQA-RAD dataset contains 3515 total visual questions.
      </p>
      <p>
        Another line of work in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] focuses in a new ways of synthesizing QA pairs from
currently available image description datasets. They propose to use neural networks and
visual semantic embeddings using LSTM on MS-COCO dataset. Their final model was
not able to consume image features as large as 4096 dimensions at one time step, where
the dimensionality reductions lose some useful information.
      </p>
      <p>Deep neural networks have recently achieved very good results in representation
learning and classification of images. With all this effort, there is still no widely used
method to construct these systems. This is due to the fact that the medical domain
requires high accuracy and especially the rate of false negatives to be very low, so we
studied several VQA networks and we selected deep neural networks models for our
participation in VQA-Med 2019.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>In the scope of the VQA-Med challenge, three datasets were provided:
– The training set contains 12792 question-answer pairs associated with 3200 training
images.
– The validation set contains 2000 question-answer pairs associated with 500
validation images.</p>
        <p>– The test set contains 500 questions associated with 500 test images.
The classes for each question category are: Modality, Plane, Organ system and
Abnormality (see table1)</p>
        <p>Q: what type of imaging modality is used
to acquire the image? A: us ultrasound</p>
        <p>Q: what plane was used? A: axial
Q:what organ system is evaluated primarily?
A:face, sinuses, and neck</p>
        <p>Q: is this image normal? A:yes
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Method and Results</title>
        <p>To solve the task of VQA-med at image CLEF 2019, we chose to use CNN and RNN
models without intermediate stages such as object detection and image segmentation.</p>
        <p>
          All existing methods and VQA algorithms consist of:
– Extracting image features (image featurization).
– Extracting question features (question featurization).
– Combining features to produce an answer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (see figure 1).
        </p>
        <p>
          In our case, we chose the approach used by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We treat the task as a classification
problem and we apply neural network and visual semantic embeddings method. We
assume that the answers consist of only a single word.
        </p>
        <p>The process of building the classification model includes preprocessing and
extraction of visual features from already labelled images and their own questions. Our system
learns image regions relevant to answer the clinical questions. Images and question are
represented as global features which are merged to predict the answers. The
effectiveness of the model is evaluated by using new images.</p>
        <p>We have used visual semantic embeddings to connect a CNN and a Recurrent
Neural Networks (RNN). Our model is built on the basis of the LSTM (Long short-term
memory) which is an easier form of RNN to train the dataset. Because of its very
uniform architecture for extracting features from images we used 16 convolutional layers
of VGG-16 and in order to generate questions as inputs examples, we used RNN which
is the apropriate technique to use with sequential data [9]. The LSTM(s) outputs are
introduced into a softmax layer to generate answers.</p>
        <p>The answer prediction task is modeled as N-class classification problem and
performed using a one-layer neural network.</p>
        <p>Our model results are shown in the table below (see table 2).</p>
        <p>The analysis of the results obtained by all the participants, in terms of accuracy and
BLEU, shows that the best approach is the one used by the Hanlin team. The results
obtained by all the participants vary between 0.624 and 0 in terms of accuracy and
between 0.644 and 0.025 in terms of BLEU. The Hanlin team thus obtained the highest
scores (0.624, 0.644) while the IITISM @ CLEF team obtained the lowest scores (0.0,
0.025).</p>
        <p>The results obtained by our system (0.462, 0.485) compared with other systems are
encouraging and we hope to make improvements in the future.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we present techno team approach used in VQA at ImageCLEF 2019 task.
We evaluate currently existing VQA system by testing a publicly available VQA
network.</p>
      <p>We found that the RNN model based on the feature fusion is helpful on improving
the system’s performance. But it is still very naive in many situations.</p>
      <p>It should be noted that we encountered the problem of overfitting which is a major
problem in neural networks.</p>
      <p>As a result, we achieved 0.486 BLEU score in the challenge. In the future we
consider working on in order to obtain the optimum deep learning layer structure.
9. Liu, Pengfei and Qiu, Xipeng and Huang, Xuanjing. Recurrent neural network for text
classification with multi-task learning.arXiv preprint arXiv:1605.05101, (2016).
10. Bogdan Ionescu and Henning Müller and Renaud Péteri and Yashin Dicente Cid and Vitali
Liauchuk and Vassili Kovalev and Dzmitri Klimuk and Aleh Tarasau and Asma Ben Abacha
and Sadid A. Hasan and Vivek Datla and Joey Liu and Dina Demner-Fushman and Duc-Tien
Dang-Nguyen and Luca Piras and Michael Riegler and Minh-Triet Tran and Mathias Lux and
Cathal Gurrin and Obioma Pelka and Christoph M. Friedrich and Alba García Seco de
Herrera and Narciso Garcia and Ergina Kavallieratou and Carlos Roberto del Blanco and Carlos
Cuevas Rodríguez and Nikos Vasillopoulos and Konstantinos Karampidis and Jon
Chamberlain and Adrian Clark and Antonio Campello . ImageCLEF 2019: Multimedia Retrieval in
Medicine, Lifelogging, Security and Nature . Experimental IR Meets Multilinguality,
Multimodality, and Interaction .Proceedings of the 10th International Conference of the CLEF
Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer.Lugano,
Switzerland.September 9-12, (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>Jason J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gayen</surname>
          </string-name>
          , Soumya and Abacha, Asma Ben and
          <string-name>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <article-title>Dina. A dataset of clinically generated visual questions and answers about radiology images</article-title>
          .
          <source>Scientific data</source>
          ,
          <volume>5</volume>
          ,
          <issue>180251</issue>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Peng</surname>
          </string-name>
          , Yalei and Liu, Feifan and Rosen,
          <string-name>
            <surname>Max P.UMass at ImageCLEF Medical Visual Question Answering (Med-VQA</surname>
          </string-name>
          )
          <year>2018</year>
          Task (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Antonie</surname>
          </string-name>
          , Maria-Luiza and Zaiane,
          <string-name>
            <surname>Osmar</surname>
            <given-names>R</given-names>
          </string-name>
          and Coman, Alexandru.
          <article-title>Application of data mining techniques for medical image classification</article-title>
          .
          <source>Proceedings of the Second International Conference on Multimedia Data Mining</source>
          ,
          <fpage>94</fpage>
          -
          <lpage>101</lpage>
          .Springer-Verlag.
          <article-title>(</article-title>
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kafle</surname>
          </string-name>
          , Kushal and Kanan, Christopher.
          <article-title>Visual question answering: Datasets, algorithms, and future challenges</article-title>
          .
          <source>Computer Vision</source>
          and Image Understanding,
          <volume>163</volume>
          ,
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .Elsevier,(
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Asma</given-names>
            <surname>Ben</surname>
          </string-name>
          <article-title>Abacha and Sadid A. Hasan and Vivek V. Datla and Joey Liu and Dina Demner-Fushman and Henning Müller</article-title>
          .
          <article-title>VQA-Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019</article-title>
          .
          <article-title>CLEF2019 Working Notes</article-title>
          .CEUR Workshop Proceedings.CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;.September 9-12</source>
          .Lugano,
          <string-name>
            <surname>Switzerland</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Koitka</surname>
          </string-name>
          , Sven and Friedrich,
          <string-name>
            <surname>Christoph M. Traditional</surname>
          </string-name>
          <article-title>Feature Engineering and Deep Learning Approaches at Medical Classification Task of ImageCLEF 2016</article-title>
          .
          <source>CLEF (Working Notes)</source>
          .
          <fpage>304</fpage>
          -
          <lpage>317</lpage>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Abacha</surname>
            , Asma Ben and Gayen, Soumya and Lau,
            <given-names>Jason J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rajaraman</surname>
          </string-name>
          , Sivaramakrishnan and
          <string-name>
            <surname>Demner-Fushman</surname>
          </string-name>
          , Dina. NLM at
          <article-title>ImageCLEF 2018 Visual Question Answering in the Medical Domain (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ren</surname>
          </string-name>
          , Mengye and Kiros, Ryan and Zemel, Richard.
          <article-title>Exploring models and data for image question answering</article-title>
          .
          <source>Advances in neural information processing systems</source>
          .
          <volume>2953</volume>
          -
          <fpage>2961</fpage>
          . (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>