<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Captioning for the ImageCLEF 2017 Medical Image Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Lyndon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashnil Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jinman Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Technologies, University of Sydney</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Manual image annotation is a major bottleneck in the processing of medical images and the accuracy of these reports varies depending on the clinician's expertise. Automating some or all of the processes would have enormous impact in terms of e ciency, cost and accuracy. Previous approaches to automatically generating captions from images have relied on hand-crafted pipelines of feature extraction and techniques such as templating and nearest neighbour sentence retrieval to assemble likely sentences. Recent deep learning-based approaches to general image captioning use fully di erentiable models to learn how to generate captions directly from images. In this paper, we address the challenge of end-to-end medical image captioning by pairing an imageencoding convolutional neural network (CNN) with a language-generating recurrent neural network (RNN). Our method is an adaptation of the NICv2 model that has shown state-of-the-art results in general image captioning. Using only data provided in the training dataset, we were able to attain a BLEU score of 0.0982 on the ImageCLEF 2017 Caption Prediction Challenge and an average F1 score of 0.0958 on the Concept Detection Challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>LSTM</kwd>
        <kwd>CNN</kwd>
        <kwd>RNN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Generating a textual summary of the insights gleaned from a medical image is
a routine, yet nonetheless time-consuming task requiring much human e ort on
the part of highly trained clinicians. Prior e orts to automate this task relied
on hand-crafted pipelines, employing manually designed feature extraction and
techniques such as templating and sentence retrieval to assemble likely sentences
[
        <xref ref-type="bibr" rid="ref13 ref14 ref9">9, 13, 14</xref>
        ]. Recent deep learning-based approaches to general image captioning,
however, use fully di erentiable models to learn how to generate captions directly
from images. In general the advantages of such fully learnable models is that any
part of the model can adapt in a manner most useful for the problem at hand,
whereas a hand-designed system is constrained by the assumptions made during
feature extraction, concept detection, and sentence generation.
      </p>
      <p>
        In this paper we describe the submission of the University of Sydney's
Biomedical Engineering &amp; Technology (BMET) group to the caption prediction and
concept detection task of the ImageCLEF 2017 caption challenge [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ]. This
submission employs a fully di erentiable model, pairing an image-encoding CNN
with a language-generating RNN to generate captions for images from a range
of modalities.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Image captioning, whereby the contents of an image are automatically described
in natural language, is challenging task in machine learning, requiring methods
from both image and natural language processing. Many early approaches to
this problem involved complex systems comprising of visual feature extractors
and rule based methods for sentence generation. Li et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] utilise image
feature similarity measures to locate likely n-grams from a large corpus of image
and text, then use a simple sentence template and local search to generate a
caption. Yao et al. [22] extract image features such as SIFT and edges to match
images to a concept database, then apply a graph-based ontology to these
concepts to produce readable sentences. Ordonez et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use image features and a
ranking-based approach to locate likely sentences in an extremely large database
of images and text. Such methods require a great deal of hand-crafted
optimisation and produce systems which are brittle and limited to specialised domains.
      </p>
      <p>
        Recently, deep learning-based encoder-decoder frameworks for machine
translation [16] have been adapted and applied to problem of image captioning. By
replacing the language-encoding Long Short-Term Memory (LSTM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] RNN
with an image-encoding CNN, the model is able to learn to generate captions
directly from images. The entire model is completely di erentiable so errors
are propagated to the di erent components proportional to their contribution to
the error, allowing them to adapt appropriately. While there were several
precursors that replaced various components of existing image to caption frameworks
with trainable RNNs or CNNs, Vinyals et al. [19] proposed the rst end-to-end
neural network based approach to captioning with their \Show and Tell" (also
called Neural Image Captioning (NIC)) model. An updated method, NICv2 [20],
won the Microsoft Common Objects in Contex (MSCOCO) challenge in 2015.
Qualitative analysis has shown that neural captioning methods are preferred in
comparison with conventional nearest-neighbour sentence lookup approaches [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        There has been limited work in adapting such methods to the medical
domain, despite the large volume of image and text data found in PACS. Schlegl
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] present the rst such work that leveraged text reports to improve
classi cation accuracy of CNN applied to Optical Coherence Tomography (OCT)
images. Mahmood et al. [17] present a method that uses hand-coded topic
extraction, hand-coded image features and a SVM-based correlation system. Shin
et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] document e orts to mine an extremely large database of images and
text extracted from the PACS of the National Institutes of Health Clinical
Center (approximately 216 thousand images) using latent Dirichlet allocation (LDA)
to extract topics from the raw text and then correlate these topics to image
features. Kisilev et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposes an SVM-based approach to highlight regions of
interest (ROIs) and generate template-based captions for the Digital Database
for Screening Mammography (DDSM). This is extended using a multi-task loss
CNN in a later work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>To the best of our knowledge, only one published work exists for applying
neural image captioning to a medical dataset [15]. In this work the authors
employ an architecture similar to Vinyals et al. [19] to generate an array of
keywords for a radiological dataset.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>Unless otherwise speci ed, the same method was applied for both the caption
prediction and concept detection tasks. The set of concepts assigned to an image
in the concept detection task is considered to be a caption where each concept
label is a word in the sentence. In both cases only the supplied training dataset
was used to train the models.
3.1</p>
      <sec id="sec-3-1">
        <title>Preprocessing</title>
        <p>
          In order to simplify the task, each image in the training set was preprocessed in
accordance with the task's evaluation preprocessing speci cations. This involved
converting the caption to lower case, removing all punctuation (some captions
contained multiple sentences, however, after this step each caption became a
single sentence), removing stopwords using the NLTK [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] English stopword list and
nally applying stemming using NLTK's Snowball stemmer. No preprocessing
was applied to the `sentences' for the concept detection task. After this
preprocessing the count of each unqiue word in the training corpus was taken. Words
that appeared less than 4 times were discarded and this resulted in a dictionary
of 25237 distinct words. For the RNN framework described below, two reserved
words indicating the start and end of sentences are added to the dictionary and
used to prepend and append each sentence.
        </p>
        <p>
          The images are rst resized to 324x324px, and a 299x299px crop is then
selected. During training this is a random crop, but during evaluation a central
crop is used. We apply image augmentation during training to regularise the
model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This augmentation consists of distorting the image, rst by randomly
ipping it horizontally then randomly adjusting the brightness, saturation, hue
and contrast. The random cropping and distortion are performed each time an
image is passed into the model and means that it is extremely rare that exactly
the same image is seen twice.
        </p>
        <p>A validation set was provided by the organisers of the task and it entirely
reserved for validation. No part of it was used for training, and speci cally we
did use it to build the dictionary of unique words.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Model</title>
        <p>
          Our method extends Vinyals et. al's NICv2 model [20] 1. The NICv2 model
consists of two di erent types of neural networks paired together to form an
imageto-language, encoder-decoder pair. A CNN, speci cally the InceptionV3 [18]
architecture, is used as an image encoder. InceptionV3 is one of the most accurate
architecture for general image classi cation according to the ImageNet [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
benchmark, but is signi cantly more computationally e cient than alternatives such
as Residual Networks [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We utilised a RNN based on LSTM units as the
language decoder as per the original paper, however, we doubled the number of
units from 512 to 1024 as this showed improved results in our experiments.
        </p>
        <p>An image is rst preprocessed as described above and then fed to the input
of the CNN. The logits of the CNN are passed into a single layer fully-connected
neural network which functions as an image embedding layer. This image
embedding then the becomes to initial state of the LSTM network. As per [20] the
embedding was passed only at the initial state and is not used subsequently. At
each state subsequent to the initial state, then LSTM's output is passed to a
word embedding layer and then to a softmax layer. At each time step the
output of the softmax is the probability of each word in the dictionary. For two of
the caption prediction experiments (PRED2 &amp; PRED4) we modi ed the
baseline language model to use a 3-layer LSTM model with a single dropout layer
on the output. Increasing the number of LSTM layers improves the ability of
the language model to represent complex sentences and long term dependencies.
Industrial neural machine translation models have been demonstrated to use
decoder layers with up to 8 layers [21].</p>
        <p>In all our models we used 1024 units for both the image and word
embedding layers. The CNN is initialised using weights from a model trained on the
ImageNet dataset, while the weights for the LSTM are initialised from a
random normal distribution with values between -0.08 and 0.08 as per [16]. For the
nal caption prediction experiment (PRED4) we attempted to domain transfer
the updated CNN weights from the DET2 caption detection model. This was
attempted to avoid corruption of CNNs during end-to-end training (discussed in
Sect. 4).
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Training</title>
        <p>The loss optimised during training is the summed cross entropy of the output of
the softmax compared to the one-hot encoding of the next word in the ground
truth sentence. This loss was minimised with standard Stochastic Gradient
Descent (SGD) using an initial learning rate of 1:0 and a decay procedure that
reduced the learning rate by half every 8 epochs (there were 164541 examples
in the training set and a batch size of 16, so each epoch contains 10284
minibatches). Gradients were clipped to 5:0 for all experiments.
1 As per the Tensor ow-Slim implementation
https://github.com/tensor ow/models/tree/master/im2txt
available</p>
        <p>at:
g
m
E</p>
        <p>orthomorpha unk
sp n holotyp b right</p>
        <p>gonopod mesal
later view respect
cf distal part right</p>
        <p>gonopod mesal
later subor subcaud</p>
        <p>view respect
scale bar 02 mm</p>
        <sec id="sec-3-3-1">
          <title>LSTM</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Language</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Generator</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>CNN Vision Encoder</title>
          <p>As suggested by Vinyals et al. [19] we use Beam Search to generate sentences
at inference time. This avoids the non-trivial issue that greedily selecting the
most probable word at each time-step may result in a sentence which is itself of
low probability. Ideally we would search the entire space for the most probable
sentence, however, this would have an exponential computational cost associated
with it as a forward pass through the entire model must be made for each node of
the search tree. Therefore some search procedure is required in order to nd the
most probable sentence given limited computational resources. Our best results
were achieved with a beam size of 3 and maximum caption length of 50.
The sentence output of the concept detection task was converted to an ordered
set of concept labels.
Tables 3 &amp; 4 detail the training, validation and test results of the various runs.
Please note that due to the high cost of inference, the training scores are
estimated based on a random sample of 10000 images from the training set.</p>
          <p>We attempted a two-phase training procedure as suggested by Vinyals et
al. [20] for DET2 and some unsubmitted experiments. In the rst phase we froze
the CNN weights and trained only the LSTM and embedding layers. Then, once
the language model had begun to converge we trained the entire model
end-toend with a very small learning rate (1e 5). This is suggested by the Vinyals et
al. as necessary as otherwise the CNN model will become corrupted and never
recover. However, we found that despite training the LSTM for a very long time
in the rst phase and using a very small learning rate in the second phase we
would very quickly corrupt the CNN as evidenced by a sharp increase in dead
ReLUs and a large decrease in BLEU score. We found that BLEU scores would
eventually return to those achieved in the rst phase of training, however, the
dead ReLUs did not revive. We believe that the underlying issue is that the
degree of domain transfer required to go from general images to medical images
is vastly greater than that required to go from one collection of general images
to another (i.e. ImageNet to MSCOCO).
Based on the small variance between our training and validation scores we do
not believe that the models were over tting, however, the large variance between
validation and test scores indicates that there was a large disparity between the
training and validation data and the data in test set. Based on the non-over tting
analysis we could potentially train a much larger vision model for a longer time
and improve the overall performance.</p>
          <p>Additionally the fact that we could not successfully train the vision model
without corrupting the network was a major limiting factor in our experiments.</p>
          <p>Future work will investigate the potential of larger language models and
devising a training regime that allows true end-to-end training for medical images.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Actual</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>Predicted</title>
        </sec>
        <sec id="sec-3-3-7">
          <title>Scores</title>
          <p>Actual</p>
          <p>orthomorpha latiterga sp n holotyp b right
gonopod mesal later view respect cf distal part
right gonopod mesal later subor subcaud
view respect scale bar 02 mm
orthomorpha unk sp n holotyp
b right gonopod mesal later view respect cf
distal part right gonopod mesal later subor
subcaud view respect scale bar 02 mm
BLEU: 0.9304 BLEU1: 0.9629
BLEU2: 0.92 BLEU3: 0.9231</p>
          <p>BLEU4: 0.9167
preoper later radiograph right knee</p>
        </sec>
        <sec id="sec-3-3-8">
          <title>Predicted later radiograph right knee BLEU: 0.7788 BLEU1: 1.0 Scores BLEU2: 1.0 BLEU3: 1.0</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>The authors are grateful to the NVIDIA Corporation for their donation of the
Titan X GPU used in this research.
15. Shin, H.C., Roberts, K., Lu, L., Demner-Fushman, D., Yao, J., Summers, R.M.:</p>
      <sec id="sec-4-1">
        <title>Learning to read chest x-rays: recurrent neural cascade model for automated im</title>
        <p>age annotation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 2497{2506 (2016)
16. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger,</p>
      </sec>
      <sec id="sec-4-2">
        <title>K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 3104{3112.</title>
        <p>Curran Associates, Inc. (2014)
17. Syeda-Mahmood, T., Kumar, R., Compas, C.: Learning the correlation between
images and disease labels using ambiguous learning. In: Medical Image Computing
and Computer-Assisted Intervention { MICCAI 2015. pp. 185{193. Springer, Cham
(Oct 2015)
18. Szegedy, C., Vanhoucke, V., Io e, S., Shlens, J., Wojna, Z.: Rethinking the
inception architecture for computer vision (Dec 2015)
19. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator (Nov 2014)
20. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from
the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach.</p>
        <p>Intell. (Jul 2016)
21. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,</p>
      </sec>
      <sec id="sec-4-3">
        <title>M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,</title>
      </sec>
      <sec id="sec-4-4">
        <title>Kaiser, ., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G.,</title>
      </sec>
      <sec id="sec-4-5">
        <title>Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O.,</title>
      </sec>
      <sec id="sec-4-6">
        <title>Corrado, G., Hughes, M., Dean, J.: Google's neural machine translation system:</title>
      </sec>
      <sec id="sec-4-7">
        <title>Bridging the gap between human and machine translation (Sep 2016)</title>
        <p>22. Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text
description. Proc. IEEE 98(8), 1485{1508 (Aug 2010)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>NLTK: The natural language toolkit</article-title>
          .
          <source>In: Proceedings of the COLING/ACL on Interactive Presentation Sessions</source>
          . pp.
          <volume>69</volume>
          {
          <fpage>72</fpage>
          . COLING-ACL '
          <fpage>06</fpage>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          .
          <source>In: 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>248</volume>
          {
          <issue>255</issue>
          (Jun
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            , R., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Exploring nearest neighbor approaches for image captioning</article-title>
          (May
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwall</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , Garc a Seco de Herrera,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Overview of ImageCLEFcaption 2017 - image caption prediction and concept detection for biomedical images</article-title>
          .
          <source>In: CLEF 2017 Labs Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          , Dublin,
          <source>Ireland (September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          (
          <year>Dec 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {1780 (Nov
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dicente Cid</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia Seco de Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwall</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Overview of ImageCLEF 2017: Information extraction from images</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017. Lecture Notes in Computer Science</source>
          , vol.
          <volume>10456</volume>
          . Springer, Dublin, Ireland (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kisilev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sason</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barkan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hashoul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Medical image description using multi-task-loss CNN</article-title>
          .
          <source>In: Deep Learning and Data Labeling for Medical Applications</source>
          , pp.
          <volume>121</volume>
          {
          <fpage>129</fpage>
          . Springer, Cham (Oct
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kisilev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walach</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hashoul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barkan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ophir</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alpert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Semantic description of medical image ndings: structured learning approach</article-title>
          .
          <source>In: Procedings of the British Machine Vision Conference</source>
          <year>2015</year>
          . pp.
          <volume>171</volume>
          .
          <issue>1</issue>
          {
          <fpage>171</fpage>
          .11.
          <string-name>
            <surname>British Machine Vision Association</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lyndon</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fulham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An ensemble of Fine-Tuned convolutional neural networks for medical image classi cation</article-title>
          .
          <source>IEEE journal of biomedical and health informatics 21(1)</source>
          ,
          <volume>31</volume>
          {40 (Jan
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulkarni</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Composing simple image descriptions using web-scale n-grams</article-title>
          .
          <source>In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning</source>
          . pp.
          <volume>220</volume>
          {
          <fpage>228</fpage>
          . CoNLL '11,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ordonez</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kulkarni</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          :
          <article-title>Im2Text: Describing images using 1 million captioned photographs</article-title>
          . In:
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bartlett</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.Q</given-names>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>24</volume>
          , pp.
          <volume>1143</volume>
          {
          <fpage>1151</fpage>
          . Curran Associates, Inc. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Schlegl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waldstein</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vogl</surname>
            ,
            <given-names>W.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Erfurth</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langs</surname>
          </string-name>
          , G.:
          <article-title>Predicting semantic descriptions from medical images with convolutional neural networks</article-title>
          .
          <source>In: Information Processing in Medical Imaging</source>
          . pp.
          <volume>437</volume>
          {
          <fpage>448</fpage>
          . Springer, Cham (Jun
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>H.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Se</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Summers</surname>
            ,
            <given-names>R.M.:</given-names>
          </string-name>
          <article-title>Interleaved text/image deep mining on a very large-scale radiology database</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>1090</volume>
          {
          <issue>1099</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>