<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recurrent Attention Networks for Medical Concept Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sam Maksoud</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arnold Wiliem</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Lovell</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Technology and Electrical Engineering, The University of Queensland</institution>
          ,
          <addr-line>Brisbane QLD</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the working notes for the CRADLE group's participation in the ImageCLEF2019 medical competition. Our group focused on the concept detection task which challenged participants to approximate the mapping from radiology images to concept labels. Traditionally, such a task is often modelled as an image tagging or image retrieval problem. However, we empirically discovered that many concept labels had weak visual connotations; hence, image features alone are insufficient for this task. To this end, we utilize a recurrent neural network architecture which enables our model to capture the relational dependencies among concepts in a label set to supplement visual grounding when their association to image features is weak or unclear. We also exploit soft attention and visual gating mechanisms to enable our network to dynamically regulate “where” and “when” to extract visual data for concept generation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In 2019, ImageCLEF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] hosted its 3rd edition of the medical image captioning task.
The participants of this task were challenged to develop a method for generating
Concept Unique Identifiers (CUI) to describe the contents of a radiology image [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In
contrast to natural language captions, CUIs parse out standardized concept terms from
the medical texts. Resolving captions into key concepts alleviates the constraint of
modelling the syntactic structures of free text. Removing the language modelling component
results in a task akin to image tagging i.e. identifying the presence of a label (CUI) by
its most distinguishable visual features.
      </p>
      <p>
        However, a considerable number of CUI terms in the supplied subset of the ROCO
dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have no obvious association to visual features. This is due to the fact that
the CUIs were extracted automatically from the figure captions using only natural
language processing tools; there was no constraint for the CUIs to be associated with
visual features. Consequently, concepts with weak visual connotations such as “study ,
“rehab” and “supplement” are abundant throughout the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; it is
unreasonable to assume that a model can learn general visual features to reliably identify
such concepts. While it is true that in isolation, accurately identifying these non-visual
concepts is unlikely or impossible, their relevance to an image can be indirectly
estimated by modelling relational dependencies to other CUIs in the set of concept labels.
This is because all CUIs in a set of concepts are derived from a common source: the
original figure caption.
      </p>
      <p>
        Under these conditions, our group concluded it would be best to model the problem
as an image to sequence translation task; emphasizing the need to map an image to a
set of concepts, rather than mapping individual CUIs directly to image features. Thus,
we design our model as a recurrent neural network (RNN) given their unrivalled
performance in capturing the long term dependencies in sequential data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Our proposed
RNN is conditioned on features from both the image and CUI labels. We utilize a soft
attention mechanism [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] which dynamically attends to different regions of an image in
order to select the most distinguishable visual features for each CUI. In situations where
a CUI has weak visual connotations, a visual feature gating mechanism [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] allows the
model to focus on textual features as they are likely to provide greater discriminatory
power in such contexts.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Challenges</title>
      <p>
        In order to design an appropriate model for the task, our group carried out an extensive
investigation of the supplied subset of the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. During this investigation
we identified several challenges that would complicate the task of mapping text to
visual features. These challenges pertain to incidences of redundant, inconsistent, and/or
nonsensical assignment of CUIs to an image. We describe these challenges and how
they influenced our approach to this task in detail below.
First and foremost, we identified that a majority of concepts redundantly describe generic
radiology images. In Table. 1 we list the top 10 most frequent concepts in the training
dataset. Eight out of the top 10 concepts (all but “alesion” and “thoracics”) could
arguably describe most of the radiology images in the dataset. The ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
exclusively contains radiological images; a concept such as “radiograph” would be
appropriate for all of the images. However, since the umbrella concept of “radiograph”
can be expressed using a variety of different CUIs, we are forced to find arbitrary
features to distinguish these cognate identifiers. The CUIs C1548003 and C1962945
describe “radiograph” as a diagnostic procedure and a diagnostic service ID respectively.
While distinguishing these different types of “radiograph” is trivial in natural language
contexts, identifying discriminating visual features is an extremely dubious pursuit. As
such, any model tasked with learning the haphazard distribution of these semantically
interchangeable (and often universal) concepts in the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is expected to
have limited generalizability.
      </p>
      <p>
        This property of the dataset has implications on the F1 score used to evaluate this
task. The F1 score will penalize models for misidentifying the arbitrary instances or
absences of these CUIs in the test data. This is because of the inherent stochasticity of
the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; where unobservable variations in source figure captions
determine the CUIs assigned to a sample.
      </p>
      <p>
        In the supplied subset of the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we observe recurring patterns of these
semantically similar CUIs. For example C0043299 and C1962945 occur frequently as
a pair but they also regularly occur as a tripartite alongside C1548003. An RNN
architecture enables our model to exploit the statistical co-occurrence of these concepts
when modelling probability distributions for a set of CUIs [
        <xref ref-type="bibr" rid="ref11 ref7">7, 11</xref>
        ]. To achieve a
competitive F1 score, a model must not only learn “what” visual features best represent a
CUI, but also “when” that CUI is most likely to occur in a given set of labels. Since all
CUIs in a set of labels are derived from the same figure caption, modelling their
interdependencies will ensure our model is more robust to the unobservable variations in the
original figure caption. This enables our model to more reliably predict “when” a label
is assigned to an image based on the learned co-occurrence statistics with previously
generated concepts.
      </p>
      <p>
        Another challenge encountered in this task is the assignment of nonsensical CUIs by the
quickUMLS system [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] used to create the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The quickUMLS system
utilizes the CPMerge algorithm for dictionary mapping [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. CPMerge uses character
trigrams as features and maps terms to a dictionary based on overlapping features [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
This method introduces a significant source of error resulting in random and
nonsensical CUIs being extracted from a medical figure caption. Table. 3 showcases examples
of when trigram feature matching resulted in nonsensical or redundant CUIs being
assigned to an image. This presents a major obstacle for multi-modal retrieval as minor
changes in descriptive syntax results in significant and erratic variations of the CUIs
extracted from the figure caption. For example, one could rephrase the last sentence for
ROCO CLEF ID 25756 in Table. 3 as “Visualization of the proximal ACL is poor,
suggesting an ACL rupture”. The arbitrary decision to remove the word ”substance” would
no longer produce the erroneous CUI describing 11-Deoxycortisol (C0075414) based
on its common name “Reichsteins Substance S”.
      </p>
      <p>
        To satisfy our scientific curiosity, we compared CUIs extracted using quickUMLS to
those extracted by MetaMap [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; as MetaMap is a commonly used alternative for
automatic concept extraction. In the particular instance shown in Table. 2, the CUIs
produced by MetaMap are undoubtedly higher quality than those produced by
quickUMLS. This is likely due to the fact that MetaMap does not use trigram character
matching [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and so it accurately captures C0007133 (Papillary Carcinoma) instead
of C0226964 (Papilla of tongue). In the original paper describing quickUMLS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
the authors claim the quickUMLS system could outperform MetaMap in certain tasks.
However, an important caveat of this claim is that they use SpaCy models to pre-process
texts instead of the MetaMaps inbuilt pre-processing tools [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]). SpaCy pre-processing
models are trained on a general text corpus whereas MetaMap utilizes the
SPECIALIST lexicon [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Medical terms are highly featured in the SPECIALIST lexicon [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; the
lexems are likely to be more representative of those seen in radiology figure captions.
Thus, substituting MetaMap’s pre-processing tools with SpaCy’s may not accurately
reflect the performance of the end-to-end MetaMap system.
      </p>
      <p>Furthermore, we empirically discovered that certain semantic types were more prone
to erroneous assignment; the majority of nonsensical CUIs encountered were
chemical names and abbreviations. In light of this issue, it may be worthwhile to investigate
semantic types more prone to error and identify those which have the strongest visual
connotations. This could assist our multi-modal retrieval models in determining how to
weigh the importance of visual features and CUI relational dependencies based on the
To overcome the challenges described in Section. 2 we seek to construct a model that
satisfies the following requirements:
1. It must be able to identify the most distinguishable visual characteristics for a CUI;
2. It must capture interdependences among CUIs in a set of labels and;
3. It must be able to regulate the weight of visual features based on the variable
strength of a CUI’s visual connotation.</p>
      <p>
        To this end, the proposed methodology borrows many features from the works of Xu et
al [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Although their architecture was originally designed for use in image
captioning tasks, the dynamic soft attention mechanism, recurrent inductive bias of long short
term memory networks (LSTM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and deterministic visual gating mechanism can be
exploited to satisfy requirements (1), (2) and (3) respectively. We describe our
methodology in detail below.
      </p>
      <p>
        Firstly, we resize all images to 244x244x3 pixels in order to exploit a VGG16 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
convolutional neural network (CNN) pre-trained on ImageNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Although the
distribution of images in the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] differs greatly to the ImageNet dataset; there
is empirical evidence to suggest that ImageNet-trained CNNs produce state-of-the-art
results when transfer learning techniques are applied to smaller datasets. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Given
that the ROCO dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is over 200x smaller than ImageNet, we exploit
ImageNettrained VGG16 models to benefit from this effect of transfer learning. Thus, we use the
Keras [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implementation of a VGG16 model with pre-trained ImageNet weights and
extract the 14x14x512 vector from the “block-4 max-pooling” intermediary layer to
represent the image features. We keep the weights of the CNN fixed during training to
limit the number of trainable parameters; hence reducing the complexity of our model.
The image features are then passed into a recurrent network where each CUI is
processed one at a time until maximum time T has passed. The unconstrained maximum
number of CUIs in the training data is 72; however, we observe that we can reduce the
number of time steps by 74% and retain 99% of the training data if we constrain the
maximum number of CUIs to 19. Hence, to maximize efficiency, we exclude samples
with CUIs greater than 19. We add ”START” and ”END” tokens respectfully to the
beginning and end of each label set; NULL tokens are added to sets with fewer than 19
CUIs to attain a fixed length time sequence T = 21.
      </p>
      <p>
        We pre-process each label set such that each CUI is represented by its unique index
in the concept vocabulary V = 5531. To represent concept features, we train an
embedding space E 2 RV xd; where d is the concept vector dimensions. At the beginning of
every time step, the CUI index at position t is used to retrieve its vector representation
Xt = 1xd from the embedding. Meanwhile, the attention mechanism takes the LSTM
hidden state vector ht to construct a probability distribution, At over the 14x14 spatial
dimensions for the image; we multiply the feature vector by At and average the spatial
dimensions to produce a visual context vector with Ct = 1x512 dimensions as per [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
A visual sentinel learns to estimate a gating scalar S 2 [0; 1] from ht to dynamically
assign an attention weighting to Ct; this process is described in depth in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. As Ct
and St are both produced as a function of ht, the network learns ”where” to look for
discriminatory visual features and how important those visual features are in generating
the CUI at time t + 1.
      </p>
      <p>Once we multiply gating scalar St to context vector Ct, we concatenate the image
features with CUI feature vector Xt along the last dimension to produce the 512 + d
dimensional input to the LSTM network. A fully connected layer with relu activation
reduces the D dimensional output of the LSTM network into a vector with d
dimensions; residual connections to previous CUI are added by adding Xt 1 to the output.
We then multiply the resulting vector by P = RdxV and apply a softmax function to
construct a probability distribution over all the concepts in the vocabulary. At t = 0,
the LSTM is initialized on global image feature vector G. To produce G, image features
are averaged along their 14x14 spatial dimensions and pushed through a fully connected
layer to create a vector with the same dimensions as the LSTM input i.e. 512 + d.
The protocol described above represents the general framework for all 6 model
variants used in this task. The learning rate lr = 0:0001 and batch size n = 125 were
fixed across all variant training protocols and their performance was evaluated on the
validation dataset after 20 epochs. We now describe each model variant in detail below.
3.1</p>
      <sec id="sec-2-1">
        <title>Model A</title>
        <p>
          Model A is the standard implementation of our model. We set the dimensions D and d to
1024 and 512 respectively. The loss at each time step is calculated as the cross entropy
between the estimated probability distribution and the ground truth concept label. In
addition to using cross entropy loss, we use an alpha regularizing strategy described in
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to regulate the outputs of the attention mechanism. When no constraints are placed
on an attention network, a neural network can output nonsensical weights to optimize
performance on training data. To ensure the attention mechanism produces attention
salient weights we first construct an attention matrix 2 R196xT from the probability
distributions over the 14x14 spatial dimensions for each time step. As described in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
we calculate the alpha regularizing term, Lalpha from as follows;
(1)
(2)
(3)
(4)
Lxu =
        </p>
        <p>N
X(1
i</p>
        <p>C
X
t</p>
        <p>ti)2
t=0
LSAL = C1 XC( maxi( ti)</p>
        <p>
          meani( ti) )
meani( ti)
LT D =
1 XN( stdt( ti)
N i=0
meant( ti)
)
Where Lxu is the alpha regularising term in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], t represents the time axis, i represents
the probability distribution axis, maxi is the maximum value, meani is the mean value
along the column axis, stdt is the standard deviation and meant is the mean along
the row axis of ti. Lxu ensures all image regions receive attention over the course
generating each CUI, LSAL ensures attention mechanism produces salient attention
maps at each time step and LT D ensures that the attention mechanism is not biased to
any particular image region over the course of generation. The final alpha term can thus
be written as;
        </p>
        <p>Lalpha =
1Cxu +</p>
        <p>2
max( ; CSAL)
+</p>
        <p>
          3
max( ; CT D)
Where 1, 2 and 3 are hyper-parameters to scale the representation of each term.
is used to avoid zero division and exploding gradients in the initial training steps. This
loss term is then added to the total cross entropy loss and we perform standard
backpropagation with ADAM optimisation [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
During training, we implement a teacher forcing training protocol [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] where we feed
the ground truth CUI to the LSTM at every time step. During inference, the ”START”
token is fed into the LSTM network to get the probability distribution for the first CUI;
the index with the highest probability estimate is used to generate the CUI for that time
step. This process is repeated until a terminal ”END” sequence token is produced or
maximum time steps T have passed.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Model B 3.3</title>
      </sec>
      <sec id="sec-2-3">
        <title>Model C</title>
        <p>Model B is a standard implementation of our model. The protocol is identical to A
except we restrict the dimension of the CUI feature vectors to d = 300. This was due
to concerns that an embedding size of 512 may over-fit to the training distribution.
Model C is a standard implementation of our model. The protocol is identical to A
except we restrict the dimension of the concept vectors to D = 512. This was due to
concerns that an LSTM hidden state size of 1024 may over-fit to the training
distribution.
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Model D</title>
        <p>Model D seeks to address the problem of cumulative error resulting in a bias towards
learning samples with longer CUI sequences. In the standard implementation of our
model, the maximum error for each sample is constrained by the number of CUIs in
the set. This is because cross entropy error is calculated on a per concept basis (at each
time step), not a per sample basis. To ensure each sample has equal weighting in the
objective function, we divide the error at every time step by the total number of CUIs
for each sample and multiply the result by the maximum number of CUIs (19). This
ensures that every sample has the same theoretical maximum error and that the error
incurred for each incorrect concept is relative to the total number of concepts in the
set. Aside from the new weighted cross entropy loss function, Model D is otherwise
identical to Model A.
3.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Model E 3.6</title>
      </sec>
      <sec id="sec-2-6">
        <title>Model F</title>
        <p>Model E assesses the performance of our standard implementation without any
constraints on our attention mechanism. Here, we use the standard implementation
described in Model A except only the cross entropy error is used to train the network.
This was done to ensure the alpha regularisation strategy is appropriate for this task and
not over regulating our network.</p>
        <p>Model F assesses the performance of our standard implementation without the visual
sentinel. Here, we use a similar implementation to that described in Model A; however,
we remove the step of estimating the visual gating scalar St and allow the LSTM to
be conditioned on the unscaled Ct vector. This can be interpreted equally representing
features from Xt and Ct at every time step; meaning that the network no longer has
the capability of dynamically assessing the importance of visual features for each CUI.
This was done to ensure that the gating scalars produced by the network in Model A
actually resulted in improved outcomes with regards to performance on the validation
dataset.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>This section provides the results for our own internal evaluations on the validation
dataset supplied for this task; these are tabulated in Table. 4. We submitted Model A
for evaluation on test data as it achieved the highest F1 score, as shown in Table. 4. We
decided to submit Model D as well as it achieved a competitive result with a
surprisingly small average concepts per sample; we were curious to see the performance of a
more conservative model on the test distribution. Model A and Model D achieved F1
scores of 0.1749349 (rank 22) and 0.1640647 (rank 27) on the test dataset.
The performance of the proposed methods placed the CRADLE group 6th out of 12
participating teams in the ImageCLEF 2019 medical concept detection task. The baseline
architecture “Model A” achieved the highest performance. “Model B” and “Model C”
did not improve the F1 score which suggests that the proposed dimensionality of hidden
state and word embedding vectors in “Model A” is not resulting in over-fitting to the
training distribution.</p>
      <p>As evident by the reduced performance of “Model D”, resolving disparities in
concept distributions by normalising per-sample error has an adverse effect on training.
This contrary to what was hypothesized in Section 3.4. In retrospect, normalizing
persample error in fact forms a bias towards samples with fewer concepts. This is because
the “disproportionate” increase in per-sample error for longer concept sequences would
occur at time steps where operations are exclusive to those longer sequences. Once the
“END” token is generated for a sample, the error at latter time steps for this sample
should in fact be zero. The normalization methods described in Section 3.4 would
unfairly disadvantage longer sequences by reducing the relative error at each time step.
Subduing error in operations common to all samples to resolve disparities in total error
due to exclusive operations in longer samples is counter-productive and is likely to
explain the reduced performance of “Model D”.</p>
      <p>
        The reduced performance of ‘Model E” confirms that unregulated attention mechanisms
result in reduced performance and that the general constraints described in Section 3.1
are capable of improving attention and overall performance. “Model F” achieved one
of the lowest F1 scores, highlighting the importance of regulating the weight of visual
features depending on the visual connotations of each CUI. Future work will attempt
to address the challenges described in Section 2 by studying the association of CUI
semantic type to visual connotation. This will be achieved by retrieving CUI meta-data
from the UMLS metathesaurus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This project has been funded by Sullivan Nicolaides Pathology and the Australian
Research Council (ARC) Linkage Project [Grant number LP160101797].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aronson</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          :
          <article-title>Effective mapping of biomedical text to the umls metathesaurus: the metamap program</article-title>
          .
          <source>In: Proceedings of the AMIA Symposium</source>
          . p.
          <fpage>17</fpage>
          . American Medical Informatics Association (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The unified medical language system (umls): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research 32(suppl 1)</source>
          ,
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Browne</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCray</surname>
            ,
            <given-names>A.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasan</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The specialist lexicon</article-title>
          .
          <source>National Library of Medicine Technical Reports</source>
          pp.
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.: Keras. https://keras.io (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohli</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenman</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shooshan</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thoma</surname>
            ,
            <given-names>G.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Preparing a collection of radiology examinations for distribution and retrieval</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <fpage>304</fpage>
          -
          <lpage>310</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In: CVPR</source>
          . pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . Ieee (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Framewise phoneme classification with bidirectional lstm and other neural network architectures</article-title>
          .
          <source>Neural Networks</source>
          <volume>18</volume>
          (
          <issue>5-6</issue>
          ),
          <fpage>602</fpage>
          -
          <lpage>610</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Mu¨ller, H., Pe´teri, R.,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimuk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarasau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavallieratou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Blanco</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          , Rodr´ıguez,
          <string-name>
            <given-names>C.C.</given-names>
            ,
            <surname>Vasillopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Karampidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature</article-title>
          . In:
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Lugano,
          <source>Switzerland (September 9-12</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Maksoud</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiliem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Lovell</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.C.</surname>
          </string-name>
          :
          <article-title>Coral8: Concurrent object regression for area localization in medical image panels</article-title>
          .
          <source>In: International Conference on Medical Image Computing</source>
          and
          <string-name>
            <surname>Computer-Assisted Intervention</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Karafia´t, M.,
          <string-name>
            <surname>Burget</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Cˇernocky` ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Khudanpur</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Recurrent neural network based language model</article-title>
          . In:
          <article-title>Eleventh annual conference of the international speech communication association (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Oquab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laptev</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivic</surname>
          </string-name>
          , J.:
          <article-title>Learning and transferring mid-level image representations using convolutional neural networks</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>1717</fpage>
          -
          <lpage>1724</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          , Garc´ıa Seco de Herrera,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Mu¨ ller, H.:
          <article-title>Overview of the ImageCLEFmed 2019 concept prediction task</article-title>
          .
          <source>In: CLEF2019 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Lugano,
          <source>Switzerland (September</source>
          <volume>09</volume>
          -12
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koitka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ru¨ ckert, J.,
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.:</given-names>
          </string-name>
          <article-title>Radiology objects in context (roco): A multimodal image dataset</article-title>
          .
          <source>In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis</source>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Soldaini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goharian</surname>
          </string-name>
          , N.:
          <article-title>Quickumls: a fast, unsupervised approach for medical concept extraction</article-title>
          . In: MedIR workshop, sigir (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zipser</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A learning algorithm for continually running fully recurrent neural networks</article-title>
          .
          <source>Neural computation 1(2)</source>
          ,
          <fpage>270</fpage>
          -
          <lpage>280</lpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhudinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>In: ICML</source>
          . pp.
          <fpage>2048</fpage>
          -
          <lpage>2057</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>