Recurrent Attention Networks for Medical Concept
                         Prediction

                      Sam Maksoud, Arnold Wiliem, and Brian Lovell

 School of Information Technology and Electrical Engineering, The University of Queensland,
                                 Brisbane QLD, Australia


        Abstract. This paper presents the working notes for the CRADLE group’s par-
        ticipation in the ImageCLEF2019 medical competition. Our group focused on the
        concept detection task which challenged participants to approximate the mapping
        from radiology images to concept labels. Traditionally, such a task is often mod-
        elled as an image tagging or image retrieval problem. However, we empirically
        discovered that many concept labels had weak visual connotations; hence, im-
        age features alone are insufficient for this task. To this end, we utilize a recurrent
        neural network architecture which enables our model to capture the relational de-
        pendencies among concepts in a label set to supplement visual grounding when
        their association to image features is weak or unclear. We also exploit soft atten-
        tion and visual gating mechanisms to enable our network to dynamically regulate
        “where” and “when” to extract visual data for concept generation.


1     Introduction

In 2019, ImageCLEF [8] hosted its 3rd edition of the medical image captioning task.
The participants of this task were challenged to develop a method for generating Con-
cept Unique Identifiers (CUI) to describe the contents of a radiology image [13]. In
contrast to natural language captions, CUIs parse out standardized concept terms from
the medical texts. Resolving captions into key concepts alleviates the constraint of mod-
elling the syntactic structures of free text. Removing the language modelling component
results in a task akin to image tagging i.e. identifying the presence of a label (CUI) by
its most distinguishable visual features.

However, a considerable number of CUI terms in the supplied subset of the ROCO
dataset [14] have no obvious association to visual features. This is due to the fact that
the CUIs were extracted automatically from the figure captions using only natural lan-
guage processing tools; there was no constraint for the CUIs to be associated with vi-
sual features. Consequently, concepts with weak visual connotations such as “study ,
“rehab” and “supplement” are abundant throughout the ROCO dataset [14]; it is un-
reasonable to assume that a model can learn general visual features to reliably identify
such concepts. While it is true that in isolation, accurately identifying these non-visual
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
concepts is unlikely or impossible, their relevance to an image can be indirectly esti-
mated by modelling relational dependencies to other CUIs in the set of concept labels.
This is because all CUIs in a set of concepts are derived from a common source: the
original figure caption.

Under these conditions, our group concluded it would be best to model the problem
as an image to sequence translation task; emphasizing the need to map an image to a
set of concepts, rather than mapping individual CUIs directly to image features. Thus,
we design our model as a recurrent neural network (RNN) given their unrivalled perfor-
mance in capturing the long term dependencies in sequential data [11]. Our proposed
RNN is conditioned on features from both the image and CUI labels. We utilize a soft
attention mechanism [18] which dynamically attends to different regions of an image in
order to select the most distinguishable visual features for each CUI. In situations where
a CUI has weak visual connotations, a visual feature gating mechanism [18] allows the
model to focus on textual features as they are likely to provide greater discriminatory
power in such contexts.


2   Dataset Challenges
In order to design an appropriate model for the task, our group carried out an extensive
investigation of the supplied subset of the ROCO dataset [14]. During this investigation
we identified several challenges that would complicate the task of mapping text to vi-
sual features. These challenges pertain to incidences of redundant, inconsistent, and/or
nonsensical assignment of CUIs to an image. We describe these challenges and how
they influenced our approach to this task in detail below.


                 Table 1. Top 10 most frequent concepts in the training data.


           RANK   CUI    FREQUENCY             CONCEPT
             1  C0040395    6033                  tomogr
             2  C0034579    6002                pantomogr
             3  C0043299    5830             x-ray procedure
             4  C0441633    5283           diagnostic scanning
             5  C1548003    5045                radiograph
             6  C1962945    5044                  radiogr
             7  C0817096    4794                 thoracics
             8  C0772294    4372                  alesion
             9  C0040405    3113   x-ray computer assisted tomography
            10 C0009924     2771         materials/contrast media


First and foremost, we identified that a majority of concepts redundantly describe generic
radiology images. In Table. 1 we list the top 10 most frequent concepts in the training
dataset. Eight out of the top 10 concepts (all but “alesion” and “thoracics”) could ar-
guably describe most of the radiology images in the dataset. The ROCO dataset [14]
exclusively contains radiological images; a concept such as “radiograph” would be ap-
propriate for all of the images. However, since the umbrella concept of “radiograph”
can be expressed using a variety of different CUIs, we are forced to find arbitrary fea-
tures to distinguish these cognate identifiers. The CUIs C1548003 and C1962945 de-
scribe “radiograph” as a diagnostic procedure and a diagnostic service ID respectively.
While distinguishing these different types of “radiograph” is trivial in natural language
contexts, identifying discriminating visual features is an extremely dubious pursuit. As
such, any model tasked with learning the haphazard distribution of these semantically
interchangeable (and often universal) concepts in the ROCO dataset [14] is expected to
have limited generalizability.

This property of the dataset has implications on the F1 score used to evaluate this
task. The F1 score will penalize models for misidentifying the arbitrary instances or
absences of these CUIs in the test data. This is because of the inherent stochasticity of
the ROCO dataset [14]; where unobservable variations in source figure captions deter-
mine the CUIs assigned to a sample.

In the supplied subset of the ROCO dataset [14], we observe recurring patterns of these
semantically similar CUIs. For example C0043299 and C1962945 occur frequently as
a pair but they also regularly occur as a tripartite alongside C1548003. An RNN ar-
chitecture enables our model to exploit the statistical co-occurrence of these concepts
when modelling probability distributions for a set of CUIs [7, 11]. To achieve a com-
petitive F1 score, a model must not only learn “what” visual features best represent a
CUI, but also “when” that CUI is most likely to occur in a given set of labels. Since all
CUIs in a set of labels are derived from the same figure caption, modelling their inter-
dependencies will ensure our model is more robust to the unobservable variations in the
original figure caption. This enables our model to more reliably predict “when” a label
is assigned to an image based on the learned co-occurrence statistics with previously
generated concepts.


Another challenge encountered in this task is the assignment of nonsensical CUIs by the
quickUMLS system [16] used to create the ROCO dataset [14]. The quickUMLS system
utilizes the CPMerge algorithm for dictionary mapping [16]. CPMerge uses character
trigrams as features and maps terms to a dictionary based on overlapping features [16].
This method introduces a significant source of error resulting in random and nonsensi-
cal CUIs being extracted from a medical figure caption. Table. 3 showcases examples
of when trigram feature matching resulted in nonsensical or redundant CUIs being as-
signed to an image. This presents a major obstacle for multi-modal retrieval as minor
changes in descriptive syntax results in significant and erratic variations of the CUIs
extracted from the figure caption. For example, one could rephrase the last sentence for
ROCO CLEF ID 25756 in Table. 3 as “Visualization of the proximal ACL is poor, sug-
gesting an ACL rupture”. The arbitrary decision to remove the word ”substance” would
Table 2. This table compares CUIs produced by MetaMap [1] compared to the quickUMLS [16]
for the following medical text: “Intensity modulated radiotherapy (IMRT) planning axial CT
post-contrast showing a residual post-operative cystic nodal metastasis from papillary carcinoma
(arrow). The patient underwent a further neck dissection to remove the node before IMRT was
performed.” (source caption for ROCO CLEF ID 42724).


                quickUMLS                               MetaMap
            C0034619 (radiother)        C1512814 (Intensity modulated radiotherapy)
        C1522449 (radiation therapy)              C0032074 (Planning)
              C2939420 (mets)                       C0205131 (Axial)
             C0027530 (collum)                       C3888140 (CT)
       C0012737 (dissection procedure)            C1609982 (Residual)
        C0226964 (papillae linguales)        C0032790 (Postoperative Period)
           C0935624 (capillaris)                    C0205207 (Cystic)
             C0746922 (noded)                       C0443268 (Nodal)
            C1328685 (metastat)                  C0027627 (Metastasis)
             C0227296 (papilla)             C0007133 (Carcinoma, Papillary)
       C0027627 (spreading of cancer)                       −
       C0006901 (smallest blood vessel)                     −


no longer produce the erroneous CUI describing 11-Deoxycortisol (C0075414) based
on its common name “Reichsteins Substance S”.

To satisfy our scientific curiosity, we compared CUIs extracted using quickUMLS to
those extracted by MetaMap [1]; as MetaMap is a commonly used alternative for auto-
matic concept extraction. In the particular instance shown in Table. 2, the CUIs pro-
duced by MetaMap are undoubtedly higher quality than those produced by quick-
UMLS. This is likely due to the fact that MetaMap does not use trigram character
matching [1] and so it accurately captures C0007133 (Papillary Carcinoma) instead
of C0226964 (Papilla of tongue). In the original paper describing quickUMLS [16],
the authors claim the quickUMLS system could outperform MetaMap in certain tasks.
However, an important caveat of this claim is that they use SpaCy models to pre-process
texts instead of the MetaMaps inbuilt pre-processing tools [16]). SpaCy pre-processing
models are trained on a general text corpus whereas MetaMap utilizes the SPECIAL-
IST lexicon [1]. Medical terms are highly featured in the SPECIALIST lexicon [3]; the
lexems are likely to be more representative of those seen in radiology figure captions.
Thus, substituting MetaMap’s pre-processing tools with SpaCy’s may not accurately
reflect the performance of the end-to-end MetaMap system.

Furthermore, we empirically discovered that certain semantic types were more prone
to erroneous assignment; the majority of nonsensical CUIs encountered were chemi-
cal names and abbreviations. In light of this issue, it may be worthwhile to investigate
semantic types more prone to error and identify those which have the strongest visual
connotations. This could assist our multi-modal retrieval models in determining how to
weigh the importance of visual features and CUI relational dependencies based on the
Table 3. This table shows examples of erroneous CUI assignment. To retrieve the original caption,
each image was used to query the original figure caption pair using the Openi image search
engine [5].


    ROCO CLEF
                   ORIGINAL CAPTION                     ERRONEOUS CUIs                  REASON FOR ERROR
       ID
                                                                                        The caption refers to the
                Tc99m pertechnetate thyroid
                                                   C0004268 (concentration)              chemistry definition of
                scan did not show any tracer
      24120                                      C0086045 (attention concentration)   concentration however is has
                concentration by the thyroid
                                                  C3827302 (i can concentrate well)       also been mistakenly
                           gland
                                                                                          interpreted as a verb
                A twenty five year old female
                                                                                           Poor understanding of
                   suffering from internal
                                                                                          sentence semantics has
                derangement of the left knee.
                                                                                      resulted in the word substance
      25756      The MRI report described        C0075414 (Reichstein’s Substance)
                                                                                       triggering the assignment of
                  ACL rupture due to poor
                                                                                           chemical Reichstein’s
                  visualization of the ACL
                                                                                          Substance to the image.
                         substance.
                                                                                      Another example of chemical
                                                                                            drug name Alesion
                CECT abdomen showing the
      22356                                              C0772294 (alesion)               (antihistamine) being
                         lesion
                                                                                         mistakenly matched to a
                                                                                      commonly used term (lesion).
                                                                                        Poor trigram matching has
                                                                                      produced CUIs for Teration (a
                Severe Bilateral secretion and       C0076106 (Teration)              type of Organothiophosphate)
      24120
                  concentration alterations             C1306232 (Sever)                  and Sever (verb) from
                                                                                           alteration and severe
                                                                                                respectively.


CUI’s semantic type.


3     Proposed Methodology
To overcome the challenges described in Section. 2 we seek to construct a model that
satisfies the following requirements:

 1. It must be able to identify the most distinguishable visual characteristics for a CUI;
 2. It must capture interdependences among CUIs in a set of labels and;
 3. It must be able to regulate the weight of visual features based on the variable
    strength of a CUI’s visual connotation.

To this end, the proposed methodology borrows many features from the works of Xu et
al [18]. Although their architecture was originally designed for use in image caption-
ing tasks, the dynamic soft attention mechanism, recurrent inductive bias of long short
term memory networks (LSTM) [7] and deterministic visual gating mechanism can be
exploited to satisfy requirements (1), (2) and (3) respectively. We describe our method-
ology in detail below.

Firstly, we resize all images to 244x244x3 pixels in order to exploit a VGG16 [15]
convolutional neural network (CNN) pre-trained on ImageNet [6]. Although the distri-
bution of images in the ROCO dataset [14] differs greatly to the ImageNet dataset; there
is empirical evidence to suggest that ImageNet-trained CNNs produce state-of-the-art
results when transfer learning techniques are applied to smaller datasets. [12]. Given
that the ROCO dataset [14] is over 200x smaller than ImageNet, we exploit ImageNet-
trained VGG16 models to benefit from this effect of transfer learning. Thus, we use the
Keras [4] implementation of a VGG16 model with pre-trained ImageNet weights and
extract the 14x14x512 vector from the “block-4 max-pooling” intermediary layer to
represent the image features. We keep the weights of the CNN fixed during training to
limit the number of trainable parameters; hence reducing the complexity of our model.

The image features are then passed into a recurrent network where each CUI is pro-
cessed one at a time until maximum time T has passed. The unconstrained maximum
number of CUIs in the training data is 72; however, we observe that we can reduce the
number of time steps by 74% and retain 99% of the training data if we constrain the
maximum number of CUIs to 19. Hence, to maximize efficiency, we exclude samples
with CUIs greater than 19. We add ”START” and ”END” tokens respectfully to the
beginning and end of each label set; NULL tokens are added to sets with fewer than 19
CUIs to attain a fixed length time sequence T = 21.

We pre-process each label set such that each CUI is represented by its unique index
in the concept vocabulary V = 5531. To represent concept features, we train an embed-
ding space E ∈ RV xd ; where d is the concept vector dimensions. At the beginning of
every time step, the CUI index at position t is used to retrieve its vector representation
Xt = 1xd from the embedding. Meanwhile, the attention mechanism takes the LSTM
hidden state vector ht to construct a probability distribution, At over the 14x14 spatial
dimensions for the image; we multiply the feature vector by At and average the spatial
dimensions to produce a visual context vector with Ct = 1x512 dimensions as per [18].
A visual sentinel learns to estimate a gating scalar S ∈ [0, 1] from ht to dynamically
assign an attention weighting to Ct ; this process is described in depth in [18]. As Ct
and St are both produced as a function of ht , the network learns ”where” to look for
discriminatory visual features and how important those visual features are in generating
the CUI at time t + 1.

Once we multiply gating scalar St to context vector Ct , we concatenate the image
features with CUI feature vector Xt along the last dimension to produce the 512 + d
dimensional input to the LSTM network. A fully connected layer with relu activation
reduces the D dimensional output of the LSTM network into a vector with d dimen-
sions; residual connections to previous CUI are added by adding Xt−1 to the output.
We then multiply the resulting vector by P = RdxV and apply a softmax function to
construct a probability distribution over all the concepts in the vocabulary. At t = 0,
the LSTM is initialized on global image feature vector G. To produce G, image features
are averaged along their 14x14 spatial dimensions and pushed through a fully connected
layer to create a vector with the same dimensions as the LSTM input i.e. 512 + d.

The protocol described above represents the general framework for all 6 model vari-
ants used in this task. The learning rate lr = 0.0001 and batch size n = 125 were
fixed across all variant training protocols and their performance was evaluated on the
validation dataset after 20 epochs. We now describe each model variant in detail below.


3.1   Model A

Model A is the standard implementation of our model. We set the dimensions D and d to
1024 and 512 respectively. The loss at each time step is calculated as the cross entropy
between the estimated probability distribution and the ground truth concept label. In
addition to using cross entropy loss, we use an alpha regularizing strategy described in
[10] to regulate the outputs of the attention mechanism. When no constraints are placed
on an attention network, a neural network can output nonsensical weights to optimize
performance on training data. To ensure the attention mechanism produces attention
salient weights we first construct an attention matrix α ∈ R196xT from the probability
distributions over the 14x14 spatial dimensions for each time step. As described in [10]
we calculate the alpha regularizing term, Lalpha from α as follows;
                                       N
                                       X      C
                                              X
                               Lxu =     (1 −   αti )2                                (1)
                                         i        t

                                   C
                               1 X maxi (αti ) − meani (αti )
                     LSAL =          (                        )                       (2)
                               C t=0     meani (αti )

                                             N
                                       1 X stdt (αti )
                             LT D =         (             )                           (3)
                                       N i=0 meant (αti )

Where Lxu is the alpha regularising term in [18], t represents the time axis, i represents
the probability distribution axis, maxi is the maximum value, meani is the mean value
along the column axis, stdt is the standard deviation and meant is the mean along
the row axis of αti . Lxu ensures all image regions receive attention over the course
generating each CUI, LSAL ensures attention mechanism produces salient attention
maps at each time step and LT D ensures that the attention mechanism is not biased to
any particular image region over the course of generation. The final alpha term can thus
be written as;

                                           λ2            λ3
                 Lalpha = λ1 Cxu +                  +                                 (4)
                                       max(δ, CSAL ) max(δ, CT D )
Where λ1 , λ2 and λ3 are hyper-parameters to scale the representation of each term. δ
is used to avoid zero division and exploding gradients in the initial training steps. This
loss term is then added to the total cross entropy loss and we perform standard back-
propagation with ADAM optimisation [9]

During training, we implement a teacher forcing training protocol [17] where we feed
the ground truth CUI to the LSTM at every time step. During inference, the ”START”
token is fed into the LSTM network to get the probability distribution for the first CUI;
the index with the highest probability estimate is used to generate the CUI for that time
step. This process is repeated until a terminal ”END” sequence token is produced or
maximum time steps T have passed.


3.2   Model B

Model B is a standard implementation of our model. The protocol is identical to A
except we restrict the dimension of the CUI feature vectors to d = 300. This was due
to concerns that an embedding size of 512 may over-fit to the training distribution.


3.3   Model C

Model C is a standard implementation of our model. The protocol is identical to A
except we restrict the dimension of the concept vectors to D = 512. This was due to
concerns that an LSTM hidden state size of 1024 may over-fit to the training distribu-
tion.


3.4   Model D

Model D seeks to address the problem of cumulative error resulting in a bias towards
learning samples with longer CUI sequences. In the standard implementation of our
model, the maximum error for each sample is constrained by the number of CUIs in
the set. This is because cross entropy error is calculated on a per concept basis (at each
time step), not a per sample basis. To ensure each sample has equal weighting in the
objective function, we divide the error at every time step by the total number of CUIs
for each sample and multiply the result by the maximum number of CUIs (19). This
ensures that every sample has the same theoretical maximum error and that the error
incurred for each incorrect concept is relative to the total number of concepts in the
set. Aside from the new weighted cross entropy loss function, Model D is otherwise
identical to Model A.


3.5   Model E

Model E assesses the performance of our standard implementation without any con-
straints on our attention mechanism. Here, we use the standard implementation de-
scribed in Model A except only the cross entropy error is used to train the network.
This was done to ensure the alpha regularisation strategy is appropriate for this task and
not over regulating our network.


3.6   Model F

Model F assesses the performance of our standard implementation without the visual
sentinel. Here, we use a similar implementation to that described in Model A; however,
we remove the step of estimating the visual gating scalar St and allow the LSTM to
be conditioned on the unscaled Ct vector. This can be interpreted equally representing
features from Xt and Ct at every time step; meaning that the network no longer has
the capability of dynamically assessing the importance of visual features for each CUI.
This was done to ensure that the gating scalars produced by the network in Model A
actually resulted in improved outcomes with regards to performance on the validation
dataset.

4    Results
This section provides the results for our own internal evaluations on the validation
dataset supplied for this task; these are tabulated in Table. 4. We submitted Model A
for evaluation on test data as it achieved the highest F1 score, as shown in Table. 4. We
decided to submit Model D as well as it achieved a competitive result with a surpris-
ingly small average concepts per sample; we were curious to see the performance of a
more conservative model on the test distribution. Model A and Model D achieved F1
scores of 0.1749349 (rank 22) and 0.1640647 (rank 27) on the test dataset.

Table 4. This table compares the quantitative performance of each of our models on the validation
dataset. F1 refers to the average F1 score on the validation dataset. MIN, MAX, and MEAN
respectively refer to the minimum, maximum and mean number of concepts generated for each
example in the validation set. The highest F1 score is highlighted in bold font.


                                Model F1 MIN MAX MEAN
                                 A 0.16 1     16  4.3
                                 B 0.15 1     11  4.2
                                 C 0.13 0     14  3.7
                                 D 0.15 0     14  0.4
                                 E 0.12 1     9   2.5
                                 F 0.12 1     9   2.9


5    Conclusion and Future Works
The performance of the proposed methods placed the CRADLE group 6th out of 12 par-
ticipating teams in the ImageCLEF 2019 medical concept detection task. The baseline
architecture “Model A” achieved the highest performance. “Model B” and “Model C”
did not improve the F1 score which suggests that the proposed dimensionality of hidden
state and word embedding vectors in “Model A” is not resulting in over-fitting to the
training distribution.

As evident by the reduced performance of “Model D”, resolving disparities in con-
cept distributions by normalising per-sample error has an adverse effect on training.
This contrary to what was hypothesized in Section 3.4. In retrospect, normalizing per-
sample error in fact forms a bias towards samples with fewer concepts. This is because
the “disproportionate” increase in per-sample error for longer concept sequences would
occur at time steps where operations are exclusive to those longer sequences. Once the
“END” token is generated for a sample, the error at latter time steps for this sample
should in fact be zero. The normalization methods described in Section 3.4 would un-
fairly disadvantage longer sequences by reducing the relative error at each time step.
Subduing error in operations common to all samples to resolve disparities in total error
due to exclusive operations in longer samples is counter-productive and is likely to ex-
plain the reduced performance of “Model D”.

The reduced performance of ‘Model E” confirms that unregulated attention mechanisms
result in reduced performance and that the general constraints described in Section 3.1
are capable of improving attention and overall performance. “Model F” achieved one
of the lowest F1 scores, highlighting the importance of regulating the weight of visual
features depending on the visual connotations of each CUI. Future work will attempt
to address the challenges described in Section 2 by studying the association of CUI
semantic type to visual connotation. This will be achieved by retrieving CUI meta-data
from the UMLS metathesaurus [2].


6   Acknowledgements

This project has been funded by Sullivan Nicolaides Pathology and the Australian Re-
search Council (ARC) Linkage Project [Grant number LP160101797].


References
 1. Aronson, A.R.: Effective mapping of biomedical text to the umls metathesaurus: the
    metamap program. In: Proceedings of the AMIA Symposium. p. 17. American Medical In-
    formatics Association (2001)
 2. Bodenreider, O.: The unified medical language system (umls): integrating biomedical termi-
    nology. Nucleic acids research 32(suppl 1), D267–D270 (2004)
 3. Browne, A.C., McCray, A.T., Srinivasan, S.: The specialist lexicon. National Library of
    Medicine Technical Reports pp. 18–21 (2000)
 4. Chollet, F., et al.: Keras. https://keras.io (2015)
 5. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., An-
    tani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for
    distribution and retrieval. Journal of the American Medical Informatics Association 23(2),
    304–310 (2015)
 6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierar-
    chical image database. In: CVPR. pp. 248–255. Ieee (2009)
 7. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and
    other neural network architectures. Neural Networks 18(5-6), 602–610 (2005)
 8. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk, D., Tarasau,
    A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Dang-Nguyen, D.T.,
    Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Pelka, O., Friedrich, C.M., de Herrera,
    A.G.S., Garcia, N., Kavallieratou, E., del Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N.,
    Karampidis, K., Chamberlain, J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia
    retrieval in medicine, lifelogging, security and nature. In: Experimental IR Meets Multilin-
    guality, Multimodality, and Interaction. Proceedings of the 10th International Conference of
    the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer,
    Lugano, Switzerland (September 9-12 2019)
 9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
10. Maksoud, S., Wiliem, A., Zhao, K., Zhang, T., Wu, L., Lovell, B.C.: Coral8: Concurrent
    object regression for area localization in medical image panels. In: International Conference
    on Medical Image Computing and Computer-Assisted Intervention (2019)
11. Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network
    based language model. In: Eleventh annual conference of the international speech commu-
    nication association (2010)
12. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image repre-
    sentations using convolutional neural networks. In: Proceedings of the IEEE conference on
    computer vision and pattern recognition. pp. 1717–1724 (2014)
13. Pelka, O., Friedrich, C.M., Garcı́a Seco de Herrera, A., Müller, H.: Overview of the Image-
    CLEFmed 2019 concept prediction task. In: CLEF2019 Working Notes. CEUR Workshop
    Proceedings, CEUR-WS.org, Lugano, Switzerland (September 09-12 2019)
14. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects in context
    (roco): A multimodal image dataset. In: Intravascular Imaging and Computer Assisted Stent-
    ing and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, pp. 180–
    189. Springer (2018)
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog-
    nition. arXiv preprint arXiv:1409.1556 (2014)
16. Soldaini, L., Goharian, N.: Quickumls: a fast, unsupervised approach for medical concept
    extraction. In: MedIR workshop, sigir (2016)
17. Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural
    networks. Neural computation 1(2), 270–280 (1989)
18. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.:
    Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp.
    2048–2057 (2015)