=Paper= {{Paper |id=Vol-2380/paper_92 |storemode=property |title=Informative and Intriguing Visual Features: UA.PT Bioinformatics in ImageCLEF Caption 2019 |pdfUrl=https://ceur-ws.org/Vol-2380/paper_92.pdf |volume=Vol-2380 |authors=Ana Jorge Gonçalves,Eduardo Pinho,Carlos Costa |dblpUrl=https://dblp.org/rec/conf/clef/GoncalvesP019 }} ==Informative and Intriguing Visual Features: UA.PT Bioinformatics in ImageCLEF Caption 2019== https://ceur-ws.org/Vol-2380/paper_92.pdf
    Informative and Intriguing Visual Features:
             UA.PT Bioinformatics in
            ImageCLEF Caption 2019

        Ana Jorge Gonçalves⋆ , Eduardo Pinho⋆ , and Carlos Costa

       DETI - Institute of Electronics and Informatics Engineering of Aveiro
                           University of Aveiro, Portugal
          {ana.j.v.goncalves,eduardopinho,carlos.costa}@ua.pt




      Abstract. Digital medical imaging has opened new advances in clinical
      decision support and treatment procedures since its inception. This leads
      to the creation of huge amounts of data that are often not fully exploited.
      The development and evaluation of representation learning techniques
      for automatic detection of concepts in medical images can make way
      for improved indexing, processing and retrieval capabilities in medical
      imaging archives.
      This paper discloses several independent approaches for multi-label clas-
      sification of biomedical concepts, in the context of the ImageCLEFmed
      Caption challenge of 2019. We emphasize the use of threshold tuning to
      optimize the quality of sample retrieval, as well as the differences be-
      tween training a convolutional neural network end-to-end for supervised
      image classification, and training unsupervised learning models before
      linear classifiers. In the test results, the best mean F1 -score of 0.206 was
      obtained with the supervised approach, albeit with images of a larger
      resolution than for the dual-stage approaches.

      Keywords: representation learning · deep learning · auto-encoders



1 Introduction

Medical imaging modalities are an essential and well established medium, and
as the amount of medical images is dramatically growing, automatic and semi-
automatic algorithms are quite pertinent for the extraction of information from
biomedical image data [14]. Therefore, deep learning techniques are becoming
increasingly useful and necessary for this aim, posing as a valuable key for the
development of representation learning techniques, and ultimately for improving
the quality of systems in healthcare.

  Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
  ber 2019, Lugano, Switzerland.
⋆
  Both authors contributed equally, names are in alphabetical order.
    The process of annotating images with useful information in this context is
time-consuming and usually requires medical expertise. The development of pow-
erful representations of images could enable the automatic detection of biomed-
ical concepts in a medical imaging data set. The ImageCLEFmed initiative,
inserted in ImageCLEF [4], has been focused on automatic concept detection,
diagnosis, and question answering from medical images. In particular, the Im-
ageCLEFmed Caption challenge of 2019 [11] has narrowed its scope into the task
of concept detection, with the goal of recognizing biomedical concepts presented
in medical images, using only the visual content.
    This paper presents our solution proposal for the concept detection task, de-
scribing our methodology and evaluating its performance under the ImageCLEF
2019 challenge.


2 Methods

For this task, a data set with a total of 70,786 radiology images of several medical
imaging modalities was provided from Radiology Objects in Context (ROCO)
[12]. This global set was further split into training (56,629 images), validation
(14,157 images) and test (10,000 images) sets by the organizers. Only the first
two were accompanied with the list of concepts applicable to each image, whereas
the testing set’s ground truth was hidden from the participants.
    The ImageCLEF Caption 2019 data set includes an overwhelming number
of 5,216 unique concepts, not all of which can be reasonably considered due to
the very small number of positive samples in the training and validation splits.
In all of the methods described next, we have admitted only the 1,100 concepts
with the highest number of samples with a positive occurrence of that concept
(henceforth named positive samples). The label vectors were built based on a
direct mapping from the UMLS concept unique identifier (CUI) to an index
in the vector. The reverse mapping was kept for producing the textual list of
concepts.
    Also contributing to this decision, was the observed imbalance in the number
of positives of each label, as discerned in Figure 1, making them difficult to train
classifiers and evaluate them. By considering the 1,100 most frequent concepts,
one could ensure, in the extreme case, a minimum number of 29 positive samples
in the training set and 2 positive samples in the validation set. We admit that
attempting to detect any less frequent concepts is unlikely to result in useful
classifiers.
    At ImageCLEF 2018, the highest mean F1 -score was obtained through unsu-
pervised methods, namely by using the features of an adversarial auto-encoder,
followed by logistic regression [13]. However, in relation to last year challenge,
the number of images in the training set was reduced by 34 % and since all
the data was annotated, we felt inclined to compare our past approach with the
training of purely supervised methods.
    Convolutional neural networks (CNNs) are considered one of the best ap-
proaches for image classification [15]. Unlike a 2-stage approach, where the ex-
Fig. 1. Number of positives of each label, for the 1,100 more frequents ones. The con-
cepts C0441633 (diagnostic scanning), C1533810(placed), C0237058 (no hydronephro-
sis), C0205160 (ruled out), C4282132 (malignancy) and C0181209 (hooks) are marked.



tracted feature descriptors are served as input to a trainable classifier, the images
themselves are used in the learning process of the CNN. This could lead to a
more focused guidance of the feature learning process across the multiple layers
of the network. It was of our interest to compare this common supervised image
classification method with the 2-stage pipeline involving unsupervised learning
methods and simple classifiers.
    Therefore, we have addressed the concept detection task with multiple inde-
pendent approaches, which can be divided in two major groups:

 – Through image representations, obtained by the implementation of several
   feature extraction methods:
     • Color and edge directivity descriptors, that were used as image descrip-
       tors.
     • An auto-encoder and an adversarial auto-encoder was trained and fea-
       tures were extracted from its bottleneck vector.
     • The ensemble of features obtained from the previous point was used for
       classification.
 – An end-to-end approach, using two deep learning architectures:
     • A simple convolutional neural network model was assumed.
     • A residual neural network.

   In every case, some form of optimum threshold tuning was employed, to over-
come the classifier’s focus on accuracy rather than F-measure. Further details
are given in Sections 2.4 and 2.5.
   Neural network training, feature extraction, and logistic regression were con-
ducted using TensorFlow on one of the GPUs of an NVIDIA Tesla K80 graphics
card in an Ubuntu server machine.
2.1 Color and Edge Directivity Descriptors
As traditional visual feature extraction algorithms are still very often considered
in medical image recognition, these techniques contribute to a baseline, which
we expect modern deep learning methods to surpass. For this purpose, we have
extracted Color and Edge Directivity Descriptors (CEDDs) [2] from the images1 ,
after they were resized to a minimum size of 256 while keeping the aspect ratio.
These low-level features accumulate color and texture information into a his-
togram of 144 bins per sample, and are known for their appealing accuracy in
image retrieval tasks, when contrasted with their high compactness.

2.2 Adversarial Auto-encoder
For the unsupervised extraction of visual features from the medical images, an
adversarial auto-encoder (AAE) [9] was trained on the given data set, with im-
ages resized to 64 pixels (64 × 64 × 3 inputs). While functioning as a typical
auto-encoder, which seeks to minimize the information loss of passing samples
through an information bottleneck (Equation 1), a discriminator D is also in-
cluded. The purpose of D is to learn to distinguish latent codes produced by the
encoder E from a prior code created by an arbitrary distribution p(z), whereas
E seeks to fool the code discriminator by approximating its output distribution
to that of p(z) (Equation 2). Based on the concept of Generative Adversarial
Networks (GANs) [3], this min-max game of adversarial components provides
variational inference to the basic auto-encoder structure while leading the en-
coder to match the prior distribution, thus regularizing the encoder. In this work,
we have sampled ϵ ∼ p(z) from a rectified unit-norm Gaussian distribution (as
in, N (0, I) with all negative numbers replaced with zeros), which resulted in
organically sparse latent codes.

                                  x′ = G(E(x))
                                            1 ∑
                                               N
                                                                                 (1)
                         Lrec (x, x′ ) =        (xi − x′i )2
                                           2N i


         V (E, D) = min max Ez∼pz [log D(z)] + Ex∼p(x) [log (1 − D(E(x)))]       (2)
                     E    D


     Both encoder and decoder architecture are based on the ResNet19 [6], each
component comprising four 2-layer residual blocks. At the end of the encoder,
the final layer was subjected to a ReLU activation and a very light L1 activation
regularization (of factor 10−6 ), thus contributing to the features’ sparsity without
deviating from the established prior. The code discriminator, on the other hand,
is composed of three 1024-channel wide dense layers, with layer normalization
[1], plus an output layer. Drop-out of rate 25% was also added before the output
layer.
1
    Available on GitHub: https://github.com/Enet4/ACEDD
    The AAE was trained for 30 epochs on the training set, with images resized
to a minimum dimension of 72 pixels and then randomly cropped to a 64 x 64
square. All three RGB channels were kept with their values normalized to the
[-1, 1] range with the formula x/127.5 − 1. Each iteration is composed of three
optimization steps: the code discriminator step, the encoder regularization step,
and the reconstruction step. The components were trained in this order with
the Adam optimizer [5], a learning rate of 10−5 , beta parameters β1 = 0.5 and
β2 = 0.999, and a mini-batch size of 32.
    Moreover, in order to better understand the influence of the adversarial auto-
encoder’s regularization phase, a separate auto-encoder (AE) was trained with
the same encoder and decoder characteristics as in the adversarial one, but
without the adversarial loss. In this case, each iteration is only composed of the
reconstruction step, mainly influenced by the loss Lrec .

2.3 Early Fusion: Auto-encoder and Adversarial Auto-encoder
The combination of information from different techniques seems intuitively ap-
pealing for improving the performance of the learning process. More recently,
there is an attempt of combining traditional image characteristics with high-level
features [7].
    Although it is known that the combination of low-level features, such as color
and texture in CEDD, is important to the quality of visual features, it is not
as clear whether the combination of high-level features obtained from two auto-
encoders can benefit from an early fusion. Hence, we hereby took the features
obtained from the AE and the AAE, and concatenated them to form a 1024-
dimensional feature set, for its subsequent use in the logistic regression step
alongside the other feature sets.

2.4 Logistic Regression
For each of the previously described set of features, logistic regression classi-
fiers were trained for the chosen labels, for a set of predefined operating point
thresholds: 0.075, 0.1, 0.125, and 0.15. The linear classifiers were trained with
the Adam optimizer [5], with a mini-batch size of 128, until the best F1 -score
among the various thresholds would reach a plateau. Our experiments suggests
that training in this phase with a very small learning rate, often 10−5 in our
experiments, for a large number of epochs (more than 500), helps the training
process to find a more optimal solution.
    Aware of the presence of concepts with a very low number of positive sam-
ples, one may wonder whether certain labels were not well trained or resulted
in uninformative classifiers. To mitigate this, we calculated the area under the
curve (AUC) of each binary classifier’s receiver operating characteristic curve
(ROC). Afterwards, we have tested whether ignoring the predictions of concepts
where the AUC was lower that 0.5 would potentially improve the overall per-
formance. Testing this hypothesis on the validation set, it is revealed that this
would improve the mean F1 -score in most cases, albeit only slightly. For the
features of the AAE, as an example, this tweak has only increased the score by
5 × 10−5 . With the features of the simple AE, the score was only improved by
7.3 × 10−4 . We held this mechanism away from the classifiers trained with the
CEDD feature set.
    Probabilistic classifiers minimizing binary cross-entropy inherently optimize
for the accuracy of predictions. However, accuracy is overoptimistic when labels
have a very low number of positives, as is the case in this task, making a poor
metric for the classifiers’ usefulness. As recognized by past work in the scope of
ImageCLEF concept detection, adjusting the operating point thresholds to op-
timize the F1 -score provides significant improvements in the final metric values,
in spite of the known implications of this practice [8]. In order to adjust the
probabilistic threshold for optimizing the F1 -score, the provided validation set
was split in five folds. For each one, we used a granular sequential search (with
a granularity of 0.01) to identify the threshold resulting in the highest F1 -score
and the calculated median of the five optimizing thresholds was used for the
prediction over the testing set, using the trained classifiers.
    In the event that a sample was predicted to have more than 100 concepts
before submission, the list of concepts was trimmed by the less frequent concepts.
In practice, this has only happened to the linear classifiers trained using CEDD.


2.5 End-to-end Convolutional Neural Network

A simple CNN was designed (Table 1) and trained for multi-label classification,
thus once again treating concepts as labels. Conv2D stands for 2D convolution
layer, GAP for global average pooling and FC for fully connected layer. Training
samples were augmented using random square random crops, experimented with
different sized squares. In one approach, denoted as CNN-A-256px, we used 256
pixel-wide and excluded the layer Conv2D-5. In CNN-B, the full CNN was used
with 128 (CNN-B-128px) and 64 (CNN-B-64px) pixel-wide images. Validation
and test samples were simply resized to fit these dimensions, according to the
training process.


                    Table 1. The specification of the CNN used.

         Layer Type    Kernel/Stride   Output Shape     Details
         Conv2D-1      5x5/2           64 × 64 × 64     ReLU activation
         Conv2D-2      3x3/2           32 × 32 × 128    ReLU activation
         Conv2D-3      3x3/2           16 × 16 × 256    ReLU activation
         Conv2D-4      3x3/2           8 × 8 × 512      ReLU activation
         Conv2D-5      3x3/2           4 × 4 × 512      ReLU activation
         GAP           -               512              -
         Dropout       -               -                50 %
         FC            -               1100             sigmoid activation
    Moreover, to serve as a more intuitive means of comparison, the same archi-
tecture as the encoder in the AAE and AE, based on ResNet19 [6], was trained
for end-to-end classification. It is composed by five ResNet blocks, with the ar-
chitectures depicted in Tables 2 and 3. We employed the same process of data
augmentation as in the previously described CNN, resulting in 64×64 images. In
Table 3, BN means batch normalization and cin and cout represent the input and
output channels for the ResNet block, respectively. The Addition layer depicts
the addition of the previous stages: the first, with one convolution layer and the
second, with two convolution layers.


    Table 2. ResNet architecture.
                                        Table 3. Residual block specification.
      Layer Type   Output Shape
                                       Layer Type Kernel/Stride Output Shape
      ResBlock     64 × 64 × 64
      ResBlock     32 × 32 × 128       Conv2D       3×3 / 2      h/2 × w/2 × cout
      ResBlock     16 × 16 × 256       BN, ReLU -                h × w × cin
      ResBlock     8 × 8 × 256         Conv2D     3×3 / 1        h × w × cout
      ResBlock     4 × 4 × 512         BN, ReLU -                h × w × cout
      GAP, ReLU    512                 Conv2D, BN 3 × 3 / 2      h/2 × w/2 × cout
      Dropout      512
                                       Addition     -            h/2 × w/2 × cout
      FC           1100



    Both end-to-end deep learning models were trained with the AMSGrad op-
timizer [16], with a batch size of 32 and a learning rate of 10−4 with a decay
of 2x10−5 over each update and parameters β1 = 0.9 and β2 = 0.999. For
each model, threshold fine tuning was performed by evaluating the F1 -score per-
formance for multiple thresholds, using the validation data set. Thereafter, we
determined the threshold which would yield the optimal mean F1 -score on the
validation set.


3 Results and Discussion
Alongside with the metrics obtained from our submissions, we also present a
brief qualitative analysis as part of our results.

3.1 Qualitative Feature Analysis
The visualizations of the features for five of our models were obtained by train-
ing dimensionality reduction algorithms, namely principal component analysis
(PCA) and uniform manifold approximation and projection (UMAP) [10]. The
visualizations are presented in Figures 2 and 3, respectively, using a stratified
portion of 5 % of the training set, where the extreme outliers were removed from
the figures. The official and open implementation in Python of UMAP2 was used
2
    Available on GitHub: https://github.com/lmcinnes/umap
and the algorithm was configured with 15 as the number of neighbors and 0.1
as the minimum distance.
    For the CNN, the features were extracted at the GAP layer, whereas in the
remaining the visualizations depict the features extracted before the classifi-
cation process. The points associated with the concepts C0441633 (diagnostic
scanning) , C0817096 (thoracics) and C0935598 (sagittal planes set) are labeled
in red, green and blue, respectively, each painted in an additive fashion.




  Fig. 2. The 2D projections of the features, obtained with PCA, for each model.


    Comparing the two types of dimensionality reduction algorithms, we no-
tice that the representations obtained with PCA have more outliers. In a good
representation, the samples will be linearly separable based on their associated
concepts. In both types of representations, we can identify regions in the man-
 Fig. 3. The 2D projections of the features, obtained with UMAP, for each model.


ifold in which points of one of the chosen labels are mostly gathered, with this
being more perceptible in the representations obtained with UMAP and for the
CNN trained end-to-end. In fact, the clustering of representations with common
labels is highly expected, even more so for the CNN, since the feature learning in
this case was uniquely guided by the target labels. In general, these observations
are a rough approximation of the effective performance of each method.

3.2 Quantitative Results
The metrics on the validation and test set for each submission are depicted in
Table 4. Both Val F1 -score and Test F1 -score represent the F1 -scores averaged
sample-wise. When applied to the validation set, all concepts of the ground truth
were considered, even though the predictions were made only assuming the 1,100
most frequent concepts on the training set, and so always predicting ”negative”
for the remaining concepts.



Table 4. The final results obtained in the concept detection task, ranked by the list
of all valid submissions.

                                                                    Val        Test
 Rank    Run file name                          Details          F1 -score   F1 -score
 16      simplenet                              CNN-A-256px      0.19103     0.20586
 19      simplenet128x128                       CNN-B-128px      0.17636     0.18934
 20      mix-1100-o0-2019-05-06_1311            AE + AAE         0.16973     0.18254
 21      aae-1100-o0-2019-05-02_1509            AAE              0.16064     0.17601
 24      ae-1100-o0-2019-05-02_1453             AE               0.16021     0.17152
 25      cedd-1100-o0-2019-05-03_0937-trim      CEDD             0.15725     0.16679
 38      simplenet64x64                         CNN-B-64px       0.11747     0.12799
 39      resnet19-cnn                           CNN-RN           0.11813     0.12695




   The scores obtained from end-to-end CNN models (CNN-A-256px, CNN-B-
128px, CNN-B-64px) was highly varied, which demonstrates the impact of the
input shape, as well as neural network architecture, in the performance of the
model. With an image resolution of 64×64, this approach did not perform better
than any of the 2-stage procedures. On the other hand, higher resolutions have
contributed to significantly better scores.
   Concerning the unsupervised methods, the mean F1 -score obtained with
CEDDs (CEDD) was lower than with the deep learning architectures, proba-
bly because they lack representation ability for high-level problems, an effect
that was also observed in prior work [13]. Even with a smaller data set than
the previous edition of the concept detection task, unsupervised methods have
pushed the performance limits within the initially proposed input shape.
   The early fusion of the features obtained from the two auto-encoders (AE +
AAE) was also beneficial, resulting in a higher score than any of the two forms
independently(AE and AAE), suggesting that this aggregation was not entirely
redundant, thus providing another useful distribution.
    It is also worth noting that, much unlike in our previous participations in the
same task, the instance-wise mean F1 -scores on the testing set were higher than
on the validation set. This effect is even more noteworthy, since these methods
relied on the validation set for threshold optimization, and as such the classifiers
were fit for both the training and validation sets. This consistent discrepancy
is due to the fact that the test set did not include any new concepts that were
not present in the overall data set provided at the beginning of the challenge,
whereas the validation set contained some concepts which were not present in
the training set.
4 Conclusion

In the context of the ImageCLEFmed Caption challenge, we did an assessment of
feature learning techniques for concept detection from radiology images of several
medical imaging modalities. The extraction of informative – and intriguing –
visual features can yield great potential for multiple use cases in medical imaging
systems, including automated image labelling and content-based retrieval.
    We had confirmed the greater potential of deeper architectures for the con-
struction of more powerful representations, in comparison with low-level feature
extraction algorithms. With the data set size being significantly smaller in this
edition of the challenge, this was seen as an opportunity to compare end-to-end
classification models with the use of unsupervised learning methods. The out-
performing CNN model had a larger image size as input, making this factor a
counterbalance to obtain a better F1 -score than with the unsupervised models.
In fact, at a late stage of these experiments, we have identified that the at-
tempted resolution of 64 × 64 is insufficient to attain better results. In the end, a
simple CNN with a higher resolution showed the best performance among these
submissions. Time constraints have not enabled us to combine the two ideas
together in our submissions.
    The quality of the results obtained in this edition may also be attributed to
the use of threshold tuning to optimize the F1 -score. Without an adjustment
of the classifiers’ operating point, these methods would have a focus towards
the highest accuracy of a prediction, which is not as useful in the context of
information retrieval. When the number of positive samples is low, the potential
retrieval of less relevant entries is compensated by a significantly greater chance
of receiving relevant images. Nevertheless, we understand that the focus of a
single metric can distort the perception of quality among multiple methods in
the challenge, such that a change of performance metric could result in different
rankings [8]. Therefore, it may be insightful for future editions to also present
other metrics alongside the main metric, such as the mean precision and recall
on the testing set.
    This year presented an increase in participants engagement in the challenge,
which might echo the interest in solving timely situations in medical information
retrieval and automated medical data analysis. We believe that further invest-
ment in the challenge, both from participants and organizers, will enable the
implementation of these solutions in real-world scenarios.


Acknowledgments

This work was supported by the Integrated Programme of SR&TD “SOCA”
(Ref. CENTRO-01-0145-FEDER-000010), co-funded by Centro 2020 program,
Portugal 2020, European Union, through the European Regional Development
Fund.
References

 1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer Normalization (jul 2016).
    https://doi.org/10.1038/nature14236, http://arxiv.org/abs/1607.06450
 2. Chatzichristofis, S., Boutalis, Y.: CEDD: Color and edge directivity descriptor:
    A compact descriptor for image indexing and retrieval. pp. 312–322 (01 2008).
    https://doi.org/10.1007/978-3-540-79547-6_30
 3. Goodfellow, I.J., Pouget-abadie, J., Mirza, M., Xu, B., Warde-farley, D., Ozair, S.,
    Courville, A., Bengio, Y.: Generative Adversarial Nets pp. 1–9 (2014)
 4. Ionescu, B., Müller, H., Péteri, R., Dang-Nguyen, D.T., Piras, L., Riegler, M., Tran,
    M.T., Lux, M., Gurrin, C., Cid, Y.D., Liauchuk, V., Kovalev, V., Ben Abacha,
    A., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Pelka, O., Friedrich,
    C.M., Chamberlain, J., Clark, A., de Herrera, A.G.S., Garcia, N., Kavallieratou,
    E., del Blanco, C.R., Rodríguez, C.C., Vasillopoulos, N., Karampidis, K.: Overview
    of ImageCLEF 2019: Challenges, datasets and evaluation. In: Experimental IR
    Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth
    International Conference of the CLEF Association (CLEF 2019), LNCS Lecture
    Notes in Computer Science, Springer, Lugano, Switzerland (September 09-12 2019)
 5. Kingma, D.P., Ba, J.L.: Adam: A method for stochastic optimization. In: Interna-
    tional Conference on Learning Representations (2015), https://arxiv.org/pdf/
    1412.6980.pdf
 6. Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S.: The GAN landscape:
    Losses, architectures, regularization, and normalization. CoRR abs/1807.04720
    (2018), http://arxiv.org/abs/1807.04720
 7. Lai, Z., Deng, H.: Medical image classification based on deep features ex-
    tracted by deep model and statistic feature fusion with multilayer percep-
    tron. Computational Intelligence and Neuroscience 2018, 1–13 (09 2018).
    https://doi.org/10.1155/2018/2061516
 8. Lipton, Z.C., Elkan, C., Narayanaswamy, B.: Thresholding Classifiers to Maximize
    F1 Score. Machine Learning and Knowledge Discovery in Databases 8725, 225—-
    239 (feb 2014), http://arxiv.org/abs/1402.1892
 9. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoen-
    coders (nov 2015), http://arxiv.org/abs/1511.05644
10. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and
    Projection for Dimension Reduction. arXiv e-prints arXiv:1802.03426 (Feb 2018)
11. Pelka, O., Friedrich, C.M., García Seco de Herrera, A., Müller, H.: Overview
    of the ImageCLEFmed 2019 concept prediction task. In: CLEF2019 Working
    Notes. CEUR Workshop Proceedings, (CEUR- WS.org), ISSN 1613-0073, vol. 2380.
    Lugano, Switzerland (September 09-12 2019)
12. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects
    in context (ROCO): A multimodal image dataset. In: Stoyanov, D., Taylor, Z.,
    Balocco, S., Sznitman, R., Martel, A., Maier-Hein, L., Duong, L., Zahnd, G.,
    Demirci, S., Albarqouni, S., Lee, S.L., Moriconi, S., Cheplygina, V., Mateus, D.,
    Trucco, E., Granger, E., Jannin, P. (eds.) Intravascular Imaging and Computer As-
    sisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label
    Synthesis. pp. 180–189. Springer International Publishing, Cham (2018)
13. Pinho, E., Costa, C.: Feature learning with adversarial networks for concept detec-
    tion in medical images: Ua.pt bioinformatics at imageclef 2018. In: Working Notes
    of CLEF (09 2018)
14. Pinho, E., Costa, C.: Unsupervised learning for concept detection in
    medical images: A comparative analysis. Applied Sciences 8(8) (2018).
    https://doi.org/10.3390/app8081213
15. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classifica-
    tion: A comprehensive review. Neural Computation 29(9), 2352 – 2449 (2017).
    https://doi.org/10.1162/neco_a_00990
16. Reddi, S.J., Kale, S., Kumar, S.: On the Convergence of Adam and Beyond. In: In-
    ternational Conference on Learning Representations (2018), https://openreview.
    net/forum?id=ryQu7f-RZ