AUEB NLP Group at ImageCLEFmed Caption 2019

               Vasiliki Kougia, John Pavlopoulos, and Ion Androutsopoulos

         Department of Informatics, Athens University of Economics and Business, Greece
                             {kouyiav,annis,ion}@aueb.gr


         Abstract. We present the systems that AUEB’s NLP Group used to participate in
         the ImageCLEFmed 2019 Caption task. The goal of this task is to automatically
         select medical concepts related to each image, as a first step towards generating
         image captions, medical reports, or to help in medical diagnosis. We participated
         with four systems, all using CNN image encoders. The encoder of each system
         is combined with an image retrieval method or a feed-forward neural network to
         predict concepts. Our systems were ranked 1st, 2nd, 3rd, and 5th.

         Keywords: Medical Images · Concept Detection · Image Retrieval · Multi-label
         Classification · Image Captioning · Machine Learning · Deep Learning


1       Introduction
Deep learning methods are being developed to automatically interpret biomedical im-
ages in order to help clinicians who examine large numbers of images daily [10]. The
ImageCLEFmed Caption task [12] is part of ImageCLEF 2019 [6].1 Image CLEF is a
campaign that suggests novel challenges and develops benchmarking resources for the
evaluation of systems operating on images. The ImageCLEFmed Caption Task ran for
the 3rd year in 2019. It included a Concept Detection sub-task, where the goal was to
perform multi-label classification of medical images by automatically selecting medical
concepts that should be assigned to each image. The concepts come from the Unified
Medical Language System (UMLS).2 Selecting the appropriate concepts per image can
be a first step towards automatically generating image captions, longer medical reports,
and can also assist, more generally, in computer-assisted diagnosis [9]. In the two pre-
vious years, ImageCLEFmed also included a Caption Prediction (generation) sub-task
[2, 4], which was not included this year.
     This paper presents the four Concept Detection systems that AUEB’s NLP Group
used to participate in ImageCLEFmed 2019 Caption. The systems were ranked 1st, 2nd,
3rd, and 5th. The system that was ranked 3rd consists of a DenseNet-121 [5] Convo-
lutional Neural Network (CNN) image encoder and a k-Nearest Neighbors (k-NN) re-
trieval component that uses the encoding of the image being classified to retrieve similar
training images with known concepts; these are then used to assign concepts to the new
    1
    https://www.imageclef.org/2019/medical/caption/
    2
    https://www.nlm.nih.gov/research/umls/
    Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
image. The top-ranked system is a re-implementation of CheXNet [14], with modifica-
tions for ImageCLEFmed Caption 2019. CheXnet also uses the DenseNet-121 encoder
[5], combined with a feed-forward neural network (FFNN) that performs multi-label
classification. The second-best system is an ensemble combining concept probability
scores obtained from the CheXNet-based system and image similarity scores produced
by k-NN retrieval of similar training images. The system ranked 5th uses the VGG-19
image encoder [15], which was also used by Jing et al. [7], combined with a FFNN for
multi-label classification.

2       Data
The ImageCLEFmed Caption 2019 dataset is a subset of the Radiology Objects in COn-
text (ROCO) dataset [13]. It consists of medical images extracted from open access
biomedical journal articles of PubMed Central.3 Each image was extracted along with
its caption. The caption was processed using QuickUMLS [16] to produce the gold
UMLS concept unique identifiers (CUIs). An image can be associated with multiple
CUIs (Figure 1). Each CUI is accompanied by its corresponding UMLS term.


Fig. 1. Two images from ImageCLEFmed Caption 2019, with their gold CUIs and UMLS terms.

In ImageCLEFmed Caption 2017 [2] and 2018 [4], the datasets were noisy. They in-
cluded generic and compound images, covering a wide diversity of medical images;
    3
        https://www.ncbi.nlm.nih.gov/pmc/
there was also a large total number of concepts (111,155) and some of them were too
generic and did not appropriately describe the images [18]. In the ROCO dataset, com-
pound and non-radiology images were filtered out using a CNN model. This led to
80,786 radiology images in total, of which 56,629 images were provided as the training
set, 14,157 as the validation set, and the remaining 10,000 images were used for testing.
In ImageCLEFmed Caption 2019, the total number of UMLS concepts was reduced to
5,528, with 6 concepts assigned to each training image on average. The minimum num-
ber of concepts per training image is 1, and the maximum is 72. Table 1 shows the 6
most frequent concepts of the training set and how many training images they were
assigned to, according to the gold annotations. We note that 312 of the 5,528 total con-
cepts are not assigned to any training image; and 1,530 concepts are assigned to only
one training image.

                              CUI       UMLS term       Images
                            C0441633 diagnostic scanning 6,733
                            C0043299 x-ray procedure     6,321
                            C1962945       radiogr       6,318
                            C0040395       tomogr        6,235
                            C0034579     pantomogr       6,127
                            C0817096      thoracics      5,981

Table 1. The 6 most frequent concepts (CUIs) in the training set of ImageCLEFmed Caption
2019 and how many training images they are assigned to, according to the gold annotations.

We randomly selected 20% of the training images and used them as our development
set (11,326 images along with their gold concepts). The models we used to produce the
submitted results were trained on the entire training set. The validation set was used for
hyper-parameter tuning and early stopping.


3         Methods
This section describes the four methods we developed for ImageCLEFmed Caption
2019.

3.1       System 1: DenseNet-121 Encoder + k-NN Image Retrieval (Ranked 3rd)
In this system, we followed a retrieval approach, extending the 1-NN baseline of our
previous work on biomedical image captioning [9]. Given a test image, the previous
1-NN baseline returned the caption of the most similar training image, using a CNN
encoder to map each image to a dense vector. For ImageCLEFmed Caption 2019, we
retrieve the k-most similar training images and use their concepts, as described below.
    We use the DenseNet-121 [5] image encoder, a CNN with 121 layers, where all
layers are directly connected to each other improving information flow and avoiding
vanishing gradients. We started with DenseNet-121 pre-trained on ImageNet [1] and
fine-tuned it on ImageCLEFmed Caption 2019 training images.4 The fine-tuning was
      4
          We used the implementation of https://keras.io/applications/#densenet.
performed as when training DenseNet-121 in System 2, including data augmentation
(Section 3.2). Without fine-tuning, the performance of the pre-trained encoder was
worse. ImageCLEFmed Caption 2019 images were rescaled to 224×224 and normal-
ized with the mean and standard deviation of ImageNet to match the requirements of
DenseNet-121 and how it was pre-trained on ImageNet. Having fine-tuned DenseNet-
121, we used it to obtain dense vector encodings, called image embeddings, of all train-
ing images. The image embeddings are extracted from the last average pooling layer of
DenseNet-121. Given a test image (Fig. 2), we again use the fine-tuned DensNet-121 to
obtain the image’s embedding. We then retrieve the k training images with the highest
cosine similarity (computed on image embeddings) to the test image, and return the r
concepts that are most frequent among the concepts of the k images. We set r to the
average number of concepts per image of the particular k retrieved images. We tuned
the value of k in the range from 1 to 200 using the validation set, which led to k = 199.
Further fine-tuning may improve performance further. This system ranked 3rd.


Fig. 2. Illustration of how System 1 (DenseNet-121 and k-NN image retrieval) works at test time.


3.2       System 2: CheXNet-based, DenseNet-121 Encoder + FFNN (Ranked 1st)
This system, which is based on CheXNet [14], achieved the best results in Image-
CLEFmed Caption 2019. In its original form, CheXNet maps X-rays of the ChestX-ray
14 dataset [17] to 14 labels. It uses DenseNet-121 [5] to encode images, adding a FFNN
to assign one or more of the 14 labels (classes) to each image.
    We re-implemented CheXNet in Keras5 and extended it for the many more labels
(5,528 vs. 14) of ImageCLEFmed Caption 2019. The images of ImageCLEFmed Cap-
tion 2019 were again rescaled to 224×224 and normalized using the mean and standard
deviation values of ImageNet. Also the training images of ImageCLEFmed Caption
2019 were augmented by applying random horizontal flip. Image embeddings are again
extracted from the last average pooling layer of DenseNet-121. In this system, however,
the image embeddings are then passed through a dense layer with 5,528 outputs and sig-
moid activations to produce a probability per label. We trained the model by minimizing
      5
          https://keras.io/
binary cross entropy loss. We used Adam [8] with its default hyper-parameters, early
stopping on the validation set, and patience of 3 epochs. We also decayed the learning
rate by a factor of 10 when the validation loss stopped improving.
    At test time, we predict the concepts for each test image using their probabilities, as
estimated by the trained model. For each concept (label), we assign it to the test image if
the corresponding predicted probability exceeds a threshold t. We use the same t value
for all 5,528 concepts. We tuned t on the validation set, which led to t = 0.16.

3.3       System 3: Based on Jing et al., VGG-19 Encoder + FFNN (Ranked 5th)
This system is based on the work of Jing et al. [7], who presented an encoder-decoder
model to generate tags and medical reports from medical images. Roughly speaking,
the full model of Jing et al. uses a VGG-19 [15] image encoder, a multi-label classifier
to produce tags (describing concepts) from the images, and a hierarchical LSTM that
generates texts by attending on both image and tag embeddings; the top level of the
LSTM generates sentence embeddings, and the bottom level generates the words of
each sentence. We implemented in Keras a simplified version of the first part of Jing et
al.’s model, the part that performs multi-label image classification.
     Again, we rescale the ImageCLEFmed Caption 2019 images to 224×224 and nor-
malize them using the mean and standard deviation of ImageNet. We feed the resulting
images to the VGG-19 CNN, which has 19 layers and uses small kernels of size 3 × 3.
We used VGG-19 pre-trained on ImageNet.6 We feed whole images to VGG-19, unlike
Jing et al. [7], who divide each image into regions and encode each region separately.
The output of the last fully connected layer of VGG-19 is then given as input to a dense
layer with a softmax activation to obtain a probability distribution over the concepts.
The model is trained using categorical cross entropy, which is calculated as:
                                     |C|
                                     X
                             E=−           ytrue,i log2 (ypred,i )                     (1)
                                     i=1

where C is the set of |C| = 5, 528 concepts, ytrue is the ground truth binary vector
of a training image, and ypred is the predicted softmax probability distribution over the
concepts C for the training image. Categorical cross entropy sums loss terms only for
the gold concepts of the image, which have a value of 1 in ytrue . When using soft-
max and categorical cross-entropy, usually ytrue is a one-hot vector and the classes are
mutually exclusive (single-label classification). To use softmax with categorical cross
entropy for multi-label classification, where ytrue is binary but not necessarily one-hot,
the loss is divided by the number of gold labels (true concepts) [3, 11]. Jing et al. [7]
achieve this by dividing the ground truth binary vector ytrue by its L1 norm, which
equals the number of gold labels. Hence, the categorical cross-entropy loss is computed
as follows:
                      |C|                                   M
                      X     ytrue,i                      1 X
               E=−                   log2 (ypred,i ) = −      log (ypred,j )           (2)
                      i=1
                          k ytrue k1                     M j=1 2
      6
          https://keras.io/applications/#vgg19
where M is the number of gold labels (true concepts) of the training image, which is
different per training image. In this model, the loss of Eq. 2 achieved better results on
the development set, compared to binary cross entropy with a sigmoid activation per
concept. We used the Adam optimizer with initial learning rate 1e-5 and early stopping
on the validation set with patience 3 epochs. Given a test image, we return the six
concepts with the highest probability scores, since the average number of gold concepts
per training image is 6.


3.4   System 4: Ensemble, k-NN Image Retrieval + CheXNet (Ranked 2nd)
This method is an ensemble of System 1 (DenseNet-121 + k-NN Image Retrieval) and
System 2 (CheXNet-based), where System 1 is modified to produce a score for each
returned concept.
    Given a test image g, we use System 1 (Fig. 2) to retrieve the k most similar training
images g1 , . . . , gk , their gold concepts, and the cosine similarities s(g, g1 ), . . . , s(g, gk )
between the test image g and each one of the k retrieved images. Let C be again the set
of |C| = 5, 528 concepts. For each concept cj ∈ C, the modified System 1 assigns to
ci the following score:
                                               k
                                               X
                               v1 (cj , g) =         s(g, gi ) δ(cj , gi )                        (3)
                                               i=1

where δ(cj , gi ) = 1 if cj is a gold concept of the retrieved training image gi , and
δ(cj , gi ) = 0 otherwise. In other words, the score of each concept cj is the sum of the
cosine similarities of the retrieved documents where cj is a gold concept.
     For the same test image g, we also obtain concept probabilities from System 2,
i.e., a vector of 5,528 probabilities. Let v2 (cj , g) be the probability of concept cj being
correct for test image g according to System 2. For each cj ∈ C, the ensemble’s score
v(cj , g) of cj is simply the average of v1 (cj , g) and v2 (cj , g). The ensemble returns the
six concepts with the highest v(cj , g) scores, as in System 3, on the grounds that the
average number of gold concepts per training image is 6.


4     Results
Systems were evaluated in ImageCLEFmed Caption 2019 by computing their F1 scores
on each test image (in effect comparing the binary ground truth vector ytrue to the pre-
dicted concept probabilities ypred ) and then averaging over all test images [6]. Table 2
reports the evaluation results of our four systems on the development and test data, as
well as their ranking among the approximately 60 systems that participated in the task.
The ensemble (System 4) had the best results on development data, but the CheXNet-
based system (System 2) had the best results on the test set.
                                                               F1 Score
     System                        Description                               Ranking
                                                            Dev       Test
     S1                            DenseNet [5] + k-NN 0.2575244 0.2740204      3
     S2 (CheXNet-based [14]) DenseNet [5] + FFNN 0.2599914 0.2823094            1
     S3 (based on Jing et al. [7]) VGG-19 [15] + FFNN 0.2497768 0.2639952       5
     S4 (ensemble)                 Combination of S1, S2 0.2644322 0.2792511    2

               Table 2. Results of our four systems on development and test data.


5   Conclusions and Future Work

We described the four systems that AUEB’s NLP Group used to participate in Image-
CLEFmed 2019 Caption. The four systems were ranked 1st, 2nd, 3rd, and 5th. Our
top system was a re-implementation of CheXNet [14], with modifications to handle
the much larger label set of ImageCLEFmed 2019 Caption and data augmentation. The
system that was ranked 3rd used DenseNet [5] to encode images and k-NN retrieval to
return the concepts of the most similar training images. Our second-best system was an
ensemble of the previous two (CheXNet-based and k-NN based), indicating that the two
approaches are complementary. Our weakest system, which nevertheless was ranked
5th, was based on the multi-label classification part of the system of Jing et al. [7],
which aims to generate draft medical reports using an encoder-decoder approach.
    In future work, we aim to experiment with, combine, and improve upon additional
methods and datasets for medical image captioning. Towards that direction, we recently
published a survey on medical image to text methods [9], which we also plan to extend.


References
 1. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierar-
    chical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition.
    pp. 248–255. Miami Beach, FL, USA (2009)
 2. Eickhoff, C., Schwall, I., de Herrera, A.G.S., Müller, H.: Overview of ImageCLEFcaption
    2017 - the Image Caption Prediction and Concept Extraction Tasks to Understand Biomed-
    ical Images. In: CLEF2017 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org
    <http://ceur-ws.org>, Dublin, Ireland (September 11-14 2017)
 3. Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep Convolutional Ranking for Multilabel
    Image Annotation. In: International Conference on Learning Representations (2014)
 4. de Herrera, A.G.S., Eickhoff, C., Andrearczyk, V., Müller, H.: Overview of the ImageCLEF
    2018 Caption Prediction Tasks. In: CLEF2018 Working Notes. CEUR Workshop Proceed-
    ings, CEUR-WS.org <http://ceur-ws.org>, Avignon, France (September 10-14 2018)
 5. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely Connected Convolu-
    tional Networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern
    Recognition. pp. 4700–4708. Honolulu, HI, USA (2017)
 6. Ionescu, B., Müller, H., Péteri, R., Cid, Y.D., Liauchuk, V., Kovalev, V., Klimuk, D., Tarasau,
    A., Abacha, A.B., Hasan, S.A., Datla, V., Liu, J., Demner-Fushman, D., Dang-Nguyen, D.T.,
    Piras, L., Riegler, M., Tran, M.T., Lux, M., Gurrin, C., Pelka, O., Friedrich, C.M., de Herrera,
    A.G.S., Garcia, N., Kavallieratou, E., del Blanco, C.R., Rodrı́guez, C.C., Vasillopoulos, N.,
    Karampidis, K., Chamberlain, J., Clark, A., Campello, A.: ImageCLEF 2019: Multimedia
    Retrieval in Medicine, Lifelogging, Security and Nature. In: Experimental IR Meets Multi-
    linguality, Multimodality, and Interaction. Proceedings of the 10th International Conference
    of the CLEF Association (CLEF 2019), LNCS Lecture Notes in Computer Science, Springer,
    Lugano, Switzerland (September 9-12 2019)
 7. Jing, B., Xie, P., Xing, E.: On the Automatic Generation of Medical Imaging Reports. In:
    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
    (Long Papers). pp. 2577–2586. Melbourne, Australia (2018)
 8. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014)
 9. Kougia, V., Pavlopoulos, J., Androutsopoulos, I.: A Survey on Biomedical Image Caption-
    ing. In: Workshop on Shortcomings in Vision and Language of the Annual Conference of
    the North American Chapter of the Association for Computational Linguistics. pp. 26–36.
    Minneapolis, MN, USA (2019)
10. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Laak,
    J.A.V.D., Ginneken, B.V., Sánchez, C.I.: A Survey on Deep Learning in Medical Image
    Analysis. Medical Image Analysis 42, 60–88 (2017)
11. Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., van der
    Maaten, L.: Exploring the Limits of Weakly Supervised Pretraining. In: European Confer-
    ence on Computer Vision. pp. 181–196. Munich, Germany (2018)
12. Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Müller, H.: Overview of the ImageCLEFmed
    2019 Concept Prediction Task. In: CLEF2019 Working Notes. CEUR Workshop Pro-
    ceedings, vol. ISSN 1613-0073. CEUR-WS.org <http://ceur-ws.org/Vol-2380/>, Lugano,
    Switzerland (September 09-12 2019)
13. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology Objects in COntext
    (ROCO): A Multimodal Image Dataset. In: MICCAI Workshop on Large-scale Annotation
    of Biomedical data and Expert Label Synthesis. pp. 180–189. Granada, Spain (2018)
14. Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., et al.: CheXNet: Radiologist-Level
    Pneumonia Detection on Chest X-rays with Deep Learning. arXiv:1711.05225 (2017)
15. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image
    Recognition. arXiv:1409.1556 (2014)
16. Soldaini, L., Goharian, N.: QuickUMLS: A Fast, Unsupervised Approach for Medical Con-
    cept Extraction. In: MedIR workshop (2016)
17. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-
    scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Lo-
    calization of Common Thorax Diseases. In: Proceedings of the IEEE conference on Com-
    puter Vision and Pattern Recognition. pp. 2097–2106. Honolulu, HI, USA (2017)
18. Zhang, Y., Wang, X., Guo, Z., Li, J.: ImageSem at ImageCLEF 2018 Caption Task: Image
    Retrieval and Transfer Learning. In: CLEF2018 Working Notes. CEUR Workshop Proceed-
    ings. Avignon, France (2018)