AUEB NLP Group at ImageCLEFmed Caption 2020 Basil Karatzas1 , John Pavlopoulos2,1 * [0000−0001−9188−7425] , Vasiliki Kougia2,1[0000−0002−0172−6917] , and Ion Androutsopoulos1 1 Department of Informatics, Athens University of Economics and Business, Greece 2 Department of Computer and Systems Sciences, Stockholm University, Sweden karatzas.basil@gmail.com, {annis,kouyiav,ion}@aueb.gr Abstract. This article concerns the participation of AUEB’s NLP Group in the ImageCLEFmed Caption task of 2020. The goal of the task was to identify med- ical terms that best describe each image, in order to accelerate and improve the interpretation of medical images by experts and systems. The systems we im- plemented extend our previous work [7,8,9] on models that employ CNN image encoders combined with an image retrieval method or a feed-forward neural net- work. Our systems were ranked 1st, 2nd and 6th. Keywords: Medical Images · Concept Detection · Image Retrieval · Image Cap- tioning · Multi-label Classification · Multimodal · Ensemble · Convolutional Neu- ral Network (CNN) · Machine Learning · Deep Learning 1 Introduction ImageCLEF [4] is an evaluation campaign held annually since 2003 as part of CLEF3 , and revolves around image analysis and retrieval tasks. ImageCLEFmedical [11] is a collection of ImageCLEF tasks that are associated with the study of medical images. In 2020, it consisted of 3 tasks: VQA-Med, Caption and Tuberculosis.4 The Image- CLEFmed Caption task concerns the automatic assignment of medical terms (called concepts) to medical images. The dataset of ImageCLEFmed Caption 2020 consisted of medical images, which were split to 7 categories according to their radiology modal- ity (see Table 1). Writing a diagnostic report for a medical image is a demanding and very time-consuming task that needs to be handled by medical experts [2,14]. One of the main goals of the ImageCLEFmed Caption task is to assist the development of ef- ficient, multi-label, medical-image tagging models, which could be used to assist the medical experts and reduce the time needed for the diagnosis as well as to reduce po- tential medical errors. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. * Corresponding author. 3 http://www.clef-initiative.eu/ 4 https://www.imageclef.org/2020/medical Table 1. The names of the seven categories in the dataset (first column), as provided by the or- ganisers, and the description of each category (second column), drawn from [10]. We also give the Concept Unique Identifier (CUI) of concepts that appear in every image of the respective cat- egory (third column) and their corresponding description from the UMLS Metathesaurus (fourth column). Name ROCO Description CUI UMLS Description DRAN ANGIOGRAPHY C002978 Angiogram DRCO COMBINED MODALITIES - - C0040398 Tomography DRCT COMPUTERIZED TOMOGRAPHY C0040405 X-Ray Computed Tomography DRMR MAGNETIC RESONANCE C0024485 Magnetic Resonance Imaging DRPE PET C0032743 Positron-Emission Tomography DRUS ULTRASOUND C0041618 Ultrasonography DRXR X - RAY, 2 D TOMOGRAPHY C0043299 Diagnostic radiologic examination In this paper we describe the medical image tagging systems of the AUEB NLP Group that were submitted to ImageCLEFmed Caption 2020. Following our last year’s success [8], our 3 submissions were ranked 1st, 2nd and 6th.5 Overall, our submissions were based on two methods. The first method was based on the Mean@k-NN system of [9] that assigns concepts to each test image using the k nearest neighbors from the training dataset. The second method extends the Con- ceptCXN system of [9] and uses a DenseNet-121 CNN to encode any test image and a Feed Forward Neural Network classifier on top. The remaining of the paper describes the data, methods, submitted systems and our results, followed by conclusions and fu- ture directions. 2 Data Initially, the ImageCLEFmed Caption datasets comprised a broad variety of clinical im- ages, which were extracted from figures of scientific articles found in the open-access biomedical literature database PubMed Central.6 Each image was assigned medical terms from the Unified Medical Language System (UMLS) [1]. These terms, called concepts, were extracted from the processed text of the respective figure caption. Since 2019, in order to discard compound or non-radiology images from the initial datasets, the organisers applied filters and also performed a manual revision of their data. A sub- set of the resulting dataset, which is called the extended Radiology Objects in COntext (ROCO) [10], was chosen to be used as the dataset of the competition this year (see Fig. 1). Additionally, this year, images were classified into 7 mutually exclusive categories, as shown in Table 1, depending on the type of the radiology exam. The number of possible concepts was reduced compared to previous years, by re- moving concepts with few occurrences, since the large number of concepts in the pre- vious years resulted in the task being difficult for models [15]. There were 111,156 5 Our best performing system will become available in the bioCaption PyPi package. 6 https://www.ncbi.nlm.nih.gov/pmc/ Fig. 1. Two images from ImageCLEFmed Caption 2020, with their gold CUIs. On the left is an image from a Computer Tomography (CT) and on the right is an image from a Positron Emission Tomography (PET). possible concepts in 2018 [16], 5,528 in 2019 [8] and 3,047 in 2020. We also observed that there were concepts that appeared in every single image of a specific category, but rarely or never appeared in other categories. The Concept Unique Identifiers (CUIs) of these concepts are shown in Table 1.7 These concepts rather describe the modality of the respective category. For example, DRPE images are always assigned with C0032743, whose UMLS term is P OSITRON -E MISSION T OMOGRAPHY. The dataset was split by the organisers to a training set of 64,753 images, a vali- dation set of 15,970 images, and a test set of 3,534 images. For our experiments, we merged the provided training and validation sets and used 10% of the merged data as a development set. We will refer to the remaining 90% of the merged dataset as the training set for the rest of the paper. 3 Methods This section describes the systems that were used in our submissions. 3.1 System 1: 2xCNN+FFNN CNN+FNNN (a.k.a. ConceptCXN or DenseNet121+FFNN) [9,8] is the system that we submitted last year (and was ranked 1st) for the same task. It is a variation of CheXNet [13] that uses DenseNet-121 [3], which is a stack of 120 CNN layers, followed by 7 We used UMLS Metathesaurus (uts.nlm.nih.gov/home.html) to map each CUI to its term. Fig. 2. Statistics regarding the number of images and concepts per category. a feed-forward Neural Network (FFNN) that acts as a classifier layer on top. In the ImageCLEFmed Caption task of 2019, we changed the original FFNN to comprise 5,528 outputs (instead of 14), one per available concept. For the task of 2020, which also comprises different image categories (see Table 1), we followed the same approach per model (i.e., we employed one model per category). For example, the respective FFNN for the model of category C generates NC outputs, which is the number of all possible concepts in category C.8 The red bars of Fig. 2 depict the NC values per category C, computed on the training set. We trained the model by minimizing the binary cross entropy loss. We used Adam [6] as our optimizer and decreased the learning rate by a factor of 10 when the loss showed no improvement, following the work of [8]. We used a batch size of 16 and early stopping with a patience of 3 epochs. For each CNN+FFNN of a specific category (i.e., for each category, we fine-tuned a CNN and an FFNN on top), a classification threshold for all the concepts of the respective category was tuned by optimising the F1 score. Any concepts for which the respective output values exceeded that threshold were assigned to the corresponding image. Two of our 2020 submissions consisted of ensembles of CNN+FFNN models, con- structed in the following way. We trained 5 models per category, and kept the 2 best performing ones, according to their F1 score. We then created two ensembles, using the U NION and the I NTERSECTION of the concepts returned by these two models. Here- after, these two ensembles will be called 2xCNN+FFNN@U and 2xCNN+FFNN@I, respectively. 8 We did not use image augmentation this year due to time restrictions, because of the large number of models we needed to train. Fig. 3. Boxplots illustrating the number of gold concepts for the images of each category. 3.2 System 2: CNN+kNN Following our previous work [8,7], the goal of our CNN+k-NN model for each test image was to retrieve similar images from the training set. The encoder of this model stemmed from our fine-tuned CNN+FFNN system, hence it is a CNN per category. We employed the output of the last average pooling layer of the CNN to represent each encoded image.9 The encoded test image was then compared to all the images in the training set (encoded offline), using cosine similarity, and the k nearest images were returned. After retrieving the k nearest images, CNN+k-NN returned the r concepts that were most frequently assigned to the k images. We tuned k and r for each category, investi- gating values from 1 to 200 for k, and values from 1 to 10 for r. During tuning, we also considered two other functions for r. First, we used the average number of concepts in the k images: k 1X r= ni (1) k i=1 9 Each image is rescaled to 224x224 and normalised with the mean and standard deviation of ImageNet. Fig. 4. Illustration of CNN+k-NN [8] for the DRAN category (we use DRAN as an example). Second, we used a weighting based on cosine similarity to weigh the concepts: k X cos(g, gi ) r= Pk ∗ ni (2) i=1 j=1 cos(g, gj )) where ni is the number of concepts of the i-th retrieved image, g is the test image, gi is the i-th closest training image, and cos(g, gi ) the cosine similarity between g and gi . As with CNN+FFNN, our DenseNet-121 CNN was pretrained on ImageNet and fine-tuned on the ImageCLEFmed Caption dataset. However, we experimented also with adding an attention layer10 [12] to our CNN. We call this model CNN+kNN@att. We also experimented with fine-tuning our CNN on a large dataset of radiography im- ages called MIMIC-CXR [5] (before fine-tuning it further on the ImageCLEFmed Cap- tion dataset), but we did not obtain any improvements. 4 Submissions and Results In order to decide what models to use for the final submissions, we evaluated all mod- els on our development set. Since images were separated into seven categories, each submission consisted of a model per category, resulting in seven models per submis- sion. Two out of our three submissions employed different instances of the same sys- tem (2xCNN+FFNN@U & 2xCNN+FFNN@I), each for a different category, while the third one (called BEST@CATEGORY) combined results from different types of sys- tems for each category. The official measure of the competition was F1, macro-averaged over the images, without taking into account the different categories. To generate the predictions for the test set, we merged the training with the development set. We used a held-out set (20% of the merged data) to tune the hyper-parameters of the CNN+FFNN and CNN+k-NN models (see Table 4 and Table 4 for the final values). As shown in Fig. 5, the best score 10 https://bit.ly/323coXF Fig. 5. The top F1 score achieved each year in the ImageCLEFmed Caption task. Table 2. The F1 scores of our submitted models, measured on the development set. DRAN DRPE DRCO DRCT DRMR DRUS DRXR ANY BEST@CATEGORY 0.3012 0.2485 0.1650 0.4456 0.3413 0.3093 0.3656 0.3715 2xCNN+FFNN@U 0.3012 0.2485 0.1554 0.4455 0.3388 0.3063 0.3651 0.3704 2xCNN+FFNN@I 0.2995 0.2388 0.1560 0.4456 0.3400 0.3093 0.3656 0.3710 improves every year. This is probably because the task was indeed simplified each year (see Section 2). Table 3. The results and rankings of our systems on the development and test set. The baseline of the last row only predicts the concepts that always appear in the images of the category. F1 Score Approach Ranking Development Test B EST @C ATEGORY 0.3715 0.3933 2 2 X CNN+FFNN@U 0.3704 0.3870 6 2 X CNN+FFNN@I 0.3710 0.3940 1 BASELINE 0.3666 – – Table 4 presents the scores of our systems on the development and the official test set, along with the official rankings. 2xCNN+FFNN@I was the best. On the other hand, 2xCNN+FFNN@U, which returns the union (instead of the intersection) of the pre- dicted concepts of the models in the ensemble, was ranked much lower. It is worth mentioning that a baseline, which simply returns the concepts always shown per cate- gory, achieves very high F1 on the development set. The submission that combined the best model per category (see Table 4) was ranked 2nd. We noticed that models tended to predict only the concepts that always appear in each category, thus we also show statistics regarding the diversity of our submissions Fig. 6. The diversity (number of distinct concepts predicted / number of all possible concepts) for each of our submitted models per category. for the test set (Fig. 6). We define diversity as the total number of distinct concepts the models predicted for a specific category divided by the total number of concepts found in the training set of that category. Observing the output of our models, we noticed that the ones with very low diversities only predicted the concepts that appeared in every image (as shown in Table 1) of the category they were trained on. Table 4. The best models for each category according to the development scores, along with the hyper-parameter values used for the final submissions. t1 and t2 are the classification thresholds of the two models included in 2xCNN+FFNN, k is the number of the nearest images and r is the number of concepts or the function that defines the number of concepts. Category: DRAN DRPE DRCO DRCT DRMR DRUS DRXR Best Model: 2xCNN+FFNN@U CNN+k-NN@att CNN+k-NN 2xCNN+FFNN@I t1 = 0.34 t1 = 0.3 k = 91 k=5 k=6 t1 = 0.18 t1 = 0.45 Hyper-parameters: t2 = 0.7 t2 = 0.23 r = average r=2 r=1 t2 = 0.22 t2 = 0.86 Table 5. The thresholds used in our 2xCNN+FFNN ensembles, one for each CNN+FFNN. The two CNN+FFNN models and their corresponding thresholds are different for each category. Category: DRAN DRPE DRCO DRCT DRMR DRUS DRXR t1 = 0.34 t1 = 0.3 t1 = 0.14 t1 = 0.73 t1 = 0.28 t1 = 0.18 t1 = 0.45 Parameters: t2 = 0.7 t2 = 0.23 t2 = 0.08 t2 = 0.5 t2 = 0.98 t2 = 0.22 t2 = 0.86 5 Conclusions and Future Work This article described the submissions of AUEB’s NLP Group to the 2020 Image- CLEFmed Caption task. One of our submissions was ranked 1st, while the other two were ranked 2nd and 6th. All of our systems were based on a DenseNet-121 CNN [3] to encode images. A retrieval-based method achieved the best results in three out of seven categories. However, an ensemble of two CNN+FFNN multi-label classifiers was ranked 1st overall. Future work includes the assessment of our models on more datasets and improving retrieval-based methods, which are still under-explored. References 1. Bodenreider, O.: The unified medical language system (umls): integrating biomedical termi- nology. Nucleic acids research 32(Database issue), 4 (Jan 2004) 2. Chokshi, F.H., Hughes, D.R., Wang, J.M., Mullins, M.E., Hawkins, C.M., Jr, R.D.: Diagnos- tic radiology resident and fellow workloads: a 12-year longitudinal trend analysis using na- tional medicare aggregate claims data. Journal of the American College of Radiology 12(7), 664–669 (2015) 3. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely Connected Convolu- tional Networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 4700–4708. Honolulu, HI, USA (2017) 4. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner-Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., de Her- rera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L., Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu, M., Ştefan, L.D., Constantin, M.G.: Overview of the Image- CLEF 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-25 2020) 5. Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., ying Deng, C., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs (2019) 6. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014) 7. Kougia, V., Pavlopoulos, J., Androutsopoulos, I.: A Survey on Biomedical Image Caption- ing. In: Workshop on Shortcomings in Vision and Language of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. pp. 26–36. Minneapolis, MN, USA (2019) 8. Kougia, V., Pavlopoulos, J., Androutsopoulos, I.: AUEB NLP Group at ImageCLEFmed Caption 2019. In: CLEF2019 Working Notes. CEUR Workshop Proceedings. pp. 9–12. Lugano, Switzerland (2019) 9. Kougia, V., Pavlopoulos, J., Androutsopoulos, I.: Medical Image Tagging by Deep Learn- ing and Retrieval. In: Experimental IR Meets Multilinguality, Multimodality, and Interac- tion Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020). Thessaloniki, Greece (2020) 10. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In: MICCAI Workshop on Large-scale Annotation of Biomedical data and Expert Label Synthesis. pp. 180–189. Granada, Spain (2018) 11. Pelka, O., Friedrich, C.M., Garcı́a Seco de Herrera, A., Müller, H.: Overview of the Im- ageCLEFmed 2020 concept prediction task: Medical image understanding. In: CLEF2020 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org $$, Thessaloniki, Greece (September 22-25 2020) 12. Raffel, C., Ellis, D.P.W.: Feed-forward networks with attention can solve some long-term memory problems (2015) 13. Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., et al.: CheXNet: Radiologist-Level Pneumonia Detection on Chest X-rays with Deep Learning. arXiv:1711.05225 (2017) 14. Rimmer, A.: Radiologist shortage leaves patient care at risk, warns royal college. British Medical Journal 359 (2017) 15. S. Singh, S. Karimi, K.H.S., Hamey, L.: Biomedical Concept Detection in Medical Images: MQ-CSIRO at 2019 ImageCLEFmed Caption Task. In: CLEF2019 Working Notes. CEUR Workshop Proceedings. p. 15. Lugano, Switzerland (2019) 16. Zhang, Y., Wang, X., Guo, Z., Li, J.: ImageSem at ImageCLEF 2018 Caption Task: Image Retrieval and Transfer Learning. In: CLEF2018 Working Notes. CEUR Workshop Proceed- ings. Avignon, France (2018)