=Paper=
{{Paper
|id=Vol-2696/paper_249
|storemode=property
|title=Overview of the ImageCLEFmed 2020 Concept Prediction Task: Medical Image Understanding
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_249.pdf
|volume=Vol-2696
|authors=Obioma Pelka,Christoph M. Friedrich,Alba García Seco De Herrera,Henning Müller
|dblpUrl=https://dblp.org/rec/conf/clef/PelkaFHM20
}}
==Overview of the ImageCLEFmed 2020 Concept Prediction Task: Medical Image Understanding==
Overview of the ImageCLEFmed 2020 Concept Prediction Task: Medical Image Understanding Obioma Pelka1,2[0000−0001−5156−4429] , Christoph M. 1,3[0000−0001−7906−0038] Friedrich , Alba G. Seco de Herrera4[0000−0002−6509−5325] , and Henning Müller5,6[0000−0001−6800−9878] 1 Department of Computer Science, University of Applied Sciences and Arts Dortmund, Germany {obioma.pelka,christoph.friedrich}@fh-dortmund.de 2 Department of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Germany 3 Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany 4 University of Essex, UK alba.garcia@essex.ac.uk 5 University of Applied Sciences Western Switzerland (HES-SO), Switzerland henning.mueller@hevs.ch 6 University of Geneva, Switzerland Abstract. This paper describes the ImageCLEFmed 2020 Concept De- tection Task. After first being proposed at ImageCLEF 2017, the med- ical task is in its 4th edition this year, as the automatic detection from medical images still remains a challenging task. In 2020, the format re- mained the same as in 2019, with a single sub-task. The concept de- tection task is part of the medical tasks, alongside the tuberculosis and visual question and answering tasks. Similar to the 2019 edition, the data set focuses on radiology images rather than biomedical images, however with an increased number of images. The distributed images were ex- tracted from the biomedical open access literature (PubMed Central). The development data consists of 65,753 training and 15,970 valida- tion images. Each image has corresponding Unified Medical Language System (UMLS R ) concepts, that were extracted from the original arti- cle image captions. In this edition, additional imaging acquisition tech- nique labels were included in the distributed data, which were adopted for pre-filtering steps, concept selection and ensemble algorithms. Most applied approaches for the automatic detection of concepts were deep learning based architectures. Long short-term memory (LSTM) recurrent neural networks (RNN), adversarial auto-encoder, convolutional neural networks (CNN) image encoders and transfer learning-based multi-label classification models were adopted. The performances of the submitted models (best score 0.3940) were evaluated using F1-scores computed per image and averaged across all 3,534 test images. Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. Keywords: Concept Detection · Computer Vision · ImageCLEF 2020 · Image Understanding · Image Modality · Radiology 1 Introduction In this paper, the approaches for the detection of Unified Medical Language Sys- tem (UMLS R ) concepts present in radiology images are presented. The task is part of the ImageCLEF1 bench-marking campaign, that is part of the Cross Lan- guage Evaluation Forum2 (CLEF). Since 2003, the ImageCLEF bench-marking campaign has been proposing several image understanding tasks from different domains every year [4, 15, 11]. Detailed information on other proposed tasks at the ImageCLEF 2020 can be found in Ionescu et al. [9]. The concept detection task in this year is the fourth edition. At Image- CLEFmed Caption 2017 [3] and ImageCLEFmed Caption 2018 [7], the task was comprised of two (2) sub-tasks: concept detection and caption prediction. The format changed in ImageCLEFmed Caption 2019 [16] with the single task of concept detection and remained that way this year at ImageCLEFmed Caption 2020. New in this edition is that the imaging modality is given for each image both in the development and evaluation sets. As there is an increasing number of medical images available without meta- data, for example in the scientific literature, there is an essential need to create systems that can automatically generate such information, hence making the content of these data sets more useful. The purpose of the ImageCLEFmed 2020 concept detection task was to create a platform for the evaluation of sys- tems capable of automatically creating UMLS R concepts of a given radiology image. These predicted information is applicable for data sets that either not labeled or structured, but also for medical data sets lacking textual metadata, as multi-modal approaches prove to obtain better results regarding several image classification tasks [18, 19]. The manual interpretation and generation of knowledge from medical images is not only time-consuming and prone to error, but also impractical. Therefore, the modeling systems that can automatically map visual content present in the images to concise textual representations is a necessity, in regards to efficient information retrieval and image classification. For development data, both the development and test sets from the Image- CLEFmed Caption 2019 [16] was distributed. This data set is a subset of the Radiology Object in COntext data set (ROCO) [17] and contains solely radiology images that originate from the PubMed Central (PMC) Open Access Subset3 [20]. Several UMLS R Concept Unique Identifiers (CUIs) are included to each image. The test set used for official evaluation was created in the same manner as proposed in Pelka et al. [17], for generalization purposes. 1 http://imageclef.org/ [last accessed: 28.07.2020] 2 http://www.clef-initiative.eu/ [last accessed: 28.07.2020] 3 https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/[last accessed: 28.07.2020] This paper presents an overview of the ImageCLEFmed 2020 Concept De- tection Task. Section 2 contains the task description and lists the participating teams. An explorative analysis computed on the distributed development and test data sets is described in Section 3. The framework used to evaluate the submission runs is explained in Section 4. Section 5 displays the modeling ap- proaches applied by the participating teams and the obtained scores, and is followed by discussion and conclusions in Section 6. 2 Task and Participation Similar to the ImageCLEF caption task in 2019 [16], in ImageCLEF Caption 2020 the focus is on the automatic detection of concepts in a large corpus of radi- ology images. The proposed task aims to interpret and summarise insights gained from medical images and therefore provide tools for radiology image under- standing. The distributed images in both development and evaluation data sets originate from biomedical articles extracted from the PubMed Central (PMC) Open Access Subset[20]. To each radiology image in the distributed data sets, UMLS R CUIs are included. These concepts are generated from the the original image captions found in the articles. Figure 1 displays an example of an image in the distributed data sets. In comparison to the previous tasks, the following improvements were made: – The imaging modality was included. – The focus remained on radiology images as in ImageCLEF 2019 . – The number of concepts was decreased by preprocessing the captions prior to concept extraction. Fig. 1. Example of a radiology image with the corresponding extracted UMLS R CUIs. The automatic detection of concepts present in images is a fundamental step towards scene understanding and hence image captioning, as the presence of applicable biomedical concepts can be detected and located. As the usage of multi-modal representations (visual and textual) for image classification tasks helps to achieve good performance [19], the automatically generated concepts can be adopted for this purpose. In addition, the concepts can also be used for context-based image analysis, as well as for information retrieval. The de- tected concepts are evaluated image-wise with precision and recall scores from the ground truth, which is described in Section 4. Table 1. Participating groups of the ImageCLEF 2020 Concept Detection Task. Teams with previous participation in 2019 are marked with an asterisk. Team Institution Runs AUEB NLP Group* Department of Informatics, Athens University of 3 [12] Economics and Business, Athens, Greece PwC Healthcare [24] PricewaterhouseCoopers US Advisory, 9 Mumbai, India Essex [6] School of computer Science and Electronic 9 Engineering, University of Essex, Essex, United Kingdom IML DFKI [10] Interactive Machine Learning Group, German 5 Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany TUC MC [25] Technische Unversität Chemnitz, 10 Chemnitz, Germany Morgan CS [14] Computer Science Department, 10 Morgan State University, Baltimore, Maryland, United States of America CSE SSN [2] Department of Computer Science and 1 Engineering, SSN College of Engineering, Chennai, India In the ImageCLEF 2020 concept detection task a total of 23 unique teams registered in AICrowd and downloaded the End-User-Agreement. This license is needed to obtain access to both development and evaluation data. 57 graded runs were submitted for evaluation by 7 teams from the following countries: Germany, United Kingdom, India, Greece and United States of America, which is listed in Table 2. Each of the groups was allowed 10 graded runs and 5 faulty runs altogether. 10 of the submitted runs were faulty and were not used for the official evaluation. 3 Data Set As in previous editions, the data set distributed for the task originates from biomedical articles of the PMC Open Access subset [20]. The development data set contains training and validation sets with 65,753 and 15,970 images, respec- tively. These images are subsets of the multi-modal image data set Radiology Objects in COntext (ROCO), which is presented in Pelka et al. [17]. ROCO has two classes: Radiology and Out-Of-Class. The first contains 81,825 radiol- ogy images and was adopted for the proposed task. It includes several medical imaging modalities such as, Computed Tomography (CT), Ultrasound, X-Ray, Fluoroscopy, Positron Emission Tomography (PET), Mammography, Magnetic Resonance Imaging (MRI), Angiography and PET-CT. The development data of the 2020 task includes the ImageCLEF caption 2019 development data set (archiving date: until 31.01.2018) and the official evaluation set (archiving date: 01.02.2018 - 01.02.2019). To avoid an overlap with images distributed in previous ImageCLEF medical tasks, the test set for ImageCLEF 2020 was created with a subset of PMC Open Access (archiving date: 01.02.2019 - 01.02.2020). The same procedures applied for the creation of the ROCO data set were applied for the test set as well. An analysis of the distributed data can be seen in Table 2. Table 2. Analysis on data distribution for ImageCLEFmed 2020 Concept Detection Task. Imaging Technique Train Validation Test Sum DRAN: Angiography 4,713 1,132 325 6,170 DRCO: Combined modalities in one image 487 73 49 609 DRCT: Computerized Tomography 20,031 4,992 1,140 26,163 DRMR: Magnetic Resonance 11,447 2,848 562 14,857 DRPE: Positron emission tomography 502 74 38 614 DRUS: Ultrasound 8,629 2,134 502 11,265 DRXR: X-Ray, 2D radiography 18,944 4,717 918 24,579 Sum 65,753 15,970 3534 84,257 From the PMC Open Access subset [20], a total of 6,031,814 image - caption pairs were extracted in January 2018. Compound figures, which are images with more than one subfigure, were removed using deep learning as proposed in Koitka et al. [13]. The non-compound images were further split into radiology and non- radiology, as the focus was on radiology. Semantic knowledge of object interplay present in the images were extracted in the form of UMLS R Concepts using the QuickUMLS library [23]. The image captions from the biomedical articles served as basis for the extraction of the concepts. The text pre-processing steps applied are described in Pelka et al. [17]. Using deep learning systems as proposed in Koitka et al. [13], the radiology images were further split into seven (7) imaging modality classes. This information can be used for filtering steps prior to model training, as well as for model fine-tuning. An additional UMLS R CUI denoting the imaging technique modality was added to each image. Figure 2 shows example images from the development data set, according to image modality and additional UMLS R CUI. Similarly to the caption task in 2019 [16], concepts with very high frequency (>13,000), as well as redundant synonyms were removed. This lead to a reduction of concepts per image in comparison to the previous years, from 5,528 in 2019 [16] to 3,047 in 2020. Not all concepts in the ground truth can be visually seen, for example the concept ’Hole Finding’ in Fig. 2 can not be detected from the image. Images in the training, validation and test sets have [1-140], [1-142] and [1-95] concepts, respectively. All concepts in the validation and test sets also exist in the training set. Fig. 2. Examples of radiology images distributed at the ImageCLEF 2020 concept de- tection task, showing the seven imaging modalities. All images were randomly selected from the development data set. Table 3. UMLS R (An excerpt of Unified Medical Language System R ) Concept Unique Identifiers (CUIs) distributed for the task with their respective occurrences. The concepts were randomly chosen in a descending order. All listed concepts were distributed in the training set. CUI Concept Occurrence C0040398 Tomography 20,031 C0040405 X-Ray Computed Tomography 20,031 C0043299 Diagnostic radiologic examination 18,944 C0024485 Magnetic Resonance Imaging 11,447 C0041618 Ultrasound 8,629 C0441633 Scanning 6733 C0043299 Diagnostic radiologic examination 6321 C1962945 Radiographic imaging procedure 6318 C0040395 Tomography 6235 C0034579 Panoramic Radiography 6127 C0817096 Chest 5981 C0040405 X-Ray Computed Tomography 5801 C1548003 Diagnostic Service Section ID - Radiograph 5159 ... ... ... C0000726 Abdomen 2297 ... ... ... C2985765 Enhancement Description 1084 ... ... ... C0228391 Structure of habenulopeduncular tract 672 C0729233 Dissecting aneurysm of the thoracic aorta 652 ... ... ... C0771711 Pancreas extract 456 ... ... ... C1704302 Permanent premolar tooth 177 ... ... ... C0042070 Urography 67 C0085632 Apathy 67 C0267716 Incisional hernia 67 ... ... ... C0081923 Cardiocrome 1 C0193959 Tonsillectomy and adenoidectomy 1 4 Evaluation Methodology For all 3,534 radiology images distributed in the test set, UMLS R CUIs have to be predicted by the participating teams automatically. As in the previous years [3, 7, 16], the model performance was measured using the balanced precision and recall trade-off in terms of F1-score. The default implementation of the Python scikit-learn (v0.17.1-2) library was applied to compute the F-scores per image and average them across all test images. The maximum number of concepts allowed per image was set to 150. This limitation was chosen as the training, validation and test set contain a maximum of 140, 142 and 95 concepts per image. Each group could have a maximum of 15 submission, with 10 valid and 5 faulty. Faulty submissions may include: – Same image id more than once – Wrong image id – Too many concepts – Same concept more than once – Not all test images included All submission runs were uploaded by the participating teams and evalu- ated with AICrowd4 . The source code of the evaluation tool is available on the ImageCLEF web page5 . 5 Results The overall performance achieved by the concepts detection models submitted by the 7 participating teams are listed and discussed in this section. In Table 4, the submission run with best performance per team is shown. An additional evaluation regarding the imaging modality was done internally, after the official concept detection evaluation process. The accuracy (%) across all images in the test set was computed and is listed in Table 6. Compared to the previous edi- tions, there is an improvement regarding the F1-Score of the submitted concept detection models, from 0.1583 in ImageCLEF 2017 [3], 0.1108 in ImageCLEF 2018 [7] and 0.2823 in ImageCLEF 2019 [16] to 0.3940 in 2020. The AUEB NLP Group [12] from the Athens University of Economics achieved the overall highest F1-Score of 0.3940 for the detection of concepts for the im- ages in the official evaluation test set. Their three (3) submission runs ranked 1st, 2nd and 6th of all 47 submitted runs. The submitted systems are a vari- ation of CheXNet [26] with DenseNet-121 [8] and followed by a feed-forward Neural Network (FFNN), which acts as the classifier layer on the top [12]. The system was first pre-trained on the ImageNet data set [21] and then fine-tuned using the ImageCLEF 2020 concept detection development data set. Several en- semble methods such as the intersection and union of predicted concepts were experimented. The system with the intersection of concepts achieved the overall highest F1-Score. The overall 2nd ranked participating team is PwC Healthcare group from PricewaterhouseCoopers with a total number of nine (9) submitted runs. The adopted approaches range from Convolutional Neural Network (CNN) architec- tures, to Natural Language Processing techniques, as well as clustering algo- rithms [24]. The group’s three (3) best systems ranked 3rd, 4th and 5th. Sev- eral pre-processing approaches such as range and intensity normalization and 4 https://www.aicrowd.com/challenges/imageclef-2020-caption-concept-detection [last accessed: 26.07.2020] 5 https://www.imageclef.org/system/files/ImageCLEF-ConceptDetection- Evaluation.zip [last accessed: 26.07.2020] data augmentation were adopted prior to training the models [24]. Multi-modal approaches were experimented to incorporate the concept imbalanced distribu- tion and a novel approach of band classification was applied. This classification method first clusters the vocabulary of concepts into bands and then creates for each band a classification architecture [24]. Table 4. Performance of the participating teams in the ImageCLEF 2020 concept detection task in regards to correctly predicting concepts of the images in the test set. The best run per team is selected. Teams with previous participation in 2019 are marked with an asterisk. Team Institution F1-Score AUEB NLP Group* [12] Department of Informatics, Athens University of 0.3940 Economics and Business, Athens, Greece PwC Healthcare [24] PricewaterhouseCoopers US Advisory, 0.3924 Mumbai, India Essex [6] School of computer Science and Electronic 0.3808 Engineering, University of Essex, Essex, United Kingdom IML DFKI [10] Interactive Machine Learning Group, German 0.3745 Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany TUC MC [25] Technische Unversität Chemnitz, 0.3512 Chemnitz, Germany Morgan CS [14] Computer Science Department, 0.1673 Morgan State University, Baltimore, Maryland, United States of America CSE SSN [2] Department of Computer Science and 0.1347 Engineering, SSN College of Engineering, Chennai, India The third best participating team was from the University of Essex, with an overall F1-Score of 0.381. The proposed approach adopts pre-trained DenseNet models [8] for the extraction of relevant features. The additional information on the imaging modality was used for fine-tuning by adding a fully connected layer to the DenseNet-121 model and thereby transforming it into a multi-label classification model [6]. Several concept selection strategies, such as distance and ranked based methods, were applied to a given query image from the test set. The group’s five best runs of the nine submitted runs ranked 6th to 10th among all submissions. Five runs were submitted by the IML group from the German Research Center for Artificial Intelligence, with the best F1-Score of 0.3745, and the 4th best team. Multiple deep learning systems such as VGG16 [22], ResNet50 [5] and DenseNet169 [8], which were pre-trained on the ImageNet data set, were applied for modeling the concept detection systems. The task was addressed as a multi one-hot encoding with a final prediction layer of 3,047 sigmoidal activation units and several fine-tuning steps, such as data augmentation, hyper-parameter settings, were undertaken [10]. Table 5: Concept detection performance in terms of all submitted runs for the ImageCLEF 2020 Concept Detection Task Group Name Submission Run F1-Score AUEB NLP Group InterceptCheXNetCheckpoints.csv 0.3940 AUEB NLP Group BestOf.csv 0.3933 PwC Healtcare folderwise KNN resnet101 test pred.csv0.3924 PwC Healtcare combined test pred v1.csv 0.3889 PwC Healtcare folder wise test pred v1.csv 0.3889 AUEB NLP Group UnionCheXNetCheckpoints.csv 0.3870 Essex submit run3.csv 0.3808 Essex submit run5.csv 0.3805 Essex submit run1.csv 0.3797 Essex cp99 all modified.txt 0.3785 Essex c99 all man.txt 0.3777 IML DFKI imageclefmed2020-test-vgg16-f1-bce- 0.3745 nomissing-iml.txt IML DFKI imageclefmed2020-test-vgg16-f1-bce- 0.3744 iml.txt PwC Healtcare combined test pred new.csv 0.3681 PwC Healtcare NLP clusters test pred.csv 0.3668 PwC Healtcare knn t117 test pred.csv 0.3666 IML DFKI imageclefmed2020-test-resnet50-iml.txt 0.3652 IML DFKI imageclefmed2020-test-vgg16-iml.txt 0.3631 IML DFKI imageclefmed2020-test-densenet169- 0.3602 iml.txt TUC MC model thr0 18.csv 0.3512 TUC MC streamlined1 thr0 25.csv 0.3486 TUC MC streamlined1 thr0 20.csv 0.3486 TUC MC 2streamlined1.csv 0.3486 TUC MC basemodel thr0 20.csv 0.3474 TUC MC model low lr thr0 20.csv 0.3455 Essex submit run2.csv 0.3449 TUC MC streamlined1 nomax.csv 0.3448 TUC MC basemodel.csv 0.3435 TUC MC streamlined1 thr0 12.csv 0.3423 PwC Healtcare f1 band test t025 pred.csv 0.3379 Essex cp98 all.txt 0.3370 TUC MC model weighting.csv 0.3325 PwC Healtcare NLP test pred fixed.csv 0.3163 Essex canberra all modified.txt 0.2804 PwC Healtcare combined wo folder test.csv 0.2655 Essex cp95 all.txt 0.2459 Morgan CS MSU dense fcn.txt 0.1673 Morgan CS MSU dense fcn 4.txt 0.1591 Morgan CS MSU dense resnet fcn 1.txt 0.1534 Morgan CS MSU dense resnet fcn 1.txt 0.1447 Morgan CS MSU dense feat.txt 0.1395 CSE SSN captions output.txt 0.1347 Morgan CS MSU dense feat.txt 0.1284 Morgan CS MSU dense fcn 2.txt 0.0943 Morgan CS MSU dense fcn 3.txt 0.0894 Morgan CS MSU autoenc fcn.txt 0.0634 Morgan CS MSU lstm dense fcn.txt 0.0625 TUC MC, a media computing group from the Chemnitz University of Tech- nology ranked 5th best participating team. The highest F1-Score from the ten submitted runs was 0.3745. The adopted deep learning model was based on the Xception architecture [1] with weights pre-trained on ImageNet. The submitted runs use the same model base structure, however the hyper-parameters are varied in regards to last layer threshold and max-pooling in the highest layers [25]. Ten runs were submitted by Morgan CS, a group from the computer science department at the Morgan State University. The best achieved F1-Score was 0.1673, by approaching the concept detection task as a multi-label classification problem [14]. Classifiers were trained with deep features extracted with the deep learning system DenseNet169 and ResNet50 and pre-trained on ImageNet. Other methods experimented include a recurrent concept sequence generator that was modelled using a multimodal technique of fusing text and image features for recurrent sequence prediction. CSE SSN from the department of computer science of the SSN College of En- gineering Chennai submitted one (1) run for official evaluation and achieved the average F1-Score of 0.1347 on all images in the test set. Similar to several partic- ipating teams, the concept detection task was addressed as a convolution neural network multi-label classification problem [2]. The imaging modality distributed was applied for pre-processing and model fine-tuning steps. An ex-post evaluation was computed on all submitted runs. The aim was to compute the performance on correctly predicting the imaging modality. All images in the development and test set were assigned concepts that denote the acquisition technique, as shown in Figure 2. The images belonging to the imaging modality ’DRCO: Combined modalities in one image’ were not considered for evaluation. For all images in the test set, we computed the presence of these concepts in the submission runs using this additional information. The best performance grouped per team is listed in Table 6 and the complete evaluation in Table 7. Table 6. Performance of the participating teams in the ImageCLEF 2020 concept detection task on correctly predicting the imaging modality of the images in the test set. The best run per team is selected. Teams with a previous participation in 2019 are marked with an asterisk. Team Institution Accuracy (%) PwC Healthcare [24] PricewaterhouseCoopers US Advisory, 62.08 Mumbai, India AUEB NLP Group* Department of Informatics, Athens University 59.73 [12] of Economics and Business, Athens, Greece Essex [6] School of computer Science and Electronic 56.34 Engineering, University of Essex, Essex, United Kingdom TUC MC [25] Technische Unversität Chemnitz, 50.08 Chemnitz, Germany IML DFKI [10] Interactive Machine Learning Group, 47.06 German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany Morgan CS [14] Computer Science Department, 02.06 Morgan State University, Baltimore, Maryland, United States of America CSE SSN [2] Department of Computer Science and 01.39 Engineering, SSN College of Engineering, Chennai, India Table 7: Modality classification performance in terms of all sub- mitted runs for the ImageCLEF 2020 Concept Detection Task Group Name Submission Run Acc(%) PwC Healtcare NLP clusters test pred.csv 62.08 AUEB NLP Group InterceptCheXNetCheckpoints.csv 59.73 AUEB NLP Group BestOf.csv 59.48 essexgp2020 cp99 all modified.txt 56.34 essexgp2020 c99 all man.txt 55.69 AUEB NLP Group UnionCheXNetCheckpoints.csv 55.23 PwC Healtcare folderwise KNN resnet101 test pred.csv54.70 PwC Healtcare folder wise test pred v1.csv 52.43 PwC Healtcare combined test pred v1.csv 52.43 essexgp2020 submit run3.csv 50.93 TUC MC streamlined1 thr0 25.csv 50.08 essexgp2020 submit run1.csv 49.29 essexgp2020 submit run5.csv 48.84 TUC MC model low lr thr0 20.csv 48.22 iml imageclefmed2020-test-densenet169- 47.06 iml.txt iml imageclefmed2020-test-vgg16-f1-bce- 46.94 iml.txt iml imageclefmed2020-test-vgg16-f1-bce- 46.94 nomissing-iml.txt iml imageclefmed2020-test-resnet50-iml.txt 46.83 iml imageclefmed2020-test-vgg16-iml.txt 45.47 TUC MC model thr0 18.csv 44.88 TUC MC basemodel thr0 20.csv 44.74 PwC Healtcare combined test pred new.csv 42.05 PwC Healtcare knn t117 test pred.csv 41.34 TUC MC streamlined1.csv 41.23 TUC MC streamlined1 thr0 20.csv 41.23 TUC MC basemodel.csv 39.30 essexgp2020 submit run2.csv 38.88 TUC MC model weighting.csv 38.88 TUC MC streamlined1 nomax.csv 37.35 TUC MC streamlined1 thr0 12.csv 35.94 PwC Healtcare f1 band test t025 pred.csv 34.27 essexgp2020 cp98 all.txt 19.78 PwC Healtcare combined wo folder test.csv 14.60 essexgp2020 canberra all modified.txt 11.83 PwC Healtcare NLP test pred fixed.csv 10.67 essexgp2020 cp95 all.txt 02.86 Morgan CS MSU dense fcn.txt 02.07 Morgan CS MSU dense fcn 4.txt 01.75 Morgan CS MSU dense resnet fcn 1.txt 01.75 Morgan CS MSU dense feat.txt 01.75 Morgan CS MSU autoenc fcn.txt 01.58 Morgan CS MSU dense resnet fcn 1.txt 01.50 Morgan CS MSU lstm dense fcn.txt 01.44 Morgan CS MSU dense fcn 2.txt 01.41 saradadevi captions output.txt 01.39 Morgan CS MSU dense feat.txt 01.39 Morgan CS MSU dense fcn 3.txt 01.39 6 Conclusion This paper presents an overview of applied approaches and their performance, as well as the task description, participation and distributed data set for the ImageCLEF 2020 concept detection task. Similar to the 2019 edition, the results this year show that there is an improvement in the achieved F1-scores (best score 0.3940). In this edition, not only does the dataset contain an increased number of images, the number of concepts were reduced to be more precise and additional modality information was distributed. In the previous editions, the overall best F1-Scores were 0.2823 in Image-med Caption 2019, 0.1108 in ImageCLEFmed Caption 2018 and 0.1583 in ImageCLEFmed Caption 2017. Almost all participating groups were new to the task, with only one team that participated in ImageCLEF caption 2019. The seven participating teams are affiliated to institutions from 5 countries, which shows the continuing research interest to this challenging task. Most of the submitted runs are based on deep learning architectures. The pre-trained models DenseNet-121, ResNet50 and VGG16 on the ImageNet and CheXNet were used to extract relevant visual representation for the images. Mul- tiple pre-processing steps such as concept filtering, data augmentation and im- age enhancement were applied to optimize the input for the predicting systems. Long short-term memory (LSTM) recurrent neural networks (RNN), adversar- ial auto-encoders, CNN image encoders and transfer learning-based multi-label classification models were the frequently used approaches. As the focus in the caption task 2019 was reduced from biomedical images to solely radiology images, a reduction of the extracted concepts from 111,155 to 5,528 was observed. We added this year an additional label denoting the imaging modality of the images. This extra information was used by several teams for pre-filtering steps prior to training the models, concept selection and for ensemble algorithms. The class imbalance in the distributed data set proved to be challenging for several teams. However, medical data and diseases are also usually unbalanced with a few conditions happening very frequently and most being very rare. In future work, an extensive review of the clinical relevance for the concepts in the development data should be explored. As the concepts originate from the natural language captions, not all concepts have high clinical utility. Medical journals also have very different policies in terms of checking figure captions. We believe this will assist in creating more efficient systems for automated medical data analysis. References 1. Chollet, F.: Xception: Deep Learning with Depthwise Separable Convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR, Honolulu, USA, July 22-25, 2017. pp. 1800–1807 (07 2017). https://doi.org/10.1109/CVPR.2017.195 2. Devi, S., S, K.: ImageCLEF 2020: Image Caption Prediction using Multilabel Con- volutional Neural Network. In: CLEF2020 Working Notes. CEUR Workshop Pro- ceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 3. Eickhoff, C., Schwall, I., de Herrera, A.G.S., Müller, H.: Overview of ImageCLE- Fcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017. (2017), http://ceur-ws.org/Vol- 1866/invited paper 7.pdf 4. Ferro, N., Peters, C.: Information Retrieval Evaluation in a Changing World Lessons Learned from 20 Years of CLEF: Lessons Learned from 20 Years of CLEF (01 2019) 5. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR, Las Vegas, USA, June 26 - July 1, 2016. pp. 770–778 (06 2016). https://doi.org/10.1109/CVPR.2016.90 6. de Herrera, A.G.S., Andrade, F.P., Bentley, L., Compean, A.A.: Essex at Image- CLEFcaption 2020 task. In: CLEF2020 Working Notes. CEUR Workshop Proceed- ings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 7. de Herrera, A.G.S., Eickhoff, C., Andrearczyk, V., Müller, H.: Overview of the ImageCLEF 2018 Caption Prediction Tasks. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018. (2018), http://ceur-ws.org/Vol-2125/invited paper 4.pdf 8. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, USA, July 22-25, 2017. pp. 2261–2269 (July 2017). https://doi.org/10.1109/CVPR.2017.243 9. Ionescu, B., Müller, H., Péteri, R., Abacha, A.B., Datla, V., Hasan, S.A., Demner- Fushman, D., Kozlovski, S., Liauchuk, V., Cid, Y.D., Kovalev, V., Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Ninh, V.T., Le, T.K., Zhou, L., Piras, L., Riegler, M., l Halvorsen, P., Tran, M.T., Lux, M., Gurrin, C., Dang-Nguyen, D.T., Chamberlain, J., Clark, A., Campello, A., Fichou, D., Berari, R., Brie, P., Dogariu, M., Ştefan, L.D., Constantin, M.G.: Overview of the ImageCLEF 2020: Multimedia Retrieval in Medical, Lifelogging, Nature, and Internet Applications. In: Experi- mental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), vol. 12260. LNCS Lecture Notes in Computer Science, Springer, Thessaloniki, Greece (September 22-25 2020) 10. Kalimuthu, M., Nunnari, F., Sonntag, D.: A Competitive Deep Neural Network Ap- proach for the ImageCLEFmed Caption 2020 Task. In: CLEF2020 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 11. Kalpathy-Cramer, J., de Herrera, A.G.S., Demner-Fushman, D., An- tani, S.K., Bedrick, S., Müller, H.: Evaluating performance of biomed- ical image retrieval systems - An overview of the medical image re- trieval task at ImageCLEF 2004-2013. Comp. Med. Imag. and Graph. 39, 55–61 (2015). https://doi.org/10.1016/j.compmedimag.2014.03.004, https://doi.org/10.1016/j.compmedimag.2014.03.004 12. Karatzas, B., Pavlopoulos, J., Kougia, V., Androutsopoulo, I.: AUEB NLP Group at ImageCLEFmed Caption 2020. In: CLEF2020 Working Notes. CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 13. Koitka, S., Friedrich, C.M.: Optimized Convolutional Neural Network Ensembles for Medical Subfigure Classification. In: Jones, G.J., Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction at the 8th International Conference of the CLEF Association, Dublin, Ireland, September 11-14, 2017, Lecture Notes in Computer Science (LNCS) 10456. pp. 57–68. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1 5 14. Lyode, O., Rahman, M.: Concept Detection in Biomedical Images with Deep Learn- ing Based Multilabel Classification. In: CLEF2020 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 15. Müller, H., Clough, P.D., Deselaers, T., Caputo, B. (eds.): ImageCLEF, Experimental Evaluation in Visual Information Retrieval. Springer (2010). https://doi.org/10.1007/978-3-642-15181-1 16. Pelka, O., Friedrich, C.M., de Herrera, A.G.S., Müller, H.: Overview of the Image- CLEFmed 2019 Concept Detection Task. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019), http://ceur-ws.org/Vol- 2380/paper 245.pdf 17. Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology Ob- jects in COntext (ROCO): A Multimodal Image Dataset. In: Intravascular Imaging and Computer Assisted Stenting - and - Large-Scale Annotation of Biomedical Data and Expert Label Synthesis - 7th Joint International Work- shop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings. pp. 180–189 (2018). https://doi.org/10.1007/978-3-030-01364-6 20, https://doi.org/10.1007/978-3-030-01364-6 20 18. Pelka, O., Nensa, F., Friedrich, C.M.: Adopting Semantic Information of Grayscale Radiographs for Image Classification and Retrieval. In: Proceedings of the 11th In- ternational Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018) - Volume 2: BIOIMAGING, Funchal, Madeira, Portugal, January 19-21, 2018. pp. 179–187 (2018). https://doi.org/10.5220/0006732301790187 19. Pelka, O., Nensa, F., Friedrich, C.M.: Variations on Branding with Text Occur- rence for Optimized Body Parts Classification. In: Proceedings of the 41th An- nual International Conference of the IEEE Engineering in Medicine and Biol- ogy Society EMBC 2019, Berlin, Germany, July 23-27, 2019. pp. 890–894 (2019). https://doi.org/10.1109/EMBC.2019.8857478 20. Roberts, R.J.: PubMed Central: The GenBank of the published literature. Pro- ceedings of the National Academy of Sciences of the United States of America 98(2), 381–382 (Jan 2001). https://doi.org/10.1073/pnas.98.2.381 21. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (09 2014). https://doi.org/10.1007/s11263-015-0816-y 22. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556 (09 2014) 23. Soldaini, L., Goharian, N.: QuickUMLS: a fast, unsupervised approach for medical concept extraction. In: MedIR Workshop, SIGIR (2016) 24. Sonker, R., Mishra, A., Bansal, P., Pattnaik, A.: Techniques for Medical Concept Detection from Multi-Modal Images. In: CLEF2020 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 25. Udas, N., Beuth, F., Kowerko, D.: TUC MC group at ImageCLEFmed 2020 concept detection task using Xception models. In: CLEF2020 Working Notes. CEUR Work- shop Proceedings, CEUR-WS.org, Thessaloniki, Greece (September 22-25 2020) 26. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Hon- olulu, USA, July 22-25, 2017. pp. 3462–3471 (2017)