-

Concept Detection in Medical Images using Xception Models - TUC MC at ImageCLEFmed 2020

Nisnab Udas

nisnab.udas@gmail.com 0 1

Frederik Beuth

beuth@cs.tu-chemnitz.de 0 1 0 , and Danny Kowerko 1 Chemnitz University of Technology , Germany

This paper summarizes the approach and the results of the submission of the Media Computing group from the Chemnitz University of Technology (TUC MC) at ImageCLEFmed Caption task, launched by ImageCLEF 2020. In the task, contents of medical images have to be detected, for the goal of a better diagnosis of medical diseases and conditions. In the context of a master thesis by Nisnab Udas, Xception model, which slightly outperformed InceptionV3 on the ImageNet dataset in 2017, was adapted to this caption task. Out-of-the-box, his approach achieved an F1 score of 35.1% compared to the best contribution with 39.4%, which places our team in the top-5. Part of his strategy was to optimize the con dence threshold, and to bring in a max pooling in the last layer which reduced the number of parameters making the model less prone to over tting.

Computer science challenges have been established in the last decades to advance diverse problems in text, audio and video processing [ 4 ]. In this tradition, challenges are organized within the established ImageCLEF or LifeCLEF lab since 2003 and 2014, respectively. Since 2003, medical (retrieval) tasks have been part of the challenge and been continuously developed into 3 subtasks, where one is called medical concept detection since 2017 [ 2,13,10,6 ]. It contains automatic image captioning and scene understanding to identify the presence and location of relevant concepts in a large corpus of medical images. The latter stem from the PubMed Open Access subset containing 1,828,575 archives. A total number of 6,031,814 image - caption pairs were extracted. A combination of automatic ltering with deep learning systems and manual revisions was applied to focus merely on radiology images and non-compound gures. The origin of the biomedical images distributed in this challenge is a subset from the extended ROCO (Radiology Objects in COntext) dataset [11]. In ImageCLEF 2020, additional information regarding the modalities of all 80,747 images was distributed [ 6 ].

Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof. Leaderboards utilize the F1 metric summarized in Table 1. The results prove that the task remains challenging even though a continuous improvement from year to year is to be noted. The results of this year bring the top-3 group closer together for the rst time. In the caption task of 2019 [ 10 ], Kougia et al. won the competition by combining their CNN (Convolutional Neural Network) image encoders with an image retrieval method or a feed-forward neural network and achieved an F1 score of 28.2% [ 7 ]. Xu et al. applied a multilabel classi cation model based on ResNet [ 5 ] and achieved 26.6% [14]. Guo et al. achieved 22.4% F1 score with a two-stage concept including the medical image pre-classi cation based on body parts with AlexNet ([ 8 ]) and the transfer learning-based multi-label classi cation model based on Inception V3 and Resnet152 [ 3 ]. 2

Data Analysis

The amount of images has increased from 2019 to 2020. The concept detection task this year, contains training and validation images in 7 separate folders. In total, there are 64,753 training images, 14,970 validation images and 3,534 test images, respectively. Concept frequency was reduced from 5,528 last year to 3,047 in 2020 as low occuring concepts were removed by the organizers. Top-20 concepts in our training images are shown in Fig. 1. Concepts 'C0040405' and 'C0040398' both occur 25022 images in training images. The gure clearly shows how our concepts are imbalanced in the dataset.

In Fig. 2, we show distribution of concept length in training dataset. Maximum number of images, 5,248, to be speci c, have only 2 concepts. The second and third largest group of images have 3 and 4 concepts per image, respectively. The highest number of concepts occurring in an image is 140 which occurs one time.

2 5 0 0 0 2 0 0 0 0 5 0 0 0

0 0 0

T r a i n i n g i m a g e s 1 0 N u m b e r o f c 2 o 0 n c e p t s p e r i m a g3 0e Fig. 2. Frequency distribution of the concepts length. Our deep learning based architecture was based on Xception architecture [ 1 ] and is shown in Fig. 3. The Xception model slightly outperforms Inception V3 on the ImageNet dataset in 2017, and was chosen due to this performance and as our preliminary test found it well working on our medical detection task. For ne-tuning concerning model, we utilize transfer learning and use weights pretrained on ImageNet Dataset [12]. We then eliminated the top classi er layer as is required in transfer learning. We froze the entire Xception model and made only the last six layers trainable.

Generally, as in transfer learning, before adding classi er to a pre-trained model, the layer is attened. Flattening transforms a 2D matrix of features to the vector, which can be provided to a fully-connected layer (FC layer). In our case, we used a max pooling layer of window size (2,2) followed by the dropout layer to reduce the number of free parameters and facilitate object-size dependent pooling as a special trick. Afterwards, the usual attening layer is added.

Subsequently to adding the attening layer, we used the ReLU activation function, followed by the dropout layer. The data contains 3,047 concepts in total, thus our nal FC layer contains 3,047 units and sigmoid as our activation function because we are dealing with a multi-label problem. The white rings in the FC layer represent neurons (Fig. 3). The top lambda layer extracts top100 highest probabilities. These probabilities are compared against a threshold value, e.g. t =0.12, which generates boolean values for these 100 probabilities. The lower lambda layer gives the index of these individual neurons/concepts. In data processing, these indices are used to locate only the neurons with 'True' boolean value. In the data processing part, results are reformatted into the competition format. Table 2 shows how feature maps change shape after passing through each layer.

For training our network, a proper optimization is necessary and we conduct the following optimization methods. The major contribution to a satisfying F1 score had the optimization of the con dence threshold, along with the maxpooling, as shown in the next sections. Besides these two methods, we deploy other minor approaches to raise the performance. These include the tuning of the drop-out level and the data augmentation level. Drop-out is a well-suitable technique to avoid over tting and the values were optimized for both neuronal layers of drop-out (Table 2) via conducting a cross-test of 25 di erent combinations. The best con guration has a drop-out value of 0:2 for the rst layer and 0:5 for the second layer. Additionally, data augmentation was tuned, also increasing the F1 score value by 0:01 0:02 depending on the con guration. Each of the methods raise the F1 score by 0:01 0:02 only, but in total, these e ects add up to an elevate of the F1 score by a level of 0:03 0:05. 4

Results

One of the ideas for improving the original Xception model [ 1 ] was the introduction of an additional max-pooling operation before the highest layer. It is shown in Table 2 in the second entry. This particular max-pooling operation reduces, on the application set, the spatial resolution, inducing a reduction of the free parameters in the next layer. In our dataset, the layer before the maxpooling operation had a 5 5 resolution, which is reduced by a 2 2 pooling to a 2 2 layer resolution. This operation reduces the free trainable parameters from 160; 758; 247 parameters to 29; 712; 871 parameters in total, which fabricates a more robust and stable model. As a second argument, the operation allows the recognition of concepts in the image more independently from the position. In the original ImageNet data set, objects are larger on average than in our medical image data set. To compensate for this di erence in size, we increase the pooling as the objects in the original dataset cover large portions of the image, while our concepts are typically appearing in a smaller region. The pooling operation allows both, a recognition independent of the concept's place, and smaller object sensitive lters facilitating the recognition of smaller objects. The di erence in F1 score and performance, i.e. free trainable parameters, is shown in Table 3. 4.1

Con dence threshold optimization

Fig. 4 shows the threshold variation against accuracy and F1. A classical accuracy metric is not optimal for training a model in this challenge as we have a large class imbalance. Therefore, the F1 score metric is used.

Con dence threshold selection plays a crucial role in the multi-label problem. The threshold determines over which predicted probability a concept is mapped to our image or not. When a class is predicted, the network outputs a probability and only probabilities exceeding a certain threshold are counted as that this concept is in that image. Given that our model is well trained, an unoptimized threshold may still have a substantial e ect on our result. And determining the optimum threshold can often be tricky.

Therefore, we varied the threshold systematically and tuned the value of the threshold on the validation set, shown in Fig. 4. The maximum performance with respect to the con dence threshold is identi ed to range around 0:1 to 0:25. Hence, we submitted several runs with di erent threshold values between = 0:12 and = 0:25 (see Table 4). As expected from the Fig. 4, improvements of F1 are within a large amount. 4.2

Optimization techniques

There are plenty of ways of managing limited data volume and imbalanced datasets such as eliminating outliers, expanding the data set, augmentation, etc. In the medical image domain, few type of disease or conditions occurs less frequently to humans resulting in less sample numbers. Thus, to tackle these problems, we decided to use image data augmentation. Available methods are for example in Keras the following, which we employ as parameterized below: { Rotation is performed by randomly rotating an image around its center of up to 5 . { Vertical and horizontal ip. Flipping images is one of the most widely implemented techniques popularized by [ 8 ]. { Height and width shift range: The images are randomly shifted horizontally or vertically up to 5% of the total height and width respectively. { Zoom: Objects in images are randomly zoomed in a range of 5%. { Brightness shift: The image is randomly darkening or brightening in range of 80 120% of the initial brightness. { Samplewise center: To eliminate the problem of vanishing gradients or saturating values, data are normalized in such a way that the mean value of each data sample becomes 0. { Samplewise standard normalization: This pre-processing method approaches the same concept as sample-wise centering, but rather it xes the standard deviation to 1.

The enabling of data augmentation increases the F1 score and contributes to a more robust working of the system.

The competition requires only 100 concepts per image. Therefore, to ensure that, probabilities were sorted in descending order and Top-100 probabilities were selected. 4.3

Run description

We submitted ten runs (Table 4). The runs often utilize the same base-structure, an Xception model, and all use transfer learning from ImageNet. The runs vary in meta-parameters as we tested di erent ones. We vary primarily: (i) the threshold in the last layer, (ii) slightly di erent base-models, and (iii) with and without max-pooling in the highest layers. Run ID 1/68024 We deploy an Xception model and utilize transfer learning from ImageNet. We ran our model for N=100 epochs and set the learning rate to 1e-3. The model uses the con dence threshold theta in the last layer to map concepts probabilities to true/false for the concepts. We tuned the threshold to 0.15 from the validation set, selected the top 100 concepts and submitted our results.

Run ID 2/68029 This run again uses the Xception model and generally the con guration of Run ID 1. It optimizes the threshold further, setting it to 0.20.

Run ID 3/68034 We again deploy an Xception model and utilize transfer learning from ImageNet. This submission has a more streamed-lined source code structure and explores di erent meta-parameters: We ran our model for N=30 epochs and set the learning rate to 1e-2. We tuned the threshold to 0.15 from the validation set.

Run ID 4/68045 We again deploy an Xception model and utilize the con guration of run 1. This submission explores di erent meta-parameters: We ran our model for N=50 epochs and set the learning rate to 1e-4. We tuned the threshold to 0.20 from the validation set.

Run ID 5/68067 This run again uses the Xception model and generally the con guration of Run ID 3, while exploring which e ect has the max-pooling layer before the highest layer. The max-pooling was removed here to show the e ect.

Run ID 6/68073 This run uses the more streamed-lined source code structure and explores again di erent meta-parameters: We ran our model for N=30 epochs and set the learning rate to 1e-2. We tuned the threshold to 0.12 from the validation set.

Run ID 7/68074 This run again uses the Xception model and general the con guration of Run ID 6, while tuning the threshold to 0.25.

Run ID 8/68076 We again deploy the standard con guration of Run ID 1.

This submission focuses on an experimental normalizing of the dataset, but was not very successfully.

Run ID 9/68077 We again deploy an Xception model and utilize transfer learning from ImageNet. This submission explores an early stop strategy: the best, i.e. lowest loss, was used over a run period of N=30 epochs. The learning rate was 1e-3 and the threshold was tuned to 0.18.

Run ID 10/68078 This run deploys the more streamed-lined source code structure and explores a di erent threshold: 0.20.

In Table 5, we listed the top teams with their best F1 score in percent. Our team, TUC MC occupied 5th position in terms of team ranking with F1 score of 0.3512. Our approach of adapting an Xception model for the medical caption task 2020 achieves an F1 score of 35.1% which is better than the 2019 results and close to the best contributions of 2020 which achieved 39.4%. Our strategies to rely on a modern Xception neural network proved to be successful. It also shows that transfer learning, with weights pre-learned on ImageNet, is very usable on an indeed di erent image material such as medical images. The introduction of a max pooling in the last layer, and to optimize the con dence threshold, have boosted the performance of our Xception model. Further investigation could lead in the direction of optimization learning through entropy-based analysis concepts of neural networks. Moreover, a more in-depth analysis of certain concept classes might be carried out in order to better understand the errors in the present classi cation task. 11. Pelka, O., Koitka, S., Ruckert, J., Nensa, F., Friedrich, C.M.: Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In: Stoyanov, D., Taylor, Z., Balocco, S., Sznitman, R., Martel, A., Maier-Hein, L., Duong, L., Zahnd, G., Demirci, S., Albarqouni, S., Lee, S.L., Moriconi, S., Cheplygina, V., Mateus, D., Trucco, E., Granger, E., Jannin, P. (eds.) Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, vol. 11043, pp. 180{189. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-01364-6 20, http://link.springer.com/10.1007/978-3-030-01364-6_20, series Title: Lecture Notes in Computer Science 12. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis 115(3), 211{252 (Dec 2015). https://doi.org/10.1007/s11263-015-0816-y, http://link.springer. com/10.1007/s11263-015-0816-y 13. Seco De Herrera, A.G., Eickho , C., Andrearczyk, V., Muller, H.: Overview of the ImageCLEF 2018 Caption Prediction Tasks. In: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum. vol. 2125, pp. 1{12. Avignon, France (Sep 2018), http://ceur-ws.org/Vol-2125/invited_paper_4.pdf 14. Xu, J., Liu, W., Liu, C., Wang, Y., Chi, Y., Xie, X., Hua, X.: Concept detection based on multi-label classi cation and image captioning approach - DAMO at ImageCLEF 2019. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. vol. 2380, pp. 1{10. Lugano, Switzerland (Sep 2019), http: //ceur-ws.org/Vol-2380/paper_141.pdf

1. Chollet , F. : Xception: Deep Learning with Depthwise Separable Convolutions . In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . pp. 1800 { 1807 . IEEE, Honolulu, HI (Jul 2017 ). https://doi.org/10.1109/CVPR. 2017 . 195 , http://ieeexplore.ieee.org/ document/8099678/

2. Eickho , C. , Schwall , I. , Muller , H.: Overview of ImageCLEFcaption 2017 { Image Caption Prediction and Concept Detection for Biomedical Images . Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum 1866 , 1 { 10 (Sep 2017 ), http://ceur-ws. org/ Vol-1866/

3. Guo , Z. , Wang , X. , Zhang , Y. , Li , J.: ImageSem at ImageCLEFmed Caption 2019 Task: a Two-stage Medical Concept Detection Strategy . Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum 2380 , 1 {8 (Sep 2019 ), http://ceur-ws. org/ Vol- 2380 /paper_80.pdf

4. Hanbury , A. , Muller, H., Balog , K. , Brodt , T. , Cormack , G.V. , Eggel , I. , Gollub , T. , Hopfgartner , F. , Kalpathy-Cramer , J. , Kando , N. , Krithara , A. , Lin , J. , Mercer , S. , Potthast , M. : Evaluation-as-a-Service: Overview and Outlook . arXiv: 1512 .07454 [cs] pp. 1 { 28 (Dec 2015 ), http://arxiv.org/abs/1512.07454

5. He , K. , Zhang , X. , Ren , S. , Sun , J.: Deep residual learning for image recognition . In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . pp. 770 { 778 ( 2016 )

6. Ionescu , B. , Muller, H., Peteri , R. , Abacha , A.B. , Datla , V. , Hasan , S.A. , DemnerFushman , D. , Kozlovski , S. , Liauchuk , V. , Cid , Y.D. , Kovalev , V. , Pelka , O. , Friedrich , C.M. , de Herrera , A.G.S. , Ninh , V.T. , Le , T.K. , Zhou , L. , Piras , L. , Riegler , M. , l Halvorsen, P. , Tran , M.T. , Lux , M. , Gurrin , C. , Dang-Nguyen , D.T. , Chamberlain , J. , Clark , A. , Campello , A. , Fichou , D. , Berari , R. , Brie , P. , Dogariu , M. , Stefan , L.D. , Constantin , M.G. : Imageclef 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020 ), vol. 12260 . LNCS Lecture Notes in Computer Science , Springer, Thessaloniki, Greece (September 22 -25 2020 )

7. Kougia , V. , Pavlopoulos , J. , Androutsopoulos , I. : AUEB NLP Group at ImageCLEFmed Caption 2019 . In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum . vol. 2380 , pp. 1 { 8 . Lugano , Switzerland (Sep 2019 ), http://ceur-ws. org/ Vol- 2380 /paper_136.pdf

8. Krizhevsky , A. , Sutskever , I. , Hinton , G.E.: ImageNet Classi cation with Deep Convolutional Neural Networks . In: Pereira, F. , Burges , C.J.C. , Bottou , L. , Weinberger , K.Q . (eds.) Advances in Neural Information Processing Systems 25 , pp. 1097 { 1105 . Curran Associates, Inc. ( 2012 ), http://papers.nips.cc/paper/ 4824-imagenet -classification-with-deep-convolutional-neural-networks . pdf

9. Pelka , O. , Friedrich , C.M. , Garc a Seco de Herrera, A. , Muller, H.: Overview of the ImageCLEFmed 2020 concept prediction task: Medical image understanding . In: CLEF2020 Working Notes. CEUR Workshop Proceedings , CEUR-WS.org, Thessaloniki, Greece (September 22 -25 2020 )

10. Pelka , O. , Friedrich , C.M. , Seco De Herrera , A.G. , Muller, H.: Overview of the ImageCLEFmed 2019 Concept Detection Task . In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum . vol. 2380 , pp. 1 { 13 . Lugano , Switzerland (Sep 2019 ), http://ceur-ws. org/ Vol- 2380 /paper_245.pdf