Leveraging Diverse CNN Architectures for Medical Image Captioning: DenseNet-121, MobileNetV2, and ResNet-50 in ImageCLEF 2024 Notebook for the VIT_Conceptz Lab at CLEF 2024 Sriram Ram1,* , Shashaank Vinoth1,* , Rahul Natesh Gopalakrishnan1,* , Aastick Amirteswar Balakumar1,* , Lekshmi Kalinathan1,* and Thomas Abraham Joseph Velankanni1,* 1 Vellore Institute of Technology, Chennai Campus, Vandalur-Kelambakkam Road, Chennai, Tamil Nadu 600127, India Abstract In this study, we employed three deep learning models ResNet50, MobileNetV2, and DenseNet-121 to perform concept detection, which involves identifying and locating relevant concepts in medical images. This provides the foundation for generating coherent captions in the subsequent caption prediction. In medical imaging, concept detection plays a pivotal role. It enables accurate disease diagnosis and monitoring by identifying specific features, such as tumors, fractures, or anomalies. These concepts guide treatment planning, ensuring timely interventions. Among the models, ResNet50 achieved the highest performance, followed by MobileNetV2 and DenseNet-121. These results indicate that ResNet50 is the most effective model for identifying relevant concepts within medical images. This study provides insights into the applicability of different convolutional neural networks for medical image analysis, contributing to advancements in automated medical image captioning. Our team secured 8th place on the overall challenge leaderboard in the Concept Detection Task of the 8th edition of Caption Challenge in ImageCLEFmedical 2024. Keywords ImageCLEF 2024, Image Captioning, DenseNet121, Resnet50, MobileNetV2, 1. Introduction The 8th edition of the Caption Challenge in the ImageCLEFmedical 2024 [1] focuses on two tasks: Concept Detection and Caption Prediction [2]. This study examines the results derived from pre-trained convolutional models—ResNet50, MobileNetV2 and DenseNet-121—utilized specifically for the Concept Detection task. Concept Detection involves multilabel classification, where each radiology image may contain one or more labels. These labels, represented as CUIs (Controlled User Information), are mapped to specific concepts. Identifying these concepts helps in isolating individual components within the image and can be further applied to information retrieval. The motivation for this task stems from the growing availability of images without accompanying metadata. Acquiring metadata is crucial for making the content usable and accessible for further analysis and application [3]. Previous studies have underscored the challenges and potential of various approaches in med- ical image analysis. Rahman [4] demonstrated commendable precision in lesion detection using a bespoke CNN architecture, highlighting the effectiveness of tailored models in specific tasks. Dimitris and Ergina [5] explored the efficacy of transfer learning in medical imaging, showing that pretrained models can be effectively adapted for diverse imaging modalities. Rossetto et al. [6] extended visual concept detection to video retrieval, showcasing the versatility of these techniques across different formats. Ohri and Kumar [7] presented a comprehensive framework for medical image classification, CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ sriram.r1003@gmail.com (S. Ram); shashaankvinoth@gmail.com (S. Vinoth); rahulnateshgr8@gmail.com (R. N. Gopalakrishnan); aastick.amirteswar.b@gmail.com (A. A. Balakumar); lekshmi.k@vit.ac.in (L. Kalinathan); thomasabraham.jv@vit.ac.in (T. A. J. Velankanni) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings suggesting that integrating diverse methodologies can enhance the accuracy and applicability of concept detection. 2. Methodology The dataset for the concept detection task is derived from the Radiology Objects in COntext Version 2 (ROCOv2) dataset [8], an enhanced version of the original Radiology Objects in COntext (ROCO) dataset [9]. This dataset is specifically curated for radiology images and is sourced from biomedical articles in the PMC OpenAccess subset. The training set comprises 70,108 radiology images, the validation set includes 9,972 images, and the test set contains 17,237 images. 2.1. Dataset Processing and Environmental Setup We began by loading the train_concepts.csv file to extract UMLS (Unified Medical Language System) [10] concept IDs from the CUIs column. Images were resized to 224x224 pixels, normalized, and converted to the appropriate format. Generators were created for the training and validation datasets to handle multilabel classification, yielding batches of images and corresponding CUIs. The implementation was configured with Python 3.6 or higher, and the necessary libraries, in- cluding TensorFlow(v2.15.0), Keras(v2.15.0), and NumPy(v1.24.3), were installed as shown in the listing below. We also ensured compatibility with CUDA and cuDNN to leverage GPU acceleration for model training, significantly improving computational efficiency. In addition, the default learning rate scheduler used in our experiments is the ReduceLROnPlateau method from TensorFlow/Keras, which reduces the learning rate when a monitored metric, such as validation loss, has stopped improving. We also implemented the EarlyStopping callback to stop training when the validation loss metric shows no improvement for a specified number of epochs. These methods help optimize the training process and prevent overfitting. 1 # Python version 2 Python 3.6 or higher 3 4 # Required libraries and versions 5 TensorFlow v2.15.0 6 Keras v2.15.0 7 NumPy v1.24.3 Listing 1: Environment Setup 2.2. DenseNet-121 In this study, we employ the DenseNet-121 architecture for image classification tasks. DenseNet-121 is a densely connected convolutional network that enhances information flow between layers by connecting each layer to every other layer in a feed-forward fashion shown in Fig. 1 [11]. The model begins with an input layer for images of size 224 × 224 × 3, followed by an initial convolutional layer with 64 filters of size 7 × 7 and a stride of 2. This is followed by a batch normalization layer, a ReLU activation function, and a max pooling layer with a filter size of 3 × 3 and a stride of 2. The network consists of four dense blocks with increasing complexity. The first dense block contains 6 layers, each comprising two convolutions repeated 6 times. This is followed by a transition layer, which includes a 1 × 1 convolution and a 2 × 2 average pooling layer. The second dense block contains 12 layers with two convolutions repeated 12 times, followed by another transition layer. The third dense block consists of 24 layers with two convolutions repeated 24 times, followed by a transition layer. The final dense block has 16 layers with two convolutions repeated 16 times. After dense blocks, the model includes a final batch normalization layer, a ReLU activation function, an average pooling layer of size 7 × 7, and a fully connected layer with 1945 output units. This comprehensive design facilitates efficient feature reuse, leading to improved performance and reduced parameter count compared to traditional convolutional networks. Figure 1: DenseNet-121 Architecture Diagram [12]. 2.3. MobileNetV2 In this study, we employ the MobileNetV2 architecture for image classification tasks. MobileNetV2 is a lightweight and efficient convolutional neural network designed for mobile and embedded vision applications shown in Fig. 2. The model begins with an input layer designed for images sized at 224 × 224 × 3 pixels. This is followed by a 3 × 3 convolutional layer employing 32 filters and a stride of 2, complemented by batch normalization and ReLU activation. The core of MobileNetV2 consists of inverted residual blocks with linear bottlenecks. Each block includes depthwise convolutions, expansion layers with 1 × 1 convolutions, and pointwise convolutions to adjust channel dimensions, accompanied by batch normalization and ReLU activation. These blocks are repeated to enhance feature extraction efficiently [13]. After multiple inverted residual blocks, the network uses a global average pooling layer to consolidate spatial information, followed by a flattening layer. The final layers are a fully connected layer and a softmax activation function, which output class probabilities. MobileNetV2’s design ensures effective feature extraction with fewer parameters, making it suitable for resource-limited environments. Figure 2: MobileNetV2 Architecture Diagram [13]. 2.4. ResNet50 ResNet50 is a deep convolutional neural network architecture consisting of 50 layers, designed to address the vanishing gradient problem through the use of residual blocks shown in Fig. 3. Each block includes skip connections that allow the network to effectively propagate gradients, enabling the training of very deep networks [14]. ResNet50 features a bottleneck design with layers organized into five stages, combining convolutional operations and identity mappings, followed by a global average pooling and a fully connected layer for classification. This architecture achieves high performance on image recognition tasks, making it a cornerstone in modern deep learning. 2.5. Training and Evaluation In this study, we explore three distinct convolutional neural network architectures—DenseNet-121, MobileNetV2, and ResNet50—for image classification tasks (Figures 1, 2, and 3). For DenseNet- 121, initialized with pre-trained weights from ImageNet and customized with additional dense and classification layers, we conducted initial training with base layers frozen over 20 epochs, using a batch size of 32 and employing data augmentation techniques such as rescaling, horizontal flipping, and zooming. Subsequently, we unfroze the base layers for fine-tuning to adapt the model specifically to our dataset. The model was evaluated on a separate test dataset to assess performance metrics including loss and accuracy, ensuring robustness and reproducibility of results. MobileNetV2, optimized for efficiency in mobile and embedded applications, was initialized with pre-trained weights and customized with additional dense layers for multi-label classification, aligning with the number of UMLS concepts. The model was compiled using binary cross-entropy loss and the Adam optimizer, with accuracy as the primary evaluation metric. Training commenced with the dataset split into training and validation sets. Over 20 epochs and a batch size of 32, the model underwent initial training with frozen layers, accompanied by rescaling, horizontal flipping, and zooming data augmentation techniques applied solely to the training data for improved generalization. Subsequent Figure 3: ResNet50 Architecture Diagram [14]. fine-tuning unfroze layers to adapt the model specifically to the dataset. Evaluation on independent validation and test sets verified MobileNetV2’s robust performance in classifying images, affirming its suitability for resource-efficient environments and underscoring its effectiveness in medical image analysis tasks. ResNet50, leveraging pre-trained weights from ImageNet without the top classification layer, was initialized to maintain the integrity of learned features. The base model’s layers were frozen to pre- serve these weights during training. For model compilation, binary cross-entropy served as the loss function for its effectiveness in multilabel classification tasks. The Adam optimizer was chosen to efficiently navigate the model’s training process. Accuracy, a critical metric, was employed to evaluate model performance. Training spanned 15 epochs using a batch size of 32, with data augmentation techniques—including rescaling, horizontal flipping, and zooming—applied to enhance generalization capabilities. The model demonstrated robustness and reliability when evaluated on separate validation and test datasets, underscoring its efficacy in complex image recognition scenarios. 3. Experimental Results and Analysis The results of our analysis are as follows: Table 1 Comparison of F1 Scores and Secondary F1 Scores for Deep Learning Models. Model F1 Score Secondary F1 Score ResNet50 0.181 0.264 MobileNetV2 0.178 0.253 DenseNet-121 0.114 0.23 Table 2 Performance Metrics of Different Deep Learning Models Model Precision Recall F1 Score ResNet50 0.41 0.36 0.38 MobileNetV2 0.61 0.37 0.46 DenseNet-121 0.35 0.42 0.38 The better F1 score of ResNet50 compared to MobileNetV2 and DenseNet-121 shown in Table (1) can be attributed to its superior performance in reducing loss through the training epochs as observed. The graphs provided display the training and validation accuracy and loss over several epochs for three different models: DenseNet-121 (Fig. 6), MobileNetV2 (Fig. 5), and ResNet-50 (Fig. 4). These visualizations highlight distinct performance characteristics and potential issues like overfitting. DenseNet-121 (Fig. 6) shows a steady increase in training accuracy, but its validation accuracy fluctuates significantly, indicating instability. The training loss decreases consistently, reflecting effective learning on the training data. However, the validation loss initially decreases and then starts to increase, a clear sign of overfitting. This suggests the model is too complex, capturing noise and nuances that don’t generalize well. In contrast, MobileNetV2 (Fig. 5) demonstrates both training and validation accuracies that increase and converge closely, indicating consistent improvement without significant divergence. The training and validation loss curves also decrease and stabilize, suggesting that the model generalizes well to the validation data. This implies that MobileNetV2, designed for efficiency, maintains a good balance between complexity and generalization, effectively preventing overfitting. ResNet-50 (Fig. 4) exhibits mild overfitting, with small gaps between training and validation accuracy and slight fluctuations in validation loss, indicating it captures some noise but still generalizes relatively well. With some hyperparameter tuning and regularization, its performance could improve. In addition, Table 2 presents the precision, recall, and F1 scores for three models: ResNet50, MobileNetV2, and DenseNet-121. ResNet50 has a precision of 0.41, recall of 0.36, and an F1 score of 0.38, indicating moderate performance with a tendency to miss actual positives. MobileNetV2 shows the highest precision at 0.61 and an F1 score of 0.46, suggesting it balances precision and recall better than the other models, despite a recall of 0.37. DenseNet-121 has the highest recall at 0.42 but the lowest precision at 0.35, resulting in an F1 score of 0.38, similar to ResNet50. This indicates that DenseNet-121 identifies more actual positives but also has a higher rate of false positives. Consequently, MobileNetV2 demonstrates the most balanced performance, making it potentially the most reliable model when considering both precision and recall. The results achieved by our models, compared to EfficientNet-B0, EfficientNet-v2-s models, and other challenge participants, are notably inferior due to several factors. DenseNet-121 exhibits unstable validation accuracy and increasing validation loss, indicating overfitting and inability to generalize effectively. MobileNetV2, in contrast, demonstrates consistent improvement in both training and validation metrics, indicating better generalization capabilities. ResNet-50 shows mild overfitting with small accuracy gaps and fluctuating validation loss. Additionally, while MobileNetV2 achieves the highest precision and balanced F1 score (0.46), DenseNet-121’s high recall comes at the cost of lower precision and similar overall F1 score (0.38) to ResNet-50, indicating challenges in correctly identifying positives and minimizing false positives(Table 2). These performance differences might also stem from the models’ regularization techniques, the quality and quantity of the training data, and the tuning of hyperparameters such as learning rates and batch sizes. Enhancing regularization, tuning hyperparameters, or augmenting the dataset could help improve DenseNet-121 and ResNet-50’s performance to reduce overfitting and enhance generalization. To ensure reproducibility, the code and model weights for our experiments are accessible on GitHub at https://github.com/Sriram0703/ImageCLEFmedical-2024-Concept-Detection. Figure 4: ResNet50 accuracy and loss graph. Figure 5: MobileNetV2 accuracy and loss graph. Figure 6: DenseNet121 accuracy and loss graph. 4. Conclusion In this study, we addressed the Concept Detection Task of the ImageCLEFmedical Caption 2024 challenge, aiming to enhance automatic captioning and scene understanding of radiology images. We evaluated three deep learning models—ResNet50, MobileNetV2, and DenseNet-121—based on their ability to identify and locate relevant concepts within a large corpus of medical images. Among these models, ResNet50 demonstrated superior performance with an F1 score of 0.181, followed by MobileNetV2 with an F1 score of 0.178, and DenseNet-121 with an F1 score of 0.114. These results indicate that ResNet50 is the most effective model for concept detection in this context, providing the most accurate identification of individual components that form the basis for generating coherent captions. This work underscores the potential of using advanced convolutional neural networks in medical image analysis, contributing to the development of more efficient and reliable automated medical image captioning systems. 5. Acknowledgments We would like to express our gratitude to Vellore Institute of Technology, Chennai for providing access to their advanced computational facilities. The models for this study were run on a Lenovo Thinkstation P348, which is equipped with an Intel Core i7-11700 processor @ 2.5 GHz (8 cores), 64 GB of RAM, a 2 TB hard disk, and a 12 GB NVIDIA graphics card. The robust hardware and high computational capabilities significantly contributed to the successful completion of this study. References [1] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. Garcıa Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. G. Pakull, H. Damm, B. Bracke, C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire, D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks, M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein, Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024. [2] J. Rückert, A. Ben Abacha, A. G. Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, B. Bracke, H. Damm, T. M. G. Pakull, C. S. Schmidt, H. Müller, C. M. Friedrich, Overview of ImageCLEFmedical 2024 – Caption Prediction and Concept Detection, in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024. [3] O. Pelka, C. M. Friedrich, A. García Seco de Herrera, H. Müller, Overview of the ImageCLEFmed 2019 concept detection task, in: Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes, 9-12 September 2019, 2019. URL: https://repository.essex.ac.uk/26557/. [4] M. Rahman, A Cross Modal Deep Learning Based Approach for Caption Prediction and Concept Detection by CS Morgan State., in: CLEF (Working Notes), 2018, p. 8. URL: https://ceur-ws.org/ Vol-2125/paper_138.pdf. [5] K. Dimitris, K. Ergina, Concept detection on medical images using Deep Residual Learning Network, Working Notes CLEF (2017). URL: https://ceur-ws.org/Vol-1866/paper_122.pdf. [6] L. Rossetto, M. Amiri Parian, R. Gasser, I. Giangreco, S. Heller, H. Schuldt, Deep learning-based concept detection in vitrivr, in: MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part II 25, Springer, 2019, pp. 616–621. doi:10.1007/978-3-030-05716-9_55. [7] K. Ohri, M. Kumar, Review on self-supervised image recognition using deep neural networks, Knowledge-Based Systems 224 (2021) 107090. URL: https://www.sciencedirect.com/science/article/ pii/S0950705121003531. doi:10.1016/j.knosys.2021.107090. [8] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B. Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset, Scientific Data (2024). URL: https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6. [9] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology objects in context (roco): a multimodal image dataset, in: Intravascular Imaging and Computer Assisted Stenting and Large- Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, Springer, 2018, pp. 180–189. doi:10.1007/978-3-030-01364-6_20. [10] O. Bodenreider, The unified medical language system (UMLS): integrating biomedical terminology, Nucleic acids research 32 (2004) D267–D270. doi:10.1093/nar/gkh061. [11] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.243. [12] Q. Ji, J. Huang, W. He, Y. Sun, Optimized deep convolutional neural networks for identification of macular diseases from optical coherence tomography images, Algorithms 12 (2019) 51. doi:10. 3390/a12030051. [13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520. doi:10.1109/CVPR.2018.00474. [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. doi:10. 1109/CVPR.2016.90.