1. Task Performed

R. V. Reddy);

Evaluating Deep CNNs for Multi-Label Concept Detection in ROCOv2 Radiology Image Dataset by Team LekshmiscopeVIT

Aryan Sahni

Rachit Gupta

Raamigaani Venugopal Reddy

Lekshmi Kalinathan

0 Vellore Institute of Technology , Chennai-600127 , India

2025

000 0 0002

The "Lekshmiscopevit" team presents a ResNet50-based approach for the Concept Detection Task of the ImageCLEF Medical 2025 challenge, using the Radiology Objects in Context version 2 (ROCOv2) dataset. Our experiments explored multiple deep learning architectures, including InceptionV3, DenseNet, and custom convolutional models, with and without pretrained ImageNet weights. Among these, the ResNet50 model consistently outperformed the others, achieving the highest accuracy in both the validation and the test sets. Training was carried out using 80,091 radiology images, 17,277 images used for validation, and 19,267 for testing. To assess the efect of label space complexity, we also experimented with reducing the number of predicted labels to the top most frequently occurring UMLS CUIs. This label reduction improved model performance by alleviating class imbalance and increasing generalization.

eol>Transfer Learning Deep Learning Concept detection ResNet50 Multi-label Classification

1. Task Performed

In the context of the ImageCLEFmedical Caption 2025 challenge [ 1 ], we have contributed to the Concept Detection Task, which is the task of detecting clinically relevant UMLS concepts directly from radiological images. The task represents a building block toward automatic image captioning and scene understanding in the medical field. We created and trained Multi-Label Classification models to make predictions for UMLS concepts related to each image in the dataset. The concepts were chosen from a filtered portion of the UMLS 2022AB release, including those with greater frequency and specific semantic types to maintain relevance and feasibility. Our method used the ROCOv2 dataset, which contained a training set of 80,091 radiology images, a validation set of 17,277 images, and a test set of 19,267 images. [ 2 ] The forecasted concepts were assessed with set coverage measures, namely precision, recall, and F1-score, depicting the correctness and completeness of the concept sets produced by the models. The experiments were all run on only the oficial training data, as per the task guidelines, to make it comparable with other participating systems. The codes and trained model can be found in the following GitHub repository: https://github.com/C0okiegranny221/CONCEPT-DETECTION.

2. Main Objectives of the Experiments

The main goal of our experiments was to create a successful deep learning pipeline for multi-label concept detection from radiology images within the ImageCLEFmedical Caption 2025 challenge [ 1 ]. Of particular interest was finding clinically significant UMLS concepts solely based on visual features, ultimately leading to the facilitation of downstream applications like automatic image captioning and semantic retrieval. To this purpose, we exhaustively tested various convolutional neural network architectures such as ResNet50, DenseNet121, InceptionV3, and custom-designed models under pretrained and randomly initialised weight configurations. These custom models were lightweight CNN models that were made up of dense connections to baseline the tests with the more well-established models later in the test phase. The tests were designed to identify the model architecture with the optimal generalisation performance for unseen medical images. A major experimental aim was to examine the efect of label distribution on model performance by changing the number of concepts employed during training and inference. We explored how limiting predictions to the most common UMLS concepts afected performance metrics like precision, recall, and F1-score. Our highest-performing results were obtained with a ResNet50-based model, and it showed higher accuracy for both the test set and the validation set than other architectures. These results highlight the need for architecture choice and optimisation of label space as key factors in improving visual concept recognition in medical imaging.

3. Approaches Used and Progress Beyond State-of-the-Art

Our strategy towards the concept detection task [ 1 ] involved utilising deep convolutional neural networks with efective preprocessing and label encoding techniques suited for multi-label classification. The MultiLabelBinarizer (MLB) was used to encode the UMLS concepts in a binary matrix format, which would allow the model to make multiple concept predictions per image. Input images were preprocessed by Keras ImageDataGenerator, where real-time data augmentation and normalisation were possible, improving the model’s generalisation on unseen medical images.

We tried diferent CNN architectures, such as ResNet50, DenseNet121, and InceptionV3, and found that ResNet50 models with pretrained ImageNet weights performed best overall. Preloading with pretrained weights helped the models transfer low-level feature representations from natural images to medical image data, leading to faster convergence and better accuracy. [ 3 ] This transfer learning approach gave a robust initialisation, enabling the network to concentrate on learning domain-specific patterns applicable to radiology. [ 4 ]

Compared to conventional concept detection techniques based on handcrafted features or shallow classifiers, our solution achieved a significant improvement by combining deep visual feature learning with multi-label semantic prediction. The use of label frequency analysis and concept space reduction additionally provided performance boosts by concentrating on the most informative clinical concepts. In total, our pipeline demonstrated strong gains on the ROCOv2 validation and test sets [ 2 ], demonstrating the power of combining CNN architectures with medical data-specific preprocessing and label optimisation methodologies.

The image size to feed into the neural network was set to a 224 x 224 x 3 image, and the image set was not shufled and sent in batches of 128 at a time. Our team used the Adam optimizer, and the learning rate was set to 0.01. We used binary cross-entropy as the loss function in our training and accuracy as the metric. The early stopping function was utilised to avoid overfitting of the model on the training set and was validated against the validation set to a minimum delta of 0.001 and a patience of 3, and the best weights were restored at the end of the training.

4. Resources Used

The experiments were carried out by leveraging a mix of on-campus GPU facilities and cloud-based setups. Much of the training and testing of models was done on GPUs hosted by the high-performance computing facilities of the college, which enabled the necessary computational power for processing the massive ROCOv2 dataset [ 2 ]. We also leveraged online environments like Google Colab and Kaggle Notebooks that provided rapid prototyping functionality and working with pretrained models without much hassle. The GPUs used for the training of the models were Tesla T4 with a 30 GB RAM ceiling.

In order to speed up training and increase performance, we utilised pre-trained weights from ImageNet for all deep networks, such as ResNet50, DenseNet121, and InceptionV3 [ 5 ]. This application of transfer learning enabled the models to take advantage of learnt features beforehand and thereby decrease training time as well as the likelihood of overfitting, particularly considering the intricacy of radiology image content and the multi-label nature of the task. The integration of heterogeneous computational environments and pre-trained models allowed us to efectively iterate on experiments, tune hyperparameters, and test multiple architectures at scale.

5. Results Obtained

output space to the top 10 most commonly occurring UMLS concepts, the model recorded a Jaccard index of 0.3929 with a corresponding exact match of 0.1018. This outcome captures the power of taking advantage of high-frequency concepts and pretrained feature extractors in a multi-label classification environment for understanding medical images.

Although other models like DenseNet121 and InceptionV3 were competitive in their performance, they could not equate the accuracy levels that ResNet50 attained under equivalent training conditions. These results indicate that ResNet50 is especially best suited to the visual representation requirements of the task of concept detection, particularly when used in tandem with label frequency filtering to decrease the output space complexity.

6. Result Analysis

Even though they are well known for excellent feature extraction strengths, InceptionV3 and DenseNet121 performed poorly compared to ResNet50 under the multi-label concept detection task based on ROCOv2 radiology images [ 2 ]. Various architectural and optimisation-related aspects probably attributed to this result [ 5 ]. ResNet50 proved to be the best model for this task based on its fast convergence, strong residual learning mechanism, and ability to generalise under a low-label and domain-specific environment. [ 6 ] The inferior performance of alternative architectures and the test-validation gap are explained by architectural mismatch with task requirements, distributional shift sensitivity, and transfer learning constraints from non-medical domains. Future research might delve into longer training lengths, pretraining for the medical domain, and curriculum-based label additions for further optimisation.

6.1. Architectural Complexity vs. Task Requirements

InceptionV3, with its deeply modular structure consisting of multiple convolutional filters of diferent sizes in parallel (e.g., 1×1, 3×3, 5×5), is optimised to learn multi-scale features. [ 7 ] While useful in general natural image classification applications, this multiplicity of feature scales could have introduced redundancy in learning features for medical images, where fine-grained domain-specific patterns predominate and might not correspond well to general-purpose, multiresolution filters. In addition, numerous parallel branches add computational and memory requirements, which may slow down convergence in a short number of training epochs.

DenseNet is based on dense connectivity, where every layer takes inputs from all earlier layers. This design promotes feature reuse and prevents vanishing gradients. [ 8 ] But practically, this dense connectivity can create overfitting on the unnecessary fine-grained details in high-resolution radiology images [ 2 ], especially when the training is conducted for a few epochs and without domain-specific pretraining. The nature of DenseNet to retain many low-level features can also result in information dilution in subsequent layers, which could be undesirable in an application where semantic abstraction and concept-level recognition are more important.

By contrast, ResNet50’s residual connections allow stable gradient flow and speed up convergence through the ability to learn identity mappings when deeper transformations are not required. [ 9 ] This aspect is especially beneficial for transfer learning with pretrained weights, allowing for eficient adaptation to the target domain without excessive degradation. ResNet50 thereby balances depth, simplicity, and transferability [ 5 ] and is more suitable for medical concept detection with sparse label space and high intra-class similarity.

6.2. Impact of Label Space Restriction

Another important factor in the observed performance diferences is the decision to restrict the output label space to the top 10 most frequent UMLS concepts. This substantially reduced label sparsity and class imbalance, which typically plague multi-label classification tasks. Models with higher capacity (e.g., DenseNet) may require larger label diversity to showcase their full representational power. Conversely, ResNet50 took advantage of the decreased complexity of labels to allow it to converge better in the lower-dimensional label space.

6.3. Gap between Validation and Test Set Performance

Although the performance in validation went up to 0.3929, in the unseen test set it was significantly lower achieveing a F1 score of 0.1494 and a secondary F1 score of 0.2298.There could have been several reasons why this gap ensued. First the Dataset Distribution Shift ,the ROCOv2 dataset [ 2 ]has been reported to have a wide range of imaging modalities and diagnostic scenarios. The validation set was drawn from the same distribution as training data, while the test set includes totally unseen images, potentially including underrepresented modalities, resolutions, or clinical conditions. Such domain shift can cause reduced generalization. The model could also have been privy to Overfitting to Most Frequent Patterns as the model was trained and tested on most frequent 10 concepts, which could result in overfitting to those dominant patterns. If the test set has a slightly diferent frequency distribution or a greater level of label noise, the model would not be able to adapt and hence will perform lower accuracy. Batch normalization layers found in all utilized models are batch statistics sensitive. At training time, they learn to match the training/validation batch distribution. At inference time over the test set (particularly when conducted in small batches), statistics mismatch causes suboptimal scaling of activations and performance degradation.

7. Perspectives for Future Work

While the existing method concentrated on utilizing ResNet50 with pre-initialized ImageNet weights and decreasing the label domain to common ideas, there are a variety of promising avenues that can be improved upon to increase performance and extend applicability. Further research can delve into self-supervised pretraining on medical image datasets, e.g., MIMIC-CXR or CheXpert, to improve feature representations towards the domain specific meaning of radiology. In addition, multi-modal learning by blending image features with metadata or subset caption text can potentially allow more context-sensitive predictions. Another novel avenue is the utilization of graph neural networks (GNNs) to capture co-occurring and hierarchical semantic relationships between UMLS concepts, enabling the model to draw on semantic dependencies in classification. A transformer-based approach has also shown to to be better performing in cases of multilabel classification [ 10 ]. In addition, uncertaintyaware learning with Bayesian deep learning may alleviate label noise and dataset bias, particularly in the long tail of infrequent concepts. Finally, adding visual grounding or attention maps to identify concept-related regions within the image would make the system more interpretable for clinical users, paving the way for hybrid AI-human diagnostic pipelines.

Acknowledgments

This research was supported by the Department of Science and Technology (DST), India, under the Fund for Improvement of S&T Infrastructure in Universities and Higher Educational Institutions (FIST) Program [Grant No. SR/FST/ET-I/2022/1079], along with a matching grant from VIT University. The authors express their sincere gratitude to DST-FIST and the VIT management for their financial assistance and the infrastructural support provided for this work.

Declaration on Generative AI

During the preparation of this work, the authors used GPT-4 Turbo, QuillBot, and Grammarly in order to: Grammar and spelling check. Further, the author(s) used GPT-4 Turbo for rephrasing sentences or paragraphs to improve clarity, conciseness, or style. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

[1]

Damm ,

T. M. G.

Pakull ,

Becker ,

Bracke ,

Eryilmaz ,

Bloch ,

Brüngel ,

C. S.

Schmidt ,

Rückert ,

Pelka ,

Schäfer ,

Idrissi-Yaghir ,

A. Ben

Abacha ,

García Seco de Herrera ,

Müller ,

C. M.

Friedrich , Overview of ImageCLEFmedical 2025 - Medical Concept Detection and Interpretable Caption Generation , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025 .

[2]

Rückert ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

Koitka ,

Pelka ,

A. Ben

Abacha ,

García Seco de Herrera , H. Müller,

Horn ,

Nensa , C. M. Friedrich, ROCOv2: Radiology Objects in Context Version 2, an Updated Multimodal Image Dataset , Scientific Data 11 ( 2024 ). doi:10.1038/s41597-024-03496-6.

[3]

H. E.

Kim ,

Cosa-Linan ,

Santhanam ,

Jannesari ,

M. E.

Maros , T. Ganslandt, Transfer learning for medical image classicfiation: a literature review , BMC medical imaging 22 ( 2022 ) 69 . doi: 10 .1186/s12880-022-00793-7.

[4]

Torrey ,

Shavlik , Transfer learning, in: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques , IGI global , 2010 , pp. 242 - 264 . doi: 10 .4018/ 978-1- 60566 -766-9. ch011 .

[5]

S. R.

Shah ,

Qadri ,

Bibi ,

S. M. W.

Shah ,

M. I.

Sharif ,

Marinello , Comparing inception V3, VGG 16 , VGG 19, CNN , and ResNet 50: A case study on early detection of a rice disease , Agronomy 13 ( 2023 ) 1633 . doi: 10 .3390/agronomy13061633.

[6]

Lu ,

Weng , A survey of image classification methods and techniques for improving classification performance , International journal of Remote sensing 28 ( 2007 ) 823 - 870 . doi: 10 .1080/ 01431160600746456.

[7]

Sharma ,

Jain ,

Mishra , An analysis of convolutional neural networks for image classification , Procedia computer science 132 ( 2018 ) 377 - 384 . doi: 10 .1016/j.procs. 2018 . 05 .198.

[8]

Wei ,

Xia ,

Lin ,

Huang ,

Ni ,

Dong ,

Zhao , S. Yan, HCP: A flexible CNN framework for multi-label image classification , IEEE transactions on pattern analysis and machine intelligence 38 ( 2015 ) 1901 - 1907 . doi: 10 .1109/tpami. 2015 . 2491929 .

[9]

Mascarenhas ,

Agarwal , A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification , in: 2021 International Conference on Disruptive Technologies for Multi-Disciplinary Research and Applications (CENTCON) , volume 1 , 2021 , pp. 96 - 99 . doi: 10 . 1109/CENTCON52345. 2021 . 9687944 .

[10]

Lanchantin ,

Wang ,

Ordonez ,

Qi , General multi-label image classification with transformers , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021 , pp. 16478 - 16488 . doi: 10 .1109/cvpr46437. 2021 . 01621 .