1. Introduction

10.1109/CVPR.2009.5206848

Overview of ImageCLEFmedical 2025 - Medical Concept Detection and Interpretable Caption Generation

Hendrik Damm

hendrik.damm@fh-dortmund.de 0 2 4

Tabea M. G. Pakull

tabea.pakull@uk-essen.de 0 2 5

Helmut Becker

helmut.becker@uk-essen.de 0 3

Benjamin Bracke

0 2

Bahadır Eryılmaz

0 3

Louise Bloch

louise.bloch@fh-dortmund.de 0 2 3 4

Raphael Brüngel

raphael.bruengel@fh-dortmund.de 0 2 3 4

Cynthia S. Schmidt

cynthia.schmidt@uk-essen.de 0 3

Johannes Rückert

johannes.rueckert@fh-dortmund.de 0 2

Obioma Pelka

obioma.pelka@uk-essen.de 0 1 3

Henning Schäfer

0 2 3 5

Ahmad Idrissi-Yaghir

ahmad.idrissi-yaghir@uk-essen.de 0 3

Asma Ben Abacha

abenabacha@microsoft.com 0 6

Alba G. Seco de Herrera

alba.garcia@lsi.uned.es 0 7

Henning Müller

0 8 9

Christoph M. Friedrich

christoph.friedrich@fh-dortmund.de 0 2 4 0 Athens University of Economics and Business, Greece Iran University of Science and Technology , Tehran , Iran - University of Information Technology, Ho Chi Minh City, Vietnam Hunan City University, China Rajalakshmi Engineering College, Chennai, India Universidad Europea de Valencia, Spain University of Murcia, Spain Vellore Institute of Technology, Chennai, India Morgan State University , Baltimore , USA Chung -Ang University , Seoul , Republic of Korea 1 Data Integration Center, Central IT Department, University Hospital Essen , Essen , Germany 2 Department of Computer Science, University of Applied Sciences and Arts Dortmund , Dortmund , Germany 3 Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen , Germany 4 Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen , Germany 5 Institute for Transfusion Medicine, University Hospital Essen , Essen , Germany 6 Microsoft , Redmond, Washington , USA 7 School of Computer Science, National University of Distance Education (UNED) , Spain 8 University of Applied Sciences Western Switzerland (HES-SO) , Switzerland 9 University of Geneva , Switzerland

2022

36 248 255

The ImageCLEFmedical 2025 Caption task follows challenges held from 2017-2024 and comprises three subtasks: concept detection, caption prediction, and a newly introduced explainability task. The goal is to extract Unified Medical Language System (UMLS) concepts, generate fluent captions from medical images, and provide humaninterpretable justifications for the outputs. This year's edition used an enlarged version of the Radiology Objects in COntext version 2 (ROCOv2) dataset, which was expanded with new articles and the inclusion of the optical coherence tomography (OCT) imaging modality. For concept detection, the F1-score was used to evaluate predictions against UMLS terms. For caption prediction, evaluation was updated to a composite score averaging six metrics to assess both relevance and factuality. The new explainability submissions were manually judged by a radiologist. The 2025 task attracted 80 registered research groups, with 11 teams submitting a total of 149 graded runs across the three subtasks. Top-performing systems for concept detection were predominantly based on ensembles of Convolutional Neural Networks (CNNs). For caption prediction, a general shift towards fine-tuning Vision-Language Models (VLMs) was observed, with adapted architectures like BLIP leading to strong results across the new composite metrics. Finally, the inaugural explainability task saw initial submissions of post-hoc visualizations, establishing a baseline and clarifying the need for model-intrinsic explanations in future editions.

eol>ImageCLEF Computer Vision Multi-Label Classification Image Captioning Image Understanding Radiology Explainable AI

1. Introduction

ImageCLEF1 [ 1 ] is the image–retrieval and –classification lab of the Conference and Labs of the Evaluation Forum (CLEF) conference [ 2 ]. ImageCLEF 2025 [ 3 ] consists of the ImageCLEFmedical, ImageCLEFrecommending, Image Retrieval for Arguments (Touché) and ImageCLEFToPicto labs, with the ImageCLEFmedical lab being divided into the subtasks Caption (image–captioning), VQA (text-toimage generation), MEDIQA-MAGIC (Multimodal And Generative TelemedICine) and GANs (generation of medical images).

The Caption task was first proposed as part of the ImageCLEFmedical [ 4 ] in 2016. In 2017 and 2018 [ 5, 6 ] it comprised two subtasks: concept detection and caption prediction. From 2019 [ 7 ] to 2020 [ 8 ] the focus shifted to concept detection, extracting Unified Medical Language System ® (UMLS) [ 9 ] Concept Unique Identifiers (CUIs) from radiology images. Since 2021 [ 10 ] both subtasks have run in parallel again, with gradually higher-quality, manually annotated data and—in 2023—a switch from BLEU [ 11 ] to BERTScore [ 12 ] as the primary caption-prediction metric [ 13 ]. The 2024 edition introduced a small-scale explainability trial and an enlarged metric set.

2025 marks the 9th edition of the ImageCLEFmedical Caption task. Building on the lessons of previous years, the task now comprises three components: 1. Concept Detection – identification of UMLS concepts in radiology images; 2. Caption Prediction – generation of coherent captions for full images; 3. Explainability – newly promoted to an oficial subtask: participants must provide humaninterpretable explanations for a designated subset of images, which are manually judged by a radiologist for interpretability, relevance and creativity.

For caption prediction, the overall ranking is now based on the average across these six metrics, reflecting both relevance and factuality aspects of the generated captions.

Manual creation of structured knowledge from medical images is slow and error-prone. By benchmarking automatic systems that detect clinical concepts, compose fluent radiology captions and justify their outputs, ImageCLEFmedical 2025 continues to stimulate research toward scalable, trustworthy radiology-image understanding.

As in 2024, the development data are drawn from an extended version of the Radiology Objects in COntext Version 2 (ROCOv2) dataset [ 14 ]. For 2025, this release has been enlarged with additional, newly released PubMed Central® Open-Access articles whose images and captions were again manually annotated with modalities. A novelty to this year’s dataset is the inclusion of the imaging modality optical coherence tomography (OCT), which has been retrospectively annotated for every existing ROCOv2 image and prospectively annotated for all new articles. The final split now comprises 80 091 training, 17 277 validation, and 19 267 test radiology images, all with updated licensing curation and UMLS (2022 AB) concept filtering.

This paper presents an overview of the ImageCLEFmedical 2025 Caption task: the task design and participation (Section 2), data creation (Section 3), evaluation methodology (Section 4), results (Section 5) and conclusions (Section 6). Further information on the other ImageCLEF 2025 tasks can be found in Ionescu et al. [ 3 ].

2. Task and Participation

For the 9th edition, the ImageCLEFmedical Caption task builds on two familiar subtasks: • T1 Concept Detection. Systems predict Unified Medical Language System ® (UMLS) Concept Unique Identifiers (CUIs) [ 9 ] directly from radiology images, following the format introduced in 2017 [ 5 ]. 1https://www.imageclef.org/ [last accessed: 2025-06-01] • T2 Caption Prediction. Systems generate full-sentence captions for each image, a subtask that returned in 2021 after a pause in 2019–2020. and introduces a third, oficially-graded component: • Exp Explainability. For a small radiologist-selected subset, each team provides one humaninterpretable explanation (for example a heat-map, bounding boxes or a textual rationale) that relates the image to the generated caption. This explanation is intended to clarify the model’s decision-making process and thereby support clinicians in building trust in the model. Explanations are judged manually by a radiologist for interpretability, clinical relevance and creativity.

The 2025 edition also adds six evaluation metrics for caption prediction (see Section 4) and retrospectively annotates the complete ROCOv2 corpus with the new optical coherence tomography (OCT) modality. To compensate for the greater computational efort and occasional Docker-induced submission problems, the limit for graded runs per team was raised to 30 for T1 and T2; previously, it had been set at 10 runs. The Explainability Task (Exp) only allowed one submission, due to manual evaluation efort.

2.1. Participation Statistics

Eighty research groups signed the End-User Agreement and downloaded the development data. Eleven of them submitted runs and ten provided accompanying working-note papers. The submissions were distributed across the tasks as follows: • Concept Detection (T1): 9 teams, 51 graded runs. • Caption Prediction (T2): 8 teams, 98 graded runs. • Explainability (Exp): 2 teams, 2 graded runs.

• Total: 149 graded runs.

Six groups took part in both T1 and T2. Three teams (DeepLens, mapan and LekshmiscopeVIT) focused on concept detection only, and two (CSMorgan and AI Stat Lab) entered just the captionprediction track. Five teams, AUEB NLP Group, UIT-Oggy, CS_Morgan, sakthiii and LekshmiscopeVIT, had already participated in 2024 and are marked with an asterisk in Table 1.

The 2025 task therefore attracted a participant pool similar in size to earlier editions but generated more graded submissions, while also promoting explainability to a fully assessed subtask.

3. Data Creation 3.1. Source and Split

All data originate from articles in the PubMed Central® (PMC) Open-Access subset2 [ 25 ]. The development data correspond to an extended release of ROCOv2 [ 14 ], enlarged with all papers published between October 2022 and December 2024. Captions were only stripped of URLs and non-English captions were dropped.

The final dataset is split into 80 091 training, 17 277 validation and 19 267 test images (116 635 in total). 2https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [last accessed: 2025-06-01] 16 14 Caption: Computed tomography images after treatment.

Thoracic SMARCA4‐deficient undifferentiated tumor showing osteolytic changes in the ribs (asterisk) is noted. However, pleural thickening (yellow arrow) disappears and pleural effusion (yellow arrowhead) decreases in the mediastinal window setting.

3.2. Concept Extraction

Concepts were extracted with MedCAT [ 26 ] trained on MIMIC-III [ 27 ] and mapped to UMLS 2022AB CUIs. Only concepts occurring at least ten times and belonging to semantically “visible” TUI groups were kept; ambiguous or spurious concepts were merged or removed through manual curation.

3.3. Modality and Region Concepts

Each image is manually labelled with an imaging-modality concept. In addition to the five modalities used in previous editions (X-ray, CT, MRI, ultrasound, PET/PET-CT) the 2025 corpus introduces optical coherence tomography (OCT, CUI C0920367). OCT was annotated retrospectively for the entire archive and prospectively for new articles.

Table 2 lists the modality distribution, while Table 3 details the image retrieval in medical applications (IRMA) region counts.

3.4. Concept Statistics

3.5. Released Sets • Training set: 80 091 images, 252 772 concept occurrences, 1 949 unique concepts. • Validation set: 17 277 images, 48 761 concept occurrences, 716 unique concepts. • Test set: 19 267 images, 24 242 concept occurrences, 702 unique concepts. • Explainability set: 16 images (two from each modality, including two OCT cases) were selected by a radiologist based on the clinical relevance of both the images and their corresponding captions for manual assessment. In addition, examples of how such explanations might look like are provided, which can be found in Figure 2.

4. Evaluation Methodology

This year, the evaluation procedure was revised to reflect improved methodology and the incorporation of new tools and metrics. As in previous editions, the subtasks were evaluated independently.

Year 2022 2023 2024 2025

Split Train Valid Test Train Valid Test Train Valid Test Train Valid Test

In 2025, the AI4MediaBench3 by AIMultimediaLab4 was used as the challenge platform.

For the concept detection subtask, the balanced precision and recall trade-of were measured in terms of F1-scores. Like last year, a secondary F1-score is computed using a subset of concepts that was manually curated. On the one hand, this involves the diferent image modalities (X-ray, Angiography, Ultrasound, CT, MRI, PET, OCT, and Combined such as PET/CT). On the other hand, if applicable, for X-ray also the anatomical code for body region examined of IRMA (cranium, chest, upper extremity, spine, abdomen, pelvis, and lower extremity) was involved.

For caption prediction, system outputs were assessed using a composite score, averaging across six complementary metrics to jointly capture aspects of relevance and factuality. All individual scores for each caption are summed and averaged over the number of captions, resulting in the final score.

Relevance was evaluated using four diferent methods. The first of these is BERTScore [ 12 ], which is a metric that computes a similarity score for each token in the generated text with each token in the reference text. It uses the pre-trained contextual embeddings from Bidirectional Encoder Representations from Transformers (BERT) [ 28 ]-based models and matches words by cosine similarity. In this work, the 3https://ai4media-bench.aimultimedialab.ro/ [last accessed: 2025-06-02] 4https://www.aimultimedialab.ro/ [last accessed: 2025-06-02] pre-trained model microsoft/deberta-xlarge-mnli5 was used because it is the model that correlates best with human scoring according to the authors6. Following best practices for caption evaluation reported by [ 12 ], we computed Recall-based BERTScore with inverse document frequency (idf) weighting, using idf scores derived from the test set to emphasize informative terms. The second metric, ROUGE (RecallOriented Understudy for Gisting Evaluation [ 29 ]) score, counts the number of overlapping units such as n-grams, word sequences, and word pairs between the generated text and the reference. Specifically, the ROUGE-1 (F-measure) score was calculated, which measures the number of matching unigrams between the model-generated text and a reference. The third relevance metric BLEURT (BiLingual Evaluation Understudy with Representations from Transformers) [ 30 ] is designed to assess the quality of natural language generation in English by leveraging a pre-trained model that has been fine-tuned to emulate human judgments about the quality of the generated text. The strength of BLEURT lies in its end-to-end training, which enables it to model human judgments efectively and makes it robust to domain and quality variations. For this evaluation, the BLEURT-20 model was used.

All of the above-mentioned metrics were computed using preprocessed captions that were lowercased and had punctuation stripped. Numeric values were replaced with the token "number." The captions were treated as single sentences, regardless of actual sentence boundaries. This step ensures uniformity and focuses the evaluation on linguistic content.

In addition to the text-based metrics a reference free metric was implemented. The methodology is based on CLIPScore [31], an innovative metric that diverges from the traditional reference-based evaluations of image captions. Instead, it aligns with the human approach of evaluating caption quality without references by evaluating the alignment between text and image content. The original metric employs Contrastive Language-Image Pretraining (CLIP) [32], a cross-modal model that has been pre-trained on a massive dataset of image-caption pairs sourced from the web. For this year’s evaluation the MedImageInsight [33] model was used instead. It is trained using medical images with associated text and labels from a variety of domains, including X-ray, CT, MRI, OCT, and ultrasound. The model is used to compute similarity scores between images and text.

To assess the factuality of the generated captions, two complementary metrics were employed. The UMLS Concept F1-scoreevaluates the overlap of medical entities between the generated and reference captions. Specifically, medical concepts were extracted using MedCAT [ 34], with a focus on semantic types relevant to clinical accuracy as also defined for the MEDCON [ 35] metric whereas MEDCON relies on QuickUMLS [36] for concept extraction from both texts. This is followed by calculation of the F1-score to quantify concept-level agreement. The other factuality metric, AlignScore [37], employs a deep learning approach based on RoBERTa [38] to measure factual consistency. It involves the decomposition of extensive texts into more manageable segments and aligning the claims in the generated caption with the supporting evidence in the reference caption, thereby producing an average alignment score across all claims.

For the explainability extension, a radiologist was asked to rate both, the caption and the visualisation of each image in the explainability subset on a 1-5 Likert scale, with 5 being the best score.

The captions were ranked in terms of readability, clinical appropriateness, level of detail, and focus. The readability scale ranks whether the predicted captions are readable and coherently formulated. The clinical appropriateness evaluates whether the predicted captions match ground-truth captions or are clinically plausible. The level of detail is used to assess whether the captions merely describe visual findings or also interpret underlying clinical concepts. The focus validates the appropriateness of the scope of the caption and thus penalizes short captions that lack essential observations as well as excessively long captions that are not focused on the essentials.

The visualisation was assessed based on visual-text coherence, completeness, and focus. The visualtext coherence measures, if the visualisation is comprehensible in relation to the predicted caption. The completeness scale assesses, whether the visualisations meet all relevant concepts. The focus validates the appropriateness of the visualisation. 5https://huggingface.co/microsoft/deberta-xlarge-mnli [last accessed: 2025-06-05] 6https://github.com/Tiiiger/bert_score [last accessed: 2025-06-05]

5. Results

For the concept detection and caption prediction subtasks, Tables 5 and 6 show the best results from each of the participating teams. The results will be discussed in this section. The full list of results are shown in Appendix A in Tables 12, 13 and 15. Finally, Table 9 presents the results for the explainability subtask. 5.1. Results for the Concept Detection Subtask In 2025, 9 teams participated in the concept prediction subtask, submitting 51 graded runs. Table 5 presents the best results for each team achieved in the submissions. AUEB NLP Group [ 15 ] The AUEB NLP Group based their approach on their past work, which won the competition many years, but reached second place in the last year. The approach combined CNNs (EficientNet-B0 [ 39], DenseNet-121 [40], and ConvNeXt-Tiny [41]) with perlabel threshold optimization and ensembling strategies, including dual threshold aggregation, and partial intersection aggregation. The team won the first place with a primary F1-score of 0.5888 and a secondary F1-score of 0.9484.

DeepLens [ 16 ] The DeepLens team tackled the concept detection task with an ensemble model pipeline which combined EficientNet-B0 [ 39] and DenseNet-121 [40] under a simple union ensemble. Both networks were optimized with the ADAM optimizer using the Binary Cross Entropy with Logits loss function. The output layers of the models were replaced either with a three-layer feed-forward head or a single linear classifier to finetune the models for multilabel prediction. The ensemble with the best micro-F1-scorevalidation score was frozen for test inference. This method delivered the team’s best submission, securing a primary F1-scoreof 0.5766 and a secondary F1-scoreof 0.9299, which placed second overall in the competition. Furthermore, the DeepLeans team experimented with a K-Nearest Concept-Language-Image Pre-training to improve image-concept alignment in their ensemble strategy. Although it did not yield the best quantitative results, it might hold interesting directions for future research.

UIT-Oggy [ 17 ] For the concept detection task, the team designed MedCSRA, a novel architecture featuring a dual-branch design that combines global semantic understanding through global average pooling with localized class-specific residual attention (CSRA) mechanisms. Four CNN backbones were evaluated: ResNet-101, DenseNet121, EficientNet-B4 and EficientNet-B5. All were pre-trained on ImageNet and fine-tuned for medical multi-label classification using Binary Cross Entropy Loss. The final prediction uses a weighted combination of the outputs from the global and CSRA branches. ResNet-101 achieved the highest F1-score of 0.5613, demonstrating that specialized attention mechanisms can efectively identify multiple medical concepts in biomedical images.

DS4DH [ 18 ] reformulated concept detection as an image-to-sequence task to leverage transformerbased models capable of capturing the inherent order of UMLS codes (e.g., modality before anatomy or pathology). They proposed a compact architecture combining a convolutional neural network to extract low-dimensional image embeddings (as small as 16 dimensions) with a lightweight transformer decoder (1 head, 2 layers) that autoregressively generates UMLS code sequences via cross-attention. Beam search (width = 3) was used during decoding and improved performance. This approach achieved an F1-score of 0.5225 and a secondary F1-score of 0.8672, ranking the team fifth and sixth, respectively. To address class imbalance, the team experimented with focal loss, label smoothing, and pre-trained embeddings (MedCPT [42], CUI2Vec [43]), but none outperformed their baseline model.

They observed that their model tended to produce short sequences (average length 1.3 CUIs) with low diversity (15 unique predicted CUIs), which they attributed to dataset bias toward short and imbalanced annotations. Applying loss masking strategies during training increased the average sequence length to 3.0 CUIs and raised diversity to 103 unique CUIs. However, this revised model underperformed in terms of F1-score compared to their baseline submission. The team suggested this discrepancy may result from the challenge’s F1-score evaluation design, which potentially favors shorter CUI sequences and penalizes longer, yet possibly correct predictions not aligned with the ground-truth test data. sakthiii [ 19 ] For the concept detection task, team sakthiii employed a MedCLIP-based transformer model, which was pre-trained on medical image-caption pairs. In the first stage of their dual-stage training pipeline, they fine-tuned this MedCLIP model specifically for concept detection. This process involved training for 11 epochs with a batch size of 32, using the Adam optimizer and a learning rate of 1e-5. The dataset for this stage consisted of radiology images paired with UMLS concepts, allowing the model to learn the mappings between visual features and structured medical terms. Their best model for concept detection achieved an F1-score of 0.4003 and a secondary F1-score of 0.9082 , placing them eighth in this subtask.

JJ-VMed [ 20 ] The JJ-VMed team employed a fine-tuned LLaVA-LLaMA 3 8B model, processing inputs through a CLIP ViT-Large encoder. Training used prompt-based instruction tuning, and two output formats were explored: one generating concepts independent from the caption, while the second embedded them within full-text captions. They achieved a primary F1-score of 0.3982 and a secondary F1-score of 0.8329, ranking them seventh in this subtask.

UMUTeam [ 21 ] Based on the captions generated by a fine-tuned BLIP model, the UMUTeam employed named entity recognition (SciSpacy), concept retrieval (SapBERT), followed by a BERT-based reranking classifier, to extract the medical concepts for the concept detection subtask. They achieved an F1-score of 0.2398 with a secondary F1-score of 0.5377, putting them in eighth place, showing that this caption-based approach is inferior to multi-label classification systems. LekshmiscopeVIT [ 22 ] Team LekshmiscopeVIT focused on a broader evaluation of diferent deep learning architectures to approach the concept detection subtask. The team employed the standard architectures InceptionV3, DenseNet, and ResNet as well as a custom approach. Randomly initialized and ImageNet [44] pre-trained models of each of the standard architectures were ifne-tuned on the ROCOv2 dataset for 10 epochs and then compared. Part of each training pipeline was a uniform pre-processing step during which a multi-label binarizer was applied to create a binary label matrix for training. The team further experimented with reduction of label space complexity by limiting predictions to the most frequent concepts. The pre-trained ResNet approach achieved the team’s best results of 0.1494 in the primary, and 0.2298 in the secondary F1-score.

The Concept Detection task this year revealed several methodological trends among the participating teams. The top-performing approaches relied on convolutional neural network (CNN) ensembles, combining multiple pre-trained architectures, such as EficientNet, DenseNet, and ResNet. These ensembles used fine-tuned classification heads and per-label threshold optimization to improve multilabel prediction accuracy. Both simple and complex ensembling techniques proved efective, suggesting that leveraging the complementary strengths of diferent models remains strong.

Although CNNs dominated the leaderboard, several teams explored transformer-based and generative approaches. These included image-to-sequence formulations and vision-language models, such as MedCLIP and LLaVA. Though these methods were less competitive in terms of F1-scores, they indicate a growing interest in multimodal models.

Lower-ranking submissions often relied on caption-based pipelines and traditional CNNs without extensive optimization or innovative architectures. These underperformed compared to more tailored solutions.

A comparison of the 2024 and 2025 ImageCLEFmedical Concept Detection subtasks reveals a decline in primary F1-scores across the leaderboard, suggesting that this year’s task may have been more challenging or less suited to the models deployed.

Despite this overall decline in primary performance, secondary F1-scores based on manual annotations remained high and in some cases even improved. For example, the AUEB NLP Group, which participated in both years, saw a drop in primary F1-score, but an increase in secondary F1-score from 0.9393 to 0.9484, reclaiming the top spot.

By training and evaluating our own baseline model on the data from this year, we could determine that about 0.1 of the diference in primary F1-score is purely due to the new test dataset, which contains a much smaller number of unique concepts (see Table 4).

The observed decline in primary F1-scores can likely be attributed to several interrelated factors stemming from changes in the dataset. First, the slight increase in average concepts per image introduced greater multi-label complexity, making it more dificult to make fully correct predictions under the strict F1-score metric. Second, the broader inclusion of imaging modalities, particularly the addition of optical coherence tomography (OCT) and expanded angiography cases, may have introduced domain shifts that negatively afected models that were not trained or tuned on such data. Lastly, although concept ifltering improved label quality, it may have also limited the label space, penalizing over-predictive or less conservative systems. 5.2. Results for the Caption Prediction Subtask In this edition, the caption prediction subtask attracted 8 teams which submitted 98 graded runs. Tables 6, 7 and 8 present the results of the submissions.

UMUTeam [ 21 ] The UMUTeam employed the BLIP [45] architecture, which consists of a ViT encoder and a language model decoder, to generate captions for medical images. They fine-tuned a model which performs well in general image captioning benchmarks, selecting the best model based on the relevance metric. With a score of 0.9271 for Similarity, 0.5977 for BERTScore Recall, 0.2594 for ROUGE-1, 0.3230 for BLEURT and an overall score of 0.3432, they won the caption prediction subtask, scoring highest in all but the BERTScore Recall and AlignScore metrics.

DS4DH [ 18 ] developed multiple strategies for automatic medical image captioning. First, they finetuned a Vision-Language Model (InstructBLIP-Flan-T5-XL [46]) using selective parameter freezing, focusing training on cross-modal alignment while keeping most of the vision and language encoders fixed. Second, they implemented a Retrieval-Augmented Generation [ 47] (RAG) approach that retrieves visually similar training images and incorporates their captions into the prompt to guide caption generation. Third, they introduced a Cluster-based RAG strategy that groups training data by the semantic similarity of CUI codes using MedCPT [42] embeddings, enabling hierarchical retrieval within medically relevant clusters. Finally, they trained an alignment model (BioBart-v2-large [48]) using pairs of InstructBLIP-generated and ground-truth captions to refine caption quality.

Among all approaches, the fine-tuned InstructBLIP model achieved the highest overall score (0.3708) and ranked first in the recall-based BERTScore metric ( 0.6067) among all challenge participants. In contrast, both the alignment model and standard RAG approach underperformed, likely due to the introduction of noisy or irrelevant information, which reflects the visual similarity but semantic variability of radiology images. The Cluster-based RAG showed moderate improvements over standard RAG (e.g., overall score improved from 0.3478 to 0.3620). However, due to possible noise in predicted CUIs (F1-score= 0.5225) from the concept detection subtask, it still fell short of InstructBLIP. On the validation dataset, Cluster RAG outperformed InstructBLIP on several metrics when ground-truth CUIs were used. This highlights the critical importance of accurate concept detection for precise RAG retrieval cues, because even minor inaccuracies in CUI prediction can introduce semantic noise and significantly degrade caption quality. AI Stat Lab [ 24 ] The team developed a modular framework for medical image captioning that begins with a two-stage preprocessing pipeline. This includes 2× super-resolution and inpainting to eliminate bright border artifacts. A dual-encoder setup (SigLIP2 [49] + BioMedCLIP [50]) feeds into a Q-Former [51], which generates concept-aware tokens used for both captioning and medical concept classification. A LoRA-tuned [ 52] Bio-Medical LLaMA-3-8B [53] serves as the decoder. Six model variants produce captions that are either summarized using GPT-4 [54] or reranked using custom-designed metrics: BioMedCLIP image-text alignment, BLEURT self-consensus, and BioBERT [55] centroid proximity. Their best submission used BioMedCLIP alignment, achieved an overall score of 0.3229 and ranked third overall.

UIT-Oggy [ 17 ] For this task, the UIT-Oggy team fine-tuned the BLIP model by using Vision Transformer (ViT) to encode images and BERT-based text decoding to generate medical captions. Images were preprocessed to a uniform resolution of 224×224 and captions were tokenised to a maximum length of 200 tokens, ensuring compatibility with the vision-language model’s input requirements. The BLIP model achieved an overall score of 0.3211 for captioning, demonstrating the efectiveness of vision-language pre-training in adapting to the terminology and context of the medical domain.

AUEB NLP Group [ 15 ] The AUEB NLP Group’s approach on caption prediction involved seven primary systems: A finetuned InstructBLIP [ 46] model, was extended by a synthesizer and multi-synthesizer approach, an LM-Fuser, and an Distance from Median Maximum Concept Similarity (DMMCS) mechanism. In addition a test-time-reranker based on MedCLIP [56] and a reinforcement learning-based Mixer were implemented. The team’s best results were reached for the finetuned InstructBLIP model, which reached an overall rating of 0.3068 and the fifth rank in the challenge.

JJ-VMed [ 20 ] In the caption prediction task, JJ-VMed reused their LLaVA-LLaMA 3 model for initial generation, followed by post-processing with LLaMA 3.1. With a score of 0.8251 for Similarity, 0.5953 for BERTScore Recall, 0.2389 for ROUGE-1, 0.3094 for BLEURT and an overall score of 0.3043, they ranked sixth place in the caption prediction subtask. sakthiii [ 19 ] Following the concept detection training, the team transitioned to the caption prediction task by reusing the same MedCLIP model weights. This second stage aimed to leverage the semantic understanding gained during concept identification to help generate contextually relevant textual descriptions for the images. For this task, each image was preprocessed, converted to RGB format, and then paired with its corresponding caption from the dataset. The MedCLIP processor and tokenization pipeline from the Transformers library were utilized to prepare these multimodal inputs for the model. In the caption prediction task, their approach yielded scores of 0.7957 for Similarity, 0.5553 for BERTScore Recall, 0.1607 for ROUGE-1, and 0.2806 for BLEURT , also resulting in an eighth-rank achievement.

CS_Morgan [ 23 ] The CS_Morgan team investigated six distinct captioning pipelines by fine-tuning three vision-language backbones—Qwen-2B, Qwen2.5-3B, and SmolVLM-500M on the ROCOv2 dataset. They evaluated a vanilla LoRA-based adaptation (Submissions 1–3) and a modalityconditioned variant (Submissions 4–6) in which a ResNet-50 classifier (trained from scratch on four modalities: CT, MRI, Ultrasound, Radiograph) first predicts the image modality. During inference, the predicted modality label is concatenated to the prompt (e.g., “CT image: [image]. Describe the medical image.”) to guide the caption generator toward modality-specific terminology. Across these six runs, Qwen-2B achieved the highest Overall score (0.2537) when fine-tuned without classification, while both Qwen2.5-3B and SmolVLM demonstrated improved BLEURT and MedCATs scores under modality-conditioned prompting. This two-stage pipeline highlights that even smaller models like SmolVLM-500M can approach mid-scale performance when provided with structured modality cues.

Baseline For this year’s baseline models in the caption prediction subtask, we utilized of-the-shelf vision-language models to generate appropriate captions based on the challenge images. Specifically, we evaluated the performance of the following instruction-tuned models: Meta’s LLaMA 4 Scout (17Bx16E) Instruct [57], Google DeepMind’s Gemma 3 27B Instruct [58], and Alibaba Cloud’s Qwen2.5-VL 32B Instruct [59]. Each model was prompted individually with the challenge images and the following standardized instruction prompt in-context: "You are a medical expert contributing to a peer-reviewed scientific journal. Your task is to write a caption for a medical image, exactly as it would appear beneath a figure in a PubMed-indexed article. Concisely describe the clinical content of the image, identifying the imaging modality, key medical concepts, anatomical structures, visible markings, and any relevant abnormalities or pathologies. Where appropriate, include standard abbreviations in addition to full terms for modality, medical concepts, and pathologies (e.g., ’magnetic resonance imaging (MRI)’). Do not include any explanations, introductions, titles, figure numbers (e.g., ’Figure 1:’ / ’Fig 1:’), references, or bullet points. Text only the caption." To ensure reproducibility, we employed a deterministic decoding strategy by setting the Top-k sampling parameter to = 1, thereby always selecting the most likely predicted token at each step. Among the three baseline models evaluated, Meta’s LLaMA 4 Scout (17Bx16E) Instruct model performed best, obtaining an overall challenge score of 0.3101. This result positioned it approximately in the middle range of the submitted participant approaches. Notably, LLaMA 4 Scout achieved the highest scores in the Similarity metric (0.9369) and BLEURT metric (0.3258).

In the 2025 ImageCLEFmedical Caption Prediction subtask, all participating teams used visionlanguage models (VLMs) as the basis for their methods, showing a clear trend of using recent advances in multimodal architectures. Most submissions used or fine-tuned Transformer-based models, such as BLIP, InstructBLIP, and LLaMA variants. This indicates a reliance on pretrained models with strong image-text alignment capabilities. Several teams incorporated retrieval-augmented generation (RAG), multi-stage pipelines, or modular architectures to improve alignment with medical content. However, performance gains from these methods varied depending on the accuracy of supporting components, such as concept detection systems. Additionally, some teams used post-processing strategies, such as reranking or summarization. Despite the variety of approaches, models with direct fine-tuning on medical data and minimal architectural complexity often outperformed more elaborate pipelines. This result highlights the continued relevance of focused adaptation.

The results of the ImageCLEFmedical 2025 Caption Prediction subtask indicate a notable shift in evaluation priorities from general linguistic similarity toward a more balanced assessment of relevance and clinical factuality. Teams such as UMUTeam and DS4DH exhibited strong performance across both the relevance and factuality dimensions, outperforming several returning participants.

The analysis indicates that linguistic similarity metrics, such as BERTScore and ROUGE, demonstrate a high degree of consistency with those observed in the previous year, suggesting stable performance in terms of surface-level textual alignment. Embedding-based similarity scores are notably elevated among the top-performing submissions, suggesting that the generated captions may encompass semantically relevant content that extends beyond the scope of the original reference captions. This finding suggests a potential discrepancy between lexical overlap and underlying semantic alignment. Factuality-oriented metrics such as UMLS Concept F1-score and AlignScore remain relatively low, underscoring the inherent dificulty of ensuring clinical accuracy in generated captions. However, reliance on the original captions as the sole reference may limit the efectiveness of these scores in evaluating the full range of medically plausible outputs.

5.3. Results for the Explainability Subtask

This year, two teams participated in the explainability subtask. Table 9 presents the summarised results for both teams. In addition, AUEB NLP Group [ 15 ] The AUEB NLP Group extracted UMLS concepts of the captions generated by their finetuned InstructBLIP [ 46] model using a biomedical NER model of the ScispaCy library. GPT-4o was used to identify bounding boxes for these concepts. The group reached the best overall rating of 3.2 by the radiologist. However, it should be noted that the explainability approach focuses solely on the generated captions and does not involve the black-box model itself, which means it does not enhance the radiologist’s trust in the model’s predictions. JJ-VMed [ 20 ] For the explainability task, JJ-VMed implemented a three-phase approach: Spatial mapping using GPT-4 and GPT-4V to link concepts and textual descriptions with image regions, segmentation and object detection using SAM [ 60 ] (Segment Anything Model) and YOLOv8 [ 61 ], as well as visualisation heuristics, such as arrow-following and keypoint-detection. The outputs included bounding boxes, segmentation masks, and heatmaps. The team achieved an overall rating of 2.6. Similar to the winning approach, this method does not incorporate the black-box model itself, and therefore the explanations do not contribute to increasing trust in the model’s predictions.

In summary, both approaches used bounding boxes to visualise the connection between the images and specific concepts of the captions. The JJ-VMed team also provided heatmaps. Both visualisation methods are clinically valid. Although similar visualisation methods were used, the underlying techniques used for generation strongly difered. While the AUEB NLP group combined NER with GPT-4o to generate bounding boxes, the JJ-VMed combined GPT-4V models with YOLO object detection and Segment anything models (SAM) for segmentation. Both of these methods used to generate the explainability visualisations are based on external models. These models have no direct integration with the black-box model responsible for generating the captions. In conclusion, the visualizations do not contribute to increase the clinicians’ trust in the presented captioning model. More appropriate approaches for this task would be to use attention maps [ 62 ], GradCAM [ 63 ], or Layer-wise Relevance Propagation (LRP) [ 64 ], to generate model-intrinsic explanations that highlight the regions or features within the image that actually influenced the captioning output, thereby providing more meaningful insights into the model’s decision-making process.

During the manual validation, it was found that both participating teams were generally able to identify the imaging modality and the approximate anatomical region depicted in the images. However, substantial limitations were observed in the accurate identification and spatial localization of anatomical structures and pathological findings. A recurring issue across both submissions involved the inaccurate placement, scale, and labeling of bounding boxes. Frequently, the annotations only partially covered the target anatomical entities or failed to capture them entirely. Both teams generated syntactically coherent and clinically plausible captions, though with notable diferences in level of detail and accuracy. The AUEB NLP Group demonstrated greater accuracy in the identification and localisation of anatomical entities, resulting in more precise but less informative annotations. In contrast, JJ-VMed produced more detailed and descriptive captions, albeit often based on incorrect concept detection.

6. Conclusion

The 9th edition of the ImageCLEFmedical Caption task continued its evolution with three components: the established Concept Detection and Caption Prediction subtasks, and the promotion of Explainability to a fully graded subtask. This year’s challenge introduced an enlarged dataset featuring the new Optical Coherence Tomography (OCT) modality and a revised evaluation framework for captioning. The task attracted 11 teams who submitted a total of 149 graded runs, a substantial increase in submissions fostered by a higher run quota. Participation was balanced, with six teams entering both core subtasks, three focusing solely on concept detection, and two on caption prediction. Two teams took on the new explainability challenge.

For the concept detection subtask, the top-performing methods continued to rely on powerful ensembles of Convolutional Neural Networks (CNNs). However, a notable trend was the exploration of transformer-based and generative approaches by several teams, signalling a potential shift in methodology for future challenges.

In the caption prediction subtask, a clear consensus emerged around vision-language models (VLMs), with all teams leveraging architectures like BLIP, LLaMA, and their variants. Interestingly, direct ifne-tuning on medical data often outperformed more elaborate pipelines, such as Retrieval-Augmented Generation (RAG), which proved sensitive to the quality of their retrieval components, highlighting the challenge of system interdependencies.

In a reversal from 2024, primary F1-scores for concept detection saw a general decline across the leaderboard. This is attributed to the increased dificulty of the 2025 dataset, which featured new modalities like OCT and greater multi-label complexity. Despite this, secondary F1-scores on curated concepts remained high, indicating that models still perform robustly on core clinical findings.

The introduction of a composite score for caption prediction, averaging six metrics for relevance and factuality, successfully shifted the focus toward a more holistic evaluation. While relevance scores were strong, factuality metrics like UMLS F1-scoreand AlignScore remain modest across all submissions, underscoring that generating clinically accurate text is still the primary hurdle for the field. Notably, an of-the-shelf LLaMA 4 Scout baseline proved competitive, establishing a strong benchmark and demonstrating that while large foundation models are powerful, specialised fine-tuning still provides a winning edge.

Looking ahead, a primary focus for the 2026 challenge will be on advancing the maturity of the explainability task. This year’s initial submissions relied on post-hoc visualisations generated by external models. While a valid first step, these methods do not ofer insights into the captioning model’s internal decision-making process. Future iterations will therefore strongly encourage the development of model-intrinsic explanations, such as attention maps or GradCAM, to foster genuine trust in the underlying VLM. Furthermore, the 2026 edition will broaden the task’s scope and realism. The dataset will be extended again with recent PubMed Central publications, and to address the multilinguality of scientific literature, non-English captions will be translated and incorporated into the dataset, whereas previously they were omitted. For images that lack a direct caption, a baseline description will be generated for the dataset by using the context from the source article. The introduction of multilingual data and a continued focus on model transparency are intended to stimulate further research toward capable and reliable medical image understanding systems.

Acknowledgments

The work of Louise Bloch, Benjamin Bracke and Raphael Brüngel was partially funded by a PhD grant from the University of Applied Sciences and Arts Dortmund (FH Dortmund), Germany. The work of Ahmad Idrissi-Yaghir, Henning Schäfer, Tabea M. G. Pakull, Hendrik Damm, Helmut Becker, and Bahadır Eryılmaz was funded by a PhD grant from the DFG Research Training Group 2535 Knowledge- and databased personalisation of medicine at the point of care (WisPerMed). This work was partly supported by the project GRESEL-UNED PID2023-151280OB-C22 funded by MICIU/AEI/ AEI 501100011033.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling check. After using these services, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content. Computational Linguistics, Online, 2020, pp. 7881–7892. doi:10.18653/v1/2020.acl-main. 704. [31] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, Y. Choi, CLIPScore: A reference-free evaluation metric for image captioning, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 7514–7528. doi:10.18653/v1/2021.emnlp-main.595. [32] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language supervision, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 8748–8763. URL: http://proceedings.mlr.press/v139/ radford21a.html. [33] N. C. F. Codella, Y. Jin, S. Jain, Y. Gu, H. H. Lee, A. Ben Abacha, A. Santamaria-Pang, W. Guyman, N. Sangani, S. Zhang, H. Poon, S. Hyland, S. Bannur, J. Alvarez-Valle, X. Li, J. Garrett, A. McMillan, G. Rajguru, M. Maddi, N. Vijayrania, R. Bhimai, N. Mecklenburg, R. Jain, D. Holstein, N. Gaur, V. Aski, J.-N. Hwang, T. Lin, I. Tarapov, M. Lungren, M. Wei, MedImageInsight: An open-source embedding model for general domain medical imaging, 2024. URL: https://arxiv.org/abs/2410.06542. arXiv:2410.06542. [34] Z. Kraljevic, T. Searle, A. Shek, L. Roguski, K. Noor, D. Bean, A. Mascio, L. Zhu, A. A. Folarin, A. Roberts, R. Bendayan, M. P. Richardson, R. Stewart, A. D. Shah, W. K. Wong, Z. Ibrahim, J. T. Teo, R. J. B. Dobson, Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit, Artificial Intelligence in Medicine 117 (2021) 102083. doi:10.1016/j.artmed.2021.102083. [35] W.-w. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, M. Yetisgen, Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation, Scientific Data 10 (2023) 586. doi:10.1038/s41597-023-02487-3. [36] L. Soldaini, N. Goharian, QuickUMLS: A fast, unsupervised approach for medical concept extraction, in: Medical Information Search Workshop (MEDIR) at SIGIR, Pisa, Italy, 2016. [37] Y. Zha, Y. Yang, R. Li, Z. Hu, AlignScore: Evaluating factual consistency with a unified alignment function, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 11328–11348. doi:10.18653/v1/2023.acl-long.634. [38] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, 2019. URL: https://arxiv.org/abs/1907. 11692. arXiv:1907.11692. [39] M. Tan, Q. V. Le, EficientNet: Rethinking model scaling for convolutional neural networks, in: Proceedings of the International Conference on Machine Learning (ICML 2019), 2019, pp. 6105–6114. [40] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017, pp. 2261–2269. doi:10.1109/CVPR.2017.243. [41] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A ConvNet for the 2020s, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11966–11976. doi:10.1109/CVPR52688.2022.01167. [42] Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, Z. Lu, MedCPT: Contrastive pre-trained transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval, Bioinformatics 39 (2023). doi:10.1093/bioinformatics/btad651. [43] A. L. Beam, B. Kompa, A. Schmaltz, I. Fried, G. Weber, N. Palmer, X. Shi, T. Cai, I. S. Kohane, Clinical concept embeddings learned from massive sources of multimodal medical data, in: Biocomputing 2020, 2019. doi:10.1142/9789811215636_0027. [44] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition F1 AUEB NLP Group AI Stat Lab AUEB NLP Group AI Stat Lab AI Stat Lab AUEB NLP Group AI Stat Lab AI Stat Lab DS4DH DS4DH AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AUEB NLP Group AI Stat Lab AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group sakthiii AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group UIT-Oggy AI Stat Lab CS_Morgan CS_Morgan CS_Morgan CS_Morgan CS_Morgan 1717 1949 1968 1939 1759 1724 1972 1407 1715 1740 1769 1760 1758 1938 1757 1408 1718 1943 1721 1723 1958 1957 1669 1954 1670 1960 1722 1720 1716 1719 1962 1961 1890 1963 1966 1967 1959 1402 1386 1245 1815 1945 1817 1955 1956 DS4DH DS4DH DS4DH AI Stat Lab AI Stat Lab AI Stat Lab UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy DS4DH AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab UIT-Oggy UIT-Oggy DS4DH UIT-Oggy DS4DH UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy AI Stat Lab AI Stat Lab AI Stat Lab DS4DH UIT-Oggy UIT-Oggy UIT-Oggy UIT-Oggy AI Stat Lab AI Stat Lab AUEB NLP Group AI Stat Lab AUEB NLP Group AUEB NLP Group AI Stat Lab AI Stat Lab JJ-VMed JJ-VMed AUEB NLP Group AI Stat Lab AUEB NLP Group AI Stat Lab AI Stat Lab AUEB NLP Group AI Stat Lab AI Stat Lab 1735 1714 1946 1900 1965 1951 1914 1922 1937 1911 1662 1952 1944 1940 1947 1908 1912 1902 1916 1525 1936 1910 1907 1906 1918 1905 1920 1909 1913 1917 1915 1941 1695 1901 1344 1224 1289 1204 1219 1673 1729 1403 1948 1463 1462 1693 1405 1896 1953 1717 1949 1968 1939 1759 1724 1972 1407

AlignScore DS4DH DS4DH AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AI Stat Lab AUEB NLP Group AI Stat Lab AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group sakthiii AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group AUEB NLP Group UIT-Oggy AI Stat Lab CS_Morgan CS_Morgan CS_Morgan CS_Morgan CS_Morgan 1715 1740 1769 1760 1758 1938 1757 1408 1718 1943 1721 1723 1958 1957 1669 1954 1670 1960 1722 1720 1716 1719 1962 1961 1890 1963 1966 1967 1959 1402 1386 1245 1815 1945 1817 1955 1956

[1]

Müller ,

Kalpathy-Cramer ,

García Seco de Herrera , Experiences from the ImageCLEF Medical Retrieval and Annotation Tasks, Springer International Publishing, Cham, 2019 , pp. 231 - 250 . URL: https://doi.org/10.1007/978-3- 030 -22948-1_ 10 . doi: 10 .1007/978-3- 030 -22948-1_ 10 .

[2]

Ferro , C. Peters (Eds.), Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF , Springer, Cham, 2019 . URL: https://link.springer.com/book/10.1007/ 978-3- 030 -22948-1. doi: 10 .1007/978-3- 030 -22948-1.

[3]

Ionescu ,

Müller ,

D.-C.

Stanciu ,

A.-G.

Andrei ,

Radzhabov ,

Prokopchuk , Ştefan, LiviuDaniel, M.-G. Constantin,

Dogariu ,

Kovalev ,

Damm ,

Rückert ,

A. Ben

Abacha ,

García Seco de Herrera ,

C. M.

Friedrich ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Bracke ,

Pelka ,

Eryilmaz ,

Becker , W.-W. Yim,

Codella ,

R. A.

Novoa ,

Malvehy ,

Dimitrov , R. J. Das , Z.

Xie , H. M.

Shan , P.

Nakov , I. Koychev, S. A.

Hicks , S.

Gautam , M. A.

Riegler , V.

Thambawita , P.

Halvorsen , D.

Fabre , C.

Macaire , B.

Lecouteux , D.

Schwab , M.

Potthast , M.

Heinrich , J.

Kiesel , M.

Wolter , B.

Stein , Overview of ImageCLEF 2025: Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025 ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain, 2025 .

[4]

García Seco de Herrera ,

Schaer ,

Bromuri ,

Müller , Overview of the ImageCLEF 2016 medical task , in: Working Notes of CLEF 2016 (Cross Language Evaluation Forum) , 2016 , pp. 219 - 232 .

[5]

Eickhof , I. Schwall ,

García Seco de Herrera , H. Müller, Overview of ImageCLEFcaption 2017 - image caption prediction and concept detection for biomedical images , in: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum , Dublin, Ireland, September 11-14 , 2017 ., 2017 . URL: http://ceur-ws. org/ Vol-1866/invited_paper_7.pdf.

[6]

García Seco de Herrera ,

Eickhof ,

Andrearczyk ,

Müller , Overview of the ImageCLEF 2018 caption prediction tasks , in: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum , Avignon, France, September 10-14 , 2018 ., 2018 . URL: http://ceur-ws. org/ Vol- 2125 /invited_ paper_4.pdf.

[7]

Pelka ,

C. M.

Friedrich ,

García Seco de Herrera , H. Müller, Overview of the ImageCLEFmed 2019 concept detection task , in: L. Cappellato , N.

Ferro , D. E.

Losada , H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9 - 12 , 2019 , volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org , 2019 . URL: http://ceur-ws. org/ Vol- 2380 /paper_245.pdf.

[8]

Pelka ,

C. M.

Friedrich ,

García Seco de Herrera , H. Müller, Overview of the ImageCLEFmed 2020 concept prediction task: Medical image understanding , in: CLEF2020 Working Notes , volume 1166 of CEUR Workshop Proceedings , CEUR-WS.org, Thessaloniki, Greece, 2020 .

[9]

Bodenreider , The Unified Medical Language System (UMLS): integrating biomedical terminology , Nucleic Acids Research 32 ( 2004 ) 267 - 270 . doi: 10 .1093/nar/gkh061.

[10]

Pelka ,

A. Ben

Abacha ,

García Seco de Herrera ,

Jacutprakart ,

C. M.

Friedrich ,

Müller , Overview of the ImageCLEFmed 2021 concept & caption prediction task , in: CLEF2021 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Bucharest, Romania, 2021 , pp. 1101 - 1112 .

[11]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation , in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics , 2002 , pp. 311 - 318 .

[12]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger , Y. Artzi, BERTScore: Evaluating text generation with BERT , in: 8th International Conference on Learning Representations, ICLR 2020 ,

Addis

Ababa , Ethiopia, April 26-30 , 2020 , 2020 . URL: https://openreview.net/forum?id=SkeHuCVFDr.

[13]

Rückert ,

A. Ben

Abacha ,

García Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

Müller , C. M. Friedrich, Overview of ImageCLEFmedical 2023 - caption prediction and concept detection , in: CLEF2023 Working Notes , volume 3497 of CEUR Workshop Proceedings , CEUR-WS.org, Thessaloniki, Greece, 2023 , pp. 1328 - 1346 .

[14]

Rückert ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

Koitka ,

Pelka ,

A. Ben

Abacha ,

García Seco de Herrera , H. Müller,

P. A.

Horn ,

Nensa , C. M. Friedrich, ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset , Scientific Data 11 ( 2024 ) 688 . doi: 10 .1038/s41597-024-03496-6.

[15]

Chatzipapadopoulou ,

Pantelidis ,

Charalampakos ,

Samprovalaki , G. Moschovis,

Kaliosis ,

Dalakleidi ,

Pavlopoulos , I. Androutsopoulos , AUEB NLP group at ImageCLEFmedical Caption 2025 , in: CLEF2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025 .

[16]

A. H. S.

Rudsari ,

B. K.

Nejad ,

Hajihosseini ,

Eetemadi , Detecting concepts for medical images: Contributions of the DeepLens team at IUST to ImageCLEFmedical caption 2025 , in: CLEF2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025 .

[17]

M. V.

Luong ,

M. H.

Dinh-Doan ,

G. P.

Bui-Hoang , T. B. Nguyen-Tat , UIT-Oggy at ImageCLEFmedical 2024 caption: CSRA-enhanced concept detection and BLIP-driven vision-language captioning , in: CLEF2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[18]

He ,

Ferdowsi ,

Feng ,

Alves ,

Platon ,

Teodoro , DS4DH group at ImageCLEFmedical caption 2025 , in: CLEF2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025 .

[19]

T. Sakthi

Mukesh ,

Beulah , R. Muthulakshmi, ImageCLEF-medical 2025 : MedCLIP model for medical caption prediction and concept detection , in: Working Notes of the Conference and Labs of the CLEF Association (CLEF 2025 ), Madrid, Spain, 2025 . Notebook for the ImageCLEF Lab at CLEF 2025 .

[20]

Angulo ,

Aguilar , JJ-VMed: A framework for automated concepts, captions and explainability of medical image , in: CLEF2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[21]

Pan ,

T. Bernal

Beltrán ,

J. A.

García Díaz , R. Valencia-García, UMUTeam at ImageCLEF 2025: Fine-tuning a vision-language model for medical image captioning and SapBERT-based reranking for concept detection , in: CLEF2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[22]

Sahni ,

Gupta ,

R. Venugopal

Reddy ,

Kalinathan , Evaluating deep CNNs for multi-label concept detection in ROCOv2 radiology image dataset by team LekshmiscopeVIT , in: CLEF2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025 .

[23]

R. N.

Chowdhury ,

Hoque ,

M. R.

Hasan ,

E. P. O.

Oluwafemi , M. M. Rahman , Modality-guided radiology caption prediction with small vision-language models and image classifier , in: CLEF2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[24]

Lee ,

H. J.

Kim ,

Shin ,

Lim , A modular framework for clinically accurate medical image captioning using vision-language models , in: Working Notes of the Conference and Labs of the CLEF Association (CLEF 2025 ), Madrid, Spain, 2025 . Notebook for the ImageCLEF Lab at CLEF 2025 .

[25]

R. J.

Roberts , PubMed Central: The GenBank of the published literature , Proceedings of the National Academy of Sciences of the United States of America 98 ( 2001 ) 381 - 382 . doi: 10 .1073/ pnas.98.2.381.

[26]

Kraljevic ,

Searle ,

Shek ,

Roguski ,

Noor ,

Bean ,

Mascio ,

Zhu ,

A. A.

Folarin ,

Roberts ,

Bendayan ,

M. P.

Richardson ,

Stewart ,

A. D.

Shah ,

W. K.

Wong ,

Ibrahim ,

J. T.

Teo ,

R. J.

Dobson , Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit , Artificial Intelligence in Medicine 117 ( 2021 ) 102083 . doi: 10 .1016/j. artmed. 2021 . 102083 .

[27]

A. E.

Johnson , T. J. Pollard , L.

Shen , L. wei H.

Lehman , M.

Feng , M.

Ghassemi , B. Moody, P. Szolovits, L. A.

Celi , R. G. Mark, MIMIC-III, a freely accessible critical care database , Scientific Data 3 ( 2016 ). doi: 10 .1038/sdata. 2016 . 35 .

[28]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . doi: 10 .18653/v1/ N19 -1423.

[29] C.-Y. Lin , ROUGE: A package for automatic evaluation of summaries , in: Text Summarization Branches Out, Association for Computational Linguistics , 2004 , pp. 74 - 81 . URL: https: //aclanthology.org/W04-1013.

[30]

Sellam , D. Das , A. Parikh , BLEURT: Learning robust metrics for text generation, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , Association for

Yang ,

Li ,

Wan ,

Wang ,

Ding ,

Fu ,

Xu ,

Ye ,

Zhang ,

Xie , Z. Cheng, H. Zhang,

Yang ,

Xu ,

Lin , Qwen2.5-VL technical report , 2025 . URL: https://arxiv.org/abs/2502.13923. arXiv: 2502 . 13923 .

[60]

Kirillov ,

Mintun ,

Ravi ,

Mao ,

Rolland ,

Gustafson ,

Xiao ,

Whitehead ,

A. C.

Berg , W.-Y. Lo,

Dollár ,

Girshick , Segment anything , in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023 , pp. 3992 - 4003 . doi: 10 .1109/ICCV51070. 2023 . 00371 .

[61]

Jocher ,

Chaurasia ,

Qiu , Ultralytics

YOLOv8

, 2023 . URL: https://github.com/ultralytics/ ultralytics.

[62]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: I. Guyon, U. von Luxburg, S. Bengio,

H. M.

Wallach ,

Fergus ,

S. V. N.

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9 , 2017 , Long Beach, CA, USA, 2017 , pp. 5998 - 6008 . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

[63]

R. R.

Selvaraju ,

Cogswell , A. Das , R.

Vedantam , D.

Parikh , D.

Batra , Grad-CAM: Visual explanations from deep networks via gradient-based localization , in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017 , pp. 618 - 626 . doi: 10 .1109/ICCV. 2017 . 74 .

[64]

Binder , G. Montavon,

Lapuschkin ,

Müller , W. Samek, Layer-wise relevance propagation for neural networks with local renormalization layers , in: Proceedings of the International Conference on Artificial Neural Networks (ICANN 2016 ), 2016 . doi: 10 .1007/978-3- 319 -44781- 0 _ 8 .