1. Introduction

AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models

Yunseo Lee

Hyun Jun Kim

hyunjun0615@cau.ac.kr 1

Heeseung Shin

Changwon Lim

clim@cau.ac.kr 0 0 Department of Applied Statistics, Chung-Ang University , 84 Heukseok-ro, Dongjak-gu, Seoul 06974 , Republic of Korea 1 Department of Smart Cities, Chung-Ang University , 84 Heukseok-ro, Dongjak-gu, Seoul 06974 , Republic of Korea 2 Department of Statistics and Data Science, Chung-Ang University , 84 Heukseok-ro, Dongjak-gu, Seoul 06974 , Republic of Korea

2025

We propose a modular framework for medical image captioning that integrates domain-adapted visual encoders, token-eficient representation via query-based compression, and post-hoc refinement. The architecture employs an ensemble of general-purpose and domain-specific vision encoders (SigLIP2 and BioMedCLIP), a Q-Former for dense concept-aware tokenization, and a LoRA-tuned Bio-Medical LLaMA-3 decoder. Auxiliary objectives guide the model to jointly predict UMLS concepts and semantic types, improving semantic grounding. At inference, captions from six independently trained variants are reranked using three complementary strategies-BioMedCLIP similarity, BLEURT scoring, and BioBERT-based centroid alignment. Evaluations on the ImageCLEF2025 Caption Prediction Task demonstrate consistent gains in semantic relevance and clinical factuality over single-encoder and non-multitask baselines. Our approach (team: AI Stat Lab, ID #1900) achieved third place with an overall score of 0.3229, corresponding to relevance and factuality scores of 0.5089 and 0.1369, respectively.

eol>Medical image captioning Vision-language model Dual Encoder UMLS concepts Caption reranking GPT summarization

1. Introduction

Medical image captioning, automatically generating radiologist-style descriptions from imaging studies, has the potential to accelerate report drafting, improve content-based image retrieval, and increase the interpretability of diagnostic AI models. Compared with natural-image captioning, the task is complicated by grayscale modalities, subtle anatomical cues, and a highly specialized vocabulary, all of which demand fine-grained visual reasoning and domain knowledge [ 1 ].

While prior eforts have made notable progress by employing encoder-decoder frameworks trained on paired image-text datasets, the performance of these systems is often hindered by limitations in data quality, domain adaptability, and output reliability [ 2, 3, 4 ]. For instance, low-resolution images [ 5 ] and annotation-induced artifacts are prevalent in public medical datasets [ 6 ], degrading model perception. Moreover, generic vision encoders may lack the capacity to extract subtle domain-specific features [ 7 ], and caption decoders often produce inconsistent or incomplete descriptions due to limited grounding in clinical semantics [ 8 ]. To address these limitations, we construct a modular medical captioning framework by assembling and adapting proven techniques across the visual and language modeling pipeline. In particular, the pre-processing stage includes resolution enhancement and visual consistency adjustments [ 9, 10 ]. To address these limitations, we construct a modular medical captioning framework by assembling and adapting proven techniques across the visual and language modeling pipeline. Specifically, we integrate: 1. A dual-encoder configuration using SigLIP2 [ 11 ] and BioMedCLIP [ 12 ] for both general and domain-specific feature extraction [ 13 ], 2. A Query Transformer (Q-Former) [ 14 ] to reduce redundancy and enable concept-aware representations, 3. A biomedical LLaMA-3 decoder [ 15 ] fine-tuned via Low-Rank Adaptation (LoRA) for eficient adaptation, and 4. A post-hoc refinement stage that consolidates outputs from six independently trained captioning models.

This module employs GPT-4-based summarization [ 16 ] and multiple reranking strategies—including BiomedCLIP similarity, BLEURT [ 17 ] scoring, and centroid-based selection [ 18 ] using BioBERT [ 19 ]—to generate a single, clinically coherent caption.

2. Related Works

Medical image captioning has evolved alongside advances in vision-language modeling, primarily following the encoder-decoder paradigm widely used in natural image captioning. Early works employed convolutional neural networks (CNNs) as visual encoders paired with recurrent neural networks (RNNs) or Transformer-based decoders to generate captions [ 1 ]. However, these approaches often lacked clinical specificity, as they relied on general-purpose image features and were trained on limited or noisy medical datasets.

More recently, the integration of large-scale vision-language models (VLMs), such as BioMedCLIP [ 12 ], has enabled more transferable and semantically rich representations across diverse medical imaging modalities [ 19, 20, 21, 22 ]. These models, pretrained on multimodal datasets, facilitate improved generalization to unseen clinical data with minimal supervision.

Furthermore, the advent of large language models (LLMs), including GPT [ 23 ] and LLaMA [ 24 ], has further advanced captioning performance by providing enhanced language fluency, contextual reasoning, and factual alignment. Some recent systems incorporate LLMs as decoders conditioned on image-derived embeddings or prompts, allowing for richer and more coherent textual outputs.

In parallel, post-hoc refinement strategies have emerged as a practical solution for improving caption consistency. Ensemble-based generation followed by reranking using clinical relevance metrics—such as BERTScore [ 25 ], BLEURT [ 17 ], and visual-semantic similarity—has shown promise in reducing redundancy and hallucination. GPT-based summarization has also been explored to consolidate conflicting candidate captions into a single coherent report.

3. Method 3.1. Pre-processing

Training images in ROCOv2 exhibit two systematic defects, low spatial resolution and bright border artifacts, that degrade visual embeddings and, by extension, caption quality. We therefore apply a two-stage pre-processing pipeline comprising super-resolution and structure-aware inpainting.

First, we observed that 3,485 training images exhibited spatial resolutions smaller than 300 × 300 pixels. Considering the non-negligible proportion of such images and the risk of losing fine-grained visual cues crucial for captioning, we applied 2× super-resolution to these samples. For this purpose, we utilized the Feedback Adaptive Weighted Dense Network (FAWDN) [ 9 ] a recurrent convolutional architecture equipped with a feedback mechanism and adaptive dense blocks. FAWDN progressively refines image quality over multiple time steps by combining current inputs with hidden states from previous iterations. The network is composed of shared input, hidden, and output units across all time steps, and integrates an Adaptive Weighted Dense Block (AWDB) that captures multi-scale features through a combination of 1×1 convolutional layers and dense connections. This network was selected not only for its proven performance on diverse image datasets but also due to the availability of pretrained models specific to the medical domain, allowing us to avoid resource-intensive training of super-resolution models from scratch.

Second, to address the frequent presence of white or overly bright borders in the dataset images—often resulting from scanning artifacts or annotation overlays—we implemented a structure-aware inpainting strategy instead of simple cropping. Specifically, we identified border regions with brightness levels exceeding 245 within a fixed 8% margin around the image edges and applied the inpainting algorithm introduced by Telea [ 10 ] to fill these regions using nearby pixel information. Representative results are shown in Figure 1. Unlike hard cropping, which risks discarding medically relevant content near the periphery, this inpainting method preserves the overall anatomical integrity of each image while eliminating non-informative border artifacts. This procedure enhances the visual consistency of inputs and prevents the model from learning spurious cues unrelated to the actual medical content.

Together, these pre-processing steps improve the signal-to-noise ratio in the image encoder input and help stabilize caption generation by standardizing input quality across the dataset.

3.2. Model Architecture

The overall architecture of our proposed medical image captioning model is illustrated in Figure 2. The model consists of dual vision encoders, a Query Transformer (Q-Former), and a domain-adapted LLaMA decoder, which are described in detail in the following subsections.

3.2.1. Dual Encoder

To derive robust and semantically rich visual representations from medical images, we adopt an ensemble of two vision encoders. Specifically, we utilize SigLIP2 [ 11 ], a general-purpose image encoder pretrained on large-scale natural image-text pairs, and BioMedCLIP [ 12 ], a medical-domain-specific encoder trained on 15 million image-caption pairs mined from PubMed Central.

To address the lack of medical image knowledge in the original SigLIP2 model, we perform domainspecific pre-adaptation by fine-tuning it on the dataset provided by the ImageCLEF2025 Caption Prediction Task [ 28, 29 ]. This enhances the encoder’s ability to capture domain-relevant visual features while preserving generalization capacity. Following standard practice, we remove the classification heads from both encoders and extract intermediate features from their penultimate transformer layers. Let the feature outputs from BioMedCLIP and SigLIP2 be denoted as fbioclip ∈ R× 768 and fsiglip ∈ R× 1536, respectively. These representations are concatenated to form a unified embedding f = [fbioclip; fsiglip] ∈ R× 2304, which preserves domain-specific detail from BioMedCLIP and high-level semantics from SigLIP2.

3.2.2. Query Transformer (Q-Former)

To reduce redundancy and computational burden, we apply a Q-Former [ 14 ] that projects the highdimensional visual embedding f ∈ R× 2304 into a fixed number of informative latent tokens. The visual feature f is broadcast across a learnable set of query tokens, resulting in a sequence input X ∈ R× 32× 2304. The Q-Former consists of six transformer layers with cross-attention modules that allow each query token to selectively attend to parts of the visual input. The output of the Q-Former is denoted as Z ∈ R× 32× 4096. This output is used for both caption generation and auxiliary concept classification.

To enhance medical grounding, we incorporate a multitask classification objective [ 30 ]. The output Z ∈ R× 32× 4096 is mean-pooled across the query dimension to produce a global representation z¯ ∈ R× 4096. This representation is passed through two linear classifiers: one to predict concept presence among 2,478 Concept Unique Identifiers (CUIs), and another to predict 21 coarse concept types. The overall loss function is a weighted combination of the captioning loss and classification loss: ℒtotal = ℒcaption + · ℒ cls where ℒcaption is the cross-entropy loss over caption tokens, and the auxiliary term ℒcls uses the multilabel margin loss. This multi-task setup improves alignment between the generated captions and clinical concepts visually present in the input image.

3.2.3. Caption Decoder

For the caption generation task, we adopt Bio-Medical LLaMA-3-8B [ 15 ] a domain-specialized variant of Meta-Llama-3-8B-Instruct [ 31 ] as the language decoder. The model has been fine-tuned on BioMedData, a high-quality biomedical dataset containing over 500,000 entries. The dataset comprises a blend of synthetic and manually curated samples, enabling robust generalization across a wide range of biomedical contexts. During training, the 32 Q-Former tokens are inserted as prefix embeddings that condition every decoding step on visual evidence. To enable eficient fine-tuning, we incorporate LoRA [ 32 ] modules into the decoder. This allows the model to adapt to medical image captioning tasks with minimal parameter updates while preserving the core language modeling capabilities of LLaMA.

3.2.4. Model Variants for Ensemble

To improve caption diversity and stabilize final output quality, we trained six independently parameterized captioning models under varying training configurations. All models share the same core architecture consisting of a Q-Former module and a LLaMA-based language decoder, but difer in their visual encoder types and auxiliary training settings. Specifically, we constructed two models each for three encoder configurations: (1) using BioMedCLIP alone, (2) using SigLIP2 alone, and (3) using a dual-encoder setup that concatenates both BioMedCLIP and SigLIP2. Within each encoder group, one model was trained with auxiliary concept classification (predicting UMLS concepts and types [ 33 ]) and one without it. These six models generate diverse caption candidates for each image, forming the foundation for our post-processing pipeline described in the next subsection.

3.3. Post-processing

To further refine the raw captions generated by our six independently trained captioning models, we applied post-processing strategies aimed at improving both clinical coherence and factual relevance. This section presents two major post-processing components: (1) summarization-based refinement using GPT APIs and (2) candidate caption reranking based on semantic and domain-specific metrics.

3.3.1. Summarization-based Refinement

We employed two GPT-4-based summarization [ 16 ] strategies to consolidate the six candidate captions—each produced by a diferent model—into a single, medically accurate sentence. Both approaches aimed to improve readability, reduce redundancy, and ensure consistency with structured medical knowledge. The exact prompts used for each summarization method are provided in Table 1 below. You are a board-certified radiologist. You are a radiologist summarizing multiple capTASK tions of a medical image into ONE detailed sen1. Parse EACH caption and list by line: <MODAL- tence.

ITY>, <ANATOMIC_SITE>, <PATHOLOGIES>, etc. – Integrate the imaging modality, anatomical lo2. Build a CONSENSUS table of token frequency. cation, pathological findings, and specific clinical 3. Resolve conflicts by majority vote or keep the details. longer/specific one. – Use medically correct, extractive phrasing that 4. Compose ONE radiology-style sentence ( 35–45 maximizes token overlap—avoid paraphrasing unwords): less synonymous medical terminology improves – retains exact terms from table clarity. – concatenates: modality → site → key findings → – Use present continuous tense with a subjectclinical context predicate-object structure. – uses “shows”, “demonstrates”, avoids headings – Keep the summary natural, clinically accurate, – omits absent content. and around 40 words, allowing slight variation if OUTPUT: FINAL_CAPTION: <your summary> shorter or longer improves clarity.

CAPTIONS: – If captions contain inconsistencies, prioritize findings with the highest diagnostic or therapeutic relevance.

HERE are the captions: • caption 1: {caption1} • caption 2: {caption2} • caption 3: {caption3} • caption 4: {caption4} • caption 5: {caption5} • caption 6: {caption6} • caption 1: {caption1} • caption 2: {caption2} • caption 3: {caption3} • caption 4: {caption4} • caption 5: {caption5} • caption 6: {caption6} Prompt-guided Summarization From each captioning models, six caption candidates were aggregated and fed into a standardized GPT-4 prompt. The prompt requested a concise and clinically coherent summary under the assumption that these captions describe the same medical image. This helped filter out redundant or inconsistent information and unify expression styles across captions. Chain-of-Thought Summarization In this variant, the prompt instructed the model to generate step-by-step reasoning [ 34 ] before concluding the final summary. The intent was to increase factual consistency by encouraging the model to align each summary point with the underlying clinical evidence extracted from input captions. Empirically, this strategy improved alignment with structured medical entities.

3.3.2. Caption Reranking

To select the most appropriate caption among the generated candidates, we implemented a reranking module based on three diferent metrics: BioMedCLIP-image-text alignment, BLEURT-self-consensus, BioBERT centroid proximity. The overall framework of these reranking strategies is illustrated in Figure 3. BioMedCLIP-image-text alignment Each caption is embedded into with the BiomedCLIP [ 13 ] text encoder, and the corresponding image I is embedded into w using the image encoder. Cosine similarity BLEURT-self-consensus BLEURT [ 17 ] estimates sentence quality via a regression head over BERTstyle embeddings. For caption among n candidates, we compute the leave-one-out average score = which rewards captions that are semantically central to the hypothesis set and thus robust to outliers. The final caption is selected by maximizing this self-consistency score: BioBERT centroid proximity The centroid

All captions are embedded via BioBERT [ 19 ] as vectors 1, . . . , .

ˆ = arg max score

v = 1 ∑︁ v =1 represents the consensus sematic position. Each caption is then ranked based on Euclidean distance to the centroid [ 18 ].

ˆ = arg min‖v − v‖

4. Experiments 4.1. Experimental Setups

Dataset We conduct experiments on the extended version of the ROCOv2 dataset[ 28, 35 ], specifically curated for the ImageCLEFmedical 2025 Caption Prediction Task [ 28 ]. Unlike the original ROCOv2 [ 35 ], this updated release includes additional manual annotations as well as a newly introduced test set for the 2025 challenge. The dataset configuration difers from prior versions: the previous test set from ROCOv2 has been reassigned as the validation set, and the prior validation set has been merged into the training set. The newly collected 2025 test set contains unseen images to evaluate generalization performance under updated task conditions. The resulting splits comprise 80,091 images for training, 17,277 for validation, and 19,267 for testing. Each image is associated with a manually curated caption and UMLS concepts, making it suitable for both generation and concept detection tasks. Evaluation Metrics Model performance is evaluated according to the oficial challenge protocol using six metrics that assess both relevance and factuality. Relevance is assessed using BERTScore (Recall with IDF), ROUGE-1 (F1), BLEURT and Image-text Similarity, with BERTScore is computed with the microsoft/deberta-xlarge-mnli model using IDF scores derived from the test set and BLEURT using the recommended BLEURT-20 checkpoint. Image-text Similarity is evaluated by independently extracting embedding vectors for the image and its corresponding caption using the MedImageInsight [ 36 ] model, followed by computing their cosine similarity. All relevance metrics are calculated on lowercase, punctuation-free captions with numbers replaced by the token “number.” For factuality, UMLS Concept F1 is computed using MedCAT and semantic type filtering via QuickUMLS, and AlignScore is used to measure information consistency between predicted and reference captions based on RoBERTa-base alignment. All scores are averaged over the entire test corpus.

Model Settings Our system utilizes either BioMedCLIP or SigLIP2 as standalone vision encoders, or their ensemble via channel-wise feature concatenation. The language decoder is Bio-Medical LLaMA-38B, a domain-specific large language model. Visual features are processed by a 6-layer Q-Former with 32 learnable query tokens. The Q-Former maps from a 2304-dim input (concatenated encoder outputs) to 4096-dim embeddings compatible with the LLM. The model optionally includes auxiliary classification heads that predict 2,478 UMLS concept labels and 21 coarse types. The total loss is computed as a weighted sum of the captioning loss and the concept classification loss, where the weighting factor is empirically set to 0.1.

The model was trained using the AdamW optimizer, with the learning rate linearly increased to 1e-4 during the first epoch and annealed to 1e-6 over a total of 10 epochs. Training was conducted on a NVIDIA H100 GPU with a batch size of 16 and a gradient accumulation step of 2. During inference, we employed beam search decoding with a beam width of 3, a repetition penalty of 2.5, a length penalty of 2.0, and a minimum and maximum output length of 8 and 64 tokens, respectively. Image and Text Pre-processing To mitigate quality degradation caused by low-resolution inputs (<300×300) and overly bright borders, we implemented a two-stage pre-processing pipeline comprising FAWDN-based 2× super-resolution and structure-aware inpainting described in 3.1. In addition, we experimented with applying a Gaussian filter during image pre-processing and GPT-based back-translation [ 37 ] (English → Korean → English) for text augmentation; however, neither approach yielded notable performance improvements.

4.2. Base Model Results

We conducted ablation experiments to evaluate the impact of the dual encoder architecture and auxiliary classification tasks on medical image captioning performance. The detailed performance comparison across model variants is presented in Table 2. Compared to single-encoder baselines using either BioMedCLIP (#1405) or SigLIP2 (#1407), the dual encoder model (#1673), which concatenates both encoders along the channel dimension, consistently outperformed in terms of both relevance and factual accuracy. On the test set, this model improved ROUGE-1 and BLEURT scores by up to +0.0107 and +0.0081, respectively, while UMLS F1 increased by as much as +0.0119, demonstrating the efectiveness of combining domain-specific and general-purpose visual representations.

Building upon this, we introduced auxiliary classification heads for predicting UMLS concepts and semantic types. Compared to the base dual encoder (#1673), the model with concept prediction (#1695) achieved further gains across all major metrics, including an additional +0.0062 in ROUGE-1 and +0.0048 in UMLS F1. These improvements underscore the value of explicitly modeling medical concepts, which enhances the factual grounding of generated captions without sacrificing fluency. Among all configurations, the dual encoder with auxiliary classification (#1695) achieved the strongest overall performance, ranking first in four out of six evaluation metrics. These findings validate our architectural choices: integrating heterogeneous visual features through a dual encoder and reinforcing clinical relevance through concept-aware auxiliary tasks. Together, these components contribute to generating more accurate, informative, and clinically coherent medical image captions.

4.3. Post-Processing Results

We evaluate three caption reranking strategies, namely BioMedCLIP-base, BLEURT-base, and bioBERTbase, by selecting the best caption among candidates generated from six base models. All three reranking methods improve upon the base model outputs, confirming that post-processing plays a critical role in enhancing caption quality. Among the methods, BLEURT-based reranking (#1965) achieved the highest cosine similarity (0.9008) and BLEURT score (0.3186), while maintaining strong results across ROUGE (0.2397) and UMLS F1 (0.1486). This suggests that selecting internally consistent captions—those that align with the majority of candidate hypotheses—enhances both fluency and factual alignment.

BioMedCLIP-based reranking (#1900) prioritized visual-semantic grounding, yielding the highest ROUGE (0.2440) and competitive scores in ALIGN (0.1231) and UMLS F1 (0.1524). In contrast, BioBERTbased reranking (#1944) produced the highest BERTScore (0.5854), along with balanced performance across all metrics and the strongest UMLS F1 (0.1536). These results demonstrate that reranking not only improves overall caption quality but also enables fine-grained control depending on whether fluency, alignment, or clinical factuality is prioritized.

In addition to reranking methods, GPT-4-guided summarization was assessed as an alternative post-processing strategy. Despite the intuitive appeal of aggregating multiple candidate captions into a single concise output, these summarization strategies underperformed relative to reranking in our quantitative assessments. Specifically, both the prompt-guided and chain-of-thought prompting approaches frequently exhibited reduced precision and occasionally introduced clinically irrelevant or hallucinated content. These shortcomings were particularly evident in factual grounding metrics such as ALIGN and UMLS F1, indicating that generative summarization may abstract away or omit key clinical entities during compression. A detailed comparison of these limitations is provided in Table 4.

In summary, each reranking strategy exhibits distinct advantages: BLEURT-base excels in fluency and self-consistency, BioMedCLIP-base in vision-language alignment, and bioBERT-base in semantic grounding and metric balance. These results highlight that the choice of reranking method should depend on the specific priorities of the medical captioning application. Given that both relevance and factuality were key evaluation metrics in the ImageCLEF2025 challenge, the post-processing approach based on image-text alignment using BioMedCLIP achieved the highest performance. This submission, made under the team name AI Stat Lab with submission ID #1900, achieved an overall score of 0.3229 and ranked third on the oficial leaderboard.

5. Conclusion

We introduced a modular medical-image captioning pipeline that unifies three components: (i) dual vision encoders (BioMedCLIP + SigLIP2) to fuse domain-specific and general visual knowledge, (ii) a multitask loss that aligns captions with 2,478 UMLS concepts and 21 semantic types, and (iii) a metric-aware reranker that selects the most faithful hypothesis among six candidates. Specifically, our best submission (#1900) was constructed by applying BioMedCLIP-based reranking to the pool of six candidates generated from six base models (submissions #1405, #1407, #1673, #1693, #1694, and #1695, in Table 2).

On the ROCOv2 benchmark our system surpasses single-encoder and concept-agnostic baselines on every shared-task metric, BERTScore, ROUGE-1, BLEURT, ALIGN, and UMLS-F1, demonstrating simultaneous gains in linguistic relevance and clinical factuality. Among the post-processing strategies, BioBERT-centric reranking achieves the best harmonic mean of relevance and factuality, whereas BioMedCLIP-based scoring ofers the highest image-text alignment, highlighting a trade-of that can be tuned to downstream needs.

These results validate two key insights: (1) heterogeneous visual encoders supply complementary features that improve descriptive richness, and (2) explicit concept supervision curbs hallucinations and improves diagnostic grounding. The proposed pipeline establishes a strong baseline for upcoming CLEF-Cap tasks and paves the way for future work on longitudinal captioning, device detection, and lightweight on-device deployment in clinical settings.

Acknowledgments

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(RS-2024-00360176).

Declaration on Generative AI

During the preparation of this work, the author(s) used OpenAI GPT-4o in order to: refine writing style, reorganize paragraph structure, and assist with technical language formulation. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

Appendix A. Summarization-based Refinement Results

While summarization-based refinement using GPT models provides a promising approach for aggregating multiple candidate captions into a single concise output, as summarized in Table 4, our experiments reveal that this strategy underperforms compared to reranking-based methods in terms of factual alignment and clinical adequacy.

The best summarization method — CoT-based refinement (#1938) — achieves a BERTScore of 0.5705, BLEURT of 0.3197, ALIGN of 0.0843, and UMLS F1 of 0.1236. In comparison, the BioMedCLIP-based reranking method (Table 3), #1900) outperforms it across all key metrics, achieving a higher BERTScore Prompt guided

Image-text BERTScore

SIM (Recall)

ROUGE-1

BLEURT

[1]

D. R.

Beddiar ,

Oussalah ,

Seppänen , Automatic captioning for medical imaging (mic): a rapid review of literature , Artificial intelligence review 56 ( 2023 ) 4019 - 4076 .

[2]

Zhao ,

Alzubaidi ,

Zhang ,

Duan ,

Gu , A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations , Expert Systems with Applications 242 ( 2024 ) 122807 .

[3]

Limbu ,

Banerjee , Medblip: Fine-tuning blip for medical image captioning , arXiv preprint arXiv:2505.14726 ( 2025 ).

[4]

Guan , M. Liu, Domain adaptation for medical image analysis: a survey , IEEE Transactions on Biomedical Engineering 69 ( 2021 ) 1173 - 1185 .

[5]

Umirzakova ,

Ahmad ,

L. U.

Khan , T. Whangbo, Medical image super-resolution for smart healthcare applications: A comprehensive survey , Information Fusion 103 ( 2024 ) 102075 .

[6]

Pérez-García ,

Sharma ,

Bond-Taylor ,

Bouzid ,

Salvatelli ,

Ilse ,

Bannur ,

D. C.

Castro ,

Schwaighofer ,

M. P.

Lungren , et al., Exploring scalable medical image encoders beyond text supervision , Nature Machine Intelligence ( 2025 ) 1 - 12 .

[7]

Lu ,

Li ,

N. A.

Parikh ,

J. R.

Dillman ,

He , Radclip: Enhancing radiologic image analysis through contrastive language-image pretraining , IEEE Transactions on Neural Networks and Learning Systems ( 2025 ).

[8]

Kaliosis ,

Pavlopoulos ,

Charalampakos , G. Moschovis, I. Androutsopoulos , A data-driven guided decoding mechanism for diagnostic captioning , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 , Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 7450 - 7466 . URL: https://aclanthology. org/ 2024 .findings-acl. 444 /. doi: 10 .18653/v1/ 2024 .findings-acl. 444 .

[9]

Chen ,

Yang ,

Jeon ,

Anisetti ,

Liu , A trusted medical image super-resolution method based on feedback adaptive weighted dense network , Artificial Intelligence in Medicine 106 ( 2020 ) 101857 .

[10]

Telea , An image inpainting technique based on the fast marching method , Journal of graphics tools 9 ( 2004 ) 23 - 34 .

[11]

Tschannen ,

Gritsenko ,

Wang ,

M. F.

Naeem , I. Alabdulmohsin,

Parthasarathy ,

Evans ,

Beyer ,

Xia ,

Mustafa , et al., Siglip 2 : Multilingual vision -language encoders with improved semantic understanding, localization, and dense features , arXiv preprint arXiv:2502.14786 ( 2025 ).

[12]

Zhang ,

Xu ,

Usuyama ,

Xu ,

Bagga ,

Tinn ,

Preston ,

Rao ,

Wei ,

Valluri , et al., Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs , arXiv preprint arXiv:2303.00915 ( 2023 ).

[13]

Zhang ,

Wang ,

Liang ,

Li ,

Guo ,

Wang ,

Li , G. Wang, Sam-guided enhanced ifne-grained encoding with mixed semantic learning for medical image captioning , in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2024 , pp. 1731 - 1735 .

[14]

Li ,

Savarese ,

Hoi , Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , in: International conference on machine learning, PMLR , 2023 , pp. 19730 - 19742 .

[15] ContactDoctor, ContactDoctor-Bio-Medical: A High-Performance Biomedical Language Model , https://huggingface.co/ContactDoctor/Bio-Medical-Llama- 3 -8B, 2024 . Accessed: 2025 -06-16.

[16]

D. M.

Chan ,

Myers ,

Vijayanarasimhan ,

D. A.

Ross , J. Canny, Ic3: Image captioning by committee consensus , arXiv preprint arXiv:2302.01328 ( 2023 ).

[17]

Sellam , D. Das , A. P. Parikh , Bleurt: Learning robust metrics for text generation , arXiv preprint arXiv: 2004 . 04696 ( 2020 ).

[18]

Lamsiyah ,

El Mahdaouy ,

Espinasse ,

S. E. A.

Ouatik , An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings , Expert Systems with Applications 167 ( 2021 ) 114152 .

[19]

Lee ,

Yoon ,

Kim ,

C. H.

So ,

Kang , Biobert: a pre-trained biomedical language representation model for biomedical text mining , Bioinformatics 36 ( 2020 ) 1234 - 1240 .

[20]

Alsentzer ,

Murphy ,

Boag ,

W.-H.

Weng ,

Jindi ,

Naumann , M. McDermott, Publicly available clinical BERT embeddings , in: A. Rumshisky , K.

Roberts , S.

Bethard , T. Naumann (Eds.), Proceedings of the 2nd Clinical Natural Language Processing Workshop , Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019 , pp. 72 - 78 . URL: https:// aclanthology.org/W19-1909/. doi: 10 .18653/v1/ W19 -1909.

[21]

Han , S . Tian, J. Zhang, A pubmedbert-based classifier with data augmentation strategy for detecting medication mentions in tweets , arXiv preprint arXiv:2112.02998 ( 2021 ).

[22]

Luo ,

Sun ,

Xia ,

Qin ,

Zhang , H. Poon, T.-Y. Liu, Biogpt: generative pre-trained transformer for biomedical text generation and mining , Briefings in bioinformatics 23 ( 2022 ) bbac409 .

[23]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[24]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and eficient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[25]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger , Y. Artzi, BERTScore: Evaluating text generation with BERT , in: ICLR, 2020 .

[26]

E. C.

Seyhan ,

S. N.

Sokucu , G. Gunluoglu,

N. S.

Veske ,

Altin , Primary pulmonary synovial sarcoma: a very rare presentation , Case Reports in Pulmonology 2014 ( 2014 ) 537618 - 537618 .

[27]

O. N.

Al Mulhim , Huge thoracic aortic aneurysm presenting with jaundice: A case report , Vascular Health and Risk Management ( 2022 ) 1 - 4 .

[28]

Damm ,

T. M. G.

Pakull ,

Becker ,

Bracke ,

Eryilmaz ,

Bloch ,

Brüngel ,

C. S.

Schmidt ,

Rückert ,

Pelka ,

Schäfer ,

Idrissi-Yaghir ,

A. B.

Abacha , A. G. S. de Herrera , H.

Müller , C.

M. Friedrich, Overview of ImageCLEFmedical 2025 - medical concept detection and interpretable caption generation , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings , CEUR-WS.org, Madrid, Spain, 2025 .

[29]

Ionescu ,

Müller ,

D.-C.

Stanciu ,

A.-G.

Andrei ,

Radzhabov ,

Prokopchuk , Ştefan, LiviuDaniel, M.-G. Constantin,

Dogariu ,

Kovalev ,

Damm ,

Rückert ,

A. Ben

Abacha ,

García Seco de Herrera ,

C. M.

Friedrich ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Bracke ,

Pelka ,

Eryilmaz ,

Becker , W.-W. Yim,

Codella ,

R. A.

Novoa ,

Malvehy ,

Dimitrov , R. J. Das , Z.

Xie , H. M.

Shan , P.

Nakov , I. Koychev, S. A.

Hicks , S.

Gautam , M. A.

Riegler , V.

Thambawita , P.

Halvorsen , D.

Fabre , C.

Macaire , B.

Lecouteux , D.

Schwab , M.

Potthast , M.

Heinrich , J.

Kiesel , M.

Wolter , B.

Stein , Overview of ImageCLEF 2025: Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025 ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain, 2025 .

[30]

Hirsch ,

Dawidowicz ,

Tal , Medrat: Unpaired medical report generation via auxiliary tasks , in: European Conference on Computer Vision , Springer, 2024 , pp. 18 - 35 .

[31]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey ,

Kadian ,

Al-Dahle ,

Letman ,

Mathur ,

Schelten ,

Vaughan , et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 ( 2024 ).

[32]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , et al., Lora: Low-rank adaptation of large language models . , ICLR 1 ( 2022 ) 3 .

[33]

Bodenreider , The unified medical language system (umls): integrating biomedical terminology , Nucleic acids research 32 ( 2004 ) D267 - D270 .

[34]

Wei ,

Wang ,

Schuurmans ,

Bosma ,

Xia ,

Chi ,

Q. V.

Le ,

Zhou , et al., Chain-of-thought prompting elicits reasoning in large language models , Advances in neural information processing systems 35 ( 2022 ) 24824 - 24837 .

[35]

Rückert ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

Koitka ,

Pelka ,

A. B.

Abacha , A. G. S. de Herrera , H. Müller, P.

Horn , F.

Nensa , C.

M. Friedrich, ROCOv2: Radiology objects in context version 2, an updated multimodal image dataset , Scientific Data 11 ( 2024 ). doi:10.1038/s41597-024-03496-6.

[36]

N. C.

Codella ,

Jin ,

Jain ,

Gu ,

H. H.

Lee ,

A. B.

Abacha ,

Santamaria-Pang ,

Guyman ,

Sangani ,

Zhang , et al., Medimageinsight: An open-source embedding model for general domain medical imaging , arXiv preprint arXiv:2410.06542 ( 2024 ).

[37]

Sennrich ,

Haddow ,

Birch , Improving neural machine translation models with monolingual data , arXiv preprint arXiv:1511.06709 ( 2015 ).