<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Computer Vision</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1023/B3AVISI.0000029664.99615.94</article-id>
      <title-group>
        <article-title>JJ-VMed: A Framework for Automated Concepts, Captions and Explainability of Medical Image</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johanna Angulo</string-name>
          <email>johanna.angulo@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jenny Aguilar</string-name>
          <email>contact@jaguilarweb.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Science, Engineering and Design, Universidad Europea de Valencia</institution>
          ,
          <addr-line>Paseo de la Alameda, 7, 46010 Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>60</volume>
      <issue>2004</issue>
      <fpage>91</fpage>
      <lpage>110</lpage>
      <abstract>
        <p>This paper presents JJ-VMed, an experimental multimodal computational framework developed by Jaimage addressing three distinct tasks within the ImageCLEFmedical Caption challenge: concept detection, caption generation, and explainability analysis for medical imaging. The system implements separate but interconnected methodological approaches tailored to each task's specific requirements. For concept detection and caption generation, the framework employs a systematic four-phase approach utilizing a fine-tuned LLaVA-LLaMA 3 8B model. The methodology incorporates Spanish prompting to enhance multilingual capabilities and cross-linguistic robustness, followed by comprehensive preprocessing, LoRA-based ifne-tuning, and systematic post-processing validation. An additional LLaVA-Mistral 7B model was developed with English prompts to address identified limitations, though temporal constraints prevented its full deployment. The explainability task implements a distinct multi-stage pipeline designed to provide visual grounding for AI-generated medical content. This experimental approach utilizes the concepts and captions from the LLaVA-LLaMA 3 8B model as foundational input, with LLaMA 3.1 merging these outputs to generate separate textual explanations that provide additional context. The pipeline subsequently employs GPT-4o and GTP-4.1 APIs for spatial coordinate mapping, attempting to establish connections between textual explanations and visual features. The final stage implements the Segment Anything Model (SAM) for generating segmentation masks, supplemented by heatmap-based confidence scoring and computer vision techniques including keypoint detection. The framework generates medical image analyses featuring visual evidence intended to support the generated explanations, constituting a post-hoc approach to medical AI interpretability. This exploratory methodology represents an attempt to contribute to the ongoing research addressing explainable artificial intelligence (XAI) requirements in the medical domain. Performance evaluation revealed moderate results across tasks: concept detection achieved an F1 score of 0.3982, caption generation obtained an overall score of 0.3043, while the explainability system demonstrated technical feasibility despite not fully meeting challenge objectives. The methodology illustrates both the potential and limitations of current approaches to medical AI interpretability, highlighting areas requiring continued research and development for clinical implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Caption</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>AI Explainability</kwd>
        <kwd>GenAI</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>ImageCLEFmedical</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        JJ-VMed, developed by Jaimage, represents an experimental multimodal system for the ImageCLEF 2025
challenge, specifically addressing the ImageCLEFmedical Caption task and its associated evaluation
components [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This challenge advances multimodal understanding in the medical domain by requiring
participants to detect key medical concepts within images, generate clinically accurate captions, and
provide transparent explanations that illuminate the model’s diagnostic reasoning processes. Our
approach centers on Large Language and Vision Assistant (LLaVA)-LLaMA 3 8B, a vision-language
model that serves as the foundational architecture for both caption generation and concept detection
tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We complement this core system with a post-hoc explainability framework that integrates
GPT4’s multimodal capabilities alongside segmentation models, computer vision techniques and specialized
medical natural language processing pipelines. This integrated approach enables the system to generate
captions and concept labels for radiology images while attempting to provide visual grounding through
image region highlighting that corresponds to generated textual content, thereby ofering visual evidence
supporting the model’s outputs. It is important to clarify that our explainability approach does not
employ established XAI methods such as SHAP, LIME, GradCAM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or attention maps, which directly
analyze model internal states. Instead, we implement an experimental post-hoc verification system
that independently attempts to validate and visualize the spatial correspondence between generated
text and image content. This methodology represents an exploratory contribution toward enhancing
trust in medical AI systems through external verification rather than intrinsic model explanation. This
paper presents a systematic examination of our experimental system design and empirical results,
encompassing the underlying techniques, fine-tuning methodologies, post-hoc verification strategies,
and performance evaluation across each challenge task. We provide critical analysis of these results
while acknowledging the limitations of our approach and exploring potential directions for future
development in medical image analysis systems.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of ImageCLEFmedical Caption 2025</title>
      <p>
        The ImageCLEFmedical Caption 2025 challenge is composed of three interconnected tasks: Concept
Detection, Caption Prediction, and Explainability [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. In the Concept Detection Task, systems identify
the presence of relevant medical concepts in an image, efectively predicting a set of UMLS (Unified
Medical Language System) concept IDs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or terms that describe the image’s content. This serves as a
foundation for captioning by providing the “building blocks” of the scene. In the Caption Prediction Task,
systems generate a coherent textual description of the entire image, ideally incorporating the detected
concepts and describing their interplay. The Explainability Task, newly introduced in 2025, requires
participants to produce an explanation for the caption on a small subset of images – for example, by
highlighting image regions and providing additional textual justification. The explainability component
is meant to improve interpretability and trust, allowing medical experts to verify why a caption or
concept was predicted.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The primary objective of our research is to develop a multimodal model that generates concept detection
and caption prediction. The secondary objective is developing an explainability pipeline combining
Natural Language Processing (NLP) and computer vision techniques as an experimental approach.</p>
      <p>
        JJ-VMed medical imaging pipeline represents a three-phase framework that combines fine-tuned
multimodal models with explainability mechanisms to produce concept detection, caption prediction
and clinically relevant, spatially-grounded medical image analyses.
3.1. Data
All tasks utilized an extended version of the Radiology Objects in Context (ROCO) Version 2 dataset.
As in previous editions, the dataset originates from biomedical articles of the PMC OpenAccess subset.
We also used the ROCOv2 dataset from the previous year for fine-tuning [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The challenge training set
(development data) consists of tens of thousands of radiology images (primarily X-rays, CT scans, MRIs,
etc.) collected from the biomedical literature, each paired with a figure caption and a set of manually
curated UMLS concept labels. Please note the challenge datasets will be described in more detail in the
overview paper [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>3.2. Evaluation Metrics</title>
        <p>
          Each task of the ImageCLEF medical caption challenge uses distinct evaluation metrics [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ]. The
concept detection task uses F1 scoring methodology [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ]. Caption prediction ranking combines six
metrics across two evaluation aspects: Relevance Metrics (four metrics) and Factuality Metrics (two
metrics). In the explainability task, a human expert radiologist evaluated the quality of each system’s
generated explanations using a 5-point Likert scale, where 5 represented the highest score. Please refer
to the overview paper for more details [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Experimental Setup</title>
        <p>
          For the computational experiments, a combination of two distinct hardware configurations was utilized:
• A desktop workstation equipped with an Intel Core i9 processor and an NVIDIA GeForce RTX
5090 graphics processing unit (GPU) featuring 32 GB of dedicated GDDR7 memory. This machine
was employed for all computational tasks, with the exception of operations involving the Segment
Anything Model (SAM) due to a technical incompatibility that prevented its use on this specific
setup [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ].
• A laptop computer, also powered by an Intel Core i9 processor, which included an NVIDIA
GeForce RTX 4070 laptop GPU with 8 GB of GDDR6X memory. This laptop was specifically used
for Phase III: Computer Vision for Segmentation and Analysis.
        </p>
        <p>The allocation of these resources was based on computational demands and system compatibility.
The desktop workstation, with its higher VRAM capacity, was primarily allocated for computationally
intensive training and evaluation phases for all tasks not involving. The laptop, despite its comparatively
lower VRAM, specifically handled the computational requirements of the SAM-dependent computer
vision tasks. The precise extent of utilization for each GPU varied according to the specific demands of
individual experiments.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Methodology for Concept and Captioning Task</title>
        <p>This investigation employs a systematic four-phase approach for medical image concept detection
and caption prediction, utilizing fine-tuned multimodal large language models with LoRA (Low-Rank
Adaptation). The methodology addresses both technical and clinical requirements through rigorous
pre-processing, model adaptation, inference optimization, and validation procedures.</p>
        <sec id="sec-3-3-1">
          <title>3.4.1. Phase 1: Image Preprocessing</title>
          <p>
            The preprocessing pipeline ensures dataset consistency and model compatibility through systematic
image standardization. The process initiates with metadata acquisition from training and validation splits
(train_captions.csv, train_concepts.csv, valid_captions.csv, valid_concepts.csv),
followed by comprehensive image validation and format conversion. Critical preprocessing steps
include: (1) RGB format conversion for grayscale and RGBA images, eliminating alpha channels to
ensure input consistency; (2) optional resizing to 336× 336 pixels using LANCZOS resampling to maintain
image quality while standardizing dimensions; (3) JPEG compression [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] at 90% quality for storage
optimization. The pipeline implements robust error handling for missing or corrupted files, providing
detailed reporting of processing success rates and error classifications. This preprocessing approach
was selected to address the heterogeneous nature of medical imaging datasets while maintaining visual
ifdelity essential for clinical interpretation. The 336 × 336 resolution represents an optimal balance
between computational eficiency and preservation of diagnostic details, as demonstrated in previous
vision-language model implementations.
          </p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.4.2. Phase 2: LLaVA-Llama Fine-tuning</title>
          <p>The fine-tuning procedure adapts the llava-llama-3-8b-v1_1-transformers model for medical
imaging tasks using Low-Rank Adaptation (LoRA). This parameter-eficient approach enables domain
adaptation while preserving the model’s foundational capabilities.</p>
          <p>
            LLaVA (Large Language and Vision Assistant), introduced by Liu et al. in 2023, was designed through
visual instruction tuning, enabling it to follow prompts about images in a conversational manner [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
The LLaVA-LLaMA 3 version represents an updated iteration using a newer LLaMA 3.1 backbone
(8B parameters) to improve capability [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. This multimodal model serves as our base system for
automatically generated captions and concepts detection, due to its capability to interpret visual inputs
and produce coherent text. The choice of an 8B backbone keeps the model lightweight enough for
finetuning while still benefiting from LLaMA 3’s improvements in language understanding. The fine-tuned
model used in this challenge by JJ-VMed is based on a LLaVA model fine-tuned from
meta-llama/MetaLlama-3-8B-Instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by
XTuner [
            <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Model Configuration and Architecture</title>
          <p>The base model employs AutoModelForVision2Seq with device_map="auto" for optimal GPU
utilization and bfloat16 precision when supported. Critical architectural modifications include: (1)
integration of medical vocabulary through token embedding resizing; (2) processor patch_size
specification at 14 pixels for vision encoder compatibility; (3) eos_token assignment as pad_token
for consistent sequence handling.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>LoRA Implementation</title>
          <p>LoRA targets specific transformer components: attention mechanisms ( q_proj, k_proj, v_proj,
o_proj), feed-forward networks (gate_proj, up_proj, down_proj), and multimodal projection
layers (linear_1, linear_2). The configuration employs  = 16,  = 32, and dropout = 0.05,
balancing adaptation capacity with overfitting prevention. These parameters were selected based on
empirical evidence suggesting optimal performance for vision-language tasks with similar model scales.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Training Configuration</title>
          <p>Hyperparameter selection follows established practices for medical domain adaptation:
learning_rate = 5 × 10− 5, batch_size=4 with gradient_accumulation_steps=8
(effective batch_size=32), single epoch training to prevent overfitting on limited medical data. The
training employs gradient checkpointing for memory eficiency and evaluation every 250 steps with
early stopping based on validation loss.</p>
          <p>The optimization strategy utilizes AdamW optimizer with fused implementation for enhanced
computational performance. Learning rate scheduling employs cosine annealing with a 3% warmup
ratio to ensure stable convergence. Mixed precision training is implemented using bfloat16 precision
when supported by hardware, with float16 fallback for compatibility. The maximum sequence length
is configured to 1024 tokens to accommodate comprehensive medical descriptions while maintaining
computational eficiency. Training monitoring includes performance metrics logged every 50 steps
with comprehensive TensorBoard integration for training visualization. The loss computation strategy
implements selective masking where input tokens receive label value -100, ensuring only assistant
responses contribute to the loss calculation, which is critical for efective instruction tuning. Bias
parameters across all targeted modules remain frozen during LoRA adaptation to maintain model
stability.</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>Prompt Engineering Strategy</title>
          <p>The training incorporates diverse prompt templates to enhance model generalization: (1) direct image
caption requests; (2) medical concept enumeration tasks; (3) combined caption-concept generation; (4)
conditional prompts for specific concept queries. This multi-task training approach promotes robust
understanding of medical imagery across various clinical scenarios.</p>
          <p>
            The use of Spanish prompts to fine-tune a LLaVA-LLaMA 8B model for concept detection and
caption prediction tasks, despite the challenge being in English, is a strategic choice aligned with
a broader project to develop a Spanish Multimodal Q&amp;A System. This approach serves as a crucial
experimental pipeline within the challenge to cultivate the model’s multilingual capabilities. While
LLaMA models are primarily English-centric [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], instruction tuning is vital for enhancing their
proficiency in other languages [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Research indicates that multilingual tuning can be on par with or
even surpass monolingual approaches, ofering benefits for cross-lingual transfer and robustness [14].
Specifically, by exposing the model to Spanish, we aim to mitigate Image-induced Fidelity Loss (IFL),
where LLaVA models often bias responses towards English, particularly after visual input [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. This bias
stems from the language model component and switching to bilingual language model backbones has
been shown to reduce IFL [
            <xref ref-type="bibr" rid="ref12">15, 12</xref>
            ]. Thus, Spanish prompts improve the model’s overall multilinguality
[16, 15], making it more appropriate for the Spanish system and potentially enhancing its English
performance by fostering a more robust, language-aware internal representation.
          </p>
          <p>Our fine-tuning approach treated the task as a form of instruction following. We constructed a prompt
template format for the model that would allow it to learn both concept detection and captioning. During
training, each sample was presented to the model randomly and the target output would combine the
caption and/or concepts. Here is an example translated into English since we used Spanish prompts.
[System: You are a medical vision-language assistant. Answer with medical terminology
when appropriate.]</p>
          <p>User: Describe the following radiology image in detail and list the key findings.
[Image]</p>
          <p>Assistant:
Regarding outputs, for the challenge we experimented with two formats:</p>
          <p>Concepts only format: The model outputs a list of “Concepts” with its Concept Unique Identifiers
(CUIs). This format explicitly trains the model to identify and name the important concepts (after seeing
the image) before generating the fluent sentence.</p>
          <p>Caption only format: The model outputs just the caption sentence(s), implicitly learning to mention
the important concepts within it. This is closer to how radiologists write captions (embedding key
terms in the description).</p>
          <p>The outputs were two .csv files with two columns:
• Concepts: column ID with the image name and column CUIs with the identified concepts.
• Captions: column ID with the image name and column Caption with the predicted captions.</p>
        </sec>
        <sec id="sec-3-3-7">
          <title>3.4.3. Phase 3: Model Inference</title>
          <p>Inference procedures address dual objectives: concept detection and caption generation, each employing
task-specific optimization strategies.</p>
        </sec>
        <sec id="sec-3-3-8">
          <title>Concept Detection Protocol</title>
          <p>Concept inference utilizes structured prompts in Spanish: "Enumera los conceptos médicos clave (CUIs)
observados o inferidos en esta imagen." The model generates natural language medical terms
subsequently mapped to Concept Unique Identifiers (CUIs) through a comprehensive dictionary matching
system. The CUI mapping process implements fuzzy matching algorithms to handle terminology
variations and synonyms, ensuring robust concept identification. The convert_natural_to_cui
function processes comma-separated natural language outputs, yielding semicolon-separated CUI codes
while maintaining concept uniqueness and ordering.</p>
        </sec>
        <sec id="sec-3-3-9">
          <title>Caption Generation Methodology</title>
          <p>Two caption generation approaches were implemented: (1) direct description prompting; (2)
conceptconditioned generation. The optimized approach incorporates previously identified natural language
concepts into prompts: "Describe esta imagen médica enfocándose en los siguientes conceptos clave:
concepts." This conditioning strategy enhances caption relevance and clinical accuracy by leveraging
concept detection outputs. Generation parameters include the max_new_tokens configuration and
optional sampling strategies (do_sample, temperature) to balance creativity and factual accuracy. Batch
processing with checkpointing mechanisms ensures computational eficiency and recovery capabilities.</p>
        </sec>
        <sec id="sec-3-3-10">
          <title>Technical Implementation Details</title>
          <p>Model loading employs optional 4-bit quantization using BitsAndBytesConfig for
memoryconstrained environments. PEFT adapter integration allows seamless fine-tuned model deployment
while maintaining base model integrity. The inference pipeline implements comprehensive error
handling and progress tracking for large-scale dataset processing.</p>
        </sec>
        <sec id="sec-3-3-11">
          <title>3.4.4. Phase 4: Post-processing and Validation</title>
          <p>Post-processing ensures output format compliance and content consistency through systematic
validation procedures.</p>
        </sec>
        <sec id="sec-3-3-12">
          <title>Concept Validation Protocol</title>
          <p>The clean_and_process_cui_string function addresses common output formatting issues: (1)
duplicate CUI removal while preserving first occurrence order; (2) whitespace normalization and empty
string elimination; (3) semicolon-separated format standardization. This validation ensures submission
compliance with evaluation framework requirements.</p>
        </sec>
        <sec id="sec-3-3-13">
          <title>Caption Refinement Procedures</title>
          <p>Caption cleaning targets generation artifacts through quote character normalization. The
clean_quotes function removes erroneous triple quotes, double quote sequences, and inconsistent
quotation marks. Subsequent CSV formatting ensures proper encapsulation of caption content while
maintaining identifier integrity.</p>
        </sec>
        <sec id="sec-3-3-14">
          <title>3.4.5. Methodological Justification</title>
          <p>This four-phase approach addresses specific challenges in medical image analysis: (1) dataset
heterogeneity through systematic preprocessing; (2) domain adaptation requirements via LoRA fine-tuning;
(3) clinical relevance through concept-conditioned generation; (4) evaluation compliance through
rigorous post-processing. The LoRA adaptation strategy was selected over full fine-tuning to prevent
catastrophic forgetting while enabling domain-specific learning. The concept-conditioned caption
generation approach represents a novel contribution, leveraging detected medical concepts to guide
description generation toward clinically relevant content.</p>
        </sec>
        <sec id="sec-3-3-15">
          <title>3.4.6. Implementation Constraints and Methodological Implications</title>
          <p>To augment the conceptual richness and detection granularity of the foundational text generation
process, we initiated the development of a secondary fine-tuned LLaVa-Mistral 7B model specifically
optimized for enhanced medical concept identification. This model was designed to complement the
primary LLaVa Llama 8B model by providing more comprehensive and nuanced detection of medical
entities, anatomical structures, and pathological findings within medical images. Despite successful
model fine-tuning completion, temporal constraints imposed by the challenge deadline prevented
the execution of full inference processes on the target dataset. Consequently, the enhanced concept
detections from this secondary model could not be integrated into the final pipeline implementation.
This limitation represents a significant methodological constraint that potentially afected the
comprehensiveness of concept coverage in the final explainability outputs. The absence of this enhanced
model’s contributions may have resulted in reduced recall for subtle or complex medical entities that
would have benefited from the more sophisticated detection capabilities.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Methodology for Explainability task</title>
        <p>The approach outlines a multi-stage pipeline designed to provide visual explanations for AI-generated
medical image captions. Although this system did not fully meet the challenge objective (as discussed in
the Explainability Discussion subsection in the Analysis section), this experimental endeavour represents
an attempt to contribute to the ongoing research addressing explainable artificial intelligence (XAI)
requirements in the medical domain, where the inherent opacity of deep learning models can hinder
trust and adoption in clinical settings [17, 18].</p>
        <sec id="sec-3-4-1">
          <title>3.5.1. Introduction and Foundational Data</title>
          <p>The foundational data for this project originates from textual content—specifically, initial concepts
and captions generated by a fine-tuned LLaVa Llama 8B model. LLaVA (Large Language-and-Vision
Assistant) models, like LLaVA-Med and Visual Instruction Tuning, are pre-trained on vast multimodal
datasets, making them capable of interpreting and generating text from visual inputs.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.5.2. Phase I: Natural Language Processing for Coordinate Generation</title>
          <p>
            Natural Language Processing begins with the Llama 3.1 model merging concepts and captions into
refined textual outputs [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Large language models (LLMs) and transformer-based architectures, such as
those discussed in the GPT-4 Technical Report [19], have demonstrated good performance in
understanding and generating nuanced natural language, proving crucial for synthesizing refined explanations [ 19].
The emphasis on high-quality, natural language forms the basis for human-interpretable explanations
[17]. Subsequently, the initial computational phase is dedicated to translating abstract textual concepts
into approximate spatial coordinates, specifically bounding box coordinates ( x, y, width, height) or
arrow-tip locations. This transformation is executed via a pipeline that interfaces with OpenAI API
models [19]. For vision-language tasks, the gpt-4o model is employed, leveraging its multimodal
capabilities to interpret medical images and their associated captions [19]. GPT-4 exhibits human-level
performance on professional and academic benchmarks [20, 21] and processes image and text inputs to
produce text outputs [19]. Subsequently, for refinement and validation, the gpt-4.1 model is utilized
[19]. The system relies on robust NLP libraries, including openai (v1.x), pandas, json, and re, for
eficient data handling.
          </p>
          <p>To achieve high precision in term normalization, several NLP techniques are rigorously applied.
Rule-based filtering is implemented using regular expressions to exclude generic modality terms like
"MRI" or "X-ray". This ensures analysis focuses on anatomically significant entities [ 18]. A preprocessing
step called group compound terms uses a dictionary of compound terms (e.g., small intestine, left axillary
region). This maintains token integrity and prevents fragmentation of terms like right lower lobe, which
is crucial in medical text processing [17].</p>
          <p>
            A two-model prompting strategy underpins the coordinate generation process. First, gpt-4o is
prompted to analyse the image, caption, and normalized terms, estimating bounding box coordinates or
arrow-tip locations. Then, gpt-4.1 validates and refines this output. If a bounding box is absent but
an arrow exists, a 40x40 pixel box is inferred. This yields a curated dataset in sam_coord.csv and
JSON, ready for segmentation[
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.5.3. Phase II: NLP-driven Term Enhancement via Named Entity Recognition (NER)</title>
          <p>To augment medical concepts, a dedicated NLP stage extracts terms directly from captions. This
employs a hybrid Named Entity Recognition (NER) approach: a biomedical model plus rule-based
techniques. The biomedical NER model d4data/biomedical-ner-all, accessed via the transformers
library (v4.x), is specifically designed for clinical entity recognition [ 22]. Complementing this,
extract_medical_terms_enhanced regular expressions capture linguistic patterns (e.g., 1.5 cm
hypoechoic mass, hematoma, stenosis).</p>
          <p>This dual approach ensures comprehensive terminology extraction. Newly identified terms are
compared against sam_coord.csv. Novel terms re-enter Phase I, generating coordinates that enrich
sam_coord.csv. This refinement enhances localization of clinical information.</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.5.4. Phase III: Computer Vision for Segmentation and Analysis</title>
          <p>This final phase uses the coordinates to segment images and generate analytical outputs. It integrates
advanced CV models and libraries tuned for medical imaging [23].</p>
          <p>
            For segmentation, the Segment Anything Model (SAM) is used [
            <xref ref-type="bibr" rid="ref8">8, 24</xref>
            ]. We used
sam_vit_h_4b8939.pth model and segment-anything on PyTorch.
          </p>
          <p>For detection, the YOLOv8n model from ultralytics is employed [25]. YOLO is known for
realtime object detection [26], with YOLOv7 and YOLOv8 being recent innovations [25, 26]. Here, YOLOv8n
enhances recall by detecting overlooked entities [25]. Detections with Intersection over Union (IoU) &lt;
0.5 are added for segmentation. Time constraints prevented a custom anatomical model, so a generic
YOLOv8n-seg.pt model was used, which introduced noise due to non-medical tags.</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>Computer Vision Techniques for Masking:</title>
          <p>• Direct Bounding Box Prompt: The primary strategy was to prompt SAM with the bounding box
generated.
• Keypoint-Based Prompting: If the confidence of the initial mask was low, the system escalated to
a more nuanced prompting strategy using local feature descriptors [27]. These were not used
for matching, but as a set of salient points to guide SAM’s attention within the bounding box.
The keypoint detectors used were SIFT (Scale-Invariant Feature Transform), FAST (Features from
Accelerated Segment Test) and LoG (Laplacian of Gaussian) [28].</p>
          <p>Feature detection uses opencv-python (v4.x) and scikit-image (v0.x). Outputs include diagnostic
plots and annotated images with legends, aligning with the heatmap concept for explainability.</p>
          <p>In the Explainability task, three types of outputs were generated:
• Bounding Boxes: Colored rectangular regions that mark areas of interest within the medical
images. Each image is accompanied by a legend indicating the specific medical entities identified,
with multiple bounding boxes possible per image to highlight diferent anatomical structures or
pathological findings.
• Heatmaps: Comprehensive visualization sets consisting of four representations per identified
object or label: the original image with bounding boxes, a conceptual probability heatmap showing
confidence distributions, a three-dimensional heatmap surface providing depth visualization, and
a heatmap overlay that combines visual attention with the original image content.
• Internal Metrics: Quantitative measurements related to the computer vision techniques employed,
providing technical insights into the model’s performance and decision-making processes during
the explainability generation phase.</p>
          <p>CSV exports detail confidence scores and metrics for each strategy. This supports AI explainability
and clinical decision-making. The approach is exploratory, targeting transparency and faithfulness—the
degree to which explanations reflect model reasoning—and plausibility—how well they align with human
understanding [18]. Since XAI methods often fail to meet full clinical requirements, this structured
approach aims to bridge that gap and ofer interpretable outputs to clinicians [18].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        This section presents JJ-VMed results under the username Jaimage. The three tasks were evaluated
on the oficial ImageCLEFmedical test sets [
        <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
        ]. We report performance on all tasks in which our
systems participated: Concept Detection, Caption Prediction and Explainability. The results include
diferent metrics for each task as explained earlier. For more details refer to the Overview paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For
Concept Detection results see Table 1, for Caption Prediction results see Table 2 and for Explainability
task results see Table 3. These are the oficial results on the independent test set for ImageCLEFmedical
Caption 2025 Tasks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Analysis and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Concept Detection Performance Analysis</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Primary Performance Metrics and Limitations</title>
          <p>Our system achieved an F1 score of 0.3982 in the concept detection task, representing a substantial
performance deficit compared to the leading approach (F1 = 0.5888). This 32.4% performance gap
indicates significant limitations in our methodology’s ability to accurately identify and classify medical
concepts within the target dataset. The moderate F1 score suggests fundamental weaknesses in either
our feature extraction processes, concept classification mechanisms, or both.</p>
          <p>Critical Analysis of Underperformance: Several methodological factors likely contributed to
this suboptimal performance. First, the reliance on a single fine-tuned LLaVA Llama 8B model may
have introduced inherent limitations in concept recognition capabilities, particularly for specialized
medical terminology. The fine-tuning parameters may not have been optimally configured for the
specific medical imaging domain represented in the challenge dataset.</p>
          <p>Language-Specific Performance Degradation: A significant contributing factor to the observed
performance deficit can be attributed to the use of Spanish prompting in our primary model
implementation. Medical terminology translation and cross-linguistic concept mapping introduce additional
complexity layers that can adversely afect concept detection accuracy. Spanish medical terminology
may not align precisely with the English-based training datasets commonly used in large vision-language
models, potentially causing systematic misclassifications or missed detections.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Secondary Metric Performance and Contextual Analysis</title>
          <p>While our approach demonstrated improved performance on the secondary F1 metric (0.8329), this
result requires critical contextualization. Although this score appears substantially higher than our
primary F1, the 12.2% performance gap compared to the leading team’s secondary F1 (0.9484) remains
significant and should not be underestimated. This approximately 10% diferential in a curated subset
of key concepts indicates that our methodology still fails to achieve optimal performance even when
evaluated on the most clinically relevant features.</p>
          <p>Methodological Implications: The improved secondary F1 score suggests that our approach
demonstrates reasonable proficiency in detecting prominent or well-defined medical concepts while
struggling with more nuanced, rare, or complex findings. This pattern indicates potential insuficient
training exposure to the full spectrum of medical concept diversity present in clinical practice.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Methodological Response to Identified Limitations</title>
          <p>During the challenge development phase, we observed suboptimal concept detection performance
in our preliminary evaluations, though without access to comparative benchmarks or final oficial
results. Recognizing these potential deficiencies and anticipating the need for enhanced performance,
we initiated the fine-tuning of a secondary LLaVA-Mistral 7B model utilizing English prompts to
address both the language-specific limitations and enhance overall concept detection granularity. The
English prompting strategy was specifically implemented to mitigate the cross-linguistic performance
degradation observed in our primary Spanish-prompted model. This secondary model demonstrated
enhanced descriptive output quality in preliminary evaluations, providing more detailed concept
identification compared to our baseline approach.</p>
          <p>The successful development of the LLaVA-Mistral 7B model confirms the technical viability of
multimodel ensemble approaches for enhanced concept detection. However, the incomplete deployment
due to temporal constraints represents a critical methodological limitation that prevented empirical
validation of this enhancement strategy’s efectiveness. Future implementations must prioritize earlier
integration of secondary models to ensure comprehensive evaluation within project timelines.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Caption Prediction Performance Analysis</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Overall Performance Assessment and Competitive Context</title>
          <p>Our caption generation model achieved an overall score of 0.3043, representing approximately 88.7% of
the leading team’s performance (0.3432). While this result demonstrates moderate competitiveness, the
11.3% performance gap indicates substantial room for improvement in caption generation quality and
accuracy.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Strengths in Textual Similarity and Content Overlap</title>
          <p>Our model demonstrated notable strengths in specific evaluation dimensions. The high textual similarity
score (0.8251) and competitive BERTScore Recall (0.5953 vs. 0.5977 for the leading team) indicate
efective content overlap and phrasing alignment with ground-truth reports. These metrics suggest that
our approach successfully captures general descriptive patterns and maintains reasonable linguistic
coherence in generated captions.</p>
          <p>The moderate ROUGE-1 (0.2389) and BLEURT (0.3094) scores reflect acceptable textual overlap
and overall caption quality as measured by established metrics. The average relevance score of 0.4922
indicates that our generated captions incorporate a substantial portion of key information from reference
reports, demonstrating reasonable content coverage.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.3. Critical Analysis of Domain-Specific Performance Deficits</title>
          <p>Medical Terminology and Concept Accuracy: Our system exhibits significant weaknesses in
domainspecific medical accuracy, as evidenced by the UMLS concept F1 score of 0.1366, representing a 24.8%
deficit compared to the leading approach (0.1816). This substantial gap indicates systematic dificulties in
accurately identifying, incorporating, and maintaining consistency with specialized medical terminology
and specific clinical findings.</p>
          <p>Visual-Textual Alignment Deficiencies: The AlignScore of 0.0964 (versus 0.1375 for optimal
performance) reveals a critical limitation in aligning image content with generated descriptions. This
30.1% performance deficit suggests that our model frequently generates descriptions containing details
insuficiently supported by visual evidence, indicating potential hallucination tendencies or inadequate
visual feature extraction capabilities.</p>
          <p>Factual Consistency and Reliability Issues: The factuality average of 0.1165, compared to the
leading result of 0.1596, represents a 27.0% performance gap that raises serious concerns about clinical
applicability. This deficit indicates recurring factual inconsistencies and possible hallucinations in
generated text, which constitute fundamental reliability issues for medical applications where accuracy
is paramount.</p>
        </sec>
        <sec id="sec-5-2-4">
          <title>5.2.4. Systematic Analysis of Performance Limitations</title>
          <p>Root Cause Analysis: The observed deficiencies likely stem from several interconnected
methodological limitations:
1. Fine-Tuning Language: Our fine-tuning process used Spanish prompts and the model although
understand Spanish may have struggle to find specialized medical terminology for medical
imaging contexts, resulting in generic rather than clinically precise descriptions.
2. Visual Feature Extraction Limitations: The visual encoding components may lack suficient
granularity to capture subtle medical imaging features necessary for accurate clinical description.
3. Cross-Modal Integration Deficiencies: The alignment between visual and textual
representations appears suboptimal, leading to descriptions that fail to accurately reflect image content.
4. Limited Medical Knowledge Integration: The absence of explicit medical knowledge bases or
fact-checking mechanisms likely contributes to factual inaccuracies and terminology misuse.</p>
        </sec>
        <sec id="sec-5-2-5">
          <title>5.2.5. Methodological Reflection</title>
          <p>This performance analysis reveals that our experimental approach, while demonstrating technical
feasibility, requires substantial refinement to achieve clinical relevance. The systematic underperformance
across multiple evaluation metrics indicates fundamental limitations in our current methodology that
extend beyond simple parameter optimization. The language-specific performance degradation and
incomplete deployment of enhancement strategies highlight the importance of comprehensive planning
and early implementation of methodological improvements in future research endeavors.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Explainability Discussion</title>
        <p>The presented framework is classified as a post-hoc, concept-grounding explainability system. Its
fundamental objective is to provide clinical interpretability by independently verifying the model’s
textual outputs against the visual evidence present in the image.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Acknowledging the Deviation from Direct Black-Box Explanation</title>
          <p>It is important to acknowledge that the primary objective of this framework is not to elucidate the
internal computational pathways of the “black-box” captioning model. This approach stands in stark
contrast to intrinsic or ante-hoc methods, which are designed to be transparent by nature, and other
post-hoc methods, such as Grad-CAM [17], which aim to explain a model’s decision-making process by
inspecting its internal state (e.g., gradients, activations) [29]. While explainable AI (XAI) broadly aims
to make AI models more transparent, interpretable, and understandable, critics of post-hoc explanations
correctly assert that they merely approximate, rather than replicate, the actual reasoning processes of
black-box systems. A truly “complete” post-hoc explanation would, in essence, equate to the original
model itself, thus negating the need for the original mode [18].</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.3.2. Justification of the Post-Hoc Approach for Clinical Interpretability</title>
          <p>Despite this acknowledged deviation from direct black-box explanation, the adoption of a post-hoc
explainability framework was necessitated by pragmatic constraints encountered during the challenge,
including the unavailability of a fully developed intrinsic explainability module and demanding challenge
timelines [29, 30]. The resultant strategy was to prioritize a method that could efectively enhance a
clinician’s trust in the final output, even if the model’s internal logic remained opaque [30, 18, 29].</p>
          <p>This approach is firmly rooted in the concept central to human-centric XAI, positing that for clinical
adoption, trust can be fostered by demonstrating that a model’s conclusions are factually correct and
visually verifiable [ 30, 18]. The system attempts to achieve this by employing an independent pipeline
to answer a crucial question for clinicians: “Given the model’s claim, is there corroborating evidence in
the image?”.</p>
          <p>
            Though indirect, this external audit approach ofers several distinct benefits:
• Modularity: It can be applied to any captioning model, underscoring its versatility.
• Integration of Specialized Models: It allows for the integration of task-specific models, such
as SAM (Segment Anything Model), optimized for segmentation verification [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
• Alternative Visualization Approach: It enables the generation of segmentation masks through
SAM [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], ofering an alternative to heatmap-based explanations. This approach provides spatially
defined regions that may assist in clinical interpretation, though empirical validation of radiologist
preferences between visualization methods remains to be established.
          </p>
          <p>Furthermore, defenses of post-hoc explanations highlight their utility even without replicating
internal model reasoning. They can improve users’ functional understanding of black-box systems,
increase the accuracy of clinician-AI teams (e.g., radiologist-AI teams have shown improved accuracy
with saliency masks), and assist clinicians in justifying their AI-informed decisions. Such explanations
empower users to better discriminate between correct and incorrect outputs [31].</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>5.3.3. Limitations and Complementary Nature</title>
          <p>Nevertheless, it is critical to acknowledge the fundamental limitation of this external audit approach:
its inability to diagnose why a model fails. It can efectively detect a hallucinated finding but cannot
explain its origin within the source model’s architecture. Therefore, this work is best viewed as a
pragmatic exploration into building trust via external, multimodal verification. It complements, rather
than replaces, the critical role of intrinsic explainability methods. For instance, techniques like
GradCAM are invaluable for model debugging and understanding feature attribution by localizing where
in the image the model “looked” [17]. This difers from the presented system’s output of grounded
segmentation masks. To illustrate this distinction, Appendix B provides a comparative Grad-CAM
visualization (obtained post-challenge from an updated version of the system). This visualization
showcases how an intrinsic method highlights the captioning model’s regions of interest, an approach
that difers from our challenge system’s output of grounded segmentation masks.</p>
          <p>Thus, while this post-hoc verification serves a vital role in fostering clinical trust and providing clear,
verifiable evidence, it is part of a broader explainable AI ecosystem necessary for a holistic understanding
of AI system behavior and for the comprehensive concept of trustworthy AI, which extends beyond mere
explainability to include dimensions such as fairness, safety, privacy, accountability, and robustness.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Based on our comprehensive evaluation and analysis of the JJ-VMed system’s performance across
the ImageCLEF medical captioning tasks, we have identified several critical areas for improvement
and promising research directions. Our current results, reveal specific limitations that provide clear
pathways for future enhancement.</p>
      <p>Several research directions warrant exploration for improving our system in future editions. Some of
them are under development.</p>
      <sec id="sec-6-1">
        <title>6.1. Methodological Improvements and Future Directions</title>
        <p>Our results indicate that while our approach achieves reasonable general language quality and content
relevance, significant improvements are required in medical domain specificity and factual accuracy.
Future developments should prioritize:
1. Enhanced Medical Knowledge Integration: Incorporating structured medical knowledge
bases and terminology validation systems
2. Improved Visual Feature Extraction: Implementing medical imaging-specific feature
extraction methods
3. Robust Fact-Checking Mechanisms: Developing explicit verification systems for generated
medical content
4. Cross-Modal Alignment Optimization: Enhancing the integration between visual and textual
model components</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Capturing Clinical Content</title>
        <p>Integrate more multimodal vision-language foundation models to better capture clinical content [32].
Ensembling complementary image encoders can produce more fluent and contextually accurate
descriptions [33]. By adopting such architectures – powerful vision backbones and generative transformers
– captioning systems can better describe complex scenes and rare pathologies in natural, expert-like
language. This enhancement directly addresses our current limitations in UMLS concept detection (F1:
0.1366) and factual alignment, potentially improving both clinical accuracy and terminology precision
in generated captions.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Hybrid Transformers + CNN Approach</title>
        <p>The concept detection component of our system presents opportunities for significant improvement
through architectural innovation. While Transformer-based models have demonstrated competitive
performance in medical image analysis, they have yet to fully surpass Convolutional Neural Networks
(CNN)-based approaches in all medical imaging contexts. However, these models have shown good
results at capturing global contextual relationships, which are crucial for comprehensive medical
image interpretation [34]. We propose developing a hybrid architecture that strategically combines the
strengths of both paradigms: leveraging Transformers’ superior global context modeling capabilities
alongside CNNs’ proven efectiveness in local feature extraction and spatial relationship detection [ 23].
This integrated approach could potentially overcome the individual limitations of each architecture
while maximizing their complementary strengths.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Evaluation and Validation Framework</title>
        <p>Future research will also focus on developing more comprehensive evaluation methodologies that better
capture the nuanced requirements of medical image captioning systems. This includes establishing
stronger correlations between automated metrics and clinical utility assessments, potentially through
expanded human expert evaluation protocols and task-specific evaluation criteria.</p>
        <p>These improvements, grounded in our current performance analysis, ongoing research eforts, and
emerging research trends, provide a clear roadmap for advancing the JJ-VMed system toward a more
robust, clinically accurate, and explainable medical image captioning system.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 and Claude in order to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
publications/tailored-llama-optimizing-few-shot-learning-in-pruned-llama-model. doi:10.3233/
FAIA240642.
[14] D. Stap, E. Hasler, B. Byrne, C. Monz, K. Tran, The fine-tuning paradox: Boosting
translation quality without sacrificing LLM abilities, https://www.amazon.science/publications/
the-fine-tuning-paradox-boosting-translation-quality-without-sacrificing-llm-abilities, 2024.</p>
      <p>Amazon Science.
[15] D. Zhu, P. Chen, M. Zhang, B. Haddow, X. Shen, D. Klakow, Fine-tuning large language models to
translate: Will a touch of noisy data in misaligned languages sufice?, in: Y. Al-Onaizan, M. Bansal,
Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 388–409.</p>
      <p>URL: https://aclanthology.org/2024.emnlp-main.24/. doi:10.18653/v1/2024.emnlp-main.24.
[16] P. Chen, S. Ji, N. Bogoychev, A. Kutuzov, B. Haddow, K. Heafield, Monolingual or multilingual
instruction tuning: Which makes a better alpaca, in: Y. Graham, M. Purver (Eds.), Findings of the
Association for Computational Linguistics: EACL 2024, Association for Computational Linguistics,
St. Julian’s, Malta, 2024, pp. 1347–1356. URL: https://aclanthology.org/2024.findings-eacl.90/.
[17] H. Zhang, K. Ogasawara, Grad-CAM-based explainable artificial intelligence related to medical
text processing, Bioengineering 10 (2023). URL: https://www.mdpi.com/2306-5354/10/9/1070.
doi:10.3390/bioengineering10091070.
[18] W. Jin, X. Li, G. Hamarneh, Evaluating explainable AI on a multi-modal medical imaging task: Can
existing algorithms fulfill clinical requirements?, Proceedings of the AAAI Conference on Artificial
Intelligence 36 (2022) 11945–11953. URL: https://ojs.aaai.org/index.php/AAAI/article/view/21452.
doi:10.1609/aaai.v36i11.21452.
[19] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. Aleman, D. Almeida, J.
Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao,
M. Bavarian, J. Belgum, B. Zoph, GPT-4 technical report, 2023. URL: https://openai.com/research/
gpt-4, OpenAI Technical Report.
[20] Y. Liu, Y. Li, Z. Wang, X. Liang, L. Liu, L. Wang, L. Cui, Z. Tu, L. Wang, L. Zhou, A systematic
evaluation of GPT-4V’s multimodal capability for chest X-ray image analysis, Meta-Radiology 2 (2024)
100099. URL: https://www.sciencedirect.com/science/article/pii/S2950162824000535. doi:https:
//doi.org/10.1016/j.metrad.2024.100099.
[21] P. Xu, X. Chen, Z. Zhao, D. Shi, Unveiling the clinical incapabilities: a benchmarking study of
GPT4V(ision) for ophthalmic multimodal image analysis, British Journal of Ophthalmology 108 (2024)
1384–1389. URL: https://bjo.bmj.com/content/108/10/1384. doi:10.1136/bjo-2023-325054.
arXiv:https://bjo.bmj.com/content/108/10/1384.full.pdf.
[22] S. Raza, D. J. Reji, F. Shajan, S. R. Bashir, Large-scale application of named entity recognition to
biomedicine and epidemiology, PLOS Digital Health 1 (2022) e0000152. URL: https://doi.org/10.
1371/journal.pdig.0000152. doi:10.1371/journal.pdig.0000152.
[23] H. Tang, Y. Chen, T. Wang, Y. Zhou, L. Zhao, Q. Gao, M. Du, T. Tan, X. Zhang, T. Tong,
HTCNet: A hybrid CNN-transformer framework for medical image segmentation, Biomedical Signal
Processing and Control 88 (2024) 105605. URL: https://www.sciencedirect.com/science/article/pii/
S1746809423010388. doi:https://doi.org/10.1016/j.bspc.2023.105605.
[24] B. Vandersmissen, J. Oramas, On the coherency of quantitative evaluation of visual explanations,
Computer Vision and Image Understanding 241 (2024) 103934. URL: https://www.sciencedirect.
com/science/article/pii/S1077314224000158. doi:https://doi.org/10.1016/j.cviu.2024.
103934.
[25] M. G. Ragab, S. J. Abdulkadir, A. Muneer, A. Alqushaibi, E. H. Sumiea, R. Qureshi, S. M. Al-Selwi,
H. Alhussian, A comprehensive systematic review of YOLO for medical object detection (2018 to
2023), IEEE Access 12 (2024) 57815–57836. doi:10.1109/ACCESS.2024.3386826.
[26] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, YOLOv7: Trainable bag-of-freebies sets new
state-ofthe-art for real-time object detectors, 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) (2023) 7464 – 7475. doi:10.1109/cvpr52729.2023.00721.
[27] D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of</p>
    </sec>
    <sec id="sec-8">
      <title>A. Online Resources</title>
      <p>The result of this work is available via
• GitHub Repo: https://github.com/Jangulo7/med_explain_ja
• LLaVA-LLaMA 3 8B Fine-Tuned Model: https://huggingface.co/JoVal26/ja-med-clef-model
• LLaVA-Mistral 7B Fine-Tuned Model: https://huggingface.co/JoVal26/ja-clefmed-model</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, Overview of ImageCLEF medical 2025 - medical concept detection and interpretable caption generation</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <year>2024</year>
          , pp.
          <fpage>26286</fpage>
          -
          <lpage>26296</lpage>
          . doi:
          <volume>10</volume>
          .1109/cvpr52733.
          <year>2024</year>
          .
          <volume>02484</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-CAM:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-D. Ştefan</surname>
            ,
            <given-names>M.</given-names>
            -G. Constantin, M.
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Damm</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rückert</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bloch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>T. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Pakull</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bracke</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eryilmaz</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
            , W.-W. Yim,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Codella</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Novoa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Malvehy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <source>Springer Lecture Notes in Computer Science (LNCS)</source>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Amos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brody</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ripple</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Humphreys</surname>
          </string-name>
          ,
          <article-title>UMLS users and uses: a current overview</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>27</volume>
          (
          <year>2020</year>
          )
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          . URL: https://doi.org/10.1093/jamia/ocaa084. doi:
          <volume>10</volume>
          .1093/jamia/ocaa084.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , H. Müller,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, ROCOv2: Radiology objects in context version 2, an updated multimodal image dataset</article-title>
          ,
          <source>Scientific Data</source>
          <volume>11</volume>
          (
          <year>2024</year>
          ).
          <source>doi:10.1038/s41597-024-03496-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <article-title>Medical SAM adapter: Adapting segment anything model for medical image segmentation</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>102</volume>
          (
          <year>2025</year>
          )
          <article-title>103547</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.media.
          <year>2025</year>
          .
          <volume>103547</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mazurowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Konz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Segment anything model for medical image analysis: An experimental study</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>89</volume>
          (
          <year>2023</year>
          )
          <article-title>102918</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S1361841523001780. doi:https://doi.org/ 10.1016/j.media.
          <year>2023</year>
          .
          <volume>102918</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabbani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>An overview of the JPEG 2000 still image compression standard</article-title>
          ,
          <source>Signal Processing: Image Communication</source>
          <volume>17</volume>
          (
          <year>2002</year>
          )
          <fpage>3</fpage>
          -
          <lpage>48</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0923596501000248. doi:https://doi.org/10.1016/S0923-
          <volume>5965</volume>
          (
          <issue>01</issue>
          )
          <fpage>00024</fpage>
          -
          <lpage>8</lpage>
          ,
          <string-name>
            <surname>JPEG</surname>
          </string-name>
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <source>Meta Llama 3</source>
          .1, https://huggingface.co/meta-llama
          <source>/Llama-3</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          8B-Instruct,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Contributors</surname>
          </string-name>
          ,
          <article-title>XTuner: A toolkit for eficiently fine-tuning LLM, https</article-title>
          ://github.com/InternLM/ xtuner,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Holtermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Olson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhiwandiwalla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          , S.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tseng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lal</surname>
          </string-name>
          ,
          <article-title>Why do LLaVA vision-language models reply to images in english? (</article-title>
          <year>2024</year>
          )
          <fpage>13402</fpage>
          -
          <lpage>13421</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-emnlp.
          <volume>783</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aftab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Davy</surname>
          </string-name>
          , Tailored-LLaMA:
          <article-title>Optimizing few-shot learning in pruned LLaMA models with task-specific prompts</article-title>
          , in: U.
          <string-name>
            <surname>Endriss</surname>
            ,
            <given-names>F. S.</given-names>
          </string-name>
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Bach</surname>
            , A. BugarinDiz,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Alonso-Moral</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Barro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Heintz</surname>
          </string-name>
          (Eds.),
          <source>ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS</source>
          <year>2024</year>
          ,
          <article-title>Proceedings</article-title>
          , IOS Press BV, Santiago de Compostela, Spain,
          <year>2024</year>
          , pp.
          <fpage>3844</fpage>
          -
          <lpage>3850</lpage>
          . URL: https://researchprofiles.tudublin.ie/en/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>