<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UIT-Oggy at ImageCLEFmedical 2025 Caption: CSRA-Enhanced Concept Detection and BLIP-Driven Vision-Language Captioning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gia-Phuc Bui-Hoang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>My-Huyen Dinh-Doan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Van-Minh Luong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien B. Nguyen-Tat</string-name>
          <email>thienntb@uit.edu.vn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the development and evaluation of two deep learning models for the ImageCLEFmedical 2025 challenge, targeting automated medical concept detection and diagnostic captioning. The primary goal was to create practical models that enhance the accuracy and clinical relevance of automated radiological reporting. For the concept detection task, a novel dual-branch architecture named MedCSRA (Medical Class-Specific Residual Attention) was proposed. It integrates a Class-Specific Residual Attention branch with a Global branch, using a ResNet-101 backbone, to efectively balance localized and global visual features. For the caption prediction task, a pre-trained BLIP (Bootstrapping Language-Image Pre-training) model was fine-tuned on the competition's dataset to adapt its powerful vision-language capabilities to the medical domain. Both submissions achieved strong results in the oficial competition. The MedCSRA model secured a fourth-place ranking in the concept detection task, with a primary F1-score of 0.5613. In the captioning task, the fine-tuned BLIP model also achieved a fourth-place ranking, with a competitive overall score of 0.3554 and strong semantic coherence, as measured by a BERTScore (Recall) of 0.5951. Our results demonstrate two key findings. First, a custom dual-branch architecture that explicitly balances local and global context is a highly efective strategy for multi-label concept detection. Second, standard fine-tuning of a large pre-trained vision-language model like BLIP is suficient to achieve a top-tier ranking in semantic captioning metrics, though challenges in clinical factuality, measured by metrics like UMLS (Unified Medical Language System) Concept F1, remain. These findings validate our approaches as competitive solutions for complex biomedical image analysis tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ImageCLEF 2025</kwd>
        <kwd>Medical Image Processing</kwd>
        <kwd>Concept Detection</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>Radiology Images</kwd>
        <kwd>CSRA</kwd>
        <kwd>BLIP</kwd>
        <kwd>Vision-Language Models</kwd>
        <kwd>ResNet-101</kwd>
        <kwd>DenseNet-121</kwd>
        <kwd>EficientNetB4</kwd>
        <kwd>EficientNetB5</kwd>
        <kwd>Transformer Models</kwd>
        <kwd>Multimodal Learning</kwd>
        <kwd>Diagnostic Captioning</kwd>
        <kwd>Multilabel Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, machine learning—particularly deep learning—has been a driving force behind
substantial advancements in biomedicine. As the volume of medical imaging data continues to grow rapidly,
there is a rising need for intelligent systems that can extract meaningful information, assist with clinical
decision-making, and streamline workflows in healthcare environments.</p>
      <p>One key area in this domain is diagnostic captioning, which involves generating descriptive,
diagnostic-level text based on medical images. This task holds great promise in assisting clinicians by
improving reporting eficiency and reducing the risk of human error—especially for less experienced
practitioners. Rather than replacing human expertise, these systems are designed to augment and
support the diagnostic process by providing preliminary insights that guide medical professionals
toward faster and more accurate decisions.</p>
      <p>
        ImageCLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an annual evaluation campaign, provides a structured platform for advancing research
in multimodal machine learning. Among its major tracks, ImageCLEFmedical addresses biomedical
image analysis through a series of challenges, including diagnostic captioning [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Our team, UIT-Oggy, participated in the ImageCLEFmedical 2025 Caption task, which consists of two
complementary subtasks: medical concept detection and caption prediction. In the concept detection
task, the goal is to identify and extract medically relevant terms from an image, which can enhance
indexing, retrieval, and diagnostic support. The caption prediction task—also referred to as diagnostic
captioning—focuses on generating coherent, informative, and medically accurate descriptions of patient
conditions and anatomical structures visible in the image.</p>
      <p>While diagnostic captioning remains a challenging problem due to the complexity of medical
language and visual interpretation, it represents a transformative tool in modern clinical workflows. By
providing initial report drafts and highlighting important image features, these models can reduce
report turnaround times and help clinicians manage increasing workloads. At the same time, concept
detection ensures that critical medical terms are not overlooked and serves as a foundation for structured
reporting and semantic image understanding.</p>
      <p>
        In this paper, we detail our methods and results from the ImageCLEFmedical 2025 challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We
explore Transformer-based architectures for captioning and introduce a novel attention-based model
for concept detection, aiming to contribute practical solutions that can enhance diagnostic support
systems.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This section reviews prior work in two key areas relevant to our participation in the ImageCLEFmedical
2025 challenge. We begin with a broad overview of deep learning’s impact on medical imaging before
delving into task-specific advancements.</p>
      <p>
        A comprehensive evaluation by Nguyen-Tat et al.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] highlights the critical role of robust
preprocessing pipelines and the efectiveness of deep learning methods across multiple modalities,
establishing a baseline for developing high-performance models. Underpinning many advanced applications
is the fundamental task of medical image segmentation, where recent innovations include hybrid
architectures combining the U-Net architecture (so-named for its U-shape), attention mechanisms, and
Transformer models for precise brain tumor segmentation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as well as weakly-supervised approaches
like Qmaxvit-unet+ for scribble-based segmentation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The task of medical image captioning is distinguished from its general-domain counterpart by its
stringent requirements for clinical accuracy and domain-specific knowledge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While early systems
relied on template- or retrieval-based methods [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], the field was revolutionized by deep learning.
Foundational approaches employed a Convolutional Neural Network-Recurrent Neural Network
(CNNRNN) encoder-decoder architecture, using networks like DenseNet for image encoding and hierarchical
Long Short-Term Memory (LSTM) networks for text generation, as demonstrated in influential works
like TieNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        A pivotal advancement was the integration of attention mechanisms, enabling models to focus on
salient image regions. In an influential study, Jing et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed a hierarchical attention model
that significantly improved caption coherence and accuracy. To address challenges with rare clinical
ifndings, some studies have explored hybrid models, such as the reinforcement learning-based agent by
Li et al.[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which intelligently alternates between retrieving templates and generating novel sentences.
      </p>
      <p>
        The current state-of-the-art is increasingly driven by Large Language Models (LLMs) and
pretrained vision-language models. Notably, models like Med-PaLM (Medical Pathways Language Model)
have demonstrated impressive capabilities by achieving a passing score on the US Medical Licensing
Examination [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Concurrently, models like BLIP (Bootstrapping Language-Image Pre-training) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
leverage bootstrapping mechanisms to learn from noisy web-scale data, providing a strong foundation
that can be fine-tuned for medical tasks.
      </p>
      <p>
        Parallel to captioning, the automatic annotation of medical images with multiple concepts is a critical
multi-label classification task. As documented in the survey by Litjens et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], this field has shifted
from traditional methods using hand-crafted features to deep learning-based approaches.
      </p>
      <p>
        A key development in this area was the release of the large-scale ChestX-ray14 dataset, which
enabled the creation of high-performance models for thoracic disease detection [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The influential
CheXNet model, a DenseNet-based architecture trained on this dataset, demonstrated the potential to
achieve radiologist-level performance. However, this approach is tailored to a specific modality. More
directly related to our work, several studies have explored advanced attention mechanisms. For instance,
the Residual Attention Network introduced by Wang et al.[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] demonstrated how stacking attention
modules within residual blocks can significantly improve classification performance. In parallel, Li et
al.[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] proposed a framework with a class-specific attention module. While these methods show robust
performance, they often do not explicitly balance global image context with localized attention—a key
gap that our proposed MedCSRA model aims to address.
      </p>
      <p>
        The ImageCLEF 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] challenge, part of the broader CLEF 2025 (Conference and Labs of the
Evaluation Forum) initiative, serves as a crucial benchmark for advancing both tasks, providing standardized
datasets and a rigorous evaluation framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our work builds directly on this foundation,
contributing a novel dual-branch architecture for concept detection that addresses the nuanced challenge
of integrating global and local features, a limitation observed even in top-performing systems from
previous years.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        Our work utilizes the oficial dataset provided for the ImageCLEFmedical 2025 Caption task. This
dataset is a curated version of the Radiology Objects in Context (ROCO) v2 dataset[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], which serves as
its foundation. The original ROCOv2 is a large-scale, "in-the-wild" benchmark comprising over 81,000
image-caption pairs sourced from biomedical publications in the PubMed Central Open Access corpus.
For the 2025 competition, this dataset was specifically updated according to the established ImageCLEF
procedure: a new, unseen test set was introduced, while the test set from the previous year became the
current validation set, and the former validation set was integrated into the training data. This process
and the final dataset composition are detailed in the oficial task overview paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The resulting 2025
training set consists of 70,108 medical images, primarily from radiology. A key feature of this dataset is
its dual-annotation structure: each image is paired with both a set of UMLS (Unified Medical Language
System ) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]-based medical concept labels and a free-text diagnostic caption. This design facilitates
the joint training of models for concept detection and caption generation. The dataset encompasses
various imaging modalities, such as X-rays and Computed Tomography (CT) scans, challenging models
to generalize across diferent diagnostic scenarios. Examples of images from the dataset are presented
in Figure 1.
      </p>
      <p>As shown in Figure 2, the distribution of caption lengths in the training set is highly right-skewed.
Most captions are concise, typically fewer than 30 words, while a small number are extremely long,
with the longest exceeding 800 words. This long-tailed behavior suggests that, while many image
descriptions are brief, models must also be robust enough to handle verbose and complex medical text.</p>
      <p>In Figure 3, we visualize the number of concepts annotated per image. The majority of images contain
1 to 4 concepts, with a sharp decline in frequency beyond that. Only a few cases involve 10 or more
concepts. This confirms the multilabel nature of the task and indicates that models must be capable of
identifying both sparse and dense sets of medical concepts depending on the image complexity.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Image Pre-processing</title>
      <p>Image preprocessing is a pivotal step in preparing medical images for captioning tasks, ensuring that
input data is standardized and optimized for model performance. The approach focuses on transforming
raw images into a format suitable for a vision-language model, balancing computational eficiency with
the preservation of essential visual information.</p>
      <sec id="sec-4-1">
        <title>4.1. Image Loading and Resizing</title>
        <p>The images are loaded from the dataset directory, specifically handling JPEG files, using a computer
vision library. Each image is resized to a uniform resolution of 224x224 pixels. This fixed size aligns
with the input requirements of the vision-language model, ensuring compatibility with its pre-trained
vision component. Standardizing image dimensions reduces computational complexity and facilitates
consistent feature extraction in diverse medical images.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Encoding with Processor</title>
        <p>A specialized processor, designed for the vision-language model, is used to prepare both images and
their corresponding captions. This processor handles the following tasks:
• Image Normalization: Pixel values are scaled and normalized to match the expected input range
of the pre-trained model, ensuring consistent processing across varied image sources.
• Text Tokenization: Captions are tokenized using the model’s tokenizer, with padding applied to a
maximum length of 200 tokens and truncation using a strategy that prioritizes longer sequences.</p>
        <p>This ensures that variable-length captions are uniformly formatted.
• Tensor Conversion: Both images and tokenized captions are converted into tensors, with
unnecessary dimensions removed to match the model’s input requirements. The resulting data includes
image tensors, tokenized caption IDs, and attention masks to indicate valid tokens, which are
critical for both training and inference.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed Method</title>
      <sec id="sec-5-1">
        <title>5.1. Caption Prediction</title>
        <p>
          For the caption prediction task, the BLIP model was employed as the foundational architecture [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
BLIP is a state-of-the-art vision-language framework that unifies understanding and generation tasks.
Its robust performance, derived from pre-training on large-scale, noisy web data, makes it a strong
candidate for adaptation to specialized domains like medical imaging.
        </p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Model Architecture</title>
          <p>
            The BLIP architecture integrates a Vision Transformer (ViT) [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] as its image encoder and a BERT-based
transformer [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] as its text encoder/decoder. The ViT processes an input image by dividing it into a
sequence of patches and encoding them into rich visual representations, capturing both global and
local features. This is particularly crucial for medical images where diagnostic clues can be subtle and
localized. The text decoder, conditioned on these visual features, generates the caption autoregressively.
A key strength of BLIP is its joint training strategy, which optimizes both the encoder and decoder to
learn aligned multimodal representations. This pre-trained foundation is then leveraged during the
ifne-tuning stage, where the model adapts to the specific terminology and visual characteristics of the
medical domain. The architecture is illustrated in Figure 4.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Fine-tuning and Implementation Details</title>
          <p>
            The captioning model was implemented using the PyTorch framework and the Hugging
Face Transformers library [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ]. The starting point for the experiments was the pre-trained
Salesforce/blip-image-captioning-large checkpoint, publicly available on the Hugging Face
Hub.
          </p>
          <p>
            This model was subsequently fine-tuned on the oficial ImageCLEFmedical 2025 Caption [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] training
set for 2 epochs. The AdamW optimizer [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] was employed with a learning rate of 1 × 10− 5. Due
to hardware limitations, a batch size of 4 was used. All training was conducted on a single NVIDIA
T4 GPU. During the fine-tuning process, all parameters of the pre-trained model were unfrozen and
optimized to maximally adapt the model’s representations to the medical domain.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Concept Detection</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Overview of the MedCSRA Architecture</title>
          <p>
            For the medical concept detection task, a novel dual-branch architecture named MedCSRA is proposed
for multi-label image classification. The core innovation of MedCSRA lies in its hybrid approach, which
integrates both global and local features to create a more comprehensive image representation. As
illustrated in Figure 5, the model consists of a shared visual backbone and two parallel processing
branches: a Global Branch and a Class-Specific Residual Attention (CSRA) Branch. The CSRA branch is
inspired by the attention mechanism introduced by Fang et al. [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], which learns class-specific feature
maps to focus on discriminative regions for each concept. This mechanism was adapted for the medical
domain. Concurrently, a newly designed Global Branch processes the entire feature map to capture
broader contextual information. The outputs (logits) from both branches are then fused, and a sigmoid
function is applied to produce the final multi-label predictions.
          </p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Visual Feature Extraction</title>
          <p>
            The MedCSRA model employs a ResNet-101 architecture as its visual feature extraction backbone [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ].
ResNet-101, a 101-layer Residual Network, was chosen for its proven efectiveness in various computer
vision tasks, including medical image analysis. Its core innovation lies in the use of "residual
connections," which reformulate layers as learning residual functions with reference to the layer inputs. This
design efectively mitigates the vanishing gradient problem in very deep networks and addresses the
degradation issue where deeper models can perform worse than shallower ones. To leverage transfer
learning, the ResNet-101 backbone was initialized with weights pre-trained on the large-scale ImageNet
dataset [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]. For domain adaptation to our specific task, a fine-tuning strategy was applied: while the
majority of the network’s early layers were kept frozen, the final convolutional block (layer4) was
unfrozen and trained on the ImageCLEFmedical 2025 dataset. This approach allows the model to retain
robust, low-level features from ImageNet while specializing its high-level feature representations for
the unique patterns found in radiology images.
          </p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.3. Dual-Branch Design</title>
          <p>
            MedCSRA processes spatial features through two branches: the Class-Specific Residual Attention
Branch, adapted from reference [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], and our novel Global Branch.
          </p>
        </sec>
        <sec id="sec-5-2-4">
          <title>5.2.3.1. Global Branch</title>
          <p>The Global Branch, our original contribution, enhances the model’s ability to capture global
semantic patterns across the entire image. Spatial features extracted from the ResNet-101 backbone
are subjected to Global Average Pooling, reducing the spatial dimensions to a compact vector of size
[batch_size, in_features]. This vector encapsulates overarching contextual information, which is then
passed through a fully connected layer to generate global logits of size [batch_size, num_classes]. The
use of Global Average Pooling ensures computational eficiency while preserving critical global context,
making it efective for detecting medical concepts with difuse patterns. The design leverages the
backbone’s hierarchical features, complementing the localized focus of the CSRA Branch.</p>
          <p>Let ℋ = {ℎ1, ℎ2, . . . , ℎ} represent the spatial feature maps extracted by the backbone, where
ℎ ∈ R1024× ×  and  and  are the height and width, reflecting the output channels of
ResNet101’s last convolutional block. Global Average Pooling is applied to compute a global feature vector:
global =</p>
          <p>1
 ·  =1 =1
 
∑︁ ∑︁ ℎ[:, , ]
global = fc · global + fc
where global ∈ R1024 is the pooled vector for each sample in the batch. This vector is then passed
through a fully connected layer to produce global logits:
where fc ∈ R1024×  and fc ∈ R are the weights and bias of the linear layer, and  is the number
of classes.</p>
        </sec>
        <sec id="sec-5-2-5">
          <title>5.2.3.2. Class-Specific Residual Attention Branch</title>
          <p>
            Inspired by prior work [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], the Class-Specific Residual Attention Branch (CSRA) in MedCSRA
focuses on localized pathological regions by applying a class-specific attention mechanism. Within the
model, spatial feature maps  (with 1024 channels from the ResNet-101 backbone) are flattened into a
tensor of shape [batch_size, 1024,  ×  ]. A linear projection, implemented as a fully connected layer
with 1024 output features, generates an attention map, which is then normalized using the sigmoid
function.
          </p>
          <p>This attention map is used to compute attention-weighted features through a tensor operation,
implemented via torch.einsum, resulting in a vector of shape [batch_size, 1024]. These weighted
features are then mapped to logits of size [batch_size, num_classes] using another fully connected layer.
This process allows CSRA to emphasize class-specific regions, enhancing the model’s ability to detect
subtle pathological patterns. The resulting logits are subsequently combined with those from the Global
Branch to produce the final predictions.</p>
        </sec>
        <sec id="sec-5-2-6">
          <title>5.2.4. Fusion and Prediction</title>
          <p>The final logits are computed by fusing the outputs of the two branches using a weighted combination:
combined = (1 −  ) · global +  · csra,
where  = 0.1, a value set based on the model implementation. This weighted fusion balances the
global context from the Global Branch and the localized attention from the CSRA Branch.</p>
          <p>A threshold is applied to the fused logits to determine the predicted labels. This threshold is
automatically selected by evaluating performance on the validation set across a range of values from 0.05 to
0.55. The optimal threshold determined during training was 0.35, corresponding to the highest achieved
F1-score.</p>
        </sec>
        <sec id="sec-5-2-7">
          <title>5.2.5. Loss Function</title>
          <p>Given that this is a multilabel classification task, we use the Binary Cross Entropy Loss (BCE) computed
independently for each class:
 = −

∑︁  · log(ˆ) + (1 − ) · log(1 − ˆ)
=1
(1)
where  is the number of medical concepts, and  ∈ {0, 1} is the ground truth label for class . This
loss function ensures the model learns to predict multiple labels accurately, aligning with the task’s
requirements.</p>
        </sec>
        <sec id="sec-5-2-8">
          <title>5.2.6. Implementation and Training Details</title>
          <p>The MedCSRA model was implemented in PyTorch. For optimization, the Adam optimizer was used
with an initial learning rate of 1 × 10− 4, adjusted via a CosineAnnealing scheduler over a maximum
of 100 epochs. A weight decay of 1 × 10− 5 was applied to mitigate overfitting. To ensure reproducibility,
all experiments were conducted with a fixed random seed of 42.</p>
          <p>The model was trained with a batch size of 16 on a single NVIDIA A100 GPU with 40 GB of VRAM.
For the ResNet-101 backbone, only the final convolutional block was unfrozen for fine-tuning, while
earlier layers retained their pre-trained ImageNet weights. An early stopping mechanism was employed
with a patience of 5 epochs, monitoring the F1-score on the validation set. Based on this criterion, the
training process for the ResNet-101-based model concluded at epoch 34.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiment Results</title>
      <p>This section details the performance of our proposed models on the oficial test sets of the
ImageCLEFmedical 2025 challenge. We present the results for the concept detection and caption prediction
subtasks separately.</p>
      <sec id="sec-6-1">
        <title>6.1. Caption Prediction Results</title>
        <p>For the caption prediction subtask, a fine-tuned BLIP model was submitted (Team: UIT-Oggy, Run ID:
1914). The comprehensive evaluation results, provided by the challenge organizers across multiple
metrics, are summarized in Table 1.
The model achieved an overall score of 32.11%, indicating a moderate level of performance. While the
high Similarity (87.98%) and BERTScore (59.51%) suggest a good semantic alignment with the reference
captions, other metrics revealed significant challenges. The low UMLS Concept F1 (16.72%) and Factuality
Average (13.46%) scores highlight that the model struggled to maintain clinical accuracy and factual
correctness, which is a critical limitation. This suggests that while pre-trained vision–language models
like BLIP provide a strong foundation, extensive domain-specific adaptation is required to overcome
the nuances of medical reporting.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Concept Detection Results</title>
        <p>For the concept detection subtask, our primary submission utilized the MedCSRA model with a
ResNet101 backbone. The oficial performance of this submission (Team: UIT-Oggy, Run ID: 1892) on the final
test set, as evaluated by the challenge organizers, is presented in Table 2.</p>
        <p>The model achieved a primary F1-score of 0.5613, securing a fourth-place ranking among all
international teams participating in the challenge. This strong competitive result serves as a direct validation of
our proposed dual-branch architecture. The high performance suggests that the model’s ability to fuse
global contextual features (from the Global Branch) with localized, class-specific details (from the CSRA
Branch) is a highly efective strategy for this complex multi-label classification task. It demonstrates
that explicitly balancing these two types of information allows the model to successfully identify a
wide range of medical concepts, from difuse abnormalities to subtle, localized findings, within a single
unified framework.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <sec id="sec-7-1">
        <title>7.1. Conclusions</title>
        <p>In this paper, we presented the methods and results of the UIT-Oggy team’s participation in the
ImageCLEFmedical 2025 challenge.</p>
        <p>For the medical concept detection task, we introduced MedCSRA, a novel dual-branch architecture
designed to integrate both global and localized visual features. Our experiments demonstrated that this
approach is highly efective, with the MedCSRA model using a ResNet-101 backbone achieving a
fourthplace ranking in the oficial competition (F1-score: 0.5613). This result validates that balancing
classspecific attention with global context is a robust strategy for multi-label medical image classification.</p>
        <p>For the caption prediction task, we adapted and fine-tuned a pre-trained BLIP model. While the model
showed a strong capability for generating semantically coherent text (Similarity: 87.98%), our analysis
revealed significant limitations in its clinical and factual accuracy (UMLS Concept F1: 16.72%). This
ifnding underscores a key challenge: despite the power of large-scale pre-training, achieving clinical
reliability in generative models requires more advanced domain-specific adaptation techniques.</p>
        <p>Overall, our work contributes an efective architecture for concept detection and provides a critical
analysis of the current state of vision-language models in the context of diagnostic captioning, paving
the way for future research directions.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Future Work</title>
        <p>
          Based on the findings of this study, several promising research directions are identified. For our
medical concept detection model, MedCSRA, future work will focus on addressing the challenge of class
imbalance, a common issue in medical datasets where rare but critical concepts are underrepresented.
We plan to incorporate Focal Loss [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] into the training process. This will compel the model to pay
more attention to hard-to-classify, infrequent concepts, aiming to enhance its diagnostic utility while
building upon the successful dual-branch architecture.
        </p>
        <p>
          For medical image captioning, our analysis revealed that while the fine-tuned BLIP model achieves
strong semantic coherence, its clinical and factual accuracy remains a significant limitation. To bridge
this gap, a primary direction is to infuse external medical knowledge into the model. Furthermore,
to improve the low factuality score, we propose introducing a fact-checking module, inspired by
claim verification systems [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], which would validate generated statements against a trusted medical
knowledge base.
        </p>
        <p>
          Finally, to enhance the model’s ability to ground its descriptions in visual evidence and improve
the AlignScore, we intend to explore multi-scale attention mechanisms like the Convolutional Block
Attention Module CBAM [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. This would allow the model to simultaneously focus on both broad
anatomical structures and fine-grained pathological details. By pursuing these targeted enhancements,
we aim to develop more robust and clinically reliable models for automated medical image analysis,
addressing the specific challenges identified in each task.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we used Gemini 2.5 Pro and ChatGPT-4o in order to check grammar
and sentence structure. After using these tools, we reviewed and edited the content as needed and take
full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments References</title>
      <p>This research is funded by University of Information Technology-Vietnam National University
HoChiMinh City under grant number D4-2025-04.
To ensure full reproducibility of our results, the source code for both models developed in this study is
publicly available on GitHub. The repositories are organized as follows:
• BLIP Fine-tuning,
• MedCSRA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ştefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <source>Springer Lecture Notes in Computer Science (LNCS)</source>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, Overview of imageclefmedical 2025 - medical concept detection and interpretable caption generation</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Hung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Evaluating pre-processing and deep learning methods in medical imaging: Combined efectiveness across multiple modalities</article-title>
          ,
          <source>Alexandria Engineering Journal</source>
          <volume>119</volume>
          (
          <year>2025</year>
          )
          <fpage>558</fpage>
          -
          <lpage>586</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.aej.
          <year>2025</year>
          .
          <volume>01</volume>
          .090.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. N.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Enhancing brain tumor segmentation in mri images: A hybrid approach using unet, attention mechanisms, and transformers</article-title>
          ,
          <source>Egyptian Informatics Journal</source>
          <volume>27</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1016/j.eij.
          <year>2024</year>
          .
          <volume>100528</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-A.</given-names>
            <surname>Vo</surname>
          </string-name>
          , P.-S. Dang, Qmaxvit-UNet+
          <article-title>: A query-based maxvit-unet with edge enhancement for scribble-supervised segmentation of medical images</article-title>
          ,
          <source>Computers in Biology and Medicine</source>
          <volume>187</volume>
          (
          <year>2025</year>
          )
          <article-title>109762</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.compbiomed.
          <year>2025</year>
          .
          <volume>109762</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning for medical image captioning</article-title>
          ,
          <source>arXiv preprint arXiv:2303.01151</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.2303.01151.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2497</fpage>
          -
          <lpage>2506</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>273</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Sierra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Delisle</surname>
          </string-name>
          , H. Liu,
          <article-title>Intelligent word embeddings of free-text radiology reports</article-title>
          ,
          <source>in: Proceedings of the 2014 IEEE International Conference on Healthcare Informatics</source>
          , IEEE,
          <year>2014</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>119</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICHI.
          <year>2014</year>
          .
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          , Tienet:
          <article-title>Text-image embedding network for common thorax disease classification and reporting in chest x-rays</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>9049</fpage>
          -
          <lpage>9058</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>On the automatic generation of medical imaging reports</article-title>
          ,
          <source>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          (
          <year>2018</year>
          )
          <fpage>2577</fpage>
          -
          <lpage>2586</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -1240.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Hybrid retrieval-generation reinforced agent for medical image report generation</article-title>
          ,
          <source>in: Advances in neural information processing systems (NeurIPS)</source>
          , volume
          <volume>31</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          , J. Wei, ...,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          )
          <fpage>172</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , S. Hoi, BLIP:
          <article-title>Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</article-title>
          ,
          <source>in: International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2022</year>
          , pp.
          <fpage>12763</fpage>
          -
          <lpage>12779</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Litjens</surname>
          </string-name>
          , et al.,
          <article-title>A survey on deep learning in medical image analysis</article-title>
          ,
          <source>Medical image analysis 42</source>
          (
          <year>2017</year>
          )
          <fpage>60</fpage>
          -
          <lpage>88</lpage>
          . URL: https://www.sciencedirect.com/science/article/abs/pii/S1361841517301135.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bagheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          , ChestX-ray8:
          <article-title>Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2097</fpage>
          -
          <lpage>2106</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>369</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Residual attention network for image classification</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>337</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Learning to see through the haze: A novel framework for multi-label classification of radiology images</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1162</fpage>
          -
          <lpage>1167</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , H. Müller,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, Rocov2: Radiology objects in context version 2, an updated multimodal image dataset</article-title>
          ,
          <source>Scientific Data</source>
          <volume>11</volume>
          (
          <year>2024</year>
          ).
          <source>doi:10.1038/s41597-024-03496-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The unified medical language system (umls): integrating biomedical terminology</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://www.aclweb.org/anthology/2020.emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled weight decay regularization,
          <source>International Conference on Learning Representations</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Wu</surname>
          </string-name>
          ,
          <article-title>Residual attention: A simple but efective method for multi-label recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2021</year>
          , pp.
          <fpage>15962</fpage>
          -
          <lpage>15971</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition</article-title>
          , Ieee,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          ,
          <source>in: Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Woo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Kweon</surname>
          </string-name>
          , Cbam:
          <article-title>Convolutional block attention module</article-title>
          ,
          <source>ECCV</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>