<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of
Data Warehousing and Mining 3 (2007) 1-13. URL: https://doi.org/10.4018/jdwm.2007070101.
doi:10.4018/jdwm.2007070101.
[20] D. H. Wolpert</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.media.2023.102802</article-id>
      <title-group>
        <article-title>Detecting Concepts for Medical Images: Contributions of the DeepLens Team at IUST to ImageCLEFmedical Caption 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amir Hossein Salimi Rudsari</string-name>
          <email>salimi_amirhosein@comp.iust.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bahareh Kavousi Nejad</string-name>
          <email>bahareh_kavousi@comp.iust.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malihe Hajihosseini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sauleh Eetemadi</string-name>
          <email>sauleh@iust.ac.ir</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Iran University of Science and Technology</institution>
          ,
          <addr-line>Tehran</addr-line>
          ,
          <country country="IR">Iran</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3740</volume>
      <fpage>247</fpage>
      <lpage>258</lpage>
      <abstract>
        <p>The ImageCLEFmedical Caption Task 2025 challenge includes two subtasks: Concept Detection and Caption Prediction. This paper addresses the Concept Detection subtask, which involves automatically identifying relevant medical concepts from radiological images to support semantic image retrieval and clinical decision-making. We developed multiple deep learning models, including Convolutional Neural Networks (CNNs) combined with Feed-Forward Neural Networks (FFNNs), as well as a novel architecture named KClipMed, based on the K-Nearest Concept-Language-Image Pre-training (KCLIP) framework. KClipMed incorporates Top-k Concept Retrieval, a trainable Logit Temperature, and a Cross-Attention mechanism to enhance image-concept alignment. We also explored ensemble strategies, with the best-performing ensemble-combining EficientNet and DenseNet-achieving an F1 score of 57.66% on the test set, placing second in the competition with a margin of 1.2% behind the top-ranked team.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Concept Detection</kwd>
        <kwd>Medical Image Analysis</kwd>
        <kwd>CNN</kwd>
        <kwd>FFNN</kwd>
        <kwd>KClipMed</kwd>
        <kwd>Top-k Retrieval</kwd>
        <kwd>Cross-Attention</kwd>
        <kwd>Ensemble</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The ImageCLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] medical challenge is an integral part of ImageCLEF, an ongoing evaluation platform
initiated in 2003 under the Conference and Labs of the Evaluation Forum (CLEF). The primary goal of
ImageCLEF is to facilitate progress and innovation in methods used for indexing, retrieval, classification,
and annotation of multimodal content, with a particular emphasis on biomedical applications. Over the
years, ImageCLEFmedical has seen growing international engagement, considerably contributing to
the advancements in medical image processing and automatic captioning techniques [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>For the 2025 ImageCLEFmedical Caption Task, two distinct subtasks were introduced: Concept
Detection and Caption Prediction [4]. Concept Detection involves automatically assigning relevant medical
concepts to biomedical images, which supports semantic searches and assists medical practitioners by
ofering preliminary structured insights [ 5]. The evaluation of system performance in this task was
based on binary F1 scores, measuring the overlap between predicted and ground-truth concept sets
[6]. Conversely, the Caption Prediction task focuses on generating detailed and accurate captions for
medical images, thereby providing diagnostic information and aiding clinicians in their workflow.</p>
      <p>Our group, DeepLens from Iran University of Science and Technology, participated specifically in
the Concept Detection subtask for the 2025 edition. Our approach involved exploring three diferent
modeling strategies. Initially, we utilized several Convolutional Neural Networks (CNNs) paired with
Feed-Forward Neural Networks (FFNNs) for classifying medical images into corresponding concepts.</p>
      <p>In addition, we developed a novel model named KClipMed based on K-Nearest
Concept-LanguageImage Pre-training (KCLIP). The KClipMed architecture incorporates a Top-k Concept Retrieval
mechanism, Cross-Attention, and a trainable Logit Temperature module, aiming to improve the alignment
between visual features and medical concepts. This model was explored using two diferent image
encoders: Swin Transformer and Vision Transformer (ViT).</p>
      <p>Finally, we applied extensive ensemble approaches by aggregating predictions from all possible
combinations of our proposed models using union-based aggregation. Through these comprehensive
experimental strategies, we successfully secured second place in the Concept Detection subtask, closely
following the leading team with only a 1.2% gap.</p>
      <p>This paper outlines our methodological approaches, elaborates on the models we designed, discusses
the experimental outcomes, and considers the implications of our findings for future developments in
medical image analysis and concept-driven methodologies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>A number of comprehensive reviews have summarized the role of deep learning in medical image
analysis [7], laying the groundwork for tasks like concept detection. Concept detection in medical
images aims to map each image to a set of relevant UMLS Concept Unique Identifiers (CUIs) for retrieval,
annotation, or clinical support. Prior work in this area can be grouped into three main technical
approaches that informed our system design.</p>
      <sec id="sec-2-1">
        <title>2.1. Convolutional Neural Networks</title>
        <p>Convolutional neural networks (CNNs) have long been the foundation for medical image classification
and tagging [8]. DenseNet architectures [9], in particular, became popular in radiology after their
success in thoracic disease detection, where their dense connectivity helped model subtle patterns in
chest X-rays [10]. EficientNet models [ 11] were introduced as a more parameter-eficient alternative
that scales depth, width, and resolution in a balanced way, making them attractive for multi-label
medical image analysis where computational eficiency matters [ 12]. Prior studies have shown that both
DenseNet and EficientNet backbones, fine-tuned on medical datasets[ 13, 14], provide strong baselines
for multi-label tagging tasks such as concept detection in radiographs.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Transformer-Based Architectures and Concept-Aware Models</title>
        <p>Transformer-based models have gained ground in medical image analysis because of their ability to
model global relationships across image regions. Vision Transformers (ViTs) [15] partition images
into patches and apply self-attention to learn contextual dependencies, while hierarchical variants
like Swin Transformers [16] introduce shifted windows to eficiently capture both local and global
structure. These architectures[17] have been successfully applied to radiographic image tagging, often
outperforming conventional CNNs in settings where long-range dependencies between regions of
interest are important.</p>
        <p>Building on this foundation, concept-aware models [18] extend vision architectures by aligning image
features with semantic representations of medical concepts. In our KClipMed models, we combine visual
encoders (ViT or Swin Transformer backbones) with learnable embeddings representing individual
CUIs. A top-K nearest retrieval mechanism and attention-based fusion allow the model to focus on
semantically relevant concepts, enhancing its ability to discriminate between visually similar but
semantically distinct classes.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Ensemble Strategies for Multi-Label Medical Image Tagging</title>
        <p>Ensemble learning [19] has proven to be a simple and efective method for improving performance on
multi-label tasks with class imbalance. By combining the outputs of models with diferent architectures
— such as CNNs and Transformers — ensembles benefit from the complementary strengths of their
components. Union-based aggregation strategies, in particular, are well suited to multi-label concept
detection, as they improve recall by preserving any positive prediction from the constituent models
without requiring complex calibration [20]. This approach helps stabilise performance on rare concepts
and is computationally eficient.</p>
        <p>Together, these prior works shaped the design of our system, which combines CNN baselines,
Transformer encoders, concept-aware alignment, and union-based ensembles for robust medical concept
detection.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The data used in the 2025 ImageCLEFmedical Caption Task consists of curated medical images extracted
from biomedical literature, accompanied by corresponding captions and manually controlled Unified
Medical Language System (UMLS) terms as metadata. The dataset provided for the challenge aimed to
facilitate advanced and diverse methodological approaches by ofering a comprehensive collection of
radiological images.</p>
      <p>Specifically, for both training and development purposes, the Radiology Objects in COntext Version
2 (ROCOv2) dataset [21] was utilized. This dataset is an expanded and enhanced version of the original
ROCO dataset, sourced primarily from biomedical articles within the PMC OpenAccess subset. The
provided data was divided into three subsets: the training set comprising 80,091 images, the validation
set including 17,277 images, and the test set consisting of 19,267 previously unseen images. The dataset
statistics are summarized in Table 1.</p>
      <p>For the Concept Detection subtask, the concepts were derived from a carefully filtered subset of the
UMLS 2022 AB release, where filtering was applied based on semantic types and concept frequency to
enhance the feasibility of recognizing relevant medical concepts from the provided images.</p>
      <p>During data preprocessing, we observed that the original radiological images contained unnecessary
white borders. Recognizing that these borders did not contribute valuable information and increased
storage requirements, we systematically removed all white borders from the images. This preprocessing
step efectively reduced the total size of the training dataset by approximately two gigabytes, thus
facilitating more eficient data handling and model training [22].</p>
      <p>Additionally, our analysis revealed that the training set includes 2,479 unique CUIs, while the
validation set includes 2,283 unique CUIs. The most frequent concepts in the training data are listed
in Table 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology and Model Architectures</title>
      <p>In this section, we present our approach to the multi-label concept detection challenge in ImageCLEF
2025. We developed and evaluated a wide range of deep learning architectures with the goal of
identifying multiple medical concepts in each radiological image, even in the presence of challenges
such as class imbalance, visual ambiguity, and overlapping semantics. Figure 1 provides an overview of
our system design.</p>
      <p>We grouped our models into three main categories: CNN + FFNN, KClipMed, and Ensemble. Each
group focused on diferent strengths—CNN + FFNN models used powerful convolutional backbones
to extract visual features; KClipMed models combined image and concept understanding through
contrastive embeddings and attention mechanisms; and Ensemble models merged predictions from
various architectures to improve reliability and overall performance. This strategy helped us make the
most of each model type and build a balanced system that performs well across diverse and complex
data. Since our models are based on pretrained architectures, we rely on transfer learning—a widely
adopted strategy in medical imaging that has demonstrated significant benefits across domains [23].</p>
      <sec id="sec-4-1">
        <title>4.1. CNN + FFNN Models</title>
        <p>The CNN + FFNN group included EficientNet and DenseNet architectures. All models were
initialized with pretrained weights and fine-tuned with custom classification heads adapted for multi-label
prediction. The original output layers were replaced with feedforward neural networks, using either
GeM pooling or global average pooling for feature aggregation, followed by sigmoid activations to
generate independent probability scores for each concept. Training was performed using the Binary
Cross Entropy with Logits (BCEWithLogitsLoss) loss function and optimized with the Adam optimizer.
Performance was evaluated based on micro-averaged F1 scores, and checkpoints were saved when
improvements were observed.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. DenseNet Models</title>
          <p>We used DenseNet-121 [9] with an input resolution of 128 × 128 to reduce computational overhead
while maintaining efective feature extraction. Its densely connected layers helped preserve gradient
lfow and capture fine-grained features. A simple linear classification head replaced the original output
layer to support multi-label output, making it a solid and eficient baseline in our CNN lineup.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. EficientNet Models</title>
          <p>We employed EficientNet variants B0 to B3, selecting input resolutions from 224 × 224 to 300 × 300
based on model depth and hardware constraints. Each model used GeM pooling to aggregate spatial
features, followed by a three-layer feedforward neural network with ReLU activations, dropout, and
sigmoid outputs for multi-label classification. The output dimensionality of the feature extractor varied
by model, with B3 producing 1536 features, B2 yielding 1408, and B1 outputting 1280. Training was
conducted using the Adam optimizer with an initial learning rate of 1 × 10− 3. For B2 and B3, a StepLR
CNN + FFNN</p>
          <p>Models
Predicted
Concepts</p>
          <p>Data
Preprocessing</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Model Inference</title>
          <p>Ensemble
Predicted
Concepts (Ensemble)</p>
          <p>KClipMed
Predicted
Concepts
scheduler reduced the learning rate every 5 epochs to stabilize convergence. All models were trained
with BCEWithLogitsLoss, and in B3, label smoothing was also explored [24].</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. KClipMed Models</title>
        <p>We incorporated two KClipMed models into our system to integrate semantic understanding into the
image-based concept detection pipeline—one using a Swin Transformer backbone and another using a
standard Vision Transformer (ViT). Both models follow a vision-language architecture that aligns image
features with concept embeddings in a shared representation space, guided by contrastive learning
principles.</p>
        <p>In the Swin Transformer variant, image features are extracted using a Swin backbone. These features
are projected into a fixed embedding space and compared to a table of learnable embeddings, each
representing a distinct medical concept (CUI). For each image, the model selects the top-K most similar
concept embeddings and combines them with the image representation using a multi-head attention
mechanism. The final prediction logits are computed using a scaled dot-product between the attended
image representation and the full set of concept embeddings. This mechanism allows the model to focus
on semantically relevant concepts and enhance discriminative power across visually similar classes.</p>
        <p>In the standard ViT-based KClipMed model, we used a ViT-Small backbone as the visual encoder.
Instead of relying on precomputed class tokens, the model extracts token embeddings directly from
image patches and uses a positional encoding scheme to maintain spatial coherence. The classification
token from the ViT encoder is projected into the shared embedding space and processed similarly to
the Swin-based model, including top-K concept selection and attention-based fusion. The use of a pure
ViT backbone ofers an alternative representation pathway and helps assess how transformer depth
and attention granularity afect performance in multi-label settings.</p>
        <p>Both models were trained on images resized to 224 × 224 and optimized using BCEWithLogitsLoss
and the AdamW optimizer with a learning rate of 2× 10− 4 and weight decay of 1× 10− 2. The evaluation
was conducted using both micro and macro F1 scores. The models were checkpointed using the best
micro F1 score on the validation set. These KClipMed variants enriched the model ensemble by bringing
semantic alignment capabilities and context-aware prediction into the concept detection pipeline, as
illustrated in Figure 2 [25].</p>
        <p>Input Image
(224×224 RGB)</p>
        <p>Visual Encoder
(ViT / Swin Transformer)</p>
        <p>Linear Projection
(→ Shared Embedding Space)</p>
        <p>Similarity with CUI Embeddings
(Dot Product + Top-K)</p>
        <p>Sigmoid Activation
(Multi-label Probabilities)</p>
        <p>Logits via Dot Product
(vs. All CUI Embeddings)</p>
        <p>Fused Embedding
(Residual + LayerNorm)</p>
        <p>Multihead Attention
(Visual + Top-K Concepts)</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Vision Transformer (ViT) Models</title>
          <p>In addition to our KClipMed experiments, we implemented standalone Vision Transformer (ViT) models
based on the ViT-B/16 architecture from the torchvision library. These models were designed to
directly learn visual representations from image patches without convolutional priors. The default
classification head was removed and replaced with a custom feedforward neural network (FFNN) that
produced sigmoid-activated outputs for multi-label concept prediction.</p>
          <p>Input images were resized to 224 × 224 and normalized using ImageNet statistics. The extracted
class token embeddings were passed through a three-layer FFNN consisting of fully connected layers,
ReLU activations, and dropout regularization. The models were trained using BCEWithLogitsLoss and
optimized using Adam with a learning rate of 1 × 10− 3. Training and validation were monitored using
micro and macro F1 scores.</p>
          <p>We implemented two variants: a standard ViT model and a mixed-precision version. The
mixedprecision model used PyTorch’s autocast and GradScaler to reduce memory usage and accelerate
training, especially beneficial for large batch sizes (e.g., 128). Training of the mixed-precision variant
followed the same loop structure but wrapped forward and loss computations within a mixed-precision
context. Both models performed well and contributed considerably to our ensemble pipeline, with the
mixed-precision version ofering a favorable trade-of between training speed and numerical stability.</p>
          <p>These ViT-based models [26, 16] demonstrated the capacity of transformer-based architectures to
extract meaningful features directly from pixel data in the absence of convolutional structure and served
as a valuable addition to the diversity of our model ensemble.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ensemble Models</title>
        <p>In the final stage of model development, we explored ensemble strategies to combine predictions from
multiple architectures. Ensembles included combinations of EficientNet, DenseNet, and KClipMed
models. We employed a logical OR (union) approach, where a concept was accepted if predicted by
any model in the ensemble. This straightforward and efective strategy helped improve the system’s
robustness by leveraging the strengths of diverse architectures. Extensive experimentation with
different model combinations confirmed that union-based ensembles considerably enhanced the overall
consistency and performance of our concept detection system.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>This section details the technical environment in which the experiments were conducted, the metrics
used for evaluation, and the results achieved. We also analyze model performance across various
architectures and ensemble combinations to highlight trade-ofs between complexity and accuracy.</p>
      <sec id="sec-5-1">
        <title>5.1. Hardware and Software Setup</title>
        <p>All experiments were executed on Google Colab using a single NVIDIA Tesla T4 GPU (15 GB VRAM).
The main software stack is:
• Python 3.11.12
• PyTorch 2.6.0 +cu124
• torchvision 0.21.0 +cu124
• pandas 2.2.2
• scikit-learn 1.6.1</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation Metrics</title>
        <p>
          The oficial evaluation script of the ImageCLEFmedical Caption 2025 task [ 27] was employed. For each
test image, the script computes a binary F1-score between the predicted and ground-truth concept
sets using the sklearn.metrics implementation (default binary averaging) [19]. Two scores are
reported:
1. Primary F1: computed on all concepts.
2. Secondary F1 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: computed only on the subset of manually annotated concepts, as required by
the task.
        </p>
        <p>Threshold tuning. Since model outputs are sigmoid-activated logits, we manually selected a global
decision threshold for each run by testing several values (e.g., 0.3, 0.4, 0.5) on the validation set. The
threshold yielding the best primary F1 score was chosen and applied to the final test predictions. This
process was repeated independently for each model configuration. To determine optimal prediction
thresholds, we performed grid search on the validation set, evaluating threshold values from 0.1 to 0.9
in 0.1 increments. The final thresholds for each model were selected based on the highest micro-F1
score. This manual tuning ensured fair comparisons between models and ensembles.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results</title>
        <p>In this section, we report both internal validation results and final test set performance.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Validation Set</title>
          <p>
            To select the most efective model configurations, we evaluated a diverse set of architectures and
ensembles on the validation set. Table 3 presents the resulting micro-F1 scores, sorted by performance. Simpler
convolutional backbones (e.g., EficientNet1) achieved solid baseline performance, while two-model
ensembles such as EfNet + DenseNet and EfNet1 + DenseNet consistently outperformed single
models. Ensembles incorporating vision–language encoders (KCLIP-ViT and KCLIP-Swin) often enhanced
semantic understanding, but did not always yield gains in micro-F1, highlighting a trade-of between
expressiveness and precision.
5.3.2. Test Set
Based on validation performance, we selected the top-performing models and submitted them to the
oficial evaluation server. Table 4 shows the results on the hidden test set, reporting both the oficial
primary F1 score (across all concepts) and secondary F1 score (restricted to manually annotated
concepts) [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>The best primary F1 score of 0.5766 was obtained by the EfNet + DenseNet ensemble (Submission ID:
1725). This configuration also showed good generalization with a secondary F1 of 0.9299. Interestingly,
another submission using KCLIP-ViT achieved the highest secondary F1 score of 0.9306, though with
slightly lower primary performance.</p>
          <p>
            Overall, ensembles of two or three well-tuned CNNs performed most consistently, whereas adding
more diverse or pre-trained VLM modules did not always improve results. This suggests that while
hybrid architectures ofer potential, careful selection and thresholding remain critical for maximizing
multi-label classification performance. The top submissions from various teams are shown in Table 5
[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>Interestingly, the best results were achieved using a vanilla ensemble of EficientNet and DenseNet,
outperforming more complex transformer-based architectures. We attribute this to their complementary
feature extraction styles: DenseNet’s dense connectivity preserves low-level features, while
EficientNet’s scaling strategy enhances semantic abstraction. Their combination likely provides a balanced
representation that generalizes well across the medical domain, especially under class imbalance
conditions.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this paper, we presented our system for the ImageCLEFmedical 2025 concept detection task, focusing
on multi-label classification of medical images using a diverse set of deep learning models. Our
approach involved fine-tuning a variety of convolutional and transformer-based architectures, including
EficientNet, DenseNet, Vision Transformers (ViT), and KClipMed. To improve robustness and overall
performance, we employed ensemble strategies that combined model outputs using a logical OR (union)
operation. Our system was developed with careful attention to label encoding, image preprocessing,
model diversity, and performance evaluation using micro and macro F1 scores.</p>
      <p>Through extensive experimentation and fine-tuning, we found that combining models with diferent
inductive biases and input resolutions considerably enhanced performance, especially in terms of
stability across validation subsets. KClipMed models contributed semantically-aware predictions, while
ViTs ofered competitive performance through patch-based representation learning.</p>
      <p>For future work, we plan to extend our system by integrating interpretable AI techniques such as
SHAP [28] to better understand model predictions and improve trustworthiness in clinical contexts.
In addition, visual explanation methods like Grad-CAM [29] may help highlight image regions most
responsible for specific predictions, increasing model transparency for clinicians. We are also exploring
caption generation pipelines that incorporate longitudinal comparison and anatomy-wise controllability,
inspired by recent advances in radiology reporting [30]. Furthermore, leveraging large-scale multilingual
generative models [31] and vision-language alignment approaches [32] may enhance generalization to
unseen medical concepts in zero-shot or few-shot scenarios.</p>
      <p>Beyond the competition setting, we believe that automated concept detection systems like ours hold
promising potential for practical use in medical education and research. For instance, they could assist
radiology students in understanding complex imaging concepts through automatic annotation and
feedback, or help researchers curate and label large-scale medical image datasets with minimal manual
efort. In the long term, such systems could assist clinical decision-making workflows by surfacing
relevant concepts during report generation or image review.</p>
      <p>Ultimately, our goal is to build a concept detection system that is not only accurate but also transparent,
scalable, and applicable across diverse medical imaging settings.</p>
    </sec>
    <sec id="sec-7">
      <title>Code Availability</title>
      <p>To support reproducibility and encourage further research, we have open-sourced our implementation,
including all model definitions, training routines, evaluation scripts, and ensemble strategies. The
codebase is available at: https://github.com/DeepLensIUST/ImageCLEF2025-Concept-Detection-DeepLens</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[4] B. Ionescu, H. Müller, D.-C. Stanciu, A.-G. Andrei, A. Radzhabov, Y. Prokopchuk, Ştefan,
LiviuDaniel, M.-G. Constantin, M. Dogariu, V. Kovalev, H. Damm, J. Rückert, A. Ben Abacha, A. García
Seco de Herrera, C. M. Friedrich, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt,
T. M. G. Pakull, B. Bracke, O. Pelka, B. Eryilmaz, H. Becker, W.-W. Yim, N. Codella, R. A. Novoa,
J. Malvehy, D. Dimitrov, R. J. Das, Z. Xie, H. M. Shan, P. Nakov, I. Koychev, S. A. Hicks, S. Gautam,
M. A. Riegler, V. Thambawita, P. Halvorsen, D. Fabre, C. Macaire, B. Lecouteux, D. Schwab,
M. Potthast, M. Heinrich, J. Kiesel, M. Wolter, B. Stein, Overview of ImageCLEF 2025: Multimedia
Retrieval in Medical, Social Media and Content Recommendation Applications, in: Experimental
IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 16th International
Conference of the CLEF Association (CLEF 2025), Springer Lecture Notes in Computer Science
LNCS, Madrid, Spain, 2025.
[5] S. Hu, Y. Gao, Z. Niu, Y. Jiang, L. Li, X. Xiao, M. Wang, E. F. Fang, W. Menpes-Smith, J. Xia,
et al., Weakly supervised deep learning for covid-19 infection detection and classification from ct
images, IEEE Access 8 (2020) 118869–118883. URL: https://doi.org/10.1109/ACCESS.2020.3005852.
doi:10.1109/ACCESS.2020.3005852.
[6] D. Powers, Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness</p>
      <p>Correlation, Machine Learning: Science and Technology 2 (2008).
[7] J. Wang, H. Zhu, S.-H. Wang, Y.-D. Zhang, A review of deep learning on medical image
analysis, Mobile Networks and Applications 26 (2021) 351–380. URL: https://doi.org/10.1007/
s11036-020-01672-7. doi:10.1007/s11036-020-01672-7.
[8] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak,
B. van Ginneken, C. I. Sánchez, A survey on deep learning in medical image analysis, Medical
Image Analysis 42 (2017) 60–88. URL: https://doi.org/10.1016/j.media.2017.07.005. doi:10.1016/j.
media.2017.07.005.
[9] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely Connected Convolutional
Networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017,
pp. 2261–2269. doi:10.1109/CVPR35066.2017.
[10] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K.
Shpanskaya, et al., Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,
in: NeurIPS Workshop on Machine Learning for Health, 2017.
[11] M. Tan, Q. V. Le, Eficientnet: Rethinking model scaling for convolutional neural networks, in:
Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings
of Machine Learning Research, PMLR, 2019, pp. 6105–6114. URL: http://proceedings.mlr.press/v97/
tan19a.html.
[12] M. Shvetsova, D. Romanov, D. Dylov, Eficientnet-based convolutional neural network for
automatic covid-19 detection from chest x-ray images, Frontiers in Medicine 8 (2021) 798055.
[13] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, C. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball,
K. Shpanskaya, J. Seekins, D. A. Mong, S. Halabi, J. K. Sandberg, R. Jones, D. B. Larson, C. P.
Langlotz, B. N. Patel, M. P. Lungren, A. Y. Ng, Chexpert: A large chest radiograph dataset with
uncertainty labels and expert comparison, Nature Scientific Data 6 (2019) 1–8.
[14] S. Gündel, S. Grbic, B. Georgescu, S. Liu, A. Maier, D. Comaniciu, Learning to recognize
abnormalities in chest x-rays with location-aware dense networks, in: R. Vera-Rodriguez,
J. Fierrez, A. Morales (Eds.), Progress in Pattern Recognition, Image Analysis, Computer
Vision, and Applications, Springer International Publishing, Cham, 2019, pp. 757–765. URL: https:
//doi.org/10.1007/978-3-030-13469-3_88. doi:10.1007/978-3-030-13469-3_88.
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16×16 words:
Transformers for image recognition at scale, in: International Conference on Learning Representations
(ICLR), 2021. URL: https://openreview.net/forum?id=YicbFdNTTy, presented at ICLR 2021.
[16] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Transformer: Hierarchical
Vision Transformer using Shifted Windows, in: 2021 IEEE/CVF International Conference on
Computer Vision (ICCV), 2021, pp. 9992–10002. doi:10.1109/ICCV48922.2021.00986.
United Arab Emirates, 2022, pp. 9019–9052. URL: https://aclanthology.org/2022.emnlp-main.616/.
doi:10.18653/v1/2022.emnlp-main.616.
[32] T. Chen, Z.-G. Xu, X.-Y. Zheng, J. Wen, X.-Y. Liu, Z.-M. Li, H.-Y. Zhang, A survey on vision-language
models for zero-shot image classification, Journal of Computer Science and Technology 38 (2023)
670–695. URL: https://doi.org/10.1007/s11390-023-4043-4. doi:10.1007/s11390-023-4043-4.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2025 -
          <article-title>Medical Concept Detection and Interpretable Caption Generation</article-title>
          , in: CLEF 2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kozlovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. D.</given-names>
            <surname>Cid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , V.-T. Ninh,
          <string-name>
            <surname>T.-K. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            , M.-
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
            Dang-Nguyen,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fichou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Ştefan</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <article-title>Constantin, Overview of the imageclef 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications</article-title>
          , in: A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhof</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
          </string-name>
          , N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Michoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Geissbuhler</surname>
          </string-name>
          ,
          <article-title>A review of content-based image retrieval systems in medical applications-clinical benefits and future directions</article-title>
          ,
          <source>International Journal of Medical Informatics</source>
          <volume>73</volume>
          (
          <year>2004</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1386505603002119. doi:https://doi.org/10.1016/j.ijmedinf.
          <year>2003</year>
          .
          <volume>11</volume>
          .024.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>