Overview of ImageCLEFmedical 2024 – Caption Prediction
                         and Concept Detection
                         Johannes Rückert1,* , Asma Ben Abacha2 , Alba G. Seco de Herrera3,4 , Louise Bloch1,5,6,† ,
                         Raphael Brüngel1,5,6,† , Ahmad Idrissi-Yaghir1,5,† , Henning Schäfer7,1,† , Benjamin Bracke1,5,† ,
                         Hendrik Damm1,5,† , Tabea M. G. Pakull7,1,† , Cynthia Sabrina Schmidt6 , Henning Müller8,9 and
                         Christoph M. Friedrich1,5
                         1
                           Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany
                         2
                           Microsoft, Redmond, Washington, USA
                         3
                           University of Essex, UK
                         4
                           UNED, Spain
                         5
                           Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany
                         6
                           Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen, Germany
                         7
                           Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany
                         8
                           University of Applied Sciences Western Switzerland (HES-SO), Switzerland
                         8
                           University of Geneva, Switzerland


                                     Abstract
                                     The ImageCLEFmedical 2024 Caption task on caption prediction and concept detection follows similar challenges
                                     held from 2017–2023. The goal is to extract Unified Medical Language System (UMLS) concept annotations and/or
                                     define captions from image data. Predictions are compared to original image captions. Images for both tasks
                                     are part of the Radiology Objects in COntext version 2 (ROCOv2) dataset. For concept detection, multi-label
                                     predictions are compared against UMLS terms extracted from the original captions with additional manually
                                     curated concepts via the F1-score. For caption prediction, the semantic similarity of the predictions to the original
                                     captions is evaluated using the BERTScore. The task attracted strong participation with 50 registered teams,
                                     14 teams submitted 82 graded runs for the two subtasks. Participants mainly used multi-label classification
                                     systems for the concept detection subtask, the winning team DBS-HHU utilized an ensemble of four different
                                     Convolutional Neural Networks (CNNs). For the caption prediction subtask, most teams used encoder-decoder
                                     frameworks with various backbones, including transformer-based decoders and Long Short-Term Memories
                                     (LSTMs), with the winning team PCLmed using medical vision-language foundation models (Med-VLFMs) by
                                     combining general and specialist vision models.

                                     Keywords
                                     ImageCLEF, Computer Vision, Multi-Label Classification, Image Captioning, Image Understanding, Radiology


                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ johannes.rueckert@fh-dortmund.de (J. Rückert); abenabacha@microsoft.com (A. Ben Abacha); alba.garcia@essex.ac.uk
                         (A. G. Seco de Herrera); louise.bloch@fh-dortmund.de (L. Bloch); raphael.bruengel@fh-dortmund.de (R. Brüngel);
                         ahmad.idrissi-yaghir@fh-dortmund.de (A. Idrissi-Yaghir); henning.schaefer@uk-essen.de (H. Schäfer);
                         benjamin.bracke@fh-dortmund.de (B. Bracke); hendrik.damm@fh-dortmund.de (H. Damm);
                         tabeamargaretagrace.pakull@uk-essen.de (T. M. G. Pakull); cynthia.schmidt@uk-essen.de (C. S. Schmidt);
                         henning.mueller@hevs.ch (H. Müller); christoph.friedrich@fh-dortmund.de (C. M. Friedrich)
                          0000-0002-5038-5899 (J. Rückert); 0000-0001-6312-9387 (A. Ben Abacha); 0000-0002-6509-5325 (A. G. Seco de Herrera);
                         0000-0001-7540-4980 (L. Bloch); 0000-0002-6046-4048 (R. Brüngel); 0000-0003-1507-9690 (A. Idrissi-Yaghir);
                         0000-0002-4123-0406 (H. Schäfer); 0000-0003-4986-7142 (B. Bracke); 0000-0002-7464-4293 (H. Damm); 0009-0009-9802-7167
                         (T. M. G. Pakull); 0000-0003-1994-0687 (C. S. Schmidt); 0000-0001-6800-9878 (H. Müller); 0000-0001-7906-0038
                         (C. M. Friedrich)
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
ImageCLEF1 is the image retrieval and classification lab of the Conference and Labs of the Evaluation Fo-
rum (CLEF) conference. ImageCLEF 2024 consists of the ImageCLEFmedical, ImageCLEFrecommending,
Image Retrieval for Augments (Touché) and ImageCLEFToPicto labs, with the ImageCLEFmedical lab
being divided into the subtasks Caption (Image Captioning), VQA (text-to-image generation), MEDIQA-
MAGIC (Multimodal And Generative TelemedICine), and GANs (generation of medical images).
   The Caption task was first proposed as part of the ImageCLEFmedical [1] in 2016. In 2017 and
2018 [2, 3] the ImageCLEFmedical caption task comprised two subtasks: concept detection and caption
prediction. In 2019 [4] and 2020 [5], the task concentrated on the concept detection subtask extracting
Unified Medical Language System® (UMLS) Concept Unique Identifiers (CUIs) [6] from radiology
images.
   In 2021 [7], both subtasks, concept detection and caption prediction, were running again due to
participants demands. The focus in 2021 was on making the task more realistic by using fewer images
which were all manually annotated by medical doctors. As additional data of similar quality is hard
to acquire, the 2022 ImageCLEFmedical caption task [8] continued with both subtasks albeit with an
extended version of the Radiology Objects in COntext (ROCO) [9] dataset used for both subtasks, which
was already used in 2020 and 2019. The 2023 edition of ImageCLEFmedical caption [10] continued in
the same vein, once again using a ROCO-based dataset for both subtasks but switching from BiLingual
Evaluation Understudy (BLEU) [11] to BERTScore [12] as the primary evaluation metric for caption
prediction. For the 8th edition in 2024, additional metrics as well as an optional explainability extension
are introduced for the caption prediction.
   This paper sets forth the approaches for the caption task: automated cross-referencing of medical
images and captions into predicted coherent captions and UMLS concept detection in radiology images
as a separate subtask. This task is a part of the ImageCLEF benchmarking campaign, which has proposed
medical image understanding tasks since 2003; a new suite of tasks is generated each subsequent year.
Further information on the other proposed tasks at ImageCLEF 2024 can be found in Ionescu et al. [13].
   This is the 8th edition of the ImageCLEFmedical caption task. Just like in 2016 [1], 2017 [2], 2018 [3],
2021 [7], 2022 [8], and 2023 [10] both subtasks of concept detection and caption prediction are included
in ImageCLEFmedical 2024 Caption.
   Manual generation of the knowledge of medical images is a time-consuming process prone to
human error. As this process requires assistance for the better and easier diagnoses of diseases that are
susceptible to radiology screening, it is important that we better understand and refine automatic systems
that aid in the broad task of radiology-image metadata generation. The purpose of the ImageCLEFmedical
2024 caption prediction and concept detection tasks is the continued evaluation of such systems. Concept
detection and caption prediction information is applicable to unlabelled and unstructured datasets and
medical datasets that do not have textual metadata. The ImageCLEFmedical caption task focuses on the
medical image understanding in the biomedical literature and specifically on concept extraction and
caption prediction based on the visual perception of the medical images and medical text data such as
medical caption or UMLS CUIs paired with each image (see Figure 1).
   In 2024, for the development data, the newly released ROCOv2 [14] dataset, a new iteration of the
ROCO [9] dataset, was used, with new images from the PubMed Central® (PMC) [15] Open Access
subset added for the test set, while images from articles with licenses other than CC BY and CC BY-NC
were removed.
   This paper presents an overview of the ImageCLEFmedical 2024 Caption task including the task and
participation in Section 2, the data creation in Section 3, and the evaluation methodology in Section 4.
The results are described in Section 5, followed by conclusion in Section 6.


1
    https://www.imageclef.org/ [last accessed: 2024-07-01]
2. Task and Participation
In 2024, the ImageCLEFmedical Caption task consisted of two subtasks: concept detection and caption
prediction.
   The concept detection subtask follows the same format proposed since the start of the task in 2017 [2].
Participants are asked to predict a set of concepts defined by the UMLS CUIs [6] based on the visual
information provided by the radiology images.
   The caption prediction subtask follows the original format of the subtask used between 2017 and
2018 [2, 3]. This subtask was paused and it is running again since 2021 because of participant demand.
This subtask aims to automatically generate captions for the radiology images provided. This year,
an optional new experimental explainability extension has been introduced for the caption prediction
task. This extension aims to improve the understanding of the models by asking participants to provide
explanations, such as heat maps or Shapley values [16, 17], for a selected number of images. These
explanations are manually reviewed to assess their effectiveness and clarity.
   In 2024, 50 teams registered and signed the End-User-Agreement that is needed to download the
development data. 14 teams submitted 82 graded runs for evaluation (13 teams submitted working
notes) attracting a similar number of teams as in 2023 [10], with an overall lower number of graded
runs. Each of the groups was allowed a maximum of 10 graded runs per subtask.
   Table 1 shows all the teams who participated in the task and their submitted runs. This year, 9 teams
participated in the concept detection subtask, 3 of those teams also participated in 2023 [10]. Of the 11
teams that submitted runs to the caption prediction subtask, 5 also participated in 2023. 3 of the teams
participated also in 2022. Overall, 6 teams participated in both subtasks, and 5 teams participated only
in the caption prediction subtask. Unlike in 2023, 3 teams participated only in the concept detection
subtask.


3. Data Creation
Figure 1 shows an example from the dataset provided by the task.

                                                                      CC BY [Ali et al. (2020)]

    UMLS CUI                     UMLS Meaning

    C1306645                     Plain x-ray
    C0030797                     Pelvis
    C1999039                     Anterior-Posterior
    C0011900                     Diagnosis
    C1305773                     Entire symphysis pubis

    C0036036                     Sacroiliac joint structure

    C0555898                     Sacroiliac

    C0301559                     Screw

    Caption: Anteroposterior pelvic radiograph of a 30-year-old female diagnosed with Ehlers-Danlos Syndrome
    demonstrating fusion of pubic symphysis and both sacroiliac joints (anterior plating, bone grafting and
    sacroiliac screw insertion)

Figure 1: Example of a radiology image with the corresponding UMLS® CUIs and caption extracted from the
2024’s ImageCLEFmedical caption task. CC-BY [Ali et al. (2020)]


  Like last year, a dataset that originates from biomedical articles of the PMC Open Access Subset2 [15]
was used and was extended with new images added since the last time the dataset was updated in
2
    https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [last accessed: 2024-07-01]
Table 1
Participating groups in the ImageCLEFmedical 2024 Caption task and their graded runs submitted to both
subtasks: T1-Concept Detection and T2-Caption Prediction. Teams with previous participation in 2023 are
marked with an asterisk (*).
        Team                        Institution                                               Runs T1    Runs T2
        AUEB-NLP-                   Department of Informatics, Athens University of                 10           9
        Group* [18]                 Economics and Business, Athens, Greece
        DBS-HHU [19]                Heinrich-Heine-Universität Düsseldorf, Düsseldorf,               8           2
                                    Germany
        DS@BioMed [20]              University of Information Technology, Ho Chi Minh                5           7
                                    City, Vietnam
        SSNMLRGKSR* [21]            Department of CSE, Sri Sivasubramaniya Nadar                     3          –
                                    College of Engineering, Chennai, India
        CS_Morgan* [22]             Computer Science Department, Morgan State                        1           9
                                    University, Baltimore, Maryland
        UACH-VisionLab [23]         Facultad de Ingeniería, Universidad Autónoma de                  2          –
                                    Chihuahua, Chihuahua, Mexico
        MICLab [24]                 School of Electrical and Computer Engineering,                   4           4
                                    Universidade Estadual de Campinas, Campinas,
                                    Brazil
        Kaprov [25]                 Department of CSE, SSN College of Engineering,                   1           1
                                    Chennai, India
        PCLmed* [26]                Peng Cheng Laboratory, Shenzhen, China and                       –           3
                                    ADSPLAB, School of Electronic and Computer
                                    Engineering, Peking University, Shenzhen, China
        VIT_Conceptz [27]           Vellore Institute of Technology (VIT), Chennai, India            4          –
        KDE-medical-                KDE Laboratory, Department of Computer Science                   –          5
        caption* [28]               and Engineering, Toyohashi University of
                                    Technology, Aichi, Japan
        2Q2T [29]                   University of Information Technology, Ho Chi Minh                –           7
                                    City, Vietnam
        DarkCow [30]                Faculty of Information Science and Engineering,                  –           3
                                    University of Information Technology, Ho Chi Minh
                                    City, Vietnam


October 2022. An advantage of using new images for the test set is that contamination of models trained
on PMC data is not an issue, since the models in use today were mostly trained prior to 2023. The
development dataset for this year consists of the images from the newly released ROCOv2 [14] dataset.
   Once again, no extensive caption pre-processing beyond the removal of links was performed to keep
the captions as realistic as possible. Captions in languages other than English were removed.
   From the resulting captions, concepts were extracted using the Medical Concept Annotation Toolkit
(MedCAT) [31]. MedCAT, which is capable of extracting biomedical concepts from unstructured text,
was trained on the Medical Information Mart for Intensive Care (MIMIC)-III dataset [32] and links
to Systematized Nomenclature of Medicine and Clinical Terms (SNOMED CT) IDs, which were later
mapped to CUIs and Type Unique Identifiers (TUIs) of the UMLS2022AB release3 . During concept
extraction, concepts were retained only if they exceeded a frequency threshold of 10 occurrences, and
semantic filters were applied to focus on visually observable and interpretable concepts. For example,
concepts of semantic type T029 (Body Location or Region) or T060 (Diagnostic Procedure) are relevant,
while concepts of semantic type T054 (Social Behavior) cannot be derived from the image if it would
appear in the caption. In addition, manual filtering was performed to exclude UMLS concepts that were
either incorrectly detected by the pipeline or were still not related to the image content in any way
after semantic filtering. Blacklisted concepts often include qualifiers that would divert actual interest to,

3
    https://www.nlm.nih.gov/pubs/techbull/nd22/nd22_umls_2022ab_release_available.html [last accessed: 2024-07-01]
for example, anatomical localization or a pathological process, and would also introduce bias, since
qualifiers are used in a highly individual and variable manner. Entity linking systems tend to link
concepts with ambiguous synonyms incorrectly, e.g. C0994894 (Patch Dosage Form) may be linked if
the caption refers to a region that is patchy. In case of high frequency occurrence of such concepts,
they were merged to the correct concept via mapping.
   Additional concepts were assigned to all images addressing their image modality. Six medical image
modalities of concepts were covered: X-ray, Computer Tomography (CT), Magnetic Resonance Imaging
(MRI), ultrasound, and Positron Emission Tomography (PET) as well as modality combinations (e.g.,
PET/CT) as standalone concept. For images of the X-ray modality further concepts on the represented
anatomy were assigned, covering specific anatomical body regions of the Image Retrieval in Medical
Application (IRMA) [33] classification: cranium, spine, upper extremity/arm, chest, breast/mamma,
abdomen, pelvis, and lower extremity/leg. New for last year’s dataset was the addition of manually
validated directionality concepts for x-ray images. Directionality refers to the x-ray imaging orientation
according to IRMA: coronal posteroanterior (PA), coronal anteroposterior (AP), sagittal, or transversal.
These concepts were not included in this year’s dataset because the medical expertise and time to both
ensure the quality of the directionality concepts for the development dataset as well as validate new
directionality concepts on the test set was not available. Table 2 shows statistics about the number of
concepts for the datasets of the last three years.

       Table 2
       Number of unique concepts and average number of concepts per image by split for the ImageCLEFmedical
       Caption datasets of 2022, 2023, and 2024.
                                 Year     Split   Unique concepts       Concepts per image
                                          train               17,210                  4.90
                                 2022     valid                5126                   4.85
                                          test                 4403                   4.97
                                          train                 2126                  3.73
                                 2023     valid                 1946                  3.84
                                          test                  1936                  3.86
                                          train                 1946                  3.15
                                 2024     valid                 1752                  3.21
                                          test                   700                  2.82

  The following subsets were distributed to the participants where each image has one caption and
one or more concepts (UMLS-CUI):

       • Training set including 70,108 radiology images and associated captions and concepts, with a total
         of 220,859 concept occurrences and 1945 unique concepts.
       • Validation set including 9972 radiology images and associated captions and concepts, with a total
         of 32,060 concept occurrences and 1751 unique concepts.
       • Test set including 17,237 radiology images, with a total of 48,563 concept occurrences and 700
         unique concepts.


4. Evaluation Methodology
In this year’s edition, the performance evaluation for the concept detection subtask is carried out in the
same way as last year. Both tasks are evaluated separately. The AI4MediaBench4 by AIMultimediaLab5
was used as the challenge platform. Like last year, participants were unaware of their own scores on the

4
    https://ai4media-bench.aimultimedialab.ro/ [last accessed: 2024-07-01]
5
    https://www.aimultimedialab.ro/ [last accessed: 2024-07-01]
test set until after the submission deadline. This was done to avoid teams optimizing their approaches
based on test set results, which would amount to information leakage.
   For the concept detection subtask, the balanced precision and recall trade-off were measured in terms
of F1-scores. Like last year, a secondary F1-score is computed using a subset of concepts that was
manually curated. On the one hand, this involves the different image modalities (X-ray, Angiography,
Ultrasound, CT, MRI, PET, and Combined such as PET/CT). On the other hand, if applicable, for X-ray
also the most prominently depicted body region (cranium, chest, upper extremity, spine, abdomen,
pelvis, and lower extremity) was involved.
   As a pre-processing step for evaluating the second task, all captions were lowercased, punctuation
was removed, and numbers were replaced by the token “number”. This step ensures uniformity and
focuses the evaluation on the linguistic content. The performance of caption prediction is evaluated
based on BERTScore [12], which is a metric that computes a similarity score for each token in the
generated text with each token in the reference text. It uses the pre-trained contextual embeddings
from Bidirectional Encoder Representations from Transformers (BERT) [34]-based models and matches
words by cosine similarity. In this work, the pre-trained model microsoft/deberta-xlarge-mnli6 was
used because it is the model that correlates best with human scoring according to the authors7 . Since
evaluating generated text and image captioning is very challenging and should not be based on a single
metric, additional evaluation metrics were explored in this year’s edition in order to find the metrics
that correlate well with human judgments for this task. First, the Recall-Oriented Understudy for
Gisting Evaluation (ROUGE) [35] score was adopted as a secondary metric that counts the number of
overlapping units such as n-grams, word sequences, and word pairs between the generated text and the
reference. Specifically, the ROUGE-1 (F-measure) score was calculated, which measures the number of
matching unigrams between the model-generated text and a reference. All individual scores for each
caption are then summed and averaged over the number of captions, resulting in the final score. In
addition to ROUGE, the Metric for Evaluation of Translation with Explicit ORdering (METEOR) [36] was
explored, which is a metric that evaluates the generated text by aligning it to reference and calculating
a sentence-level similarity score. Furthermore, the Consensus-based Image Description Evaluation
(CIDEr) [37] metric was also adopted. CIDEr is an automatic evaluation metric that calculates the
weights of n-grams in the generated text, and the reference text based on Term Frequency and Inverse
Document Frequency (TF-IDF) and then compares them based on cosine similarity. Another metric
used is the BiLingual Evaluation Understudy (BLEU) score [11], which is a geometric mean of n-gram
scores from 1 to 4. For this task, the focus was on the BLEU-1 score, which takes into account unigram
precision. BiLingual Evaluation Understudy with Representations from Transformers (BLEURT) [38] is
specifically designed to evaluate natural language generation in English. It uses a pre-trained model that
has been fine-tuned to emulate human judgments about the quality of the generated text. The strength
of BLEURT lies in its end-to-end training, which enables it to model human judgments effectively
and makes it robust to domain and quality variations. For this evaluation, the BLEURT-20 model
was used. CLIPScore [39] is an innovative metric that diverges from the traditional reference-based
evaluations of image captions. Instead, it aligns with the human approach of evaluating caption quality
without references by evaluating the alignment between text and image content. The metric employs
Contrastive Language-Image Pretraining (CLIP) [40], a cross-modal model that has been pre-trained on a
massive dataset of 400 million image-caption pairs sourced from the web. The model is used to compute
similarity scores between images and text. In addition to the reference-free CLIPScore, this evaluation
also considers RefCLIPScore [39], an extension that incorporates reference captions. This year, two new
domain-specific metrics, MedBERTScore and ClinicalBLEURT [41], have been added to the evaluation.
These metrics are tailored for evaluating text in medical contexts and aim to better assess the relevance
and accuracy of the generated medical content. MedBERTScore enhances the traditional BERTScore by
assigning higher weights to medically relevant terms identified in the text. ClinicalBLEURT is a version
of BLEURT fine-tuned on large collections of family medicine and orthopedic notes to better capture

6
    https://huggingface.co/microsoft/deberta-xlarge-mnli [last accessed: 2023-07-01]
7
    https://github.com/Tiiiger/bert_score [last accessed: 2023-07-01]
the characteristics of the medical language.


5. Results
For the concept detection and caption prediction subtasks, Tables 3 and 4 show the best results from
each of the participating teams. The results will be discussed in this section. The full list of results are
shown in Appendix A in Tables 7, 8 and 9.

5.1. Results for the Concept Detection Subtask
In 2024, 9 teams participated in the concept prediction subtask, submitting 38 graded runs. Table 3
presents the best results for each team achieved in the submissions.

    Table 3
    Performance of the participating teams in the ImageCLEFmedical 2024 Caption concept detection
    subtask. Only the best run based on the achieved F1-score is listed for each team, together with the
    corresponding secondary F1-score based on manual annotations as well as the team rankings based on
    the primary and secondary F1-score. The full results are shown in Table 7 in Appendix A.
                 Group Name         Best Run        F1   Secondary F1    Rank (secondary)
                 DBS-HHU                  601   0.6375          0.9534                1 (1)
                 auebnlpgroup             644   0.6319          0.9393                2 (2)
                 DS@BioMed                653   0.6200          0.9312                3 (4)
                 SSNMLRGKSR               425   0.6001          0.9056                4 (5)
                 UACH-VisionLab           235   0.5988          0.9363                5 (3)
                 MICLabNM                 681   0.5795          0.8835                6 (6)
                 Kaprov                   558   0.4609          0.7301                7 (7)
                 VIT_ConceptZ             233   0.1812          0.2647                8 (8)
                 CS_Morgan                530   0.1076          0.2105                9 (9)


DBS-HHU [19] Dethroning the winners of the last several years, the DBS-HHU team achieved the
    best F1-scores of 0.6375 (primary) and 0.9534 (secondary) by using an ensemble of four different
    Convolutional Neural Networks (CNNs): ResNet-152 [42], EfficientNet-B0 [43], DenseNet-201 [44],
    and Wide ResNet-101-2 [45], all pre-trained on ImageNet [46] and followed by different Feed-
    Forward Neural Networks (FFNNs). Additionally, they experimented with building a hierarchical
    system of several models, specifically oriented towards the AUEB-NLP-Group’s approach of prior
    years. However, these did not beat the best results of their first strategy.

AUEB-NLP-Group [18] The AUEB-NLP-Group based their approach on their past work, which won
    the competition in the last several years, by combining a CNN (DenseNet [44]) followed by a
    FFNN classification head which achieved a close second place with a primary F1-score of 0.6319
    and a secondary F1-score of 0.9393. They also experimented with CNNs followed by 𝑘-Nearest
    Neighbor (k-NN) models and ensembles which performed slightly worse.

DS@BioMed [20] The DS@BioMed team employed a Shifted Window Transformer v2 (Swin-v2) [47]
   to achieve an F1-score of 0.6200 and a secondary F1-score of 0.9312. They also experimented
   with other transformer-based architectures, as well as CNNs and ensembles.

SSNMLRGKSR [21] The SSNMLRGKSR team used a DenseNet-121 [44] CNN for their best approach
    which achieved a primary F1-score of 0.6001 and a secondary F1-score of 0.9056.

UACH-VisionLab [23] The UACH-VisionLab team used several EfficientNet-B0 [43] models trained
    for different sub-groups of concepts to achieve a primary F1-score of 0.5988 and a secondary
    F1-score of 0.9363.
MICLabNM [24] The MICLabNM team employed a VisualT5 image-to-text encoder-decoder architec-
    ture coupling a Vision Transformer (ViT) [48] with an encoder-decoder T5 [49] text transformer
    achieving F1-scores of 0.5795 and 0.8835.

Kaprov [25] The Kaprov team utilized a CNN-LSTM model, achieving a primary F1-score of 0.4609
     and a secondary F1-score of 0.7301

VIT_Conceptz [27] The VIT_Conceptz team used a ResNet50 [42] CNN to achieve F1-scores of 0.1812
     and 0.2647.

CS_Morgan [22] The CS_Morgan team experimented with a ConvMixer [50] model which consists of
    a combination of CNN and Transformer architectures achieving F1-scores of 0.1076 and 0.2105.

   To summarize, in the concept detection subtasks, the groups used primarily multi-label classification
systems, with one team integrating image retrieval systems in some of their approaches. Most teams
used CNNs to extract features for images. Some teams explored Transformer-based [51] models, such
as ViTs [48], while one team used a ConvMixer [50] architecture, blending convolutional networks and
ViTs. The winning team this year utilized an ensemble of four different CNNs.
   Comparing this year’s concept detection task results to those of the last year’s ImageCLEFmedical
Caption, a remarkable increase of achieved F1-Scores can be observed. For a direct comparison, last
year’s winner and now second best AUEB-NLP-Group managed to increase their F1-Score from 0.5223
to 0.6319, close to team DBS-HHU’s winning F1-Score of 0.6375. This increase is much smaller for
the secondary F1-Score, where the AUEB-NLP-Group increased their score from 0.9258 to 0.9393, and
DBS-HHU achieved a new all-time high of 0.9534. By training and evaluating our own baseline model
on the data from this year, we could determine that about 0.1 of the difference in primary F1-score is
purely due to the new test dataset, which contains a much smaller number of unique concepts (see
Table 2). One difference in this year’s dataset compared to last year’s is that the newly added images
were fully used for the test dataset and not split into validation and test, resulting in a larger test dataset.
On the other hand, the number of unique concepts in the test dataset is much lower than last year,
indicating a difference in the newly added data. The practice of updating the test set with the latest
images from the PMC Open Access subset can lead to such complications. Further improvements
in primary and secondary F1-score can be attributed to continuous changes and improvements of
the challenge dataset, e.g., correction of previous errors and further refinement of quality assurance
measures as well as improvements and scaling of the teams’ approaches.

5.2. Results for the Caption Prediction Subtask
In this 8th edition, the caption prediction subtask attracted 11 teams which submitted 53 graded runs.
Tables 4, 5 and 6 present the results of the submissions.

PCLmed [26] The winning team introduced Medical Vision-Language Foundation Models (Med-
    VLFM) with Vision Encoder Ensembling (VEE) for better representing the content of medical
    images and Modality-Aware Adaptation (MAA) to take the inference between vision and text
    modalities into account. An ensemble of a Explore the limits of Visual representation at scAle
    (EVA)-ViT-g [52] model which was pre-trained on natural images and a BioMedCLIP [53] model
    pre-trained on medical images was implemented for image encoding. Pangu-𝛼 [54] has been
    used as the Large Language Model (LLM) for text generation. The model reached a BERTScore of
    0.6299 and a ROUGE score of 0.2726 and won the caption prediction task.

CS_Morgan [22] The CS_Morgan team experimented with different Large Multimodal Models (LMMs)
    like Large Language and Vision Assistant (LLaVA) [55], IDEFICS [56], and MoonDream28 . The
    results of these models are compared to conventional encoder-decoder models like VisionGPT2

8
    https://huggingface.co/vikhyatk/moondream2 [last accessed: 2024-07-01]
   Table 4
   Performance of the participating teams in the ImageCLEFmedical 2024 Caption caption prediction
   subtask. Only the best run based on the achieved BERTScore is listed for each team, together with the
   corresponding secondary ROUGE score as well as the team rankings based on the primary BERTScore
   and secondary ROUGE score. Additional scores are shown in Tables 5 and 6. The full results are shown
   in Tables 8 and 9 in Appendix A.
              Group Name              Best Run      BERTScore       ROUGE        Rank (secondary)
              pclmed                          634         0.6299       0.2726                 1 (1)
              CS_Morgan                       429          0.6281       0.2508                2 (2)
              DarkCow                         220          0.6267       0.2452                3 (4)
              auebnlpgroup                    630          0.6211       0.2049                4 (7)
              2Q2T                            643          0.6178       0.2478                5 (3)
              MICLab                          678          0.6128       0.2135                6 (6)
              DLNU_CCSE                       674          0.6066       0.2179                7 (5)
              Kaprov                          559          0.5964       0.1905                8 (8)
              DS@BioMed                       571          0.5794       0.1031               9 (11)
              DBS-HHU                         637          0.5769       0.1531               10 (9)
              KDE-medical-caption             557          0.5673       0.1325              11 (10)


   Table 5
   Performance of the participating teams in the ImageCLEFmedical 2024 Caption caption Prediction
   subtask for additional metrics BLEU-1, BLEURT, ClinicalBLEURT and METEOR. These correspond to
   the best BERTScore-based runs of each team, listed in Table 4. The full results are shown in Tables 8
   and 9 in Appendix A.
          Group Name              Best Run     BLEU-1       BLEURT       ClinicalBLEURT      METEOR
          pclmed                        634     0.2690        0.3376               0.4666       0.1133
          CS_Morgan                     429      0.2093       0.3174               0.4559        0.0927
          DarkCow                       220      0.1950       0.3060               0.4562        0.0889
          auebnlpgroup                  630      0.1110       0.2899               0.4866        0.0680
          2Q2T                          643      0.2213       0.3139               0.4759        0.0986
          MICLab                        678      0.1853       0.3067               0.4453        0.0772
          DLNU_CCSE                     674      0.1512       0.2831               0.4756        0.0704
          Kaprov                        559      0.1697       0.2951               0.4400        0.0609
          DS@BioMed                     571      0.0121       0.2202               0.5295        0.0353
          DBS-HHU                       637      0.1493       0.2710               0.4766        0.0559
          KDE-medical-caption           557      0.1060       0.2566               0.5022        0.0386


     and CNN-Transformer architectures. The best-performing model of the team was a fine-tuned
     LLaVA 1.6 Mistral 7B. This model achieved a BERTScore of 0.6281 and a ROUGE score of 0.2508.

DarkCow [30] The DarkCow team obtained a BERTScore of 0.6267 and a ROUGE score of 0.2452.
    A VinVL [57] model was used to extract object features from the images. These features were
    combined with more general visual features extracted using a ViT [48] model. ClinicalT5- [58]
    and Biomedical Bidirectional and Auto-Regressive Transformers (BioBART) [59]-based models
    were used for the caption generation. The best results were achieved for the BioBART model.

AUEB-NLP-Group [18] The AUEB-NLP-Group’s approach on caption prediction involved four pri-
    mary systems: The first one employing a InstructBLIP [60] model, and the other ones building
    up upon it, applying a synthesizer, a rephraser, and an innovative Distance from Median Maxi-
    mum Concept Similarity (DMMCS) mechanism. One combination of InstructBLIP with DMMCS
    achieved the team’s best BERTscore of 0.6211 and a ROUGE score of 0.2049.

2Q2T [29] The 2Q2T team used the Bootstrapping Language-Image Pre-training (BLIP) [61] architec-
    ture as their main approach, which combines a ViT [48] as the encoder while using BERT [34]
   Table 6
   Performance of the participating teams in the ImageCLEFmedical 2024 Caption caption Prediction
   subtask for additional metrics CIDEr, CLIPScore, RefCLIPScore and MedBERTScore. These correspond
   to the best BERTScore-based runs of each team, listed in Table 4. The full results are shown in Tables 8
   and 9 in Appendix A.
        Group Name               Best Run    CIDEr     CLIPScore    RefCLIPScore      MedBERTScore
        pclmed                        634    0.2681        0.8236           0.8176             0.6323
        CS_Morgan                     429     0.2450       0.8213            0.8155           0.6327
        DarkCow                       220     0.2243       0.8184            0.8117            0.6292
        auebnlpgroup                  630     0.1769       0.8041            0.7987            0.6261
        2Q2T                          643     0.2200       0.8271            0.8138            0.6224
        MICLab                        678     0.1582       0.8159            0.8049            0.6172
        DLNU_CCSE                     674     0.1688       0.7967            0.7904            0.6130
        Kaprov                        559     0.1070       0.7922            0.7872            0.6089
        DS@BioMed                     571     0.0715       0.7756            0.7748            0.5804
        DBS-HHU                       637     0.0644       0.7842            0.7750            0.5827
        KDE-medical-caption           557     0.0384       0.7651            0.7610            0.5697


      for text generation. They yielded a BERTScore of 0.6178 and ROUGE score of 0.2478 for caption
      prediction.
MICLabNM [24] The MICLabNM team used a model that combines a ViT [48] with ClinicalT5 [58],
    called VisualT5. The approach also features a modified spatial attention module for interpretability,
    by highlighting important image areas for model decisions. The approach achieved a 0.6129
    BERTScore and a ROUGE score of 0.2135 for caption prediction.
DLNU_CCSE The team’s approach achieved a BERTScore of 0.6066 and a ROUGE score of 0.2179,
   with no working notes submitted by the team.
Kaprov [25] The Kaprov team implemented a combination of a Visual Geometry Group (VGG)-16 [62]-
     based CNN and a Long Short-Term Memory (LSTM) [63] model for the caption prediction task.
    The team achieved a BERTScore of 0.5964 and a ROUGE score of 0.1905 on the private test set.
DS@BioMed [20] The best performing-model which was submitted by the DS@BioMed team imple-
   mented a combination of a BERT [34] Pre-Training of Image Transformers (BEiT) [64] and an
   BioBART [59] model. This model incorporated the information which was extracted from the
   medical images with the concepts extracted in the concept detection task. The team achieved a
   BERTScore of 0.5794 and a ROUGE score of 0.1031 on the private test set.
DBS-HHU [19] The DBS-HHU team based their caption prediction approach on simple pre-processing
    (lowercasing, punctuation removal, numbers exchange with number token) to focus on linguistic
    content. Two models, fine-tuned Generative Image-to-text Transformer (GIT) [65] -base and
    GIT-large, were then employed for caption generation. Both models achieved nearly equal scores,
    with the large model achieving the higher BERTscore of 0.5769 and a ROUGE score of 0.1531.
KDE-MED-CAPTION [28] The KDE-MED-CAPTION team implemented a caption retrieval approach.
    First, a priority-based partitioning was implemented. Afterwards, EfficientNet [43], ResNeXt [66],
    and ViT [48] models were trained for concept detection. These models were used for feature
    extraction. Similarity measures were used to compare the extracted features from the test samples
    with the training samples. The caption of the most similar training sample is predicted for a test
    sample. The best model submitted by the KDE-MED-CAPTION team reached an BERTScore of
    0.5673 and a ROUGE score of 0.1325.

  To summarize, in the caption prediction subtask teams primarily utilized encoder-decoder frame-
works with various backbones, including transformer-based decoders and LSTMs [63]. ViTs [48] were
commonly employed for feature extraction. Some approaches integrated concept detection into the
caption generation process by providing predicted concepts as input to the encoder along with the
images. This year saw a notable increase in the use of LLMs such as BioBART [59] and ClinicalT5 [58]
and Vision Language Models (VLMs), including LLaVA [55] and IDEFICS [56], with some teams ex-
perimenting with visual instruction tuning. Only one team used a retrieval-based approach for this
approach. The winning team introduced medical vision-language foundation models (Med-VLFMs) by
combining general and specialist vision models to achieve top rankings in the challenge.
   This is the second iteration of the caption prediction subtask which used BERTScore and ROUGE
as primary and secondary evaluation metrics, after BLEU-1 had been used as the primary evaluation
metric in all previous iterations. While some teams were still mainly optimizing for the BLEU-1 score
last year, resulting in a wide spread of scores for the different metrics with some teams scoring very
strongly in some metrics and very weakly in others, the scores were much more even this year, with
the winning approach scoring strongly across all metrics.
   Even though last year’s winning team CSIRO achieved an all-time high BERTScore of 0.6425, a
notable overall increase is visible in returning teams’ scores. E.g., this year’s winning team PCLmed
increased their prior score from 0.6152 to 0.6299. The same applies for other teams CS_Morgan (0.5819
vs. 0.6281), the AUEB-NLP-Group (0.6170 vs. 0.6211), and team DLNU_CCSE (0.6005 vs. 0.6066). Such
notable increases are observable for the other scores ROUGE, BLEURT, CIDEr, METEOR, and CLIPScore
as well. The main reasons for the improvements are likely continuous improvements of the teams’
approaches, while experimentation with new approaches did not yield breakthrough improvements.
The newly introduced metrics ClinicalBLEURT and MedBERTScore grant additional insight.
   The new optional explainability extension was not adpoted by the teams, only the team MI-
CLabNM [24] submitted explainability results after the end of the submission phase.


6. Conclusion
This year’s caption task of ImageCLEFmedical once again ran with both subtasks, concept detection
and caption prediction. It used the newly released ROCOv2 [14] as the development dataset. It attracted
14 teams who submitted 82 graded runs using for the first time the AI4MediaBench platform. For the
concept detection task, the F1-score and a secondary F1-score, considering only the manually curated
concepts, were used. After changing the primary evaluation metric for the caption prediction subtask
from BLEU to BERTScore for last year, additional, more domain-specific metrics were added for this
year, one of which may be used as the primary metric for next year. The caption prediction subtask was
again more popular than the concept detection subtask this year, with 6 teams participating in both
subtasks, 5 teams participating only in the caption prediction subtask, and 3 teams only participating
in the concept detection subtask. As before, the teams generally approached the tasks completely
separately, with only the DS@BioMed team using the generated concepts for the predicted captions.
   Like in the 2023 challenge [10], teams generally used multi-label classification systems for the concept
detection subtask, with the winning team using an ensemble of four CNNs. Only one team integrated
image retrieval systems in some of their approaches. For the caption prediction subtask, encoder-
decoder frameworks were used by most teams, with ViTs being used to extract features. LLMs were
increasingly being used to generate and fine-tune the captions. The winning approach used Med-VLFMs
by combining general and specialist vision models.
   For the concept detection subtask, the overall primary F1-scores increased strongly compared to
last year despite very similar approaches being employed by the teams. In addition to continuously
improved and scaled-up approaches by the teams, a large part of the improvement can be explained by
a lower number of unique concepts in the test set compared to last year.
   The same applies for the general view on results of this year’s caption prediction task. The top
scores were slightly worse for BERTScore, but last year’s winners CSIRO [67] did not participate this
year. Returning teams improved their scores across the board showing that the dataset for this year is
comparable to last year for the caption prediction and that while teams have experimented with many
different approaches including LLMs for caption generation, no breakthrough improvement has been
achieved with these new techniques.
   For next year’s ImageCLEFmedical Caption challenge, some possible improvements include an
improved caption prediction evaluation metric which is specific to medical texts, as well as additional
metrics for readability and factuality. A comprehensive analysis of different metrics is planned to
determine whether they should be used as primary indicators or whether a combination of different
metrics would be more appropriate for this task, given the complex nature of evaluating generated
captions.
   An additional focus will be explainability. The optional extension to the caption prediction subtask
where participants were asked to provide explainability results for a small subset of images was not
adopted by the participants, with only a single team submitting explainability results after the end of
the submission phase. For next year, examples will be provided for how these explainability results
could look and it might be extracted into its own subtask.


Acknowledgments
This work was partially supported by the University of Essex GCRF QR Engagement Fund provided
by Research England (grant number G026). The work of Louise Bloch, Benjamin Bracke and Raphael
Brüngel was partially funded by a PhD grant from the University of Applied Sciences and Arts Dortmund
(FH Dortmund), Germany. The work of Ahmad Idrissi-Yaghir, Henning Schäfer, Tabea M. G. Pakull and
Hendrik Damm was funded by a PhD grant from the DFG Research Training Group 2535 Knowledge-
and data-based personalisation of medicine at the point of care (WisPerMed).


References
 [1] A. García Seco de Herrera, R. Schaer, S. Bromuri, H. Müller, Overview of the ImageCLEF 2016
     medical task, in: Working Notes of CLEF 2016 (Cross Language Evaluation Forum), 2016, pp.
     219–232.
 [2] C. Eickhoff, I. Schwall, A. G. S. de Herrera, H. Müller, Overview of ImageCLEFcaption 2017 - image
     caption prediction and concept detection for biomedical images, in: Working Notes of CLEF 2017 -
     Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017., 2017. URL:
     http://ceur-ws.org/Vol-1866/invited_paper_7.pdf.
 [3] A. G. S. de Herrera, C. Eickhoff, V. Andrearczyk, H. Müller, Overview of the ImageCLEF 2018
     caption prediction tasks, in: Working Notes of CLEF 2018 - Conference and Labs of the Evaluation
     Forum, Avignon, France, September 10-14, 2018., 2018. URL: http://ceur-ws.org/Vol-2125/invited_
     paper_4.pdf.
 [4] O. Pelka, C. M. Friedrich, A. G. S. de Herrera, H. Müller, Overview of the ImageCLEFmed 2019
     concept detection task, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes
     of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September
     9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL: http://ceur-ws.
     org/Vol-2380/paper_245.pdf.
 [5] O. Pelka, C. M. Friedrich, A. García Seco de Herrera, H. Müller, Overview of the ImageCLEFmed
     2020 concept prediction task: Medical image understanding, in: CLEF2020 Working Notes, volume
     1166 of CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2020.
 [6] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical terminology,
     Nucleic Acids Research 32 (2004) 267–270. doi:10.1093/nar/gkh061.
 [7] O. Pelka, A. Ben Abacha, A. García Seco de Herrera, J. Jacutprakart, C. M. Friedrich, H. Müller,
     Overview of the ImageCLEFmed 2021 concept & caption prediction task, in: CLEF2021 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021, pp. 1101–1112.
 [8] J. Rückert, A. Ben Abacha, A. García Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir,
     H. Schäfer, H. Müller, C. M. Friedrich, Overview of ImageCLEFmedical 2022 – caption prediction
     and concept detection, in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Bologna, Italy, 2022.
 [9] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology Objects in COntext (ROCO): a
     multimodal image dataset, in: Intravascular Imaging and Computer Assisted Stenting - and - Large-
     Scale Annotation of Biomedical Data and Expert Label Synthesis - 7th Joint International Workshop,
     CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with
     MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings, 2018, pp. 180–189. doi:10.1007/
     978-3-030-01364-6\_20.
[10] J. Rückert, A. Ben Abacha, A. G. Seco de Herrera, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer,
     H. Müller, C. M. Friedrich, Overview of ImageCLEFmedical 2023 – caption prediction and concept
     detection, in: CLEF2023 Working Notes, volume 3497 of CEUR Workshop Proceedings, CEUR-
     WS.org, Thessaloniki, Greece, 2023, pp. 1328 – 1346.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine
     translation, in: Proceedings of the 40th annual meeting of the Association for Computational
     Linguistics, 2002, pp. 311–318.
[12] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation
     with BERT, in: 8th International Conference on Learning Representations, ICLR 2020, Addis
     Ababa, Ethiopia, April 26-30, 2020, 2020. URL: https://openreview.net/forum?id=SkeHuCVFDr.
[13] B. Ionescu, H. Müller, A. Drăgulinescu, J. Rückert, A. Ben Abacha, A. García Seco de Herrera,
     L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, T. M. G. Pakull, H. Damm, B. Bracke,
     C. M. Friedrich, A. Andrei, Y. Prokopchuk, D. Karpenka, A. Radzhabov, V. Kovalev, C. Macaire,
     D. Schwab, B. Lecouteux, E. Esperança-Rodier, W. Yim, Y. Fu, Z. Sun, M. Yetisgen, F. Xia, S. A. Hicks,
     M. A. Riegler, V. Thambawita, A. Storås, P. Halvorsen, M. Heinrich, J. Kiesel, M. Potthast, B. Stein,
     Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental
     IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International
     Conference of the CLEF Association (CLEF 2024), Springer Lecture Notes in Computer Science
     LNCS, Grenoble, France, 2024.
[14] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.
     Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology
     Objects in COntext version 2, an updated multimodal image dataset, Scientific Data (2024). URL:
     https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6.
[15] R. J. Roberts, PubMed Central: The GenBank of the published literature, Proceedings of the
     National Academy of Sciences of the United States of America 98 (2001) 381–382. doi:10.1073/
     pnas.98.2.381.
[16] L. S. Shapley, et al., A value for n-person games (1953).
[17] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Neural
     Information Processing Systems, volume 30, 2017, pp. 4768 – 4777.
[18] M. Samprovalaki, A. Chatzipapadopoulou, G. Moschovis, F. Charalampakos, P. Kaliosis, J. Pavlopou-
     los, I. Androutsopoulos, AUEB NLP group at ImageCLEFmedical 2024, in: CLEF2024 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[19] H. Kauschke, K. Bogomasov, S. Conrad, Predicting captions and detecting concepts for medical
     images: Contributions of the DBS-HHU team to ImageCLEFmedical caption 2024, in: CLEF2024
     Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[20] N. N. Nguyen, H. L.Tu, P. D.Nguyen, T. N.Do, T. M.Thai, T. B. Nguyen-Tat, DS@BioMed at
     ImageCLEFmedical caption 2024: Enhanced attention mechanisms in medical caption generation
     through concept detection integration, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[21] R. Dhinagaran, S. S. N. Mohamed, K. Srinivasan, SSNMLRGKSR at ImageCLEFmedical caption
     2024: Medical concept detection using DenseNet-121 with MultiLabelBinarizer, in: CLEF2024
     Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[22] M. Hoque, M. R. Hasan, M. I. S. Emon, F. Khalifa, M. M. Rahman, Medical image interpretation
     with large multimodal models, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[23] A. Moncloa-Muro, G. Ramirez-Alonso, F. Martinez-Reyes, Automatic medical concept detection
     on images: dividing the task into smaller ones, in: CLEF2024 Working Notes, CEUR Workshop
     Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[24] D. Carmo, L. Rittner, R. Lotufo, VisualT5: Multitasking caption and concept prediction with
     pre-trained ViT, T5 and customized spatial attention in radiological images, in: CLEF2024 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[25] P. Balasundaram, K. Swaminathan, O. Sampath, P. KM, Concept detection and caption prediction
     of radiology images using convolutional neural networks, in: CLEF2024 Working Notes, CEUR
     Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[26] B. Yang, Y. Yu, Y. Zou, T. Zhang, PCLmed: Champion solution for ImageCLEFmedical 2024 caption
     prediction challenge via medical vision-language foundation models, in: CLEF2024 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[27] S. Ram, S. Vinoth, R. N. Gopalakrishnan, A. A. Balakumar, L. Kalinathan, T. A. J. Velankanni,
     Leveraging diverse CNN architectures for medical image captioning: DenseNet-121, MobileNetV2,
     and ResNet-50 in ImageCLEF 2024, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[28] M. Aono, T. Asakawa, K. Shimizu, K. Nomura, Medical image captioning using CUI-based
     classification and feature similarity, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[29] T. V. Phan, T. K. Nguyen, Q. A. Hoang, Q. T. Phan, T. B. Nguyen-Tat, MedBLIP: Multimodal
     medical image captioning using BLIP, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[30] Q. V. Nguyen, Q. H. Pham, D. Q. Tran, T. K.-B. Nguyen, N.-H. Nguyen-Dang, B.-T. Nguyen-Tat, UIT-
     DarkCow team at ImageCLEFmedical caption 2024: Diagnostic captioning for radiology images
     efficiency with transformer models, in: CLEF2024 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Grenoble, France, 2024.
[31] Z. Kraljevic, T. Searle, A. Shek, L. Roguski, K. Noor, D. Bean, A. Mascio, L. Zhu, A. A. Folarin,
     A. Roberts, R. Bendayan, M. P. Richardson, R. Stewart, A. D. Shah, W. K. Wong, Z. Ibrahim,
     J. T. Teo, R. J. Dobson, Multi-domain clinical natural language processing with MedCAT: The
     medical concept annotation toolkit, Artificial Intelligence in Medicine 117 (2021) 102083. URL:
     https://www.sciencedirect.com/science/article/pii/S0933365721000762. doi:https://doi.org/
     10.1016/j.artmed.2021.102083.
[32] A. E. Johnson, T. J. Pollard, L. Shen, L. wei H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,
     L. A. Celi, R. G. Mark, MIMIC-III, a freely accessible critical care database, Scientific Data 3 (2016).
     URL: https://doi.org/10.1038/sdata.2016.35. doi:10.1038/sdata.2016.35.
[33] T. M. Lehmann, H. Schubert, D. Keysers, M. Kohnen, B. B. Wein, The IRMA code for unique
     classification of medical images, in: H. K. Huang, O. M. Ratib (Eds.), Medical Imaging 2003: PACS
     and Integrated Medical Information Systems: Design and Evaluation, SPIE, 2003. doi:10.1117/
     12.480677.
[34] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171 – 4186. URL: https://aclanthology.org/N19-1423.
     doi:10.18653/v1/N19-1423.
[35] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summariza-
     tion Branches Out, Association for Computational Linguistics, 2004, pp. 74–81. URL: https:
     //aclanthology.org/W04-1013.
[36] M. Denkowski, A. Lavie, Meteor universal: Language specific translation evaluation for any target
     language, in: Proceedings of the Ninth Workshop on Statistical Machine Translation, Association
     for Computational Linguistics, 2014, pp. 376–380. URL: http://aclweb.org/anthology/W14-3348.
     doi:10.3115/v1/W14-3348.
[37] R. Vedantam, C. L. Zitnick, D. Parikh, CIDEr: Consensus-based image description evaluation, in:
     2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2015, pp. 4566–
     4575. URL: http://ieeexplore.ieee.org/document/7299087/. doi:10.1109/CVPR.2015.7299087.
[38] T. Sellam, D. Das, A. Parikh, BLEURT: Learning robust metrics for text generation, in: Proceed-
     ings of the 58th Annual Meeting of the Association for Computational Linguistics, Association
     for Computational Linguistics, Online, 2020, pp. 7881–7892. URL: https://aclanthology.org/2020.
     acl-main.704. doi:10.18653/v1/2020.acl-main.704.
[39] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, Y. Choi, CLIPScore: A reference-free evaluation
     metric for image captioning, in: Proceedings of the 2021 Conference on Empirical Methods in
     Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana,
     Dominican Republic, 2021, pp. 7514–7528. URL: https://aclanthology.org/2021.emnlp-main.595.
     doi:10.18653/v1/2021.emnlp-main.595.
[40] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
     supervision, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference
     on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of
     Machine Learning Research, PMLR, 2021, pp. 8748–8763. URL: http://proceedings.mlr.press/v139/
     radford21a.html.
[41] A. Ben Abacha, W.-w. Yim, G. Michalopoulos, T. Lin, An investigation of evaluation methods in
     automatic medical note generation, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the
     Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics,
     Toronto, Canada, 2023, pp. 2575–2588. URL: https://aclanthology.org/2023.findings-acl.161. doi:10.
     18653/v1/2023.findings-acl.161.
[42] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
     the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016, pp. 770 –
     778. doi:10.1109/CVPR.2016.90.
[43] M. Tan, Q. V. Le, EfficientNet: Rethinking model scaling for convolutional neural networks, in:
     Proceedings of the International Conference on Machine Learning (ICML 2019), 2019, pp. 6105 –
     6114.
[44] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks,
     in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017),
     2017, pp. 2261 – 2269. doi:10.1109/CVPR.2017.243.
[45] S. Zagoruyko, N. Komodakis, Wide residual networks, in: Proceedings of the British Machine
     Vision Conference (BMVC 2016), 2016. doi:10.5244/c.30.87.
[46] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image
     database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR 2009), 2009, pp. 248 – 255. doi:10.1109/CVPR.2009.5206848.
[47] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, F. Wei, B. Guo,
     Swin Transformer V2: Scaling up capacity and resolution, in: Proceedings of the IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR 2022), 2022, pp. 11999 – 12009.
     doi:10.1109/CVPR52688.2022.01170.
[48] A. Dosovitskiy, L. Beyer, A. I. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
     M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words:
     Transformers for image recognition at scale, in: Proceedings of the International Conference on
     Learning Representations (ICLR 2021), 2021.
[49] C. Raffel, N. Shazeer, A. Roberts, K. J. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. Liu, Exploring
     the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
     Research 21 (2020) 1 – 67.
[50] A. Trockman, J. Z. Kolter, Patches are all you need?, Transactions on Machine Learning Research
     (2023). URL: https://openreview.net/forum?id=rAnB7JSMXL.
[51] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus,
     S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems
     30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017,
     Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/
     3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[52] Y. Fang, W. Wang, B. Xie, Q. Sun, L. Wu, X. Wang, T. Huang, X. Wang, Y. Cao, EVA: Exploring
     the limits of masked visual representation learning at scale, in: Proceedings of the IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (CVPR 2023), 2023, pp. 19358–19369.
     doi:10.1109/CVPR52729.2023.01855.
[53] S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri,
     C. Wong, A. Tupini, Y. Wang, M. Mazzola, S. Shukla, L. Liden, J. Gao, M. P. Lungren, T. Naumann,
     S. Wang, H. Poon, BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen
     million scientific image-text pairs, 2024. arXiv:2303.00915v2.
[54] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
     Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
     F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
     Y. Wang, X. Jin, Q. Liu, Y. Tian, PanGu-𝛼: Large-scale autoregressive pretrained chinese language
     models with auto-parallel computation, 2021. arXiv:2104.12369v1.
[55] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, in: Thirty-seventh Conference on Neural
     Information Processing Systems, 2023. URL: https://openreview.net/forum?id=w0H2xGHlkw.
[56] H. Laurençon, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karam-
     cheti, A. Rush, D. Kiela, M. Cord, V. Sanh, OBELICS: An open web-scale filtered dataset of
     interleaved image-text documents, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt,
     S. Levine (Eds.), Advances in Neural Information Processing Systems, volume 36, Curran Asso-
     ciates, Inc., 2023, pp. 71683–71702. URL: https://proceedings.neurips.cc/paper_files/paper/2023/
     file/e2cfb719f58585f779d0a4f9f07bd618-Paper-Datasets_and_Benchmarks.pdf.
[57] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, VinVL: Revisiting visual
     representations in vision-language models, in: Proceedings of the IEEE/CVF Conference on
     Computer Vision and Pattern Recognition (CVPR 2021), 2021, pp. 5575–5584. doi:10.1109/
     CVPR46437.2021.00553.
[58] Q. Lu, D. Dou, T. Nguyen, ClinicalT5: A generative language model for clinical text, in: Y. Goldberg,
     Z. Kozareva, Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP
     2022, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 5436
     – 5443. doi:10.18653/v1/2022.findings-emnlp.398.
[59] H. Yuan, Z. Yuan, R. Gan, J. Zhang, Y. Xie, S. Yu, BioBART: Pretraining and evaluation of a
     biomedical generative language model, in: D. Demner-Fushman, K. B. Cohen, S. Ananiadou,
     J. Tsujii (Eds.), Proceedings of the 21st Workshop on Biomedical Language Processing (BioNLP
     2022), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 97–109. doi:10.18653/
     v1/2022.bionlp-1.9.
[60] W. Dai, J. Li, D. LI, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, S. Hoi, InstructBLIP: To-
     wards general-purpose vision-language models with instruction tuning, in: A. Oh, T. Naumann,
     A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing
     Systems, volume 36, Curran Associates, Inc., 2023, pp. 49250 – 49267.
[61] J. Li, D. Li, C. Xiong, S. Hoi, BLIP: Bootstrapping language-image pre-training for unified vision-
     language understanding and generation, in: International Conference on Machine Learning, 2022,
     pp. 12888 – 12900.
[62] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
     Proceedings of the International Conference on Learning Representations (ICLR 2014) (2014).
[63] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780. URL:
     https://doi.org/10.1162/neco.1997.9.8.1735. doi:10.1162/neco.1997.9.8.1735.
[64] H. Bao, L. Dong, S. Piao, F. Wei, BEit: BERT pre-training of image transformers, in: Proceedings
     of the International Conference on Learning Representations (ICLR 2022), 2022. URL: https:
     //openreview.net/forum?id=p-BhZSz59o4.
[65] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, GIT: A generative image-
     to-text transformer for vision and language, Transactions on Machine Learning Research 2022
     (2022).
[66] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural
     networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR 2017), 2017, pp. 5987–5995. doi:10.1109/CVPR.2017.634.
[67] A. Nicolson, J. Dowling, B. Koopman, A concise model for medical image captioning, in: CLEF2023
     Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2023.


A. Full Results

   Table 7
   Performance of the participating teams in the ImageCLEFmedical 2024 Concept Detection subtask.
                  Group Name        Run        F1    Secondary F1   Rank (secondary)
                  DBS-HHU            601   0.6375         0.9534                1 (1)
                  DBS-HHU            602    0.6375         0.9534               2 (2)
                  DBS-HHU            603    0.6375         0.9534               3 (3)
                  auebnlpgroup       644    0.6319         0.9393               4 (8)
                  DBS-HHU            625    0.6309         0.9488               5 (4)
                  auebnlpgroup       648    0.6308         0.9321              6 (13)
                  auebnlpgroup       642    0.6304         0.9333              7 (12)
                  auebnlpgroup       624    0.6274         0.9376               8 (9)
                  auebnlpgroup       640    0.6273         0.9416               9 (7)
                  DBS-HHU            600    0.6269         0.9461              10 (5)
                  DBS-HHU            604    0.6269         0.9461              11 (6)
                  auebnlpgroup       619    0.6241         0.9339             12 (11)
                  auebnlpgroup       654    0.6207         0.9243             13 (15)
                  DS@BioMed          653    0.6200         0.9312             14 (14)
                  auebnlpgroup       656    0.6162         0.9218             15 (18)
                  auebnlpgroup       655    0.6156         0.9234             16 (17)
                  auebnlpgroup       651    0.6136         0.9239             17 (16)
                  DS@BioMed          652    0.6108         0.9193             18 (19)
                  DS@BioMed          365    0.6090         0.9177             19 (20)
                  DS@BioMed          364    0.6090         0.9177             20 (21)
                  SSNMLRGKSR         425    0.6001         0.9056             21 (22)
                  SSNMLRGKSR         422    0.6001         0.9056             22 (23)
                  UACH-VisionLab     235    0.5988         0.9363             23 (10)
                  MICLabNM           681    0.5795         0.8835             24 (24)
                  MICLabNM           680    0.5594         0.8568             25 (25)
                  SSNMLRGKSR         421    0.5463         0.7969             26 (29)
                  MICLabNM           275    0.5343         0.8133             27 (28)
                  UACH-VisionLab     290    0.5292         0.8422             28 (26)
                  MICLabNM           679    0.5282         0.8325             29 (27)
                  Kaprov             558    0.4609         0.7301             30 (30)
                  DBS-HHU            610    0.3417         0.4477             31 (31)
                  DBS-HHU            616    0.3413         0.4340             32 (32)
                  VIT_ConceptZ       233    0.1812         0.2647             33 (33)
                  VIT_ConceptZ       471    0.1812         0.2647             34 (34)
                  VIT_ConceptZ       487    0.1785         0.2536             35 (35)
                  VIT_ConceptZ       488    0.1143         0.2308             36 (36)
                  CS_Morgan          530    0.1076         0.2105             37 (37)
                  DS@BioMed          242    0.0019         0.0032             38 (38)
  Table 8
  Performance of the participating teams in the ImageCLEFmedical 2024 Caption Prediction for the
  metrics BERTScore, ROUGE, BLEU-1, BLEURT, ClinicalBLEURT, and METEOR.
Group Name            Run   BERTScore     ROUGE     BLEU-1    BLEURT     ClinicalBLEURT    METEOR
pclmed                634       0.6299    0.2726    0.2690      0.3376            0.4666    0.1133
CS_Morgan             429        0.6281    0.2508    0.2093     0.3174            0.4559     0.0927
DarkCow               220        0.6267    0.2452    0.1950     0.3060            0.4562     0.0889
CS_Morgan             527        0.6254    0.2454    0.2076     0.3165            0.4435     0.0892
CS_Morgan             526        0.6250    0.2440    0.2049     0.3153            0.4438     0.0898
pclmed                633        0.6235    0.2717    0.2680    0.3386             0.4671     0.1121
CS_Morgan             525        0.6230    0.2380    0.1951     0.3096            0.4358     0.0854
pclmed                632        0.6227    0.2690    0.2650     0.3365            0.4654     0.1110
auebnlpgroup          630        0.6211    0.2049    0.1110     0.2899            0.4866     0.0680
auebnlpgroup          635        0.6210    0.2047    0.1108     0.2895            0.4870     0.0680
auebnlpgroup          646        0.6210    0.2044    0.1107     0.2900            0.4872     0.0678
auebnlpgroup          647        0.6210    0.1807    0.0860     0.2846            0.5021     0.0580
DarkCow               243        0.6200    0.2139    0.1685     0.2913            0.4597     0.0751
2Q2T                  643        0.6178    0.2478    0.2213     0.3139            0.4759     0.0986
2Q2T                  682        0.6178    0.2478    0.2213     0.3139            0.4759     0.0986
CS_Morgan             613        0.6173    0.2178    0.1559     0.2976            0.4487     0.0730
CS_Morgan             529        0.6166    0.2160    0.1827     0.3058            0.4534     0.0760
2Q2T                  683        0.6165    0.2501    0.2353     0.3153            0.4748     0.1018
auebnlpgroup          650        0.6159    0.1936    0.1050     0.2859            0.4874     0.0638
CS_Morgan             528        0.6157    0.2237    0.1741     0.3005            0.4339     0.0770
auebnlpgroup          564        0.6153    0.2052    0.1274     0.2920            0.4844     0.0698
MICLab                678        0.6128    0.2135    0.1853     0.3067            0.4453     0.0772
auebnlpgroup          605        0.6114    0.1889    0.1147     0.2796            0.4834     0.0616
auebnlpgroup          639        0.6111    0.1827    0.0744     0.2717            0.5212     0.0515
auebnlpgroup          577        0.6107    0.1838    0.0751     0.2706            0.5158     0.0513
2Q2T                  512        0.6106    0.2353    0.2069     0.3209            0.4459     0.0884
2Q2T                  684        0.6092    0.2342    0.2148     0.3243            0.4467     0.0893
2Q2T                  592        0.6091    0.2341    0.2148     0.3243            0.4468     0.0892
2Q2T                  595        0.6091    0.2341    0.2148     0.3243            0.4468     0.0892
MICLab                676        0.6072    0.1922    0.1480     0.2905            0.4608     0.0642
DLNU_CCSE             674        0.6066    0.2179    0.1512     0.2831            0.4756     0.0704
DarkCow               221        0.5994    0.2363    0.2323     0.2954            0.4597     0.0989
Kaprov                559        0.5964    0.1905    0.1697     0.2951            0.4400     0.0609
MICLab                274        0.5888    0.1933    0.1626     0.2864            0.4443     0.0617
DLNU_CCSE             675        0.5839    0.1844    0.1579     0.2756            0.4524     0.0594
DS@BioMed             571        0.5794    0.1031    0.0121     0.2202            0.5295     0.0353
DS@BioMed             563        0.5794    0.1031    0.0121     0.2202            0.5295     0.0353
DBS-HHU               637        0.5769    0.1531    0.1493     0.2710            0.4766     0.0559
DBS-HHU               645        0.5769    0.1531    0.1493     0.2710            0.4766     0.0559
KDE-medical-caption   557        0.5673    0.1325    0.1060     0.2566            0.5022     0.0386
KDE-medical-caption   544        0.5665    0.1273    0.1151     0.2513            0.5220     0.0438
KDE-medical-caption   424        0.5646    0.1223    0.1030     0.2439            0.5082     0.0413
KDE-medical-caption   423        0.5646    0.1223    0.1030     0.2439            0.5082     0.0413
KDE-medical-caption   460        0.5630    0.1199    0.1035     0.2410            0.5240     0.0406
DS@BioMed             555        0.5580    0.1355    0.0600     0.2606            0.5239     0.0548
DS@BioMed             556        0.5580    0.1355    0.0600     0.2606            0.5239     0.0548
DLNU_CCSE             673        0.5462    0.0924    0.0982     0.2279            0.5167     0.0306
CS_Morgan             614        0.5458    0.1184    0.1024     0.2447            0.4501     0.0351
DS@BioMed             313        0.4454    0.0950    0.0899     0.3122            0.6271     0.0504
DS@BioMed             465        0.4454    0.0950    0.0899     0.3122            0.6271     0.0504
DS@BioMed             314        0.4433    0.0952    0.0893     0.3351            0.6231     0.0508
CS_Morgan             615        0.4143    0.0442    0.0289     0.2614           0.6769      0.0199
MICLab                677        0.3739    0.0823    0.0510     0.1601            0.4985     0.0181
Table 9
Performance of the participating teams in the ImageCLEFmedical 2024 Caption Prediction for the
metrics CIDEr, CLIPScore, RefCLIPScore, and MedBERTScore.
      Group Name             Run    CIDEr     CLIPScore    RefCLIPScore    MedBERTScore
      pclmed                  634   0.2681        0.8236         0.8176           0.6323
      CS_Morgan               429    0.2450       0.8213          0.8155         0.6327
      DarkCow                 220    0.2243       0.8184          0.8117          0.6292
      CS_Morgan               527    0.2241       0.8208          0.8143          0.6315
      CS_Morgan               526    0.2199       0.8242          0.8147          0.6300
      pclmed                  633    0.2597       0.8231          0.8169          0.6254
      CS_Morgan               525    0.2034       0.8227          0.8121          0.6298
      pclmed                  632    0.2521       0.8217          0.8162          0.6242
      auebnlpgroup            630    0.1769       0.8041          0.7987          0.6261
      auebnlpgroup            635    0.1762       0.8040          0.7986          0.6260
      auebnlpgroup            646    0.1758       0.8041          0.7988          0.6261
      auebnlpgroup            647    0.1459       0.7936          0.7912          0.6291
      DarkCow                 243    0.1585       0.8132          0.8014          0.6233
      2Q2T                    643    0.2200       0.8271          0.8138          0.6224
      2Q2T                    682    0.2200       0.8271          0.8138          0.6224
      CS_Morgan               613    0.1708       0.8166          0.8067          0.6262
      CS_Morgan               529    0.1619       0.8151          0.8071          0.6243
      2Q2T                    683    0.2204      0.8284           0.8137          0.6212
      auebnlpgroup            650    0.1597       0.7990          0.7948          0.6212
      CS_Morgan               528    0.1730       0.8193          0.8075          0.6246
      auebnlpgroup            564    0.1728       0.8045          0.7968          0.6197
      MICLab                  678    0.1582       0.8159          0.8049          0.6172
      auebnlpgroup            605    0.1305       0.8037          0.7962          0.6174
      auebnlpgroup            639    0.1293       0.7858          0.7845          0.6141
      auebnlpgroup            577    0.1292       0.7832          0.7826          0.6134
      2Q2T                    512    0.1923       0.8215          0.8147          0.6169
      2Q2T                    684    0.1948       0.8226          0.8141          0.6162
      2Q2T                    592    0.1950       0.8226          0.8141          0.6161
      2Q2T                    595    0.1950       0.8226          0.8141          0.6161
      MICLab                  676    0.1229       0.7989          0.7915          0.6142
      DLNU_CCSE               674    0.1688       0.7967          0.7904          0.6130
      DarkCow                 221    0.1442       0.8244          0.8100          0.6016
      Kaprov                  559    0.1070       0.7922          0.7872          0.6089
      MICLab                  274    0.1082       0.7688          0.7694          0.5963
      DLNU_CCSE               675    0.0859       0.7562          0.7506          0.5921
      DS@BioMed               571    0.0715       0.7756          0.7748          0.5804
      DS@BioMed               563    0.0715       0.7756          0.7748          0.5804
      DBS-HHU                 637    0.0644       0.7842          0.7750          0.5827
      DBS-HHU                 645    0.0644       0.7842          0.7750          0.5827
      KDE-medical-caption     557    0.0384       0.7651          0.7610          0.5697
      KDE-medical-caption     544    0.0499       0.7615          0.7577          0.5700
      KDE-medical-caption     424    0.0449       0.7608          0.7580          0.5683
      KDE-medical-caption     423    0.0449       0.7608          0.7580          0.5683
      KDE-medical-caption     460    0.0425       0.7592          0.7551          0.5674
      DS@BioMed               555    0.1043       0.7999          0.7948          0.5487
      DS@BioMed               556    0.1043       0.7999          0.7948          0.5487
      DLNU_CCSE               673    0.0145       0.6913          0.6989          0.5517
      CS_Morgan               614    0.0288       0.6853          0.6924          0.5563
      DS@BioMed               313    0.0425       0.7757          0.7675          0.4282
      DS@BioMed               465    0.0425       0.7757          0.7675          0.4282
      DS@BioMed               314    0.0449       0.7850          0.7736          0.4308
      CS_Morgan               615    0.0034       0.6665          0.6698          0.4062
      MICLab                  677    0.0092       0.6366          0.6614          0.3714