=Paper=
{{Paper
|id=Vol-3180/paper-95
|storemode=property
|title=Overview of ImageCLEFmedical 2022 – Caption Prediction and Concept Detection
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-95.pdf
|volume=Vol-3180
|authors=Johannes Rückert,Asma Ben Abacha,Alba G. Seco de Herrera,Louise Bloch,Raphael Brüngel,Ahmad Idrissi-Yaghir,Henning Schäfer,Henning Müller,Christoph M. Friedrich
|dblpUrl=https://dblp.org/rec/conf/clef/RuckertAHBBISMF22
}}
==Overview of ImageCLEFmedical 2022 – Caption Prediction and Concept Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-95.pdf</pdf>
<pre>
Overview of ImageCLEFmedical 2022 – Caption
Prediction and Concept Detection
Johannes Rückert1 , Asma Ben Abacha2 , Alba G. Seco de Herrera3 , Louise Bloch1,4 ,
Raphael Brüngel1,4 , Ahmad Idrissi-Yaghir1,4 , Henning Schäfer5 , Henning Müller6,7 and
Christoph M. Friedrich1,4
1
  Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany
2
  Microsoft, Redmond, Washington, USA
3
  University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK
4
  Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Germany
5
  Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany
6
  University of Applied Sciences Western Switzerland (HES-SO), Switzerland
7
  University of Geneva, Switzerland


                                         Abstract
                                         The 2022 ImageCLEFmedical caption prediction and concept detection tasks follow similar challenges that
                                         were already run from 2017–2021. The objective is to extract Unified Medical Language System (UMLS)
                                         concept annotations and/or captions from the image data that are then compared against the original text
                                         captions of the images. The images used for both tasks are a subset of the extended Radiology Objects in
                                         COntext (ROCO) data set which was used in ImageCLEFmedical 2020. In the caption prediction task,
                                         lexical similarity with the original image captions is evaluated with the BiLingual Evaluation Understudy
                                         (BLEU) score. In the concept detection task, UMLS terms are extracted from the original text captions,
                                         combined with manually curated concepts for image modality and anatomy, and compared against the
                                         predicted concepts in a multi-label way. The F1-score was used to assess the performance. The task
                                         attracted a strong participation with 20 registered teams. In the end, 12 teams submitted 157 graded runs
                                         for the two subtasks. Results show that there is a variety of techniques that can lead to good prediction
                                         results for the two tasks. Participants used image retrieval systems for both tasks, while multi-label
                                         classification systems were used mainly for the concept detection, and Transformer-based architectures
                                         primarily for the caption prediction subtask.

                                         Keywords
                                         Concept Detection, Computer Vision, ImageCLEF 2022, Image Understanding, Image Modality, Radiology,
                                         Caption Prediction


CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ johannes.rueckert@fh-dortmund.de (J. Rückert); abenabacha@microsoft.com (A. Ben Abacha);
alba.garcia@essex.ac.uk (A. G. Seco de Herrera); louise.bloch@fh-dortmund.de (L. Bloch);
raphael.bruengel@fh-dortmund.de (R. Brüngel); ahmad.idrissi-yaghir@fh-dortmund.de (A. Idrissi-Yaghir);
henning.schaefer@uk-essen.de (H. Schäfer); henning.mueller@hevs.ch (H. Müller);
christoph.friedrich@fh-dortmund.de (C. M. Friedrich)
 0000-0002-5038-5899 (J. Rückert); 0000-0001-6312-9387 (A. Ben Abacha); 0000-0002-6509-5325 (A. G. Seco de
Herrera); 0000-0001-7540-4980 (L. Bloch); 0000-0002-6046-4048 (R. Brüngel); 0000-0003-1507-9690 (A. Idrissi-Yaghir);
0000-0002-4123-0406 (H. Schäfer); 0000-0001-6800-9878 (H. Müller); 0000-0001-7906-0038 (C. M. Friedrich)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
1. Introduction
The caption task was first proposed as part of the ImageCLEFmedical [1] in 2016. In 2017 and
2018 [2, 3] the ImageCLEFmedical caption task comprised two subtasks: concept detection and
caption prediction. In 2019 [4] and 2020 [5], the task concentrated on extracting Unified Medical
Language System® (UMLS) Concept Unique Identifiers (CUIs) [6] from radiology images.
   In 2021 [7], both subtasks, concept detection and caption prediction, were running again due
to participants demands. The focus in 2021 was on making the task more realistic by using
fewer images which were all manually annotated by medical doctors. As additional data of
similar quality is hard to acquire, the 2022 ImageCLEFmedical caption task continues with both
subtasks albeit with an extended version of the Radiology Objects in COntext (ROCO) [8] data
set used for both subtasks, which was already used in 2020 and 2019.
   This paper sets forth the approaches for the caption task: automated cross-referencing of
medical images and captions into predicted coherent captions implying UMLS concept detection
in radiology images as a first step. This task is a part of the ImageCLEF benchmarking campaign,
which has proposed medical image understanding tasks since 2003; a new suite of tasks is
generated each subsequent year. Further information on the other proposed tasks at ImageCLEF
2022 can be found in Ionescu et al. [9].
   This is the 6th edition of the ImageCLEFmedical caption task. Just like in 2016 [1], 2017 [2],
2018 [3], and 2021 [7], both subtasks of concept detection and caption prediction are included
in ImageCLEFmedical Caption 2022. Like in 2020, an extended subset of the ROCO [8] data set
is used to provide a much larger data set compared to 2021.
   Manual generation of the knowledge of medical images is a time-consuming process prone to
human error. As this process requires assistance for the better and easier diagnoses of diseases
that are susceptible to radiology screening, it is important that we better understand and refine
automatic systems that aid in the broad task of radiology-image metadata generation. The
purpose of the ImageCLEFmedical 2022 caption prediction and concept detection tasks is the
continued evaluation of such systems. Concept detection and caption prediction information is
applicable to unlabelled and unstructured data sets and medical data sets that do not have textual
metadata. The ImageCLEFmedical caption task focuses on the medical image understanding in
the biomedical literature and specifically on concept extraction and caption prediction based on
the visual perception of the medical images and medical text data such as medical caption or
UMLS CUIs paired with each image (see Figure 1).
   For the development data, an extended subset of the ROCO [8] data set from 2020 was used,
with new images from the same source added for the validation and test sets.
   This paper presents an overview of the ImageCLEFmedical caption task 2022 including the
task and participation in Section 2, the data creation in Section 3, and the evaluation methodology
in Section 4. The results are described in Section 5, followed by conclusion in Sections 6.


2. Task and Participation
In 2022, the ImageCLEFmedical caption task consisted of two subtasks: concept detection and
caption prediction.
   The concept detection subtask follows the same format proposed since the start of the task
in 2017. Participants are asked to predict a set of concepts defined by the UMLS CUIs [6] based
on the visual information provided by the radiology images.
   The caption prediction subtask follows the original format of the subtask used between 2017
and 2018. The task is running again since 2021 because of participant demand. This subtask
aims to automatically generate captions for the radiology images provided.
   In 2022, 20 teams registered and signed the End-User-Agreement that is needed to download
the development data. 12 teams submitted 157 runs for evaluation (all 12 teams submitted
working notes) attracting more attention than in 2021. Each of the groups was allowed a
maximum of 10 graded runs per subtask.
   Table 1 shows all the teams who participated in the task and their submitted runs. 11 teams
participated in the concept detection subtask this year, 3 of those teams also participated in 2021.
10 teams submitted runs to the caption prediction subtask, 4 of those teams also participated in
2021. Overall, 9 teams participated in both subtasks, two teams participated only in the concept
detection subtask and one team participated only in the caption prediction subtask.


3. Data Creation
Figure 1 shows an example from the data set provided by the task.

                                                            CC BY [Ali et al. (2020)]

 UMLS CUI                  UMLS Meaning

  C1306645                 Plain x-ray
  C0030797                 Pelvis
  C0332466                 Fused structure
  C0034014                 Bone structure of pubis
  C0205094                 Anterior
  C0005976                 Bone Transplantation
  C0021102                 Implants


 Caption: Anteroposterior pelvic radiograph of a 30-year-old female diagnosed with Ehlers-Danlos Syndrome
 demonstrating fusion of pubic symphysis and both sacroiliac joints (anterior plating, bone grafting and
 sacroiliac screw insertion)

Figure 1: Example of a radiology image with the corresponding UMLS® CUIs and caption extracted
from the 2022’s ImageCLEFmedical caption task. CC-BY [Ali et al. (2020)] [23]


   In the previous edition, in an attempt to make the task more realistic, the data set contained
a smaller number of real radiology images annotated by medical doctors which resulted in
high-quality concepts.
   Additional data of similar quality is hard to acquire and so it was decided to return to the
data set already used in 2020 and 2019, which originates from biomedical articles of the PMC
Table 1
Participating groups in the ImageCLEFmedical 2022 caption task and their graded runs submitted to
both subtasks: T1-Concept Detection and T2-Caption Prediction. Teams with previous participation in
2021 are marked with an asterisk (*).
 Team                        Institution                                                Runs T1   Runs T2
 AUEB-NLP-                   Department of Informatics, Athens                                6         9
 Group* [10]                 University of Economics and Business,
                             Athens, Greece
 CSIRO* [11]                 Australian e-Health Research Centre,                            10         9
                             Commonwealth Scientific and Industrial
                             Research Organisation, Herston,
                             Queensland, Australia and CSIRO Data61,
                             Imaging and Computer Vision Group,
                             Pullenvale, Queensland, Australia and
                             Queensland University of Technology,
                             Brisbane, Queensland, Australia
 eecs-kth [12]               KTH Royal Institute of Technology,                              10        10
                             Stockholm, Sweden
 CMRE-UoG                    Canon Medical Research Europe,                                   5         6
 (fdallaserra) [13]          Edinburgh, UK and University of Glasgow,
                             Glasgow, UK
 IUST_NLPLAB [14]            School of Computer Engineering, Iran                            10        10
                             University of Science and Technology,
                             Tehran, Islamic Republic Of Iran
 kdelab* [15] [16]           KDE Laboratory, Department of                                   10         8
                             Computer Science and Engineering,
                             Toyohashi University of Technology, Aichi,
                             Japan
 MAI_ImageSem* [17]          Institute of Medical Information and                            –          2
                             Library, Chinese Academy of Medical
                             Sciences and Peking Union Medical
                             College, Beijing, China
 Morgan_CS [18]              Morgan State University, Baltimore, MD,                          8         4
                             USA
 PoliMi-                     Politecnico di Milano, Milan, Italy                             10        –
 ImageClef [19]
 SDVA-UCSD [20]              San Diego VA HCS, San Diego, CA, USA                             1        –
 SSNSheerinKavitha           Department of CSE, Sri Sivasubramaniya                           6        7
 [21]                        Nadar College of Engineering, India
 vcmi [22]                   University of Porto, Porto, Portugal and                         9         7
                             INESC TEC, Porto, Portugal


Open Access Subset1 [24] and was extended with new images added since the last time the data
set was updated.
   All captions were pre-processed by removing punctuation, numbers and words containing

   1
       https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ [last accessed: 28.06.2022]
numbers. Additionally, lemmatization was applied using spaCy2 and the pre-trained model
en_core_web_lg. Finally, all captions were converted to lower-case.
   From the resulting captions, UMLS concepts were generated using a reduced subset of the
UMLS 2020 AB release3 , which includes the sections (restriction levels) 0, 1, 2, and 9. To improve
the feasibility of recognizing concepts from the images, concepts were filtered based on their
semantic type. Concepts with very low frequency were also removed, based on suggestions
from previous years.
   Additional concepts were assigned to all images addressing their image modality. Six modality
concepts were covered: x-ray, computer tomography (CT), magnetic resonance imaging (MRI),
ultrasound, and positron emission tomography (PET) as well as modality combinations (e.g.,
PET/CT) as standalone concept. For images of the x-ray modality further concepts on the
represented anatomy were assigned, covering specific anatomical body regions of the Image
Retrieval in Medical Application (IRMA) [25] classification: cranium, spine, upper extremity/arm,
chest, breast/mamma, abdomen, pelvis, and lower extremity/leg. Both of the described concept
extensions were created performing a two-stage process, each. In the first stage predictions
via classification models were created and assigned as annotations. For modality prediction
for all images a model trained on the ROCO dataset [8], and for anatomy prediction for x-ray
modality images a model trained on an existing IRMA-annotated image dataset [26] was used.
In the second stage, these annotations underwent manual quality control measures, involving
correction of faulty predictions and filtering of images that did not represent one of the minded
modality or anatomy concepts. Three annotators were involved. Each individual modality
concept was processed by a single annotator due to the low complexity of this task part. Anatomy
concepts of x-ray modality images were each, too, processed by a single annotator per concept.
However, due to the complexity/ambiguity of this task, the one annotator most-experienced
in anatomy classification re-evaluated the assessments of the other two. This re-evaluation
resulted in very few adjustments, indicating high agreement between annotators.
   The following subsets were distributed to the participants where each image has one caption
and multiple concepts (UMLS-CUI):

    • Training set including 83,275 radiology images and associated captions and concepts.
    • Validation set including 7,645 radiology images and associated captions and concepts.
    • Test set including 7,645 radiology images.


4. Evaluation Methodology
In this year’s edition, the performance evaluation is carried out in the same way as last year,
with both subtasks being evaluated separately.
   For the concept detection subtask, the balanced precision and recall trade-off were measured
in terms of F1-scores. In addition, a secondary F1-score was introduced in this edition, where
the score is computed using a subset of concepts that was manually curated and only contains
x-ray anatomy and image modality concepts.
   2
       https://spacy.io/api/lemmatizer/ [last accessed: 28.06.2022]
   3
       https://www.nlm.nih.gov/pubs/techbull/nd20/nd20_umls_release.html [last accessed: 28.06.2022]
   Caption prediction performance is evaluated based on the BiLingual Evaluation Understudy
(BLEU) scores [27], which a geometric mean of n-gram scores from 1 to 4. As a preprocessing
step for the evaluation, all captions were lowercased and stripped of all punctuation and English
stop words. Additionally, to increase coverage, lemmatization was applied using spaCy and the
pre-trained model en_core_web_lg. BLEU values are then computed for each test image, treating
the entire caption as one sentence, even though it may contain multiple sentences. The average
of the BLEU values for all images is reported as the primary ranking score. Since evaluating
generated text and image captioning is very challenging and should be based on a single metric,
additional evaluation metrics were explored in this year’s edition in order to find the metric
that correlate well with human judgements for this task. First, the Recall-Oriented Understudy
for Gisting Evaluation (ROUGE) [28] score was adopted as a secondary metric that counts the
number of overlapping units such as n-grams, word sequences, and word pairs between the
generated text and the reference. Specifically, the ROUGE-1 (F-measure) score was calculated,
which measures the number of matching unigrams between the model-generated text and
a reference. All individual scores for each caption are then summed and averaged over the
number of captions, resulting in the final score. In addition to ROUGE, the Metric for Evaluation
of Translation with Explicit ORdering (METEOR) [29] was explored, which is a metric that
evaluates the generated text by aligning it to reference and calculating a sentence-level similarity
score. Furthermore, the Consensus-based Image Description Evaluation (CIDEr) [30] metric was
also adopted. CIDEr is an automatic evaluation metric that calculates the weights of n-grams
in the generated text and the reference text based on term frequency and inverse document
frequency (TF-IDF), and then compares them based on cosine similarity. Another used metric is
the Semantic Propositional Image Caption Evaluation (SPICE) [31], which maps the reference
and generated captions to semantic scene graphs through dependency parse trees and measures
the similarity between the scene graphs for the evaluation. Finally, BERTScore [32] was used,
which is a metric that computes a similarity score for each token in the generated text with
each token in the reference text. It leverages the pre-trained contextual embeddings from
BERT-based models and matches words by cosine similarity. In this work, the pre-trained model
microsoft/deberta-xlarge-mnli4 was utilized, since it is the model that correlates best with human
evaluation according to the authors5 .


5. Results
For the concept detection and caption prediction subtasks, Tables 2 and 3 show the best results
from each of the participating teams. The results will be discussed in this section.

5.1. Results for the Concept Detection subtask
In 2022, 11 teams participated in the concept prediction subtask, submitting 85 runs. Table 2
presents the results achieved in the submissions.


    4
        https://huggingface.co/microsoft/deberta-xlarge-mnli [last accessed: 28.06.2022]
    5
        https://github.com/Tiiiger/bert_score [last accessed: 28.06.2022]
Table 2
Performance of the participating teams in the ImageCLEFmedical 2022 Concept Detection subtask. Only
the best run based on the achieved F1-score is listed for each team, together with the corresponding
secondary F1-score based on manual annotations as well as the team rankings based on the primary
and secondary F1-score
           Group Name            Best Run        F1    Secondary F1    Rank (secondary)
           AUEB-NLP-Group           182358   0.4511           0.7907               1 (6)
           fdallaserra              182324    0.4505          0.8222               2 (4)
           CSIRO                    182343    0.4471          0.7936               3 (5)
           eecs-kth                 181750    0.4360          0.8546               4 (2)
           vcmi                     182097    0.4329         0.8634                5 (1)
           PoliMi-ImageClef         182296    0.4320          0.8512               6 (3)
           SSNSheerinKavitha        181995    0.4184          0.6544               7 (8)
           IUST_NLPLAB              182307    0.3981          0.6732               8 (7)
           Morgan_CS                182150    0.3520          0.6281               9 (9)
           kdelab                   182346    0.3104          0.4120             10 (11)
           SDVA-UCSD                181691    0.3079          0.5524             11 (10)


AUEB-NLP-Group Like in previous years, the AUEB-NLP-Group submitted the best perform-
    ing result with a primary F1-score of 0.4511 [10] and a secondary F1-score of 0.7907.
    The winning approach was an ensemble of two EfficientNetV2-B0 backbones followed by
    a single classification layer where the union of predicted concepts was used to form the
    ensemble. This solution outperformed their retrieval-based system which won last year’s
    concept detection subtask [33].

fdallaserra The second best system, with an only slightly worse primary F1-score of 0.4505
      and a better secondary F1-score of 0.8222 [13] was proposed by CMRE-UoG (fdallaserra).
      Their best approach consisted of an image retrieval system which used an ensemble of
      five DenseNet-201, each of which retrieves 100 different images. Then CUIs appearing in
      at least 30% of the images are taken, and finally a union of each model’s predicted CUIs
      is assigned to each image.

CSIRO The CSIRO group reached a primary F1-score of 0.4471 [11] and a secondary F1-
    score of 0.7936. They experimented with a range of different backbones for multi-label
    classification system, and their best approach is an ensemble of 43 DenseNet-161 with
    top-1% threshold optimisation.

eecs-kth The eecs-kth team reached a primary F1 score of 0.4360 [12] and a secondary F1
     score of 0.8546. Their best approach utilized a multi-label classification system based on
     DenseNet161 with a single classification layer.

vcmi The VCMI (vcmi) team reached a primary F1-score of 0.4329 [22] and the best overall
     secondary F1-score of 0.8634. They combined a multi-label classification system based
     on DenseNet-121 with an information retrieval approach for their best approach, where
     the retrieval system is used if the classification did not assign any labels.
PoliMi The PoliMi team reached a primary F1-score of 0.4320 [19] and a secondary F1-score
     of 0.8512. They used a ResNext50-based multi-label classification system.

SSNSheerinKavitha The SSN MLRG (SSNSheerinKavitha) team reached a primary F1-score of
    0.4184 [21] and a secondary F1-score of 0.6544. They employed DenseNet for multi-label
    classification and an information retrieval system.

IUST_NLPLAB The IUST_NLPLAB team reached a primary F1-score of 0.3981 [14] and a
     secondary F1-score of 0.6732. They used a multi-label classification model based on
     ResNet for their best results.

Morgan_CS The CS_Morgan (Morgan_CS) team from Morgan State University (USA) reached
    a primary F1-score of 0.3520 [18] and a secondary F1-score of 0.6280. They used a fusion
    of Vision Transformers for their best approach, which outperformed their multi-label
    classification systems.

Kdelab The Kdelab team reached a primary F1-score of 0.3104 [15] and a secondary F1-score
     of 0.4120. They exclusively experimented with image retrieval systems and their best
     approach consisted of an ensemble of different backbone networks (DenseNet, EfficientNet,
     ResNet) using simple majority voting.

SDVA-UCSD The SDVA-UCSD team reached a primary F1-score of 0.3079 [20] and a sec-
    ondary F1-score of 0.5524. They used a multi-label classification system with ResNet and
    DenseNet backbones.

   To summarize, in the concept detection subtasks, the groups used primarily multi-label
classification systems and image retrieval systems, much like in the 2021 challenge. Multi-
label classification systems outperformed retrieval-based systems for most of the teams who
experimented with both, and while the winner was a multi-label classification approach, the
second placing team with an F1-score only 0.0006 less than the winning team, used a retrieval-
based system for which they took last year’s winning approach and tuned it to include more
CUIs by reducing the threshold for the percentage of retrieved images in which the CUI had to
appear from 50% to 30% [13].
   This year’s models for concept detection do not show an increased F1-score compared to
last year, however due to the much larger data set and number of concepts used in this year’s
challenge, this is not surprising. Comparing it to the 2020 results, where a data set of similar
size was used, the F1-scores show a clear improvement. There are no radically new approaches
used in this year’s concept detection subtask, but the teams experimented with, optimised and
re-combined many different existing techniques and created competitive solutions using both
multi-label classification systems and image retrieval systems.

5.2. Results for the Caption Prediction subtask
In this sixth edition, the caption prediction subtask attracted 10 teams which submitted 72 runs.
Table 3 presents the results of the submissions.
Table 3
Performance of the participating teams in the ImageCLEF 2022 Caption Prediction subtask. Only the
best run based on the achieved BLEU score is listed for each team, together with the corresponding
secondary ROUGE score as well as the team rankings based on the primary BLEU and secondary
ROUGE score. The best results are highlighted.
         Group Name            Best Run       BLEU   Secondary ROUGE       Rank (secondary)
         IUST_NLPLAB             182275    0.4828                 0.1422                1 (8)
         AUEB-NLP-Group          181853     0.3222                0.1665                2 (5)
         CSIRO                   182268     0.3114                0.1974                3 (2)
         vcmi                    182325     0.3058                0.1738                4 (4)
         eecs-kth                182337     0.2917                0.1157                5 (9)
         fdallaserra             182342     0.2913               0.2012                 6 (1)
         kdelab                  182351     0.2783                0.1584                7 (6)
         Morgan_CS               182238     0.2549                0.1441                8 (7)
         MAI_ImageSem            182105     0.2211                0.1847                9 (3)
         SSNSheerinKavitha       182248     0.1595                0.0425              10 (10)


Table 4
Performance of the participating teams in the ImageCLEF 2022 Caption Prediction subtask for additional
metrics METEOR, CIDEr, SPICE, and BERTScore. These correspond to the best F1 score-based runs of
each team, listed in Table 3. The best results are highlighted.
            Group Name            Best Run     METEOR      CIDEr     SPICE     BERTScore
            IUST_NLPLAB              182275      0.0928     0.0304    0.0072       0.5612
            AUEB-NLP-Group           181853       0.0737    0.1902    0.0313       0.5989
            CSIRO                    182268       0.0841    0.2693    0.0462      0.6234
            vcmi                     182325       0.0746    0.2047    0.0358       0.6044
            eecs-kth                 182337       0.0624    0.1317    0.0218       0.5728
            fdallaserra              182342       0.0819    0.2564    0.0464       0.6101
            kdelab                   182351       0.0735   0.4114    0.0512        0.6003
            Morgan_CS                182238       0.0559    0.1481    0.0232       0.5835
            MAI_ImageSem             182105       0.0675    0.2513    0.0393       0.6059
            SSNSheerinKavitha        182248       0.0226    0.0169    0.0072       0.5451


IUST_NLPLAB The IUST_NLPLAB team presented the best model for the caption prediction
     subtask. They reached a BLEU score of 0.4828, outperforming the competition by a large
     margin, and a ROUGE score of 0.1422 [14]. Additionally, they reached the overall best
     METEOR score of 0.0928. For their best run, they employed a multi-label classification
     system based on ResNet50 which treats every word as a label and assigns 26 words in the
     order of their probability to each image.

AUEB-NLP-Group The AUEB-NLP-Group submitted the second best performing result with a
    BLEU score of 0.3222 [10] and a ROUGE score of 0.1664. Their best approach utilizes the
    Show & Tell model [34] consisting of a CNN-RNN encoder-decoder with an EfficientNetB0
    backbone. While they were clearly behind the BLEU score of the winners, they outscore
      them in most of the other scores.

CSIRO The CSIRO group reached a BLEU score of 0.3114 [11] and a ROUGE score of 0.1974.
    Additionally, they reached the overall best BERTScore of 0.6234. They experimented
    with different encoder-to-decoder models and achieved their best scores with CvT-21 as
    the encoder and DistilGPT2 as the decoder, warm-started with a MIMIC-CXR checkpoint
    with a penalty for n-grams of size 3 that are repeated.

vcmi The VCMI (vcmi) team reached a BLEU score of 0.3058 [22] and a ROUGE score of 0.1738.
     They used a vision encoder-to-decoder system for the best results.

eecs-kth The eecs-kth team reached a BLEU score of 0.2917 [12] and a ROUGE score of 0.1157.
     They employed an information retrieval system based on AlexNet which summarizes the
     captions of a number of similar images using Pegasus.

fdallaserra The CMRE-UoG (fdallaserra) group reached a BLEU score of 0.2913 [13] and the
      overall best ROUGE score of 0.2012. They used a CNN Transformer approach with
      multi-modal (image + CUIs) input for their best results.

Kdelab The Kdelab team reached a BLEU score of 0.2782 [16] and a ROUGE score of 0.1584.
     Additionally, they reached the overall best CIDEr score of 0.4114 and overall best SPICE
     score of 0.0512. They used an image retrieval approach with an ensemble of different
     backbone networks for their best submission results.

Morgan_CS The CS_Morgan (Morgan_CS) team reached a BLEU score of 0.2549 [18] and a
    ROUGE score of 0.1441. They used a very similar approach as for the concept detection,
    namely a fusion of Vision Transformers.

MAI_ImageSem The MAI_ImageSem team reached a BLEU score of 0.2211 [17] and a ROUGE
    score of 0.1847. For the best results, they use pre-trained BLIP (Bootstrapping Language-
    Image Pre-training), a pre-training framework for vision-language understanding consist-
    ing of a multi-modal encoder-decoder and a captioning and filtering module.

SSNSheerinKavitha The SSN MLRG (SSNSheerinKavitha) team reached a BLEU score of
    0.1595 [21] and a ROUGE score of 0.0425. For their best run, they employed a Sparse
    Auto Encoder (SAE) with a Multi-Layer Perceptron (MLP) and a Gated Recurrent Unit
    (GRU).

   To summarize, in the caption prediction subtask most teams experimented with Transformer-
based architectures and image retrieval systems. Only one team used a multi-label classification
approach, and it achieved by far the best BLEU score. However, it did not score as well on most
of the other employed metrics, with the second placing team outscoring the winners in all but
the BLEU and METEOR metrics, which highlights the difficulty of evaluating caption similarity.
One metric to highlight especially is SPICE, which is specifically designed for the evaluation of
image captions. The winners scored a value of 0.0072 in this metric with the rest of the field
(except the last placing team) scoring between 0.0218 and 0.0512.
   Transfer Learning has frequently been used for pre-training, from a variety of different data
sets. As in the previous years, simpler architectures ended up yielding better results compared
to more complex ones in many instances.
   Similar to the concept detection, the BLEU scores in the caption prediction subtask are overall
lower compared to last year, which can be explained by the larger and more complex data set
and more varied captions. Since there was no caption prediction subtask running in 2020, no
comparable scores for a similar data set exist.


6. Conclusion
This year’s caption task of ImageCLEFmedical once again ran with both subtasks, concept
detection and caption prediction. It returned to a larger, ROCO-based data set for both challenges
after a smaller, manually annotated data set was used last year. It attracted 12 teams who
submitted 157 runs overall, a stronger participation compared to last year. For the concept
detection subtask, a secondary F1-score was introduced to distinguish manually curated concepts
from automatically generated ones. For the caption prediction, a number of additional scores
were added to better illustrate the difficulty of evaluating the quality of predicted captions. All
but one team participated in the concept detection subtask, with only two teams choosing not
to participate in the caption prediction subtask as well. Only one team used the generated
concepts as the input for the caption prediction model, most teams approached the subtasks
with separate systems. For the concept detection challenge, most teams employed multi-label
classification systems or image retrieval systems, while the caption prediction challenge was
predominantly approached using Transformer-based architectures and image retrieval systems,
with only the winning team using a multi-label classification system.
   The scores for both subtasks have not improved compared to the 2021 edition. However, the
larger and more complex ROCO-based data set with more concepts and more varied captions
make the scores difficult to compare. Looking at the 2020 edition, which used a similar data set,
the concept detection scores have clearly increased (there was no caption prediction subtask).
   For next year’s ImageCLEFmedical Caption challenge, some possible improvements include
adding more manually validated concepts like increased anatomical coverage and directionality
information, reducing recurring captions, more fine-grained CUI filters, improving the caption
pre-processing, and using a different primary score for the caption prediction challenge, since
the BLEU score has some disadvantages which were highlighted by this year’s caption prediction
results.
   What should also be addressed is how to deal with models that were pre-trained on PMC
data, because strictly speaking they have seen the real captions and can have an advantage
when some of these images appear in test data.


Acknowledgments
This work was partially supported by the University of Essex GCRF QR Engagement Fund
provided by Research England (grant number G026). The work of Louise Bloch and Raphael
Brüngel was partially funded by a PhD grant from the University of Applied Sciences and
Arts Dortmund (FH Dortmund), Germany. The work of Ahmad Idrissi-Yaghir and Henning
Schäfer was funded by a PhD grant from the DFG Research Training Group 2535 Knowledge-
and data-based personalisation of medicine at the point of care (WisPerMed).


References
 [1] A. García Seco de Herrera, R. Schaer, S. Bromuri, H. Müller, Overview of the ImageCLEF
     2016 medical task, in: Working Notes of CLEF 2016 (Cross Language Evaluation Forum),
     2016, pp. 219–232.
 [2] C. Eickhoff, I. Schwall, A. G. S. de Herrera, H. Müller, Overview of ImageCLEFcaption 2017
     - Image Caption Prediction and Concept Detection for Biomedical Images, in: Working
     Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland,
     September 11-14, 2017., 2017. URL: http://ceur-ws.org/Vol-1866/invited_paper_7.pdf.
 [3] A. G. S. de Herrera, C. Eickhoff, V. Andrearczyk, H. Müller, Overview of the ImageCLEF
     2018 Caption Prediction Tasks, in: Working Notes of CLEF 2018 - Conference and Labs of
     the Evaluation Forum, Avignon, France, September 10-14, 2018., 2018. URL: http://ceur-ws.
     org/Vol-2125/invited_paper_4.pdf.
 [4] O. Pelka, C. M. Friedrich, A. G. S. de Herrera, H. Müller, Overview of the ImageCLEFmed
     2019 Concept Detection Task, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.),
     Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
     Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2019. URL: http://ceur-ws.org/Vol-2380/paper_245.pdf.
 [5] O. Pelka, C. M. Friedrich, A. García Seco de Herrera, H. Müller, Overview of the Image-
     CLEFmed 2020 concept prediction task: Medical image understanding, in: CLEF2020
     Working Notes, volume 1166 of CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki,
     Greece, 2020.
 [6] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical
     terminology, Nucleic Acids Research 32 (2004) 267–270. doi:10.1093/nar/gkh061.
 [7] O. Pelka, A. Ben Abacha, A. García Seco de Herrera, J. Jacutprakart, C. M. Friedrich,
     H. Müller, Overview of the ImageCLEFmed 2021 concept & caption prediction task,
     in: CLEF2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest,
     Romania, 2021, pp. 1101–1112.
 [8] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. M. Friedrich, Radiology Objects in COntext
     (ROCO): A Multimodal Image Dataset, in: Intravascular Imaging and Computer Assisted
     Stenting - and - Large-Scale Annotation of Biomedical Data and Expert Label Synthesis -
     7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop,
     LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16,
     2018, Proceedings, 2018, pp. 180–189. doi:10.1007/978-3-030-01364-6\_20.
 [9] B. Ionescu, H. Müller, R. Péteri, J. Rückert, A. Ben Abacha, A. G. S. de Herrera, C. M.
     Friedrich, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, S. Kozlovski, Y. D. Cid, V. Ko-
     valev, L.-D. Ştefan, M. G. Constantin, M. Dogariu, A. Popescu, J. Deshayes-Chossart,
     H. Schindler, J. Chamberlain, A. Campello, A. Clark, Overview of the ImageCLEF 2022:
     Multimedia retrieval in medical, social media and nature applications, in: Experimental IR
     Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 13th Interna-
     tional Conference of the CLEF Association (CLEF 2022), LNCS Lecture Notes in Computer
     Science, Springer, Bologna, Italy, 2022.
[10] F. Charalampakos, G. Zachariadis, J. Pavlopoulos, V. Karatzas, C. Trakas, I. Androutsopou-
     los, AUEB NLP group at ImageCLEFmed caption 2022, in: CLEF2022 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[11] L. Lebrat, A. Nicolson, R. S. Cruz, G. Belous, B. Koopman, J. Dowling, CSIRO at Image-
     CLEFmed caption 2022, in: CLEF2022 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Bologna, Italy, 2022.
[12] G. M. Moschovis, E. Fransén, Neuraldynamicslab at ImageCLEF medical 2022, in: CLEF2022
     Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[13] F. D. Serra1, F. Deligianni, J. Dalton, A. Q. O’Neil, CMRE-UoG team at ImageCLEFmed
     caption 2022 task: Concept detection and image captioning, in: CLEF2022 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[14] M. Hajihosseini, Y. Lotfollahi, M. Nobakhtian, M. M. Javid, F. Omidi, S. Eetemadi,
     IUST_NLPLAB at ImageCLEFmed caption tasks 2022, in: CLEF2022 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[15] R. Tsuneda, T. Asakawa, K. Shimizu, T. Komoda, M. Aono, Kdelab at ImageCLEF 2022:
     Medical concept detection with image retrieval and code ensemble, in: CLEF2022 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[16] R. Tsuneda, T. Asakawa, K. Shimizu, T. Komoda, M. Aono, Kdelab at ImageCLEF2022 med-
     ical caption prediction task, in: CLEF2022 Working Notes, CEUR Workshop Proceedings,
     CEUR-WS.org, Bologna, Italy, 2022.
[17] X. Wang, J. Li, ImageSem Group at ImageCLEFmed Caption 2022 Task: Generating Medical
     Image Descriptions based on Visual- Language Pre-training, in: CLEF2022 Working Notes,
     CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[18] M. M. Rahman, O. Layode, CS_Morgan at ImageCLEFmed caption 2022: Deep learning
     based multilabel classification and transformers for concept detection & caption prediction,
     in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy,
     2022.
[19] S. A. M. Ghayyomnia, K. de Gast, M. J. Carmana, Polimi-imageclef group at ImageCLEFmed
     caption 2022, in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Bologna, Italy, 2022.
[20] A. Gentili, ImageCLEFmed concept detection, finding duplicates, in: CLEF2022 Working
     Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[21] N. M. S. Sitara, S. Kavitha, SSN MLRG at ImageCLEF 2022: Medical concept detection and
     caption prediction using transfer learning and transformer based learning approaches, in:
     CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy,
     2022.
[22] I. Rio-Torto, C. Patrício, H. Montenegro, T. Gonçalves, Detecting Concepts and Generating
     Captions from Medical Images: Contributions of the VCMI Team to ImageCLEFmed
     Caption 2022, in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org,
     Bologna, Italy, 2022.
[23] A. Ali, P. Andrzejowski, N. K. Kanakaris, P. V. Giannoudis, Pelvic Girdle Pain, Hypermo-
     bility Spectrum Disorder and Hypermobility-Type Ehlers-Danlos Syndrome: A Narrative
     Literature Review, Journal of Clinical Medicine 9 (2020) 3992. doi:10.3390/jcm9123992.
[24] R. J. Roberts, PubMed Central: The GenBank of the published literature, Proceedings
     of the National Academy of Sciences of the United States of America 98 (2001) 381–382.
     doi:10.1073/pnas.98.2.381.
[25] T. M. Lehmann, H. Schubert, D. Keysers, M. Kohnen, B. B. Wein, The IRMA code for unique
     classification of medical images, in: H. K. Huang, O. M. Ratib (Eds.), Medical Imaging 2003:
     PACS and Integrated Medical Information Systems: Design and Evaluation, SPIE, 2003.
     doi:10.1117/12.480677.
[26] T. Deserno, B. Ott, 15.363 IRMA Bilder in 193 Kategorien für ImageCLEFmed
     2009, 2009. URL: https://publications.rwth-aachen.de/record/667225. doi:10.18154/
     RWTH-2016-06143.
[27] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of
     machine translation, in: Proceedings of the 40th annual meeting of the Association for
     Computational Linguistics, 2002, pp. 311–318.
[28] C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries, in: Text Summa-
     rization Branches Out, Association for Computational Linguistics, 2004, pp. 74–81. URL:
     https://aclanthology.org/W04-1013.
[29] M. Denkowski, A. Lavie, Meteor Universal: Language Specific Translation Evaluation
     for Any Target Language, in: Proceedings of the Ninth Workshop on Statistical Machine
     Translation, Association for Computational Linguistics, 2014, pp. 376–380. URL: http:
     //aclweb.org/anthology/W14-3348. doi:10.3115/v1/W14-3348.
[30] R. Vedantam, C. L. Zitnick, D. Parikh, CIDEr: Consensus-based image description
     evaluation, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR), IEEE, 2015, pp. 4566–4575. URL: http://ieeexplore.ieee.org/document/7299087/.
     doi:10.1109/CVPR.2015.7299087.
[31] P. Anderson, B. Fernando, M. Johnson, S. Gould, SPICE: Semantic Propositional Image
     Caption Evaluation, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision –
     ECCV 2016, Lecture Notes in Computer Science, Springer International Publishing, 2016,
     pp. 382–398. doi:10.1007/978-3-319-46454-1_24.
[32] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text
     generation with BERT, in: 8th International Conference on Learning Representations,
     ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020. URL: https://openreview.net/
     forum?id=SkeHuCVFDr.
[33] F. Charalampakos, V. Karatzas, V. Kougia, J. Pavlopoulos, I. Androutsopoulos, AUEB
     NLP Group at ImageCLEFmed Caption Tasks 2021, in: CLEF2021 Working Notes, CEUR
     Workshop Proceedings, CEUR-WS.org, Bucharest, Romania, 2021, pp. 1184–1200.
[34] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show,
     attend and tell: Neural image caption generation with visual attention, in: F. R. Bach, D. M.
     Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML
     2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings,
     JMLR.org, 2015, pp. 2048–2057. URL: http://proceedings.mlr.press/v37/xuc15.html.

</pre>