1. Introduction

Last accessed:

AUEB NLP Group at ImageCLEFmedical Caption 2024

Marina Samprovalaki

Anna Chatzipapadopoulou

Georgios Moschovis

0 1

Foivos Charalampakos

Panagiotis Kaliosis

John Pavlopoulos

0 1

Ion Androutsopoulos

0 1 0 Archimedes Unit, Athena Research Center , 1, Artemidos Street, GR-151 25 Athens , Greece 1 Department of Informatics, Athens University of Economics and Business , 76, Patission Street, GR-104 34 Athens , Greece

2024

202 4 06

This article describes the approaches that the AUEB NLP Group experimented with during its participation in the 8th edition of the ImageCLEFmedical Caption evaluation campaign, including both Concept Detection and Caption Prediction tasks. The objective of Concept Detection is to automatically categorize biomedical images into a set of one or more concepts. In contrast, the Caption Prediction task focuses on generating a precise and meaningful diagnostic caption that describes the medical conditions depicted in the image. Building on our prior research for the Concept Detection task, we utilized a diverse set of Convolutional Neural Network (CNN) encoders, followed by a Feed-Forward Neural Network. Additionally, we implemented two versions of the retrieval-based -NN algorithm: a version that assigned concepts based on statistical frequency and a weighted version that took into account the order of the retrieved neighbors. Both models used the CNN image encoders to improve their retrieval capabilities. Regarding the Caption Prediction task, we fine-tuned the InstructBLIP model to generate initial captions and then enhanced it by employing rephrasing techniques with further pre-trained models. We also used synthesizing techniques that incorporated information from similar neighboring images in the training set to refine these captions. Additionally, we employed “Distance from Median Maximum Concept Similarity” (DMMCS), a novel guided-decoding approach that drives the model's behaviour throughout the decoding process, aiming to integrate information from the predicted concepts of Concept Detection. We explored the application of DMMCS to all of our developed systems. Our group ranked 2nd in Concept Detection and 4th in Caption Prediction.

eol>Natural Language Processing Computer Vision Biomedical Images Convolutional Neural Networks Multi-Label Classification Caption Generation Generative Models Transformers Deep Learning

1. Introduction

ImageCLEF [ 1 ] is an ongoing evaluation initiative, first run in 2003 as part of the Cross Language Evaluation Forum (CLEF)1, that promotes the evaluation of technologies for annotation, indexing, classification, and retrieval of multi-modal data. ImageCLEFmedical is one of the four main tasks in this year’s ImageCLEF campaign. We participated in the ImageCLEFmedical Caption task, which was organized for the eigth time [ 2 ]. As in previous years, the task comprised two sub-tasks: Concept Detection and Caption Prediction.

The objective of Concept Detection is to accurately associate a biomedical image with one or more relevant medical concepts (tags), while in Caption Prediction, the goal is to automatically generate a preliminary diagnostic report that accurately describes the medical findings, as well as the anatomy of the body structures and organs shown in the image. Diagnostic Captioning remains a challenging research problem aimed at assisting the diagnostic process for patients by providing a preliminary report, rather than replacing medical professionals involved in the procedure [ 3 ]. It can thus be seen as an assistive tool, capable of producing an initial draft diagnosis regarding the patient’s condition. Such a document would ideally allow doctors to focus on critical areas of the image [ 4 ] and help them produce more precise medical diagnoses at an increased speed [ 5 ]. Experienced clinicians could enhance their throughput by analyzing the large volume of daily medical examinations more quickly and eficiently. Less experienced clinicians could consider the automatically generated captions to reduce the likelihood of clinical errors [ 6 ]. Concept Detection can further improve Diagnostic Captioning by identifying key concepts that should be included in the draft report. We demonstrate the connection between the two sub-tasks by using “Distance from Median Maximum Concept Similarity” (DMMCS)2 [ 7 ], which employs information derived from our Concept Detection systems in order to improve the performance of our Caption Prediction systems.

1.1. AUEB NLP Group contributions

In this work, we present the experiments conducted and the systems submitted as part of the AUEB NLP Group’s participation in this year’s Concept Detection and Caption Prediction tasks. We used a number of new approaches influenced by the remarkable progress in the field of NLP and based on instruction-tuned Large Language Models (LLMs) [ 8 ].

Our submissions to the Concept Detection sub-task are based on two distinct approaches. We used a Convolutional Neural Network (CNN) encoder to extract visual features from the medical images. In the first approach, these features were fed into a Feed-Forward Neural Network (FFNN) to classify the images into various medical concepts. In the second approach, we implemented a separate method using a -nearest neighbors (-NN) algorithm. In this approach, neighbors are first retrieved, and the most frequently occurring concepts among these neighbors are selected.

Regarding the Caption Prediction sub-task,we tried five main approaches. First, we employed an InstructBLIP model [ 9 ] that was fine-tuned on the specified dataset [ 10 ] to generate an initial set of captions, which were then also used in the other four approaches. In the second approach, we enhanced the initial captions by drawing insights from captions of similar images and training a FLAN-T5 model [ 11 ] to refine them [ 12, 13 ]. The third approach was similar, but instead of FLAN-T5, we employed ClinicalT5 [ 14 ], which is pre-trained on numerous medical datasets, in order to rephrase and correct the initial captions produced by InstructBLIP. The fourth approach involved integrating the DMMCS algorithm [ 7 ] in the language model’s decoding process in order to promote the inclusion of a given set of keywords, which in this case where predicted by one of our Concept Detection systems. Lastly, we also applied DMMCS decoding to ClinicalT5 in order to maximize their eficacy and improve the overall caption quality. In all our models we used CNN encoders, since there are signs that vision transformers [ 15 ] still have inferior performance in visual tasks, such as classification and semantic segmentation [ 16 ], especially in medical image tagging [ 5, 17 ].

Extending our history of successful entries [ 18, 19, 20, 21, 22 ] in the ImageCLEFmedical campaign, our submissions ranked 2nd among 9 participating groups in the Concept Detection sub-task and 4th among 11 participating groups in the Caption Prediction sub-task. In Section 2, we provide insight into this year’s dataset, followed by a discussion of our approaches in Section 3. In Section 4, we present our experimental results for each sub-task. Finally, in Section 5, we summarize our findings and suggest directions for future research.

All code used for our experiments is available on GitHub.3

2. Data

In this year’s edition of the ImageCLEFmedical Caption task, the dataset is an updated and extended version of the Radiology Objects in Context (ROCO) dataset [ 10 ], which originates from biomedical articles of the PubMed Open Access (PMC OA) subset.4.

This dataset, which is common for both sub-tasks, consists of 80,080 biomedical images along with their respective medical concepts, in the form of UMLS [23] terms5, and diagnostic captions. The dataset was originally split by the organizers into training and validation subsets, with 70,108 radiology images in the first set and 9,972 in the latter. After merging the provided data, we split them again, this time into three subsets, in order to also obtain a development (private test) subset for evaluation purposes. We used a 75%-10%-15% training-validation-development split, keeping relatively equal concept distributions in all three subsets. Consequently, we obtained 64,928 images as our training data, 7,179 images as our validation set, while the remaining 7,973 images constituted our held-out development set. All of our submissions were also evaluated on the hidden oficial test set (ROCOv2) [24]. The test dataset utilizes Radiology Objects in COntext Version 2 (ROCOv2) [24], an updated and extended version of the ROCO dataset [ 10 ]. This set includes 17,237 previously unseen images.

2.1. Concept Detection

Concept Detection is a multi-label classification problem covering a broad range of 1,945 distinct biomedical concepts, originating from the Unified Medical Language System (UMLS) [ 23]. In this sub-task, the goal is to identify (assign) the distinct medical concepts (tags) depicted in each image (e.g., particular medical conditions). Among the available concepts (tag set), four are specific imaging modalities: X-Ray Computed Tomography, Ultrasonography, Magnetic Resonance Imaging (MRI), PET/CT scans. All concepts are represented by Concept Unique Identifiers (CUIs) following the UMLS standard. Some examples of images and their ground truth concepts can be found in Figure 1.

CUI UMLS Term

C0041618 Ultrasonography C0018827 Heart Ventricle C1510420 Cavitation

CC BY [Magdás et al. (2021)]

The distribution of concepts is highly skewed. Some concepts are present in more than 25, 000 images, whereas others are associated with only 1 image. Figure 2(a) depicts the long-tail distribution of the entire (development + validation + train) dataset, as shown in the left plot, where the frequencies of the concepts (number of images each concept is associated with) are plotted in descending order against their respective class indices. After conducting a comprehensive exploratory analysis of this year’s dataset, we found that certain concepts were more prevalent (Table 1); these mostly correspond 4PMC Open Access: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, Last accessed: 2024-06-20 5UMLS: https://www.nlm.nih.gov/research/umls/index.html, Last accessed: 2024-06-20 to kinds of medical examinations, such as X-Ray Computed Tomography or Plain x-ray. Most images are associated (in the ground truth) with at least one of these overarching concepts, alongside more specialized ones. The maximum and minimum number of concepts assigned to a single image are 27 and 1, occurring in 1 and 8,567 images respectively. The average number of assigned concepts per image is 3.1583. The aforementioned observations are outlined in the histogram in Figure 2(b).

2.2. Caption Prediction

In the Caption Prediction data, each image is accompanied by a gold diagnostic caption that describes the medical conditions present in the image. There are 80, 080 gold captions across the whole dataset, one for each provided image. Similar to last year’s campaign, the vast majority of the captions, specifically 99.47% (79, 658 out of 80, 080 captions), are unique. The maximum number of words in a single caption is 848 (occurred once), while the minimum is 1 (encountered 73 times). The average caption length is 21.01 words. These statistics apply to the dataset as a whole, but we have carefully checked that they remain consistent in all three subsets (training, validation, development) we formed. The five most common captions, as well as the ten most popular words, excluding the stopwords, can be found in Tables 2 and 3, respectively. In Figure 3, we provide a histogram alongside a box plot, utilizing a logarithmic scale in our visualizations. This helps make smaller counts more visible and reduces the dominance of larger values, giving a more balanced view of how the data is distributed.

Number of Images vs Number of Words in Captions (Log Scale) Boxplot of Caption Lengths (Log Scale) 103 se ifroaeubgN110021 m m 100 1

100 0 100 200 300 Num40b0erofwords 500 600 700 800 900 101 Numberofwords 102 103 (a) (b) According to the organizers, each caption is pre-processed before evaluated in the following manner: • The caption is converted to lower-case. • Numbers are replaced by words, e.g., number 10 becomes “ten”.

• Punctuation is removed.

3. Methods

In this section, we present the methods we used in our submissions for both the Concept Detection and the Caption Prediction sub-tasks.

3.1. Concept Detection

Our submissions for this year’s Concept Detection sub-task are built upon two frameworks. Initially, we extensively explored a CNN+FFNN framework, building upon our prior research [ 18, 19, 20, 21 ], experimenting with various image encoders. Additionally, we used a neural image retrieval approach by integrating a -nearest neighbors (-NN) algorithm, which selects neighbors and aggregates tags based on their frequency among the neighbors. Furthermore, we submitted several ensembles of the aforementioned systems. The ensembles employed strategies such as union-based and intersectionbased aggregation. 3.1.1. CNN + FFNN This system employs a CNN encoder as its backbone, followed by an FFNN classification head. We extract image features from the last convolutional layer of the image encoder and we condense these feature maps into a feature vector (an image embedding) using global pooling. More specifically, we used the Generalized-Mean (GeM) pooling [25] mechanism.

The FFNN component classifies the image into one or more concepts. Its output layer has || neurons, where represents the set of unique concepts in the dataset. Each neuron uses a sigmoid activation function to transform its value into a probability value in [ 0, 1 ]. This results in one probability per label, and if this probability exceeds a specific threshold value , the corresponding concept is assigned to the image. The threshold, which is the same for all concepts, was chosen through a grid search procedure that optimized the primary metric of the competition, on our validation set. The model was trained by minimizing binary cross-entropy, treating each concept as a separate binary target and summing up the individual losses. We used the Adam optimizer [26], along with a decreasing learning rate strategy and early stopping based on the validation set loss with a patience value of 3 epochs. We used an initial learning rate of = 10− 3 and decreasing factor of 10.

In order to form the ensembles, we trained several instances of the model, using diferent random initializations, and combined them using the union and the intersection of their predicted concept sets. More details about our submitted ensemble systems can be found in Section 4.1. 3.1.2. CNN + -NN For our -nearest neighbors (-NN) approach, we leveraged the image embeddings obtained from the encoder of the trained CNN+FFNN system (Section. 3.1.1). We discarded the dense classification head and used the last GeM pooling layer to extract embeddings (feature vectors) for all the training images. These embeddings served as the basis for the retrieval process in the -NN algorithm. Given a test image, the goal of the system is to retrieve similar images from the training set and select concepts from the retrieved neighbors. For each test image, we used the same encoder to obtain its embedding and we retrieved the closest neighbors from the training set, based on cosine similarity computed on the image embeddings. We tuned the value of in the range from 1 to 100 using our validation set, which led to = 33.

For each test image, having obtained its neighbors from the training set, we formed the set of concepts associated with the neighbors. We then ranked the concepts of the set based on the number of retrieved neighbors associated with each concept, ordering them from highest to lowest frequency. The concept with the highest frequency was always included in the predictions of the -NN method for the test image. We then used two thresholds, 1 and 2, which we tuned using grid search on our Fr(concept1) − Fr(concept2)

Fr(concept1)

Similarly, we determined whether to include in the prediction the third most frequent concept or not, based on a comparison involving the first and third most frequent concepts. We calculated the diference between the frequencies of the first and third concepts, dividing it by the frequency of the ifrst concept, and if this ratio exceeded 2, we included the third concept:

Fr(concept1) − Fr(concept3)

Fr(concept1) Fr(concept1) − Fr(concept4)

Fr(concept1) ≥ 2 . ≥ 2 .

The same approach was applied to the diference between the first and fourth most frequent concepts, checking again against 2, to decide if the fourth most frequent concept should be predicted:

We opted to predict at most four concepts due to the fact that the average number of concepts in the training split was 3.08. The rationale was to select concepts that have frequencies close to that of the highest frequency concept, while excluding concepts that show a significant drop in frequency compared to the preceding ones. We experimented with 1, 2 values ranging from 0.3 to 0.9. Validation results indicated that the best parameters were 1 = 0.58 and 2 = 0.65. validation set, to select which other concepts of the neighborhood to include in the predictions of -NN. We calculated the diference in frequency ( Fr) between the first and second most frequent concepts, divided by the frequency of the first concept, and if the result exceeded 1, we included the second concept in the prediction: (1) (2) (3) (4) (5) (; 1, . . . , ) = ∑︀=1 · ,, ∑︀ =1 where ,, = 1 if concept is present in the ground truth of the -th neighbor of , otherwise ,, = 0, and is the weight assigned to the -th nearest neighbor position; we explain below how the weights are learned. Concept is predicted for the test image if and only if (; 1, . . . , ) ≥ , yielding the predicted label set (; 1, . . . , ) = {|(; 1, . . . , ) ≥ }. The classification threshold ∈ [ 0, 1 ] and the number of neighbors ∈ [ 1, 100 ] were tuned on our validation set, resulting in = 0.35 and = 50. The weights 1, . . . , are the same for all the concepts and test images . They are learned using a genetic algorithm (GA) [28] by maximizing the following objective, where denotes the validation set, () is the ground truth set of concepts of image , and 1 is the oficial evaluation measure of the Concept Detection task: 3.1.3. CNN + weighted -NN We also developed a weighted version of the -NN algorithm, using the voting scheme that was described in [27]. More specifically, given a test image , we calculate for each concept ∈ a score (; 1, . . . , ) from the neighbors retrieved for :

max ∑︁ 1( (), (; 1, . . . , )) 1,..., ∈ s.t.

1 ≥ 1 ≥ . . . ≥ ≥ 0 .

In detail, we created a population of 500 randomly initialized weight vectors, initial chromosomes in GA terminology. Each chromosome had the form ⟨1, . . . , ⟩, with all weights ∈ [ 0, 1 ]; we ensured that the monotonicity constraint 1 ≥ 1 ≥ . . . ≥ ≥ 0 was satisfied by all chromosomes. We then used a crossover mechanism where two chromosomes were combined to form two new ones. At each application of the crossover mechanism, we selected pairs of chromosomes (parents) out of the population and combined their values to form two new ones from each pair of parents. The crossover operator splits the two parent chromosomes at a random point and creates two children chromosomes by combining the values before the crossover point (or after) for one parent, and after (or before) the crossover point for the other parent. Furthermore, we used a mutation mechanism that perturbed the values of the resulting children chromosomes by adding a random value in [− 0.1, 0.1] to every gene, with a 0.1 mutation probability per gene (). Both the crossover and the mutation operators paid respect to the range and monotonicity constraints; we added a clipping and a sorting operation that were applied if any of the constraints were violated in the resulting chromosomes. We used 1( (), ()) as the fitness function. The fitness function is used to select the chromosomes to be used as parents in the crossover mechanism at each iteration of the algorithm (fitter chromosomes are selected with higher probability as parents). At each generation (new population), we performed the crossover mechanism as many times as necessary to have a new generation with as many members as the previous one (and as many as the initial population, i.e., 500 chromosomes). We run the optimization process for 30 iterations (generations).

3.2. Caption Prediction

Our submissions for the Caption Prediction sub-task focused on four primary systems. The first system employs an InstructBLIP model [ 9 ] (Section 3.2.1), while the remaining submissions build on this model using techniques such as rephrasing [ 12, 13 ] (Section 3.2.3) and synthesizing [ 12 ] (Section 3.2.2). Finally, we implemented an innovative guided-decoding mechanism, DMMCS [ 7 ] (Section 3.2.4), which leverages information from the tags predicted by our CNN+-NN classifier (Section 3.1.2) in the Concept Detection task to improve the generated caption.

3.2.1. InstructBLIP

The InstructBLIP model [ 9 ] is a sophisticated neural network designed to generate descriptive text for scientific images. It employs a technique known as instruction-tuning [ 29], which refines its behavior and responses based on user-provided instructions. This approach aims to enhance the model’s controllability and its adaptability across diferent domains. The InstructBLIP model comprises three key components: an image encoder, a Q-Former [30], and an LLM. The frozen image encoder converts the image into a low-dimensional vector and generates image embeddings. The Q-Former then extracts instruction-aware visual features from these embeddings and can process the text prompt (instruction) to enhance this extraction. Through extensive training, the LLM learns to correlate textual prompts with relevant image features, thereby generating coherent and contextually appropriate descriptions. The InstructBLIP model played a crucial role in creating the initial captions, which were subsequently utilized in our other caption prediction methods.

3.2.2. Synthesizer

Our goal was to the captions obtained from the InstructBLIP model (Section 3.2.1) by leveraging information from similar training images, based on the intuition that similar images may have similar captions [31, 32]. To achieve this, we computed embeddings for all images in the dataset using the CCN + FFNN model, which was developed for Concept Detection (Section 3.1.1). A cosine similarity threshold was then applied to decide if an image qualified as a neighbor of the test image. Images exceeding this threshold were considered neighbors [33]. For each image in the test set [24], we identified the most similar images from the entire dataset [ 10 ], which includes training, validation, and development images, to retrieve their corresponding captions. We experimented with ∈ {1, 3, 5}; the best results in our validation set were obtained for = 5, so we used that value. The Synthesizer, a FLAN-T5 model [ 11 ], was trained to refine the captions generated by InstructBLIP by considering also the captions of the neighbors, which are concatenated to the caption of InstructBLIP, similarly in spirit to [ 13 ]. We also experimented with diferent beam sizes , for the beam search decoding of the Synthesizer during inference; setting = 5 yielded the best validation scores, so we used that value. Figure 4 illustrates the process (for = 3), starting with the caption generated by InstructBLIP, merging it with the captions of the neighbors, and using FLAN-T5 to obtain a refined caption.

3.2.3. Rephraser

Furthermore, we experimented with a domain-specific variation of T5, namely ClinicalT5. This is an encoder-decoder transformer, which is pre-trained in a series of both supervised and unsupervised tasks [34], including denoising tasks, and then further pre-trained on the union of MIMIC-III and IV clinical notes, to which we were granted access through PhysioNet6. Following our previous work [35], we created a corrective text-to-text training set, consisting of noisy and ground truth caption pairs, with the former having been generated by our captioning systems. Therefore, we treated our original system as a noise-insertion function, then we further fine-tuned ClinicalT5, in order to rephrase the noisy captions to approximate the gold ones, hoping it would acquire knowledge of the medical domain, use medical terms more accurately and therefore generate more medically fluent text captions. Specifically, we fine-tuned ClinicalT5 to rephrase the captions of InstructBlip (Section 3.2.1), InstructBlip with FLAN-T5 Synthesizer (Section 3.2.2) on top and InstructBlip with DMMCS (Section 3.2.4) using = 0.10. Performance in terms of the primary metric in our development set improved, but test-time performance (in the oficial evaluation) deteriorated. 3.2.4. DMMCS In this section, we present “Distance from Median Maximum Concept Similarity” (DMMCS) [ 7 ], a novel data-driven guided decoding mechanism designed to incorporate domain-specific information (in the form of keywords) into the text generation process. The intuition behind this guided decoding algorithm lies in the observation that an accurate diagnostic caption should mention the key medical conditions 6https://www.physionet.org/content/clinical-t5/1.0.0/, Last accessed: 2024-06-20 depicted in the given image. For example, if a radiology image is assigned the tag “Pneumonia”, but the generated caption does not refer to this medical condition either explicitly or implicitly, then the caption is potentially inaccurate. Such conditions are typically represented by the medical tags provided in the ImageCLEF2024 dataset, which the Concept Detection task is also trying to predict. Therefore we use tags predicted by one of our Concept Detection systems (Section 3.1), in order to guide our Caption Prediction models towards captions that express the tags appropriately. We achieve this by imposing a new penalty at each decoding step, aiming to prioritize the generation of words semantically similar to the (predicted) medical tags. This penalty also considers the frequency with which each tag is explicitly or implicitly expressed in the dataset’s gold captions.

In more detail, recent work examining DC datasets [ 22, 7 ] has shown that some tags are more prominently expressed than others in the corresponding diagnostic captions. More specifically, Kaliosis et al. [ 7 ] performed an exploratory analysis on the ImageCLEF2023 and MIMIC-CXR datasets, where they investigated the relationship between each tag and the gold captions of the images that are associated with the tag in the ground truth. This was achieved by calculating the cosine similarity between the word embeddings of each caption’s tokens and each tag. The results showed that some tags are always explicitly expressed in the gold captions of the images the tags are associated with, while other tags are mentioned more implicitly or even not at all. More concretely, the similarity between a tag and a caption is defined as the maximum cosine similarity (MCS) between the centroid ℎ() of the word embeddings of and the embedding ℎ() of each token in , i.e.,

MCS(, ) = max sim(ℎ(), ℎ()).

1≤ ≤| | A high MCS score between a tag and a caption implies that is strongly expressed in the caption, while a low MCS score indicates that it was rather implicitly (or not at all) mentioned. The MCS similarity is also calculated for all the gold captions of the images a tag is associated with in the training data. Specifically, for each tag and the set containing its associated captions, the distribution (, ) is calculated as:

(, ) = {MCS(, )| ∈ }.

MMCS(, ) = median((, )). (7) (8) (9) The median value of the distribution (, ), hereafter called Median Maximum Cosine Similarity (MMCS), indicates how strongly is expressed on average in the training captions it is associated with.

During inference, when generating the caption for an image with a single tag , the MCS(, ) of the tag and each candidate (possibly still incomplete) caption of the beam search is calculated (Eq. 7). The penalty, imposed at each decoding step, is then defined as the squared diference between MCS(, ) and MMCS(, ). The former shows how strongly the tag is mentioned in the candidate caption, while the latter indicates how strongly the tag is expressed on average in the gold training captions associated with the tag. When more than one tags are assigned to an image, a distinct penalty is calculated for each tag, and the overall penalty is the average of the individual penalties. Thus, given a candidate caption , the set of its associated training captions , and a set of tags , the penalty is calculated as: DMMCSpen(, , ) = 1 ∑︁(MCS(, ) − MMCS(, ))2. | | ∈ (10)

Intuitively, the objective of the DMMCS algorithm is to guide the model to generate captions that express each associated tag as explicitly (or implicitly) as it is expressed in the training corpus. Overall, at each decoding step, each candidate caption generated through the beam search process is scored by the following formula:

DMMCS() = · DMMCSpen(, , ) + (1 − ) · (1 − Dscore), where is a given set of predicted tags, is a tunable weighting factor, while Dscore is the score that the decoder assigns to the candidate caption .

4. Experiments, Submissions and Results

In this section, we provide details about our experiments regarding this year’s evaluation campaign [ 1 ]. Moreover, we share details about our submissions and the scores achieved in our held-out development set, as well as the oficial test set of the competition [24] for both sub-tasks.

4.1. Concept Detection

In the Concept Detection sub-task we submitted our ten best performing models, after evaluating them on our held-out development set. We submitted two instances with diferent image encoders of our CNN + FFNN model (Section 3.1.1), one instance of our CNN + -NN model (Section 3.1.2), and a single instance of our CNN + weighted -NN model (Section 3.1.3). In our subsequent submissions, we employed ensemble systems. These involved exploring the integration of predictions from multiple instances by computing either the union or the intersection of their predicted concept sets. Our submitted ensemble systems consisted of various combinations of CNN-based architectures paired with diferent classifiers, specifically CNN + FFNN, CNN + -NN (KNN), and CNN + weighted -NN (wKNN). To enhance the diversity and robustness of our ensembles, we incorporated diferent architectures for the CNN component.

The primary evaluation metric for this year’s Concept Detection sub-task was the 1-score, calculated between the predicted and ground truth captions. It is calculated as the sum of the 1-scores for each test image, divided by the total number of test images. Each partial score is derived from the binary multi-hot candidate vector compared to the corresponding ground truth vector. Specifically, let 1 represent the overall 1-score, and ^1 denote the individual 1-score for each test image. Additionally, let and be the predicted and ground truth concepts for an image , respectively. Finally, let be the test set [24].

1 = 1 ∑︁ ^1(, ) | | ∈ (6)

Moreover, a secondary evaluation metric (again an 1 score) was calculated, which only considered manually selected concepts, such as anatomy, topography, and modality.

For our first system (CNN+FFNN), we experimented with a variety of CNN encoders as their backbone components. Specifically, we trained the networks using state-of-the-art CNN architectures, including EficientNet and DenseNet. Furthermore, we extended our experiments by incorporating these CNN encoders into our -NN models.

During testing on our held-out development set, we observed a slightly higher F1 score in models utilizing the EficientNet image encoder.

Our ensembling approaches did not show significant improvement over our individual models, with minimal diferences observed in both the development and test set [24].

4.2. Caption Prediction

For the Caption Prediction sub-task, we submitted nine systems based on their performance on our development set. Our submissions included InstructBLIP (Section 3.2.1), a synthesizer variant combining InstructBLIP with FLAN-T5 (Section 3.2.2), and a rephrasing variant that employs ClinicalT5 (Section 3.2.3). Additionally, we explored combinations of all three approaches, aiming to refine the captions generated by InstructBLIP and FLAN-T5 (Section 3.2.2) using our ClinicalT5 rephraser on top. Furthermore, we submitted three variations of InstructBLIP and DMMCS, each with a diferent value (Section 3.2.4). Finally, we provided two instances where we employed ClinicalT5 to rephrase the results generated by the combination of InstructBLIP and DMMCS, in this case using a = 0.10.

In this year’s campaign, BERTScore [36] was the primary evaluation metric in the Caption Prediction task, while ROUGE [37] was the secondary metric. Other metrics utilized include, for example, BLEU-1 [38], BLEURT [39], and METEOR [40]. Table 6 shows captions produced by each of our submissions for the test image CC BY [Muacevic et al. (2024)], extracted from the test dataset [24].

Finally, Table 7 provides an overview of our models, detailing their performance across fundamental campaign metrics in both our development set and the provided test set [24], along with our attained magnetic resonance imaging of the head and neck showing a hyperintense lesion in the right internal carotid.

Axial computed tomography scan of the head showing a mass in the left maxillary sinus (arrow).

Computed tomography scan of the head and neck showing a mass in the right parotid gland.

InstructBLIP + DMMCS (alpha 0.1)

Chest X-ray showing bilateral pulmonary edema.

InstructBLIP + DMMCS (alpha 0.1) Computed tomography scan of the head and neck showing a + Rephraser mass in the right parotid gland.

InstructBLIP + DMMCS (alpha 0.1) Anteroposterior radiograph of the pelvis showing a large right+ Rephraser (random restart) sided pleural efusion. rankings. Additionally, Table 8 presents a summary of all the metrics utilized in this year’s campaign, ofering a comprehensive view of the experiments.

5. Conclusion

Our participation in the ImageCLEFmedical Caption task provided an opportunity to explore innovative NLP approaches for medical image captioning. Utilizing state-of-the-art models, we demonstrated competitive performance in both the Concept Detection and Caption Prediction sub-tasks.

In the Concept Detection sub-task, we achieved a 2nd place ranking among the participating groups. Our top-performing system was a CNN+FFNN pipeline (Section 3.1.1), while our remaining submissions included a CNN+KNN (Section 3.1.2) and a CNN+wKNN (Section 3.1.3), which also produced competitive results. We also employed ensembles that combined these approaches using union and intersection (of predicted tags) approaches.

In the Caption Prediction sub-task, we were ranked 4th among all participating groups, by both extending our previous work [ 22, 21, 17 ] and exploiting the state-of-the-art in NLP, such as instructiontuned Large Language Models. Our approach involved the initial generation of captions using the InstructBLIP model [ 9 ], followed by their enrichment through the synthesis of information from the captions of similar images [ 12, 13 ] and the utilization of a model further pre-trained in the medical domain [ 14 ] to improve the originally generated captions.

In future work, we plan to further investigate and improve biomedical LLMs and further explore their reasoning capabilities through instruction tuning and, more generally, alignment with medical professionals needs [41]. We also plan to utilize a model capable of processing both image and text inputs in our Synthesizer approach (Section 3.2.2) to combine information not only from the captions of the neighbors, but also from the images themselves. Furthermore, we plan to exploit Retrieval-Augmented Generation [42] algorithms to combine prior knowledge with new medical cases. Finally, the generated captions need to be evaluated in collaboration with medical experts, to assess their medical accuracy and usefulness.

Acknowledgments

This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program. Switzerland, September 9-12, volume 2380 of CEUR Workshop Proceedings, 2019. [19] B. Karatzas, J. Pavlopoulos, V. Kougia, I. Androutsopoulos, AUEB NLP Group at ImageCLEFmed Caption 2020, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25, volume 2696 of CEUR Workshop Proceedings, 2020. [20] F. Charalampakos, V. Karatzas, V. Kougia, J. Pavlopoulos, I. Androutsopoulos, AUEB NLP Group at ImageCLEFmed Caption Tasks 2021, in: Proceedings of the Working Notes of CLEF 2021 Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21-24, volume 2936 of CEUR Workshop Proceedings, 2021, pp. 1184–1200. [21] F. Charalampakos, G. Zachariadis, J. Pavlopoulos, V. Karatzas, C. Trakas, I. Androutsopoulos, AUEB NLP Group at ImageCLEFmedical Caption 2022, in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.or, Bologna, Italy, 2022, pp. 1355–1373. [22] P. Kaliosis, G. Moschovis, F. Charalampakos, J. Pavlopoulos, I. Androutsopoulos, AUEB NLP Group at ImageCLEFmedical Caption 2023, in: CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2023. [23] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical terminology,

Nucleic acids research 32 (2004) D267–D270. doi:10.1093/nar/gkh061. [24] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.

Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset, Scientific Data (2024). URL: https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6. [25] F. Radenović, G. Tolias, O. Chum, Fine-Tuning CNN Image Retrieval with No Human Annotation, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019) 1655–1668. doi:10. 1109/TPAMI.2018.2846566. [26] D. P. Kingma, J. L. Ba, Adam: A Method for Stochastic Optimization, in: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [27] T.-H. Chiang, H.-Y. Lo, S.-D. Lin, A Ranking-based KNN Approach for Multi-Label Classification, in: Proceedings of the Asian Conference on Machine Learning, volume 25, Singapore Management University, Singapore, 2012, pp. 81–96. [28] A. Eiben, J. E. Smith, Introduction to Evolutionary Computing, 2nd ed., Springer Publishing

Company, Incorporated, 2015. doi:10.1007/978-3-662-44874-8. [29] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned Language Models Are Zero-Shot Learners, International Conference on Learning Representations abs/2109.01652 (2021). doi:10.48550/arXiv.2109.01652. [30] J. Li, D. Li, S. Savarese, S. C. H. Hoi, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, in: International Conference on Machine Learning, 2023. URL: https://api.semanticscholar.org/CorpusID:256390509. doi:10.48550/arXiv. 2301.12597, Last accessed: 2024-06-20. [31] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-Augmented Generation for Large Language Models: A Survey, 2024. doi:10.48550/arXiv.2312.10997. arXiv:2312.10997. [32] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Neural Information Processing Systems abs/2005.11401 (2020). [33] Y. Huang, J. Huang, A Survey on Retrieval-Augmented Text Generation for Large Language Models, 2024. doi:10.48550/arXiv.2404.10981. arXiv:2404.10981. [34] C. Rafel, N. M. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal of machine learning research 21 (2019) 140:1 – 140:67. doi:10.48550/arXiv.1910.10683. [35] P. Kaliosis, Exploring Uni-modal, Multi-modal and Few-Shot Deep Learning Methods for Diagnostic Captioning, 2023. M.Sc. thesis, Department of Informatics, Athens University of Economics and Business. [36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation with BERT, International Conference on Learning Representations abs/1904.09675 (2019). [37] C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013, Last accessed: 2024-06-20. [38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a Method for Automatic Evaluation of Machine Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040. doi:10.3115/1073083.1073135, Last accessed: 2024-06-20. [39] T. Sellam, D. Das, A. Parikh, BLEURT: Learning Robust Metrics for Text Generation, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7881– 7892. URL: https://aclanthology.org/2020.acl-main.704. doi:10.18653/v1/2020.acl-main.704, Last accessed: 2024-06-20. [40] S. Banerjee, A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, in: J. Goldstein, A. Lavie, C.-Y. Lin, C. Voss (Eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, 2005, pp. 65– 72. URL: https://aclanthology.org/W05-0909. doi:10.3115/1626355.1626389, Last accessed: 2024-06-20. [41] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. E. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, R. J. Lowe, Training language models to follow instructions with human feedback, Neural Information Processing Systems abs/2203.02155 (2022). doi:10.48550/arXiv.2203.02155. [42] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 9459–9474. doi:10.48550/arXiv.2005.11401.

[1]

Ionescu ,

Müller ,

Drăgulinescu ,

Rückert ,

A. Ben

Abacha ,

Garcıa Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Damm ,

Bracke ,

C. M.

Friedrich ,

Andrei ,

Prokopchuk ,

Karpenka ,

Radzhabov ,

Kovalev ,

Macaire ,

Schwab ,

Lecouteux ,

Esperança-Rodier ,

Yim ,

Fu ,

Sun ,

Yetisgen ,

Xia ,

S. A.

Hicks ,

M. A.

Riegler ,

Thambawita ,

Storås ,

Halvorsen ,

Heinrich ,

Kiesel ,

Potthast ,

Stein , Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024 ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024 .

[2]

Rückert ,

A. Ben

Abacha ,

A. G.

Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

Bracke ,

Damm ,

T. M. G.

Pakull ,

C. S.

Schmidt ,

Müller ,

C. M.

Friedrich , Overview of ImageCLEFmedical 2024 - Caption Prediction and Concept Detection , in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024 .

[3]

Pavlopoulos ,

Kougia ,

Androutsopoulos ,

Papamichail , Diagnostic Captioning: A Survey, Knowledge and Information Systems 64 ( 2022 ) 1 - 32 . doi: 10 .48550/arXiv.2101.07299.

[4]

H.-C.

Shin ,

Roberts ,

Lu ,

Demner-Fushman ,

Yao ,

R. M.

Summers , Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2016 , pp. 2497 - 2506 . doi: 10 . 48550/arXiv.1603.08486.

[5]

Moschovis , Medical image captioning based on Deep Architectures, Master's thesis , KTH Royal Institute of Technology, Stockholm, Sweden, 2022 . URL: http://urn.kb.se/resolve?urn=urn:nbn:se: kth: diva -323528, Last accessed: 2024 -06-20.

[6]

Pavlopoulos ,

Kougia , I. Androutsopoulos , A Survey on Biomedical Image Captioning , in: R. Bernardi,

Fernandez ,

Gella ,

Kafle ,

Kanan ,

Lee , M. Nabi (Eds.), Proceedings of the Second Workshop on Shortcomings in Vision and Language , Association for Computational Linguistics, Minneapolis, Minnesota, 2019 , pp. 26 - 36 . URL: https://aclanthology.org/W19-1803. doi: 10 .18653/v1/ W19 -1803, Last accessed: 2024 -06-20.

[7]

Kaliosis ,

Pavlopoulos ,

Charalampakos , G. Moschovis, I. Androutsopoulos , A data-driven guided decoding mechanism for diagnostic captioning , in: Findings of the Association for Computational Linguistics: ACL 2024 , 2024 .

[8]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong ,

Du ,

Yang ,

Chen ,

Jiang ,

Ren ,

Li ,

Tang ,

Liu , P. Liu,

J.-Y.

Nie ,

J.-R.

Wen , A Survey of Large Language Models , 2023 . doi: 10 .48550/arXiv.2303.18223. arXiv: 2303 . 18223 .

[9]

Dai ,

Li ,

A. M. H.

Tiong ,

Zhao ,

Wang ,

Li ,

P. N.

Fung , S. Hoi, InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , Advances in Neural Information Processing Systems 36 ( 2024 ). doi: 10 .48550/arXiv.2305.06500.

[10]

Pelka ,

Koitka ,

Rückert ,

Nensa ,

Friedrich , "Radiology Objects in COntext (ROCO): A Multimodal Image Dataset: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada , Spain, September 16 , 2018 , Proceedings" , 2018 , pp. 180 - 189 . doi: 10 .1007/978-3- 030 -01364-6_ 20 .

[11]

H. W.

Chung ,

Hou ,

Longpre ,

Zoph ,

Tay ,

Fedus ,

Li ,

Wang ,

Dehghani ,

Brahma , et al., Scaling Instruction-Finetuned Language

Models

, Journal of Machine Learning Research 25 ( 2024 ) 1 - 53 . doi: 10 .48550/arXiv.2210.11416.

[12]

Li ,

Liang ,

Hu , E. Xing, Knowledge-Driven

Encode

, Retrieve, Paraphrase for Medical Image Report Generation, in: AAAI Conference on Artificial Intelligence , volume abs/ 1903 .10122, 2019 . doi: 10 .1609/aaai.v33i01. 33016666 .

[13]

Vernikos ,

Brazinskas ,

Adamek ,

Mallinson ,

Severyn , E. Malmi, Small Language Models Improve Giants by Rewriting Their Outputs , in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , St. Julians, Malta, 2024 , pp. 2703 - 2718 . URL: https://aclanthology.org/ 2024 . eacl-long . 165 . doi: 10 .48550/arXiv.2305.13514, Last

accessed

: 2024 -06-20.

[14]

Lu ,

Dou , T. Nguyen, ClinicalT5: A Generative Language Model for Clinical Text, in: Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022 , pp. 5436 - 5443 . doi: 10 .18653/v1/ 2022 .findings-emnlp. 398 .

[15]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , in: International Conference on Learning Representations, 2021 . URL: https://openreview.net/forum?id=YicbFdNTTy. doi: 10 .48550/arXiv. 2010 . 11929 , Last

accessed

: 2024 -06-20.

[16]

Athanasiadis ,

Moschovis ,

Tuoma , Weakly-Supervised Semantic Segmentation via Transformer Explainability , in: ML Reproducibility Challenge 2021 (Fall Edition) , 2022 . doi: 10 .5281/ zenodo.6574631.

[17]

Moschovis , E. Fransén, NeuralDynamicsLab at ImageCLEF Medical 2022 , in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022 .

[18]

Kougia ,

Pavlopoulos , I. Androutsopoulos , AUEB NLP Group at ImageCLEFmed Caption 2019 , in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum , Lugano,