AUEB NLP Group at ImageCLEFmedical Caption 2024

AUEB NLP Group at ImageCLEFmedical Caption 2024 MarinaSamprovalaki samprovalaki@aueb.gr Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

AnnaChatzipapadopoulou Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

GeorgiosMoschovis Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

Archimedes Unit Athena Research Center

1, Artemidos Street GR-151 25 Athens Greece

FoivosCharalampakos Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

PanagiotisKaliosis pkaliosis@aueb.gr Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

JohnPavlopoulos Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

Archimedes Unit Athena Research Center

1, Artemidos Street GR-151 25 Athens Greece

IonAndroutsopoulos Department of Informatics Athens University of Economics and Business

76, Patission Street GR-104 34 Athens Greece

Archimedes Unit Athena Research Center

1, Artemidos Street GR-151 25 Athens Greece

AUEB NLP Group at ImageCLEFmedical Caption 2024 1613-0073 0C71B9A24230E159FE03D231073B8B48 arXiv:2312.10997. GROBID - A machine learning software for extracting information from scholarly documents Natural Language Processing Computer Vision Biomedical Images Convolutional Neural Networks Multi-Label Classification Caption Generation Generative Models Transformers Deep Learning

This article describes the approaches that the AUEB NLP Group experimented with during its participation in the 8 th edition of the ImageCLEFmedical Caption evaluation campaign, including both Concept Detection and Caption Prediction tasks. The objective of Concept Detection is to automatically categorize biomedical images into a set of one or more concepts. In contrast, the Caption Prediction task focuses on generating a precise and meaningful diagnostic caption that describes the medical conditions depicted in the image. Building on our prior research for the Concept Detection task, we utilized a diverse set of Convolutional Neural Network (CNN) encoders, followed by a Feed-Forward Neural Network. Additionally, we implemented two versions of the retrieval-based 𝑘-NN algorithm: a version that assigned concepts based on statistical frequency and a weighted version that took into account the order of the retrieved neighbors. Both models used the CNN image encoders to improve their retrieval capabilities. Regarding the Caption Prediction task, we fine-tuned the InstructBLIP model to generate initial captions and then enhanced it by employing rephrasing techniques with further pre-trained models. We also used synthesizing techniques that incorporated information from similar neighboring images in the training set to refine these captions. Additionally, we employed "Distance from Median Maximum Concept Similarity" (DMMCS), a novel guided-decoding approach that drives the model's behaviour throughout the decoding process, aiming to integrate information from the predicted concepts of Concept Detection. We explored the application of DMMCS to all of our developed systems. Our group ranked 2 nd in Concept Detection and 4 th in Caption Prediction.

Introduction

ImageCLEF [1] is an ongoing evaluation initiative, first run in 2003 as part of the Cross Language Evaluation Forum (CLEF) 1 , that promotes the evaluation of technologies for annotation, indexing, classification, and retrieval of multi-modal data. ImageCLEFmedical is one of the four main tasks in this year's ImageCLEF campaign. We participated in the ImageCLEFmedical Caption task, which was organized for the eigth time [2]. As in previous years, the task comprised two sub-tasks: Concept Detection and Caption Prediction.

The objective of Concept Detection is to accurately associate a biomedical image with one or more relevant medical concepts (tags), while in Caption Prediction, the goal is to automatically generate a preliminary diagnostic report that accurately describes the medical findings, as well as the anatomy of the body structures and organs shown in the image. Diagnostic Captioning remains a challenging research problem aimed at assisting the diagnostic process for patients by providing a preliminary report, rather than replacing medical professionals involved in the procedure [3]. It can thus be seen as an assistive tool, capable of producing an initial draft diagnosis regarding the patient's condition. Such a document would ideally allow doctors to focus on critical areas of the image [4] and help them produce more precise medical diagnoses at an increased speed [5]. Experienced clinicians could enhance their throughput by analyzing the large volume of daily medical examinations more quickly and efficiently. Less experienced clinicians could consider the automatically generated captions to reduce the likelihood of clinical errors [6]. Concept Detection can further improve Diagnostic Captioning by identifying key concepts that should be included in the draft report. We demonstrate the connection between the two sub-tasks by using "Distance from Median Maximum Concept Similarity" (DMMCS)2 [7], which employs information derived from our Concept Detection systems in order to improve the performance of our Caption Prediction systems.

AUEB NLP Group contributions

In this work, we present the experiments conducted and the systems submitted as part of the AUEB NLP Group's participation in this year's Concept Detection and Caption Prediction tasks. We used a number of new approaches influenced by the remarkable progress in the field of NLP and based on instruction-tuned Large Language Models (LLMs) [8].

Our submissions to the Concept Detection sub-task are based on two distinct approaches. We used a Convolutional Neural Network (CNN) encoder to extract visual features from the medical images. In the first approach, these features were fed into a Feed-Forward Neural Network (FFNN) to classify the images into various medical concepts. In the second approach, we implemented a separate method using a 𝑘-nearest neighbors (𝑘-NN) algorithm. In this approach, 𝑘 neighbors are first retrieved, and the most frequently occurring concepts among these neighbors are selected.

Regarding the Caption Prediction sub-task,we tried five main approaches. First, we employed an InstructBLIP model [9] that was fine-tuned on the specified dataset [10] to generate an initial set of captions, which were then also used in the other four approaches. In the second approach, we enhanced the initial captions by drawing insights from captions of similar images and training a FLAN-T5 model [11] to refine them [12,13]. The third approach was similar, but instead of FLAN-T5, we employed ClinicalT5 [14], which is pre-trained on numerous medical datasets, in order to rephrase and correct the initial captions produced by InstructBLIP. The fourth approach involved integrating the DMMCS algorithm [7] in the language model's decoding process in order to promote the inclusion of a given set of keywords, which in this case where predicted by one of our Concept Detection systems. Lastly, we also applied DMMCS decoding to ClinicalT5 in order to maximize their efficacy and improve the overall caption quality. In all our models we used CNN encoders, since there are signs that vision transformers [15] still have inferior performance in visual tasks, such as classification and semantic segmentation [16], especially in medical image tagging [5,17].

Extending our history of successful entries [18,19,20,21,22] in the ImageCLEFmedical campaign, our submissions ranked 2 nd among 9 participating groups in the Concept Detection sub-task and 4 th among 11 participating groups in the Caption Prediction sub-task. In Section 2, we provide insight into this year's dataset, followed by a discussion of our approaches in Section 3. In Section 4, we present our experimental results for each sub-task. Finally, in Section 5, we summarize our findings and suggest directions for future research.

All code used for our experiments is available on GitHub.3

Data

In this year's edition of the ImageCLEFmedical Caption task, the dataset is an updated and extended version of the Radiology Objects in Context (ROCO) dataset [10], which originates from biomedical articles of the PubMed Open Access (PMC OA) subset. 4 . This dataset, which is common for both sub-tasks, consists of 80,080 biomedical images along with their respective medical concepts, in the form of UMLS [23] terms 5 , and diagnostic captions. The dataset was originally split by the organizers into training and validation subsets, with 70,108 radiology images in the first set and 9,972 in the latter. After merging the provided data, we split them again, this time into three subsets, in order to also obtain a development (private test) subset for evaluation purposes. We used a 75%-10%-15% training-validation-development split, keeping relatively equal concept distributions in all three subsets. Consequently, we obtained 64,928 images as our training data, 7,179 images as our validation set, while the remaining 7,973 images constituted our held-out development set. All of our submissions were also evaluated on the hidden official test set (ROCOv2) [24]. The test dataset utilizes Radiology Objects in COntext Version 2 (ROCOv2) [24], an updated and extended version of the ROCO dataset [10]. This set includes 17,237 previously unseen images.

Concept Detection

Concept Detection is a multi-label classification problem covering a broad range of 1,945 distinct biomedical concepts, originating from the Unified Medical Language System (UMLS) [23]. In this sub-task, the goal is to identify (assign) the distinct medical concepts (tags) depicted in each image (e.g., particular medical conditions). Among the available concepts (tag set), four are specific imaging modalities: X-Ray Computed Tomography, Ultrasonography, Magnetic Resonance Imaging (MRI), PET/CT scans. All concepts are represented by Concept Unique Identifiers (CUIs) following the UMLS standard. Some examples of images and their ground truth concepts can be found in Figure 1. The distribution of concepts is highly skewed. Some concepts are present in more than 25, 000 images, whereas others are associated with only 1 image. Figure 2(a) depicts the long-tail distribution of the entire (development + validation + train) dataset, as shown in the left plot, where the frequencies of the concepts (number of images each concept is associated with) are plotted in descending order against their respective class indices. After conducting a comprehensive exploratory analysis of this year's dataset, we found that certain concepts were more prevalent (Table 1); these mostly correspond to kinds of medical examinations, such as X-Ray Computed Tomography or Plain x-ray. Most images are associated (in the ground truth) with at least one of these overarching concepts, alongside more specialized ones. The maximum and minimum number of concepts assigned to a single image are 27 and 1, occurring in 1 and 8,567 images respectively. The average number of assigned concepts per image is 3.1583. The aforementioned observations are outlined in the histogram in Figure 2(b).

Table 1

The ten most frequent concepts (CUIs) of the ImageCLEFmedical2024 dataset, along with their corresponding UMLS terms, and the number of images they are associated with.

Most Common Concepts

Caption Prediction

In the Caption Prediction data, each image is accompanied by a gold diagnostic caption that describes the medical conditions present in the image. There are 80, 080 gold captions across the whole dataset, one for each provided image. Similar to last year's campaign, the vast majority of the captions, specifically 99.47% (79, 658 out of 80, 080 captions), are unique. The maximum number of words in a single caption is 848 (occurred once), while the minimum is 1 (encountered 73 times). The average caption length is 21.01 words. These statistics apply to the dataset as a whole, but we have carefully checked that they remain consistent in all three subsets (training, validation, development) we formed. The five most common captions, as well as the ten most popular words, excluding the stopwords, can be found in Tables 2 and 3, respectively. In Figure 3, we provide a histogram alongside a box plot, utilizing a logarithmic scale in our visualizations. This helps make smaller counts more visible and reduces the dominance of larger values, giving a more balanced view of how the data is distributed.

Table 2

The five most common gold captions found in the ImageCLEFmedical2024 dataset [10] alongside the number of images they are associated with.

Most common captions

Rank Caption

Occurrences

1 Initial panoramic radiograph. 40 2

Final panoramic radiograph. 37 3

Chest X-ray. 20 4

Chest radiograph. 17 5

Preoperative CT scan. 9

According to the organizers, each caption is pre-processed before evaluated in the following manner:

• The caption is converted to lower-case.

• Numbers are replaced by words, e.g., number 10 becomes "ten".

• Punctuation is removed.

Methods

In this section, we present the methods we used in our submissions for both the Concept Detection and the Caption Prediction sub-tasks.

Concept Detection

Our submissions for this year's Concept Detection sub-task are built upon two frameworks. Initially, we extensively explored a CNN+FFNN framework, building upon our prior research [18,19,20,21], experimenting with various image encoders. Additionally, we used a neural image retrieval approach by integrating a 𝑘-nearest neighbors (𝑘-NN) algorithm, which selects 𝑘 neighbors and aggregates tags based on their frequency among the neighbors. Furthermore, we submitted several ensembles of the aforementioned systems. The ensembles employed strategies such as union-based and intersectionbased aggregation.

Table 3

The ten most common words (of gold captions) and their frequencies in the ImageCLEFmedical2024 dataset [10], after removing stop-words.

Most common words (excluding stop-words)Word

CNN + FFNN

This system employs a CNN encoder as its backbone, followed by an FFNN classification head. We extract image features from the last convolutional layer of the image encoder and we condense these feature maps into a feature vector (an image embedding) using global pooling. More specifically, we used the Generalized-Mean (GeM) pooling [25] mechanism.

The FFNN component classifies the image into one or more concepts. Its output layer has |𝐶| neurons, where 𝐶 represents the set of unique concepts in the dataset. Each neuron uses a sigmoid activation function to transform its value into a probability value in [0, 1]. This results in one probability per label, and if this probability exceeds a specific threshold value 𝑡, the corresponding concept is assigned to the image. The threshold, which is the same for all concepts, was chosen through a grid search procedure that optimized the primary metric of the competition, on our validation set. The model was trained by minimizing binary cross-entropy, treating each concept as a separate binary target and summing up the individual losses. We used the Adam optimizer [26], along with a decreasing learning rate strategy and early stopping based on the validation set loss with a patience value of 3 epochs. We used an initial learning rate of 𝜂 = 10 −3 and decreasing factor of 10.

In order to form the ensembles, we trained several instances of the model, using different random initializations, and combined them using the union and the intersection of their predicted concept sets. More details about our submitted ensemble systems can be found in Section 4.1.

CNN + 𝑘-NN

For our 𝑘-nearest neighbors (𝑘-NN) approach, we leveraged the image embeddings obtained from the encoder of the trained CNN+FFNN system (Section. 3.1.1). We discarded the dense classification head and used the last GeM pooling layer to extract embeddings (feature vectors) for all the training images. These embeddings served as the basis for the retrieval process in the 𝑘-NN algorithm. Given a test image, the goal of the system is to retrieve similar images from the training set and select concepts from the retrieved neighbors. For each test image, we used the same encoder to obtain its embedding and we retrieved the 𝑘 closest neighbors from the training set, based on cosine similarity computed on the image embeddings. We tuned the value of 𝑘 in the range from 1 to 100 using our validation set, which led to 𝑘 = 33.

For each test image, having obtained its 𝑘 neighbors from the training set, we formed the set of concepts associated with the neighbors. We then ranked the concepts of the set based on the number of retrieved neighbors associated with each concept, ordering them from highest to lowest frequency. The concept with the highest frequency was always included in the predictions of the 𝑘-NN method for the test image. We then used two thresholds, 𝑡 1 and 𝑡 2 , which we tuned using grid search on our validation set, to select which other concepts of the neighborhood to include in the predictions of 𝑘-NN. We calculated the difference in frequency (Fr) between the first and second most frequent concepts, divided by the frequency of the first concept, and if the result exceeded 𝑡 1 , we included the second concept in the prediction:

Fr(concept 1 ) − Fr(concept 2 ) Fr(concept 1 ) ≥ 𝑡 1 .(1)

Similarly, we determined whether to include in the prediction the third most frequent concept or not, based on a comparison involving the first and third most frequent concepts. We calculated the difference between the frequencies of the first and third concepts, dividing it by the frequency of the first concept, and if this ratio exceeded 𝑡 2 , we included the third concept:

Fr(concept 1 ) − Fr(concept 3 ) Fr(concept 1 ) ≥ 𝑡 2 .(2)

The same approach was applied to the difference between the first and fourth most frequent concepts, checking again against 𝑡 2 , to decide if the fourth most frequent concept should be predicted:

Fr(concept 1 ) − Fr(concept 4 ) Fr(concept 1 ) ≥ 𝑡 2 .(3)

We opted to predict at most four concepts due to the fact that the average number of concepts in the training split was 3.08. The rationale was to select concepts that have frequencies close to that of the highest frequency concept, while excluding concepts that show a significant drop in frequency compared to the preceding ones. We experimented with 𝑡 1 , 𝑡 2 values ranging from 0.3 to 0.9. Validation results indicated that the best parameters were 𝑡 1 = 0.58 and 𝑡 2 = 0.65.

CNN + weighted 𝑘-NN

We also developed a weighted version of the 𝑘-NN algorithm, using the voting scheme that was described in [27]

s.t. 1 ≥ 𝑤 1 ≥ . . . ≥ 𝑤 𝑘 ≥ 0 .(5)

In detail, we created a population of 500 randomly initialized weight vectors, initial chromosomes in GA terminology. Each chromosome had the form ⟨𝑤 1 , . . . , 𝑤 𝑘 ⟩, with all weights 𝑤 𝑖 ∈ [0, 1]; we ensured that the monotonicity constraint 1 ≥ 𝑤 1 ≥ . . . ≥ 𝑤 𝑘 ≥ 0 was satisfied by all chromosomes. We then used a crossover mechanism where two chromosomes were combined to form two new ones. At each application of the crossover mechanism, we selected pairs of chromosomes (parents) out of the population and combined their values to form two new ones from each pair of parents. The crossover operator splits the two parent chromosomes at a random point and creates two children chromosomes by combining the values before the crossover point (or after) for one parent, and after (or before) the crossover point for the other parent. Furthermore, we used a mutation mechanism that perturbed the values of the resulting children chromosomes by adding a random value in [−0.1, 0.1] to every gene, with a 0.1 mutation probability per gene (𝑤 𝑖 ). Both the crossover and the mutation operators paid respect to the range and monotonicity constraints; we added a clipping and a sorting operation that were applied if any of the constraints were violated in the resulting chromosomes. We used 𝐹 1 (𝑌 (𝑥), 𝐻(𝑥)) as the fitness function. The fitness function is used to select the chromosomes to be used as parents in the crossover mechanism at each iteration of the algorithm (fitter chromosomes are selected with higher probability as parents). At each generation (new population), we performed the crossover mechanism as many times as necessary to have a new generation with as many members as the previous one (and as many as the initial population, i.e., 500 chromosomes). We run the optimization process for 30 iterations (generations).

Caption Prediction

Our submissions for the Caption Prediction sub-task focused on four primary systems. The first system employs an InstructBLIP model [9] (Section 3.2.1), while the remaining submissions build on this model using techniques such as rephrasing [12,13] (Section 3.2.3) and synthesizing [12] (Section 3.2.2). Finally, we implemented an innovative guided-decoding mechanism, DMMCS [7] (Section 3.2.4), which leverages information from the tags predicted by our CNN+𝑘-NN classifier (Section 3.1.2) in the Concept Detection task to improve the generated caption.

InstructBLIP

The InstructBLIP model [9] is a sophisticated neural network designed to generate descriptive text for scientific images. It employs a technique known as instruction-tuning [29], which refines its behavior and responses based on user-provided instructions. This approach aims to enhance the model's controllability and its adaptability across different domains. The InstructBLIP model comprises three key components: an image encoder, a Q-Former [30], and an LLM. The frozen image encoder converts the image into a low-dimensional vector and generates image embeddings. The Q-Former then extracts instruction-aware visual features from these embeddings and can process the text prompt (instruction) to enhance this extraction. Through extensive training, the LLM learns to correlate textual prompts with relevant image features, thereby generating coherent and contextually appropriate descriptions. The InstructBLIP model played a crucial role in creating the initial captions, which were subsequently utilized in our other caption prediction methods.

Synthesizer

Our goal was to the captions obtained from the InstructBLIP model (Section 3.2.1) by leveraging information from similar training images, based on the intuition that similar images may have similar captions [31,32]. To achieve this, we computed embeddings for all images in the dataset using the CCN + FFNN model, which was developed for Concept Detection (Section 3.1.1). A cosine similarity threshold was then applied to decide if an image qualified as a neighbor of the test image. Images exceeding this threshold were considered neighbors [33]. For each image in the test set [24], we identified the 𝑘 most similar images from the entire dataset [10], which includes training, validation, and development images, to retrieve their corresponding captions. We experimented with 𝑘 ∈ {1, 3, 5}; the best results in our validation set were obtained for 𝑘 = 5, so we used that value. The Synthesizer, a FLAN-T5 model [11], was trained to refine the captions generated by InstructBLIP by considering also the captions of the neighbors, which are concatenated to the caption of InstructBLIP, similarly in spirit to [13]. We also experimented with different beam sizes 𝑚, for the beam search decoding of the Synthesizer during inference; setting 𝑚 = 5 yielded the best validation scores, so we used that value. Figure 4 illustrates the process (for 𝑚 = 3), starting with the caption generated by InstructBLIP, merging it with the captions of the neighbors, and using FLAN-T5 to obtain a refined caption.

Rephraser

Furthermore, we experimented with a domain-specific variation of T5, namely ClinicalT5. This is an encoder-decoder transformer, which is pre-trained in a series of both supervised and unsupervised tasks [34], including denoising tasks, and then further pre-trained on the union of MIMIC-III and IV clinical notes, to which we were granted access through PhysioNet6 . Following our previous work [35], we created a corrective text-to-text training set, consisting of noisy and ground truth caption pairs, with the former having been generated by our captioning systems. Therefore, we treated our original system as a noise-insertion function, then we further fine-tuned ClinicalT5, in order to rephrase the noisy captions to approximate the gold ones, hoping it would acquire knowledge of the medical domain, use medical terms more accurately and therefore generate more medically fluent text captions. Specifically, we fine-tuned ClinicalT5 to rephrase the captions of InstructBlip (Section 3.2.1), InstructBlip with FLAN-T5 Synthesizer (Section 3.2.2) on top and InstructBlip with DMMCS (Section 3.2.4) using 𝛼 = 0.10. Performance in terms of the primary metric in our development set improved, but test-time performance (in the official evaluation) deteriorated.

DMMCS

In this section, we present "Distance from Median Maximum Concept Similarity" (DMMCS) [7], a novel data-driven guided decoding mechanism designed to incorporate domain-specific information (in the form of keywords) into the text generation process. The intuition behind this guided decoding algorithm lies in the observation that an accurate diagnostic caption should mention the key medical conditions depicted in the given image. For example, if a radiology image is assigned the tag "Pneumonia", but the generated caption does not refer to this medical condition either explicitly or implicitly, then the caption is potentially inaccurate. Such conditions are typically represented by the medical tags provided in the ImageCLEF2024 dataset, which the Concept Detection task is also trying to predict. Therefore we use tags predicted by one of our Concept Detection systems (Section 3.1), in order to guide our Caption Prediction models towards captions that express the tags appropriately. We achieve this by imposing a new penalty at each decoding step, aiming to prioritize the generation of words semantically similar to the (predicted) medical tags. This penalty also considers the frequency with which each tag is explicitly or implicitly expressed in the dataset's gold captions.

In more detail, recent work examining DC datasets [22,7] has shown that some tags are more prominently expressed than others in the corresponding diagnostic captions. More specifically, Kaliosis et al. [7] performed an exploratory analysis on the ImageCLEF2023 and MIMIC-CXR datasets, where they investigated the relationship between each tag and the gold captions of the images that are associated with the tag in the ground truth. This was achieved by calculating the cosine similarity between the word embeddings of each caption's tokens and each tag. The results showed that some tags are always explicitly expressed in the gold captions of the images the tags are associated with, while other tags are mentioned more implicitly or even not at all. More concretely, the similarity between a tag 𝑡 and a caption 𝑐 is defined as the maximum cosine similarity (MCS) between the centroid ℎ(𝑡) of the word embeddings of 𝑡 and the embedding ℎ(𝑐 𝑖 ) of each token in 𝑐, i.e.,

MCS(𝑡, 𝑐) = max 1≤𝑖≤|𝑐| sim(ℎ(𝑡), ℎ(𝑐 𝑖 )).(7)

A high MCS score between a tag 𝑡 and a caption 𝑐 implies that 𝑡 is strongly expressed in the caption, while a low MCS score indicates that it was rather implicitly (or not at all) mentioned. The MCS similarity is also calculated for all the gold captions of the images a tag 𝑡 is associated with in the training data. Specifically, for each tag 𝑡 and the set 𝐶 containing its associated captions, the distribution 𝑅(𝑡, 𝐶) is calculated as:

𝑅(𝑡, 𝐶) = {MCS(𝑡, 𝑐)|𝑐 ∈ 𝐶}.(8)

The median value of the distribution 𝑅(𝑡, 𝐶), hereafter called Median Maximum Cosine Similarity (MMCS), indicates how strongly 𝑡 is expressed on average in the training captions it is associated with.

MMCS(𝑡, 𝐶) = median(𝑅(𝑡, 𝐶)).(9)

During inference, when generating the caption for an image with a single tag 𝑡, the MCS(𝑡, 𝑐) of the tag 𝑡 and each candidate (possibly still incomplete) caption 𝑐 of the beam search is calculated (Eq. 7). The penalty, imposed at each decoding step, is then defined as the squared difference between MCS(𝑡, 𝑐) and MMCS(𝑡, 𝐶). The former shows how strongly the tag is mentioned in the candidate caption, while the latter indicates how strongly the tag is expressed on average in the gold training captions associated with the tag. When more than one tags are assigned to an image, a distinct penalty is calculated for each tag, and the overall penalty is the average of the individual penalties. Thus, given a candidate caption 𝑐, the set of its associated training captions 𝐶, and a set of tags 𝑇 , the penalty is calculated as:

DMMCS pen (𝑇, 𝐶, 𝑐) = 1 |𝑇 | ∑︁ 𝑡∈𝑇 (MCS(𝑡, 𝑐) − MMCS(𝑡, 𝐶)) 2 .(10)

Intuitively, the objective of the DMMCS algorithm is to guide the model to generate captions that express each associated tag as explicitly (or implicitly) as it is expressed in the training corpus. Overall, at each decoding step, each candidate caption 𝑐 generated through the beam search process is scored by the following formula:

DMMCS(𝑐) = 𝛼 • DMMCS pen (𝑇, 𝐶, 𝑐) + (1 − 𝛼) • (1 − D score ),(11)

where 𝑇 is a given set of predicted tags, 𝛼 is a tunable weighting factor, while D score is the score that the decoder assigns to the candidate caption 𝑐.

Experiments, Submissions and Results

In this section, we provide details about our experiments regarding this year's evaluation campaign [1]. Moreover, we share details about our submissions and the scores achieved in our held-out development set, as well as the official test set of the competition [24] for both sub-tasks.

Concept Detection

In the Concept Detection sub-task we submitted our ten best performing models, after evaluating them on our held-out development set. We submitted two instances with different image encoders of our CNN + FFNN model (Section 3.1.1), one instance of our CNN + 𝑘-NN model (Section 3.1.2), and a single instance of our CNN + weighted 𝑘-NN model (Section 3.1.3). In our subsequent submissions, we employed ensemble systems. These involved exploring the integration of predictions from multiple instances by computing either the union or the intersection of their predicted concept sets. Our submitted ensemble systems consisted of various combinations of CNN-based architectures paired with different classifiers, specifically CNN + FFNN, CNN + 𝑘-NN (KNN), and CNN + weighted 𝑘-NN (wKNN). To enhance the diversity and robustness of our ensembles, we incorporated different architectures for the CNN component.

The primary evaluation metric for this year's Concept Detection sub-task was the 𝐹 1 -score, calculated between the predicted and ground truth captions. It is calculated as the sum of the 𝐹 1 -scores for each test image, divided by the total number of test images. Each partial score is derived from the binary multi-hot candidate vector compared to the corresponding ground truth vector. Specifically, let 𝐹 1 represent the overall 𝐹 1 -score, and 𝑓 1 ^denote the individual 𝐹 1 -score for each test image. Additionally, let 𝑝 𝑡 and 𝑔 𝑡 be the predicted and ground truth concepts for an image 𝑡, respectively. Finally, let 𝑇 be the test set [24].

𝐹 1 = 1 |𝑇 | ∑︁ 𝑡∈𝑇 𝑓 1 ^(𝑝 𝑡 , 𝑔 𝑡 )(6)

Moreover, a secondary evaluation metric (again an 𝐹 1 score) was calculated, which only considered manually selected concepts, such as anatomy, topography, and modality.

For our first system (CNN+FFNN), we experimented with a variety of CNN encoders as their backbone components. Specifically, we trained the networks using state-of-the-art CNN architectures, including EfficientNet and DenseNet. Furthermore, we extended our experiments by incorporating these CNN encoders into our 𝑘-NN models.

During testing on our held-out development set, we observed a slightly higher F1 score in models utilizing the EfficientNet image encoder.

Our ensembling approaches did not show significant improvement over our individual models, with minimal differences observed in both the development and test set [24].

Caption Prediction

For the Caption Prediction sub-task, we submitted nine systems based on their performance on our development set. Our submissions included InstructBLIP (Section 3.2.1), a synthesizer variant combining InstructBLIP with FLAN-T5 (Section 3.2.2), and a rephrasing variant that employs ClinicalT5 (Section 3.2.3). Additionally, we explored combinations of all three approaches, aiming to refine the captions generated by InstructBLIP and FLAN-T5 (Section 3.2.2) using our ClinicalT5 rephraser on top. Furthermore, we submitted three variations of InstructBLIP and DMMCS, each with a different 𝛼 value (Section 3.2.4). Finally, we provided two instances where we employed ClinicalT5 to rephrase the results generated by the combination of InstructBLIP and DMMCS, in this case using a 𝛼 = 0.10.

In this year's campaign, BERTScore [36] was the primary evaluation metric in the Caption Prediction task, while ROUGE [37] was the secondary metric. Other metrics utilized include, for example, BLEU-1 [38], BLEURT [39], and METEOR [40]. Table 6 shows captions produced by each of our submissions for the test image CC BY [Muacevic et al. (2024)], extracted from the test dataset [24].

Finally, Table 7 provides an overview of our models, detailing their performance across fundamental campaign metrics in both our development set and the provided test set [24], along with our attained Anteroposterior radiograph of the pelvis showing a large rightsided pleural effusion.

Table 7

Summary of the scores of our submissions to the ImageCLEFmedical2024 Caption Prediction sub-task. rankings. Additionally, Table 8 presents a summary of all the metrics utilized in this year's campaign, offering a comprehensive view of the experiments.

AUEB NLP Group -Submission

Conclusion

Our participation in the ImageCLEFmedical Caption task provided an opportunity to explore innovative NLP approaches for medical image captioning. Utilizing state-of-the-art models, we demonstrated competitive performance in both the Concept Detection and Caption Prediction sub-tasks.

In the Concept Detection sub-task, we achieved a 2 nd place ranking among the participating groups. Our top-performing system was a CNN+FFNN pipeline (Section 3.1.1), while our remaining submissions included a CNN+KNN (Section 3.1.2) and a CNN+wKNN (Section 3.1.3), which also produced competitive results. We also employed ensembles that combined these approaches using union and intersection (of predicted tags) approaches.

In the Caption Prediction sub-task, we were ranked 4 th among all participating groups, by both extending our previous work [22,21,17] and exploiting the state-of-the-art in NLP, such as instructiontuned Large Language Models. Our approach involved the initial generation of captions using the InstructBLIP model [9], followed by their enrichment through the synthesis of information from the captions of similar images [12,13] and the utilization of a model further pre-trained in the medical domain [14] to improve the originally generated captions.

In future work, we plan to further investigate and improve biomedical LLMs and further explore their reasoning capabilities through instruction tuning and, more generally, alignment with medical professionals needs [41]. We also plan to utilize a model capable of processing both image and text inputs in our Synthesizer approach (Section 3.2.2) to combine information not only from the captions of the neighbors, but also from the images themselves. Furthermore, we plan to exploit Retrieval-Augmented Generation [42] algorithms to combine prior knowledge with new medical cases. Finally, the generated captions need to be evaluated in collaboration with medical experts, to assess their medical accuracy and usefulness.

Magdás et al. (2021)]

Figure 1 :1Figure 1: CC BY [Magdás et al. (2021)] from the ImageCLEFmedical2024 dataset, along with the corresponding CUIs and UMLS terms.

Figure 2 :2Figure 2: (a) Visualization of the dataset's long-tail distribution. The y-axis shows the number of occurrences of each concept, and the x-axis the concept's class index. (b) Histogram with 25 fixed-size bins (horizontal axis) depicting the number of gold concepts per image. Note that 13 concepts do not have corresponding UMLS terms.

Figure 3 :3Figure 3: (a) Histogram visualizing the distribution of caption lengths. The 𝑦-axis, displayed on a logarithmic scale, represents the number of images falling into each bin, while the 𝑥-axis shows the number of words in the captions. (b) Box-plot illustrating the same distribution, with the 𝑦-axis displayed on a logarithmic scale, highlighting outliers in the range of 100 to 200 words.

Figure 4 :4Figure 4: Illustration of a radiology image (CC BY [Muacevic et al., 2024]), accompanied by similar neighbor images (CC BY-NC [Popa et al., 2014], CC BY-NC [Popa et al., 2014], CC BY-NC [Bang et al., 2015]) and their corresponding captions from the 2024 ImageCLEFmedical caption task[10,24]. The initial caption, generated by InstructBLIP, is concatenated with the captions of the neighbors and is then fed to a FLAN-T5 Synthesizer, which generates a refined caption.

. More specifically, given a test image 𝑥, we calculate for each concept 𝑐 𝑖 ∈ 𝐶 a score 𝑓 𝑖 (𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ) from the 𝑘 neighbors retrieved for 𝑥: 𝑓 𝑖 (𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ) = 𝑦 𝑖,𝑗,𝑥 = 1 if concept 𝑐 𝑖 is present in the ground truth of the 𝑗-th neighbor of 𝑥, otherwise 𝑦 𝑖,𝑗,𝑥 = 0, and 𝑤 𝑗 is the weight assigned to the 𝑗-th nearest neighbor position; we explain below how the weights 𝑤 𝑗 are learned. Concept 𝑐 𝑖 is predicted for the test image 𝑥 if and only if 𝑓 𝑖 (𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ) ≥ 𝑡, yielding the predicted label set 𝐻(𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ) = {𝑐 𝑖 |𝑓 𝑖 (𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ) ≥ 𝑡}. The classification threshold 𝑡 ∈ [0, 1] and the number of neighbors 𝑘 ∈ [1, 100] were tuned on our validation set, resulting in 𝑡 = 0.35 and 𝑘 = 50. The weights 𝑤 1 , . . . , 𝑤 𝑘 are the same for all the concepts 𝑐 𝑖 and test images 𝑥. They are learned using a genetic algorithm (GA)[28] by maximizing the following objective, where 𝑉 denotes the validation set, 𝑌 (𝑥) is the ground truth set of concepts of image 𝑥, and 𝐹 1 is the official evaluation measure of the Concept Detection task:∑︀ 𝑘 𝑗=1 𝑤 𝑗 • 𝑦 𝑖,𝑗,𝑥 𝑗=1 𝑤 𝑗 ∑︀ 𝑘(4)𝑤 1 ,...,𝑤 𝑘 where max∑︁

𝑥∈𝑉𝐹 1 (𝑌 (𝑥), 𝐻(𝑥; 𝑤 1 , . . . , 𝑤 𝑘 ))

Table 4 Summary of the scores of our individual experiments (ensembles included) in the Image- CLEFmedical2024 Concept Detection sub-task.4This table presents the highest scores of our systems on our held-out development set for each method.Individual Concept Detection ExperimentsRun IDMethodDevelopment619CNN+FFNN (DenseNet)0.6007624CNN+KNN0.6007640INTERSECTION(UNION(3xCNN+FFNN),624)0.6022642UNION(2xCNN+FFNN)0.6047644CNN+FFNN (EfficientNet)0.6042648UNION(644,624)0.6045651CNN+wKNN0.5961654UNION(651,644)0.6008655UNION(651,624)0.5970656UNION(651,619)0.5981

Table 5 Summary of our submissions to the ImageCLEFmedical2024 Concept Detection sub-task.5The table presents the scores of our systems on both our held-out development set and the official test set[24]. It also includes the rankings of these systems among all submissions from the 9 participating teams.Individual Concept Detection ExperimentsRun IDMethodPrimary F1Secondary F1 RankDevTest619CNN+FFNN (DenseNet)0.6007 0.62400.933912624CNN+KNN0.6007 0.62740.93758640INTERSECTION(UNION(3xCNN+FFNN),624)0.6022 0.62720.941510642UNION(2xCNN+FFNN)0.6047 0.63040.93327644CNN+FFNN (EfficientNet)0.6042 0.63190.93924648UNION(644,624)0.6045 0.63080.93216651CNN+wKNN0.5961 0.61350.923817654UNION(651,644)0.6008 0.62070.924313655UNION(651,624)0.5970 0.61550.923316656UNION(651,619)0.5981 0.61620.921715

Table 66Captions generated by our submitted models for the test image[24] CC BY[Muacevic et al. (2024)] Generated captionsInstructBLIPDiffusion-weighted magnetic resonance imaging of the brainshowing a hyperintense lesion in the right temporal lobe.InstructBLIP + Synthesizermagnetic resonance imaging of the head and neck showing ahyperintense lesion in the right internal carotid.InstructBLIP + RephraserAxial computed tomography scan of the head showing a massin the left maxillary sinus (arrow).InstructBLIP + Synthesizer +Computed tomography scan of the head and neck showing aRephrasermass in the right parotid gland.InstructBLIP + DMMCS (alpha 0.1) Chest X-ray showing bilateral pulmonary edema.InstructBLIP + DMMCS (alpha 0.1)Computed tomography scan of the head and neck showing a+ Rephrasermass in the right parotid gland.InstructBLIP + DMMCS (alpha 0.1)+ Rephraser (random restart)

Table RunRunIDApproachBERTScoreROUGE-1RankDevTestDevTest564InstructBLIP0.6164 0.6152 0.1931 0.205222577InstructBLIP + Rephraser0.7651 0.6106 0.1840 0.183726605InstructBLIP + Synthesizer0.6194 0.6113 0.1898 0.188924630InstructBLIP + DMMCS0.6564 0.6211 0.2027 0.204810(𝛼 = 0.1)635InstructBLIP + DMMCS0.6534 0.6210 0.2025 0.204711(𝛼 = 0.05)639InstructBLIP + Synthesizer +0.7603 0.6111 0.1840 0.182725Rephraser647InstructBLIP + DMMCS (𝛼 = 0.1)0.7981 0.6209 0.1928 0.180713+ ClinicalT5650InstructBLIP + DMMCS (𝛼 = 0.1)0.8012 0.6159 0.1932 0.193620+ ClinicalT5 (random restart)646InstructBLIP + DMMCS0.6530 0.6209 0.2024 0.204412(𝛼 = 0.15)

Table 8 Summary of our submissions regarding the Caption Prediction sub-task.8The table contains each system's performance on all officially reported measures.AUEB NLP

Group Submissions -Evaluation on All Metrics Run ID BERTScore ROUGE BLEU-1 BLEURT METEOR CIDEr CLIPscore RefCLIPscore ClinicalBLEURT MedBERTScore Rank

6300.62110.20490.11100.28990.06800.17690.80410.79870.48660.6261106350.62100.20470.11080.28950.06800.17620.80400.79860.48700.6260116460.62100.20440.11070.29000.06780.17580.80410.79880.48720.6261126470.62100.18070.08600.28460.05800.14590.79360.79120.50210.6291136500.61600.19360.10500.28590.06380.15970.79800.79480.48740.6212205640.61530.20520.12740.29200.06980.17280.80450.79680.48440.6197226050.61140.18890.11470.27960.06160.13050.80370.79620.48340.6174246390.61110.18270.07440.27170.05150.12930.78580.78450.52120.6141255770.61070.18380.07510.27060.05130.12920.78320.78260.51580.613426

https://github.com/nlpaueb/dmmcs, Last accessed: 2024-06-20. https://github.com/nlpaueb/imageclef2024, Last accessed: 2024-06-20. PMC Open Access: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, Last accessed: 2024-06-20 UMLS: https://www.nlm.nih.gov/research/umls/index.html, Last accessed: 2024-06-20 https://www.physionet.org/content/clinical-t5/1.0.0/, Last accessed: 2024-06-20

Acknowledgments

This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.

(I. Androutsopoulos) https://www.linkedin.com/in/marina-samprovalaki/ (M. Samprovalaki); https://www.linkedin.com/in/anna-chatzipapadopoulou/ (A. Chatzipapadopoulou); https://geomos.sites.aueb.gr/ (G. Moschovis); https://pkaliosis.github.io (P. Kaliosis); https://ipavlopoulos.github.io/ (J. Pavlopoulos); https://www.aueb.gr/users/ion/ (I. Androutsopoulos) 0000-0003-0547-0581 (G. Moschovis); 0000-0001-9188-742 (J. Pavlopoulos)

Overview of ImageCLEF 2024: Multimedia retrieval in medical applications BIonescu HMüller ADrăgulinescu JRückert ABen Abacha AGarcıa Seco De Herrera LBloch RBrüngel AIdrissi-Yaghir HSchäfer CSSchmidt TM GPakull HDamm BBracke CMFriedrich AAndrei YProkopchuk DKarpenka ARadzhabov VKovalev CMacaire DSchwab BLecouteux EEsperança-Rodier WYim YFu ZSun MYetisgen FXia SAHicks MARiegler VThambawita AStorås PHalvorsen MHeinrich JKiesel MPotthast BStein Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024 Springer Lecture Notes in Computer Science LNCS

Grenoble, France

2024 Overview of ImageCLEFmedical 2024 -Caption Prediction and Concept Detection JRückert ABen Abacha AGSeco De Herrera LBloch RBrüngel AIdrissi-Yaghir HSchäfer BBracke HDamm TM GPakull CSSchmidt HMüller CMFriedrich CLEF2024 Working Notes, CEUR Workshop Proceedings

Grenoble, France

2024 Diagnostic Captioning: A Survey JPavlopoulos VKougia IAndroutsopoulos DPapamichail 10.48550/arXiv.2101.07299 Knowledge and Information Systems 64 2022 Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation H.-CShin KRoberts LLu DDemner-Fushman JYao RMSummers 10.48550/arXiv.1603.08486 Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 Medical image captioning based on Deep Architectures GMoschovis 2022. 2024-06-20 Stockholm, Sweden KTH Royal Institute of Technology Master's thesis A Survey on Biomedical Image Captioning JPavlopoulos VKougia IAndroutsopoulos 10.18653/v1/W19-1803 Proceedings of the Second Workshop on Shortcomings in Vision and Language, Association for Computational Linguistics RBernardi RFernandez SGella KKafle CKanan SLee MNabi the Second Workshop on Shortcomings in Vision and Language, Association for Computational Linguistics

Minneapolis, Minnesota

2019. 2024-06-20 A data-driven guided decoding mechanism for diagnostic captioning PKaliosis JPavlopoulos FCharalampakos GMoschovis IAndroutsopoulos Findings of the Association for Computational Linguistics: ACL 2024 2024 WXZhao KZhou JLi TTang XWang YHou YMin BZhang JZhang ZDong YDu CYang YChen ZChen JJiang RRen YLi XTang ZLiu PLiu J.-YNie J.-RWen 10.48550/arXiv.2303.18223 arXiv:2303.18223 A Survey of Large Language Models 2023 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning WDai JLi DLi AM HTiong JZhao WWang BLi PNFung SHoi 10.48550/arXiv.2305.06500 Advances in Neural Information Processing Systems 36 2024 Radiology Objects in COntext (ROCO): A Multimodal Image Dataset OPelka SKoitka JRückert FNensa CFriedrich 10.1007/978-3-030-01364-6_20 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop

LABELS; Granada, Spain

2018. September 16, 2018. 2018 Proceedings Scaling Instruction-Finetuned Language Models HWChung LHou SLongpre BZoph YTay WFedus YLi XWang MDehghani SBrahma 10.48550/arXiv.2210.11416 Journal of Machine Learning Research 25 2024 Knowledge-Driven Encode, Retrieve, Paraphrase for Medical Image Report Generation YLi XLiang ZHu EXing 10.1609/aaai.v33i01.33016666 AAAI Conference on Artificial Intelligence 2019 Small Language Models Improve Giants by Rewriting Their Outputs GVernikos ABrazinskas JAdamek JMallinson ASeveryn EMalmi 10.48550/arXiv.2305.13514 Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Long Papers YGraham MPurver the 18th Conference of the European Chapter of the Association for Computational Linguistics

St. Julians, Malta

2024. 2024-06-20 1 Association for Computational Linguistics ClinicalT5: A Generative Language Model for Clinical Text QLu DDou TNguyen 10.18653/v1/2022.findings-emnlp.398 Findings of the Association for Computational Linguistics: EMNLP 2022 2022 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ADosovitskiy LBeyer AKolesnikov DWeissenborn XZhai TUnterthiner MDehghani MMinderer GHeigold SGelly JUszkoreit NHoulsby 10.48550/arXiv.2010.11929 International Conference on Learning Representations 2021. 2024-06-20 Weakly-Supervised Semantic Segmentation via Transformer Explainability IAthanasiadis GMoschovis ATuoma 10.5281/zenodo.6574631 ML Reproducibility Challenge 2021. 2022 Fall Edition NeuralDynamicsLab at ImageCLEF Medical GMoschovis EFransén CLEF2022 Working Notes, CEUR Workshop Proceedings

Bologna, Italy

2022. 2022 CEUR-WS.org AUEB NLP Group at ImageCLEFmed Caption VKougia JPavlopoulos IAndroutsopoulos Working Notes of CLEF 2019 -Conference and Labs of the Evaluation Forum

Lugano, Switzerland

2019. September 9-12. 2380. 2019 CEUR Workshop Proceedings AUEB NLP Group at ImageCLEFmed Caption BKaratzas JPavlopoulos VKougia IAndroutsopoulos Working Notes of CLEF 2020 -Conference and Labs of the Evaluation Forum CEUR Workshop Proceedings

Thessaloniki, Greece

2020. September 22-25. 2696. 2020 AUEB NLP Group at ImageCLEFmed Caption Tasks FCharalampakos VKaratzas VKougia JPavlopoulos IAndroutsopoulos Proceedings of the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum CEUR Workshop Proceedings the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum

Bucharest, Romania

2021. September 21-24. 2936. 2021 AUEB NLP Group at ImageCLEFmedical Caption FCharalampakos GZachariadis JPavlopoulos VKaratzas CTrakas IAndroutsopoulos CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.or

Bologna, Italy

2022. 2022 AUEB NLP Group at ImageCLEFmedical Caption PKaliosis GMoschovis FCharalampakos JPavlopoulos IAndroutsopoulos CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS

Thessaloniki, Greece

2023. 2023 The Unified Medical Language System (UMLS): integrating biomedical terminology OBodenreider 10.1093/nar/gkh061 Nucleic acids research 32 2004 JRückert LBloch RBrüngel AIdrissi-Yaghir HSchäfer CSSchmidt SKoitka OPelka ABAbacha AG SDe Herrera HMüller PAHorn FNensa CMFriedrich 10.1038/s41597-024-03496-6 ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset, Scientific Data 2024 Fine-Tuning CNN Image Retrieval with No Human Annotation FRadenović GTolias OChum 10.1109/TPAMI.2018.2846566 IEEE Transactions on Pattern Analysis and Machine Intelligence 41 2019 Adam: A Method for Stochastic Optimization DPKingma JLBa 3rd International Conference on Learning Representations, ICLR 2015

San Diego, CA, USA

May 7-9, 2015. 2015 Conference Track Proceedings A Ranking-based KNN Approach for Multi-Label Classification T.-HChiang H.-YLo S.-DLin Proceedings of the Asian Conference on Machine Learning the Asian Conference on Machine Learning

Singapore

2012 25 Singapore Management University AEiben JESmith 10.1007/978-3-662-44874-8 Introduction to Evolutionary Computing Springer Publishing Company, Incorporated 2015 2nd ed Finetuned Language Models Are Zero-Shot Learners JWei MBosma VZhao KGuu AWYu BLester NDu AMDai QVLe 10.48550/arXiv.2109.01652 International Conference on Learning Representations 2021 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models JLi DLi SSavarese SC HHoi 10.48550/arXiv.2301.12597 International Conference on Machine Learning 2023. 2024-06-20 YGao YXiong XGao KJia JPan YBi YDai JSun MWang HWang 10.48550/arXiv.2312.10997 Retrieval-Augmented Generation for Large Language Models: A Survey 2024 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKuttler MLewis W-T. Yih TRocktäschel SRiedel DKiela abs/2005.11401 Neural Information Processing Systems 2020 A Survey on Retrieval-Augmented Text Generation for Large Language Models YHuang JHuang 10.48550/arXiv.2404.10981 arXiv:2404.10981 2024 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer CRaffel NMShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu 10.48550/arXiv.1910.10683 Journal of machine learning research 21 67 2019 Exploring Uni-modal, Multi-modal and Few-Shot Deep Learning Methods for Diagnostic Captioning PKaliosis 2023 Department of Informatics, Athens University of Economics and Business M.Sc. thesis BERTScore: Evaluating text generation with BERT TZhang VKishore FWu KQWeinberger YArtzi abs/1904.09675 International Conference on Learning Representations 2019 ROUGE: A Package for Automatic Evaluation of Summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004. 2024-06-20 BLEU: a Method for Automatic Evaluation of Machine Translation KPapineni SRoukos TWard W.-JZhu 10.3115/1073083.1073135 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics PIsabelle ECharniak DLin the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Philadelphia, Pennsylvania, USA

2002. 2024-06-20 BLEURT: Learning Robust Metrics for Text Generation TSellam DDas AParikh 10.18653/v1/2020.acl-main.704 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020. 2024-06-20 METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments SBanerjee ALavie 10.3115/1626355.1626389 Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics JGoldstein ALavie C.-YLin CVoss the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics

Ann Arbor, Michigan

2005. 2024-06-20 Training language models to follow instructions with human feedback LOuyang JWu XJiang DAlmeida CLWainwright PMishkin CZhang SAgarwal KSlama ARay JSchulman JHilton FKelton LEMiller MSimens AAskell PWelinder PChristiano JLeike RJLowe 10.48550/arXiv.2203.02155 Neural Information Processing Systems 2022 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKüttler MLewis W-T. Yih TRocktäschel SRiedel DKiela 10.48550/arXiv.2005.11401 Advances in Neural Information Processing Systems Curran Associates, Inc 2020 33