1. Introduction

Medical Image Interpretation with Large Multimodal Models

Mahmudul Hoque

Md Rakibul Hasan

Md. Ismail Siddiqi Emon

Fahmi Khalifa

fahmi.khalifa@morgan.edu 1

Md Mahmudur Rahman

0 0 Computer Science Department, Morgan State University , 1700 East Cold Spring Lane, Baltimore, Maryland 21251 , USA 1 Electrical and Computer Engineering Department, School of Engineering, Morgan State University , Baltimore MD 21251 , USA

2024

This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, diferent variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture-a hybrid approach leveraging CNN and Transformer advantages-ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images.

eol>Large Multimodal Models Vision Language Models Transformer Large Language and Vision Assistant Caption Prediction Concept Detection Medical Images Low-Rank Adaptation Quantization Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS Vision Generative Pre-trained Transformer 2

1. Introduction

The tasks of automatic caption generation and multi-label prediction from medical images have become crucial for improving healthcare due to the growing availability of medical images from diferent modalities like X-radiation (X-ray), Computed Tomography (CT), Positron Emission Tomography (PET), Magnetic Resonance Imaging (MRI), and Ultrasound (US), as well as the significant advancements in the computing power of modern graphics processing units [ 1, 2, 3 ]. The increasing need for diagnostic radiology services and the lack of report writing expertise in many medical facilities highlight the need for automating the mentioned tasks. As a result, extensive applications of recently developed AI models have been found in these domains. As an active research area of AI, combining large language models (LLMs) with vision capabilities allows users to explore emergent abilities using multimodal data, which is being popularized as LMMs or VLMs [ 4 ]. For example, LLaVA [ 5 ], Flamingo [ 6 ], and Contrastive Language-Image Pretraining (CLIP) [ 7 ] have shown remarkable performance in various vision-text tasks. Consequently, there is also potential for applying LLMs in the biomedical imaging field [ 8 ]. These models are trained on extensive databases of human knowledge, demonstrating remarkable capabilities in ofering valuable insights to physicians and healthcare professionals [ 9 ]. Utilizing knowledge from millions to billions of training examples, VLMs can help detect minor abnormalities in low-resolution radiology images that are dificult to spot with the naked eye [ 10 ]. Moreover, pre-trained LLMs like ChatGPT-4 [ 11 ] exhibit emergent abilities on tasks they were not specifically trained for (i.e., visionlanguage domain) [ 12 ]. Models like BiomedCLIP [ 13, 14 ], ChatDoctor [ 15 ], and GatorTron [ 16 ], which are pretrained on high-quality medical datasets, ofer more useful applications for medical domain users. In this working note, various multimodal models were demonstrated that were initially pretrained on multimodal image-instruction pairs from diverse sources. This approach allows for attaining competitive results in this competition of analyzing medical images such as brain MRI, chest X-ray, PET, etc.

2. Objectives

For the ImageCLEFmedical Caption 2024 [ 17 ] challenge, CS_Morgan, participant in the competition, was tasked with developing solutions to automatically predict captions and identify multi-label concepts of radiology images from ROCOv2 [ 18 ] dataset. Considering the tasks, the objectives include the following: • Concept Detection [ 19 ]: This task involved identifying and locating relevant concepts in the specified dataset. This formed the foundation for scene understanding and was essential for context-based image and information retrieval. The evaluation process was conducted using metrics like F1-score. • Caption Prediction [ 19 ]: This task focused on predicting coherent captions for the entire image test dataset using the detected concepts and their interactions within the image. This task provided insights into the interplay of visual elements. Evaluation metrics used for this task consisted of BERTScore (as a primary approach), ROUGE (as a secondary approach), BLEU-1, BLEURT, METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and MedBERTScore.

3. Dataset

Dataset for both tasks included curated images from ROCOv2 [ 18 ], an updated version of the original ROCO [ 20 ] dataset. The medical images were collected from biomedical articles in the PMC OpenAccess and were accompanied by corresponding captions and concepts. The latter was also expressed using UMLS [ 21 ] terms. The training, validation, and test sets contained 70,108, 9,972, and 17,237 radiology images, respectively, with average dimensions of the images being 600× 600. As a result, for the deep learning models implemented here, the images were resized to that average dimension, and the smaller images were padded to have a uniform distribution of image dimensions. Furthermore, the length of captions in words (without punctuations) or tokens for each image was 100 or fewer on average. Moreover, by analyzing both training and validation image-caption pairs, 42,121 unique words (without the punctuations) were found and used as the set of vocabulary in the models implemented. Additionally, there were 1,944 unique CUIs found in the concept list of the train and validation images, among which 1,934 were enlisted in the CUI mapping file.

4. Large Multimodal Models (LMMs)

LMMs as an extended variation of LLMs mark a major leap forward in AI by handling and comprehending various data types, including text, images, audio, and video [ 22, 23, 24 ]. By integrating and interpreting information from these diverse sources, LMMs achieve a holistic understanding of complex data [ 22, 23 ]. This capability allows them to perform sophisticated tasks, such as image captioning, visual question answering, and content recommendation, by leveraging the relationships between diferent data types [ 22, 23 ]. Figure 1 demonstrates theoretical architecture of LMMs.

4.1. Pre-training and Fine-tuning of LMMs

During pre-training, the model is initially trained on vast and diverse datasets, enabling it to learn general representations before being fine-tuned for specific tasks. This involves utilizing large-scale datasets that include various modalities [ 25, 26 ]. For instance, models like ViLBERT [ 27 ] have been pre-trained on extensive image-text pairs to increase their performance in downstream tasks like image captioning and visual question answering [ 25, 26 ].

Fine-tuning LMMs involves adjusting all pre-trained model parameters to enhance performance on specific tasks, such as image captioning. This process is computationally intensive and resourcedemanding, especially for models with billions of parameters. Despite these challenges, the full finetuning technique remains popular due to its potential for achieving high accuracy. For instance, models like BLIP-2 [ 28 ] and InstructBLIP [ 29 ] have demonstrated enhancements in image captioning tasks through full fine-tuning , utilizing their extensive pre-training on large datasets to adapt to specific tasks. However, the substantial computational and memory requirements make full fine-tuning impractical for many applications, leading to the exploration of more eficient fine-tuning methods.

As a result, Parameter-Eficient Fine-Tuning (PEFT) [ 30, 31 ] presents a more eficient approach compared to full fine-tuning by modifying only a small portion of the model’s parameters while leaving the majority unchanged. This strategy substantially decreases computational and memory demands, making it suitable for a variety of applications. In the domain of image captioning, PEFT techniques have proven efective with models such as mPLUG [ 32 ] and LLaVA [ 5 ]. Notably, approaches like Low-Rank Adaptation (LoRA) [ 33 ] have been particularly successful in fine-tuning. LoRA optimizes a matrix of updates to the pre-trained model weights rather than directly modifying them. This update matrix is decomposed into two smaller, lower-rank matrices, reducing the number of parameters that need updating while preserving the original weights [ 33, 34 ]. This allows diferent task-specific LoRAs to be easily swapped, efectively tailoring the pre-trained model for various applications. LoRA matches the performance of the full fine-tuning technique by updating a small number of additional weights, preventing catastrophic forgetting, and enabling better generalization with limited data [ 33, 34 ]. Figure 2 compares the approaches of LoRA and linear projection techniques.

Figure 2 indicates that the LoRA approach involves two matrices, and . The matrix is the first step in the adaptation process, projecting high-dimensional input features into a lower-dimensional latent space. Typically, its shape includes two values: rank and original dimension (e.g., 32 and 4096). The matrix is the second component, mapping the lower-dimensional features back to the original high-dimensional space, efectively reversing the reduction performed by the matrix and the shape becomes [ 4096, 32 ]. Both and matrices are trainable and updated during fine-tuning. LoRA focuses on specific weight matrices within the model, for example, the query, key, and value matrices in Transformer [ 35 ] architectures. However, traditional Transformers are hindered by their slow performance and high memory consumption, particularly with long sequences, due to the quadratic time and memory complexity of self-attention. Flash Attention [ 36 ] addresses these issues with an IO-aware exact attention algorithm that utilizes tiling to reduce the number of memory reads and writes between the GPU’s high-bandwidth memory (HBM) and on-chip SRAM.

Visual instruction tuning [ 5 ] enhances LMMs by fine-tuning them with instructions that combine visual and textual data. This technique uses machine-generated instruction-following data to improve the model’s zero-shot and few-shot performance on new tasks. For example, the LLaVA [ 37 ] model integrates a vision encoder with LLM for general-purpose visual and language understanding. The process involves generating detailed, context-aware language-image instructions using a language-only model like GPT-4 [ 11 ]. This data is then used to train the LMM, enabling it to perform tasks such as image captioning, visual question answering, and detailed image descriptions.

4.2. Large Language and Vision Assistant (LLaVA)

LLaVA [ 37, 38 ] stands as a comprehensive, end-to-end trained multimodal model that seamlessly merges a vision encoder and a LLM to facilitate broad-ranging visual and language comprehension (see Figure 3). The vision encoder is tasked with processing input images () and transforming them into a series of feature representations (). Situated above the vision encoder is the Projection ( ), functioning as a vital conduit between the vision encoder and the language model. The projection matrix facilitates the conversion of feature representations () from the vision encoder into a compatible format () for the language model. On the right side of the diagram, the Language Instruction input () represents the textual component that the model must comprehend and respond to in conjunction with the visual input. This input undergoes processing by the language model, generating its own set of feature representations (). The Language Model () (e.g., Vicuna 7B [ 39, 40 ] or Mistral 7B [ 41, 42 ] in this working note) ingests both the projected vision features () and the language features (), seamlessly integrating them to produce a Language Response (). The resulting output constitutes a coherent response incorporating elements from both visual and textual inputs. Figure 3 shows the basic architecture of LLaVA and demonstrates its working principles. 4.2.1. LLaVA-v.1.6-Vicuna-7B The Vicuna 7B [ 39, 40 ] language model components include (see Figure 4): (a) Embedding Layer Converts input tokens into dense vectors with an embedding dimension of 4,096, (b) Decoder Layers - Consists of 32 LLaMA-based Decoder Layer instances, where each layer includes a self-attention mechanism, a Multi-layer Perceptron (MLP) using Sigmoid Linear Unit (SiLU) activation, and Root Mean Square (RMS) normalization layers applied before and after the attention mechanisms, and (c) Final Normalization Layer - A RMS normalization layer applied to the final output of the decoder layers. The model supports input image resolutions of 672× 672, 336× 1344, and 1344× 336, enhancing visual detail comprehension. 4.2.2. LLaVA-v.1.6-Mistral-7B The LLaVA v.1.6 Mistral 7B [ 43 ] model integrates several key components for its functionality (see Figure 5). At its core is the vision encoder, utilizing a pre-trained CLIP ViT-L/14 [ 44 ] to extract visual embeddings from high resolution images. This encoder processes visual input, converting it into a format compatible with the language model. The language model itself is based on the Mistral-7B architecture, which inherently incorporates advanced features like Sliding Window Attention and Grouped-Query Attention, enhancing its capability to manage long sequences and improve inference eficiency [ 41, 42 ]. Additionally, A two-layer MLP projection matrix is employed to map the visual embeddings from the vision encoder into the same embedding space as the language model, ensuring seamless integration of visual and textual information. The CLIP ViT-L/14 [ 44 ], a Vision Transformer(ViT) with 14 layers, is renowned for its ability to handle complex visual tasks, contributing to the model’s overall performance.

5. Caption Prediction Task

To address the caption prediction task, the CS_Morgan team fine-tuned several LMMs that were pretrained using extensive standard datasets from the field of computer vision. These models were derived from well-known LLMs commonly utilized in Natural Language Processing (NLP). Ten submissions were made, and the technical details, methods, and approaches of these submissions are detailed in the following section. Moreover, the reproducible codes relevant to the following submissions can be found here [ 45 ].

Before any tasks are performed, the dataset is pre-processed to ensure that it is clean and correctly formatted. Beyond the initial image-text pre-processing described earlier, the training, validation, and testing datasets were structured for generating captions to meet the input requirements of the corresponding vision-language models. Furthermore, the dataset was managed using the Hugging Face Hub. Computational details can be found in Appendix A.

5.1. Submission 1: Selective fine-tuning of LLaVA-v.1.6-Mistral-7B 5.1.1. Model Description

For this submission, the pre-trained LLaVA 1.6 on Mistral 7B weights was loaded using Mistral-7BInstruct-v.0.2 as the base LLM and flash attention was used to optimize attention mechanism computations. To enhance training stability, all float16 instances of the Vision Tower model were replaced with bfloat16. Additionally, prompts were set up by combining images and texts using the "mistral_instruct" conversation mode.

For eficient fine-tuning, LoRA was applied to specific layers, configuring it with a rank r = 16, an alpha (lora_alpha) of 32, and a dropout rate of 0.05. The query, key, and value projection layers in the self-attention mechanisms of the Mistral Decoder Layer, as well as the projection layers in the MLP, were specifically targeted. In the vision model, LoRA was applied to the linear projection layers within the self-attention mechanism (CLIP attention) of the encoder layers in the CLIP encoder. This resulted in 40,108,032 trainable parameters, about 0.527% of the model’s total parameters. The LoRA components included lora_A, lora_B, and lora_dropout representing the low-rank projection to a smaller dimension, projection back to the original dimension, and a parameter to prevent overfitting, respectively.

5.1.2. Training Process

The training process involved setting up a Data Loader for the dataset, ensuring images and text inputs were properly loaded. Custom callbacks were defined for printing the best checkpoint and implementing early stopping. Key training parameters included a learning rate of 1e-4, bfloat16 precision, and the AdamW [ 46 ] optimizer. Each device processed batches of 4, with gradient accumulation steps of 8. Evaluations and saves were performed every 1,095 steps, with the training capped at 21,900 steps (10 epochs). Early stopping was set with a patience of 5 steps and a threshold of 0.01, monitoring evaluation loss (where lower values are better). Training was halted at 9,855 steps, and the best model, saved at 4,380 steps, was reloaded at the end. For evaluation, caption generation was configured with a temperature of 1.0, a beam width of 1, and a maximum of 512 new tokens. Figure 6 depicts the training and validation loss over the steps.

5.2. Submission 2: Additional fine-tuning of LLaVA-v.1.6-Mistral-7B Model 5.2.1. Model Description

The second submission was built upon the first one by fine-tuning a larger portion of the model using the same pattern. This included an expanded application of LoRA to improve utilization of the model’s capacity for more accurate and robust predictions. The fine-tuning involved additional layers to enhance the learning and improve visual-textual alignment. Specifically, output projection layers such as o_proj in the Mistral Decoder Layer’s self-attention mechanism and out_proj in the vision model were included to better capture complex relationships within the data, which is essential for tasks like image captioning. Targeting multimodal projector layers (mm_projector.0 and mm_projector.2) enhanced the alignment of visual and textual representations, which is crucial for multimodal tasks. Despite the increased number of trainable parameters (98,467,840 compared to 40,108,032), this expansion represented only a small fraction (1.285%) of the total model parameters, maintaining parameter eficiency while improving learning capabilities. LoRA was configured with a rank r = 32, lora_alpha of 32, and a dropout rate of 0.05. Various layers were targeted in the Mistral Decoder Layers, including query projection (q_proj), key projection (k_proj), value projection (v_proj), and output projection (o_proj) in the self-attention mechanism, as well as gate projection (gate_proj), up projection (up_proj), and down projection (down_proj) in the MLP components. In the CLIP Vision Model, LoRA was applied to the similar projection layers in the attention mechanism and fully connected layers (fc1 and fc2) in the MLP. Additionally, the multimodal projector layers (mm_projector.0 and mm_projector.2) were included to further enhance the model’s capabilities. These modifications were applied to the LLaVA-v.1.6 model and its pre-trained checkpoints on the Mistral-7B.

5.2.2. Training Process

The training configuration included a learning rate of 1e-5, using the AdamW [ 46 ] optimizer, bfloat16 precision, and Flash Attention enabled. Each device handled a batch size of 4, with gradient accumulation steps set to 8. The model underwent training for a maximum of 8,760 steps (4 epochs), with checkpoints and evaluations performed every 548 steps. Early stopping parameters were defined with a patience of 4 and a threshold of 0.01, monitoring the evaluation loss to select the best model, with lower values being preferable. Training was halted early at 3,836 steps, and the model saved at this point was considered the best and subsequently loaded. For evaluation, specifically for generating captions, parameters were set with a temperature of 1.0, beam width of 1, and a maximum of 100 new tokens.

5.3. Submission 3: Hybrid fine-tuning of LLaVA-v.1.6-Mistral-7B

This submission was built on the previous one, maintaining the same general pattern but altering which layers were fine-tuned and the fine-tuning strategy itself. The fine-tuning strategy was multifaceted, employing LoRA to adapt key components, such as attention mechanism projections, MLP components, and multimodal projector layers. Additionally, the language model’s head (lm_head) and the embedding tokens (embed_tokens) were explicitly set as trainable parameters to further enable these parts of the model to learn and adapt to the task. This hybrid approach leveraged the strengths of both LoRA adapters and traditional fine-tuning. Fine-tuning the lm_head allowed the model to better tailor its output generation to specific tasks or datasets, which was particularly important for generating appropriate language or captions from medical images. On the other hand, fine-tuning the embed_tokens layer helped the model learn better representations of input tokens, improving overall performance, especially when the input data distribution difers from the pre-training data.

In this configuration, LoRA was set with a rank r = 32, and the lora_alpha was calculated as 32 × √32 to stabilize training and enhance low-rank adaptation performance. This scaling factor normalized the learning rate for LoRA parameters based on rank, ensuring efective updates without causing gradient explosion or vanishing gradients. A dropout rate of 0.05 was applied to prevent overfitting and maintain generalization ability.

For layers explicitly set as trainable, the lm_head was a linear layer that mapped hidden states to the vocabulary space, generating the final output logits for each token. This layer was crucial for the model’s text generation capability. The embed_tokens layer converted input token indices into dense vectors, providing initial representations of the input tokens essential for the model to process the input text. Both the lm_head and embed_tokens layers had their full weights fine-tuned, in addition to the LoRA adapters.

Overall, this hybrid fine-tuning approach combined LoRA fine-tuning for attention, MLP, and multimodal projection layers with full weight fine-tuning of the lm_head and embed_tokens layers. The total number of trainable parameters was 350,650,368 out of 7,654,729,728 total parameters, making up 4.581% of the parameters.

5.3.1. Training Process

The training arguments included a learning rate of 1e-5, the AdamW [ 46 ] optimizer, bfloat16 precision, Flash Attention, per-device batch sizes of 4, and gradient accumulation steps of 8. The model was trained for a maximum of 6,570 steps (3 epochs), with checkpoints and evaluations performed every 548 steps. Gradient checkpointing was enabled using a re-entrant approach to reduce memory usage. Early stopping was configured with a patience of 3 and a threshold of 0.01, monitoring evaluation loss (with lower values being better). Early stopping was triggered at 3,836 steps, at which point the best model was saved and later loaded. For evaluation and caption generation, the parameters were set to a temperature of 1.0, num_beams of 1, and max_new_tokens of 100.

5.4.1. Model Description

For this submission, the pre-trained multimodal language model on checkpoints of Llava v.1.6 Vicuna 7B was loaded, which used the lmsys/vicuna-7b-v1.5 as its base LLM. The model preparation involved configuring LoRA with a rank ( r) of 16, an lora_alpha of 32, and a dropout rate of 0.05. The target modules for LoRA were expanded to include the query (q_proj), key (k_proj), and value (v_proj) projections within the self-attention mechanism of the LLaMA Decoder Layer, as well as the gate (gate_proj), up (up_proj), and down (down_proj) projections in the MLP components of the same layer. Additionally, in the CLIP Vision Model’s CLIP Encoder layers, the key (k_proj), value (v_proj), and query (q_proj) projections, along with the first ( fc1) and second (fc2) fully connected layers of the CLIP MLP, were targeted. Furthermore, the multimodal projector layers (mm_projector.0 and mm_projector.2) were included. This expanded application of LoRA resulted in 34,422,784 trainable parameters out of a total of 7,097,329,664 parameters, constituting approximately 0.485% of the model’s parameters.

5.4.2. Training Process

The training process involved setting up a Data Loader for the dataset and inspecting batches to ensure correct loading of images and text inputs. Custom callbacks were created for printing the best checkpoint and enabling early stopping. The training used a learning rate of 1e-4, bfloat16 precision, Flash Attention, the AdamW optimizer, batch sizes of 4 per device, and gradient accumulation steps of 8, with evaluation and save steps every 548 steps. The model was trained for a maximum of 10,950 steps (5 epochs), with early stopping configured with a patience of 3 and a threshold of 0.01. The evaluation loss was monitored to select the best model, with lower values being preferable. Early stopping occurred at 4,932 steps, and the best model, saved at 4,384 steps, was loaded at the end. For generating captions during evaluation, parameters included a temperature of 1.0, num_beams set to 1, and a maximum of 512 new tokens.

5.5. Submission 5: Hybrid Fine-tuning of LLaVA-v.1.6-Vicuna-7B

For this submission, same approach of the third submission was followed. The only diference was the implementation of Vicuna LLM. The total number of trainable parameters is 346,718,208 out of 7,147,481,088 total parameters (4.851% of parameters). The training process employed was similar to the previous submission. The only diference was that the maximum token limit was set to 150 for this submission. The model was trained for a maximum of 10,950 steps (5 epochs), with early stopping configured with a patience of 5 and a threshold of 0.01. Early stopping occurred at 6,576 steps, and the best model, saved at 4,384 steps, was loaded for evaluation.

5.6. Submission 6: Selective Fine-tuning of LLaVA-v.1.5-7B

The LLaVA 1.5 7B shares a similar architecture with that of LLaVA-v.1.6 Vicuna-7B. LLaVA 1.5 checkpoints on 7B parameters were loaded, and the expanded use of LoRA resulted in 84,574,208 trainable parameters out of a total of 7,147,476,992, constituting approximately 1.183% of the model’s parameters. Precision was adjusted from float16 to bfloat16 to enhance computational eficiency, and Flash attention was not enabled in this submission. Instead, LLaMA Scaled Dot-Product Attention (SDPA) was utilized in the 32 layers of the LLaMA Decoder Layer. LoRA was configured with a rank of 32, lora_alpha of 32, and a dropout rate of 0.05. Target modules for LoRA included various projections in LLaMA Decoder Layer, MLP components, and attention mechanisms within CLIP Vision Model. The training process involved creating a Data Loader, defining custom callbacks for early stopping and checkpoint printing, and setting training arguments such as a learning rate of 1e-5, and AdamW optimizer. Training was conducted for a maximum of 8760 steps with early stopping triggered at 4,672 steps, saving the best model. For evaluation, parameters included temperature = 1.0, num_beams = 1, and max_new_tokens = 100.

5.7. Submission 7: Adaptation of MoonDream2 5.7.1. Model Description

Moondream2, a small vision language model designed for eficient operation on edge devices, was evaluated on the ImageCLEF 2024 dataset using pre-trained weights from Huggingface [ 47, 48 ]. These weights were initialized from Sigmoid Loss for Language-Image Pre-Training (SigLIP) and Phi-1.5 models. Phi-1.5 [ 49 ], developed by Microsoft Research, is a compact Transformer-based language model with 24 layers, 32 heads (each with a dimension of 64), rotary embeddings, a rotary dimension of 32, a context length of 2,048, and flash-attention. SigLIP [ 50 ], an enhancement of the CLIP model, replaces the softmax loss with a pairwise sigmoid loss, operating on image-text pairs without global normalization. SigLIP’s architecture includes a ViT [ 51 ] backbone that processes image patches through a transformer encoder and a classification head with a MLP using Gaussian Error Linear Unit (GELU) activation for ifnal predictions. Moreover, the pre-processing included resizing, type conversion, and normalization. This architecture efectively combined visual and textual processing for caption generation.

LoRA was configured with an alpha ( lora_alpha) of 32, which adjusts the learning rate for lowrank matrices, and a rank (lora_rank) of 64 for the adaptation process. It was applied to specific linear layers in both the vision encoder and the text model. In the vision encoder, LoRA targeted the projection layers (proj), and fully connected layers (fc1 and fc2) within the 27 ViTBlock components. Additionally, LoRA was applied to the fc1 and fc2 layers in the multimodal projection layer, a custom module integrated to adapt the projection layer for the purpose of the caption prediction challenge. In the language model, LoRA targeted the Wqkv, out_proj, fc1, and fc2 layers within the 24 Phi Decoder Layer components. Wqkv in the Phi Decoder Layer represents the combined weights for the self-attention mechanism’s linear projections (query, key, and value). With LoRA applied, the model had 74,422,272 trainable parameters, which was about 3.850% of the total parameters (1,931,904,880).

5.7.2. Training Process

The training process employed various key parameters and strategies to optimize the model’s performance. The number of image tokens was set to 729, aligning with text tokens. Training spanned 10 epochs over 40,000 steps, using a batch size of 8 and gradient accumulation steps of 4, with evaluation after each epoch. An early stopping mechanism with a patience of 6 epochs and a minimum delta of 0.0001 monitored validation loss to prevent overfitting. Data loading and batching utilized PyTorch’s DataLoader with custom collation for images and text tokens, pre-processed and padded for uniform sequence lengths. Gradient accumulation steps set to 4 simulated a larger batch size for better GPU memory management. The Adam8bit optimizer from the bitsandbytes library, with a dynamic learning rate adjusted via a cosine schedule, was used. Loss computation combined image and text embeddings, processed by the Phi language model. The training loop iterated over epochs and batches, updating parameters post-gradient accumulation and checking validation loss for early stopping. LoRA parameters were optimized with an initial learning rate of 3e-6, scaled by a factor of 4, balancing exploration and convergence. This approach, along with gradient accumulation, enhanced resource use and fine-tuning eficiency.

5.8.1. Model Description

IDEFICS 9B Instruct [ 52 ] is an advanced multimodal model developed by Hugging Face for integrated image and text processing tasks. The model combines the vision model CLIP ViT-H/14 [ 53 ] and the language model LLaMA 7B [ 54 ], incorporating novel transformer blocks to connect these modalities. Trained on extensive datasets, including OBELICS, Wikipedia, LAION, and PMD, the IDEFICS 9B Instruct variant is fine-tuned on supervised and instruction datasets.

The lightweight IDEFICS 9B Instruct variant was explored using 4-bit quantization to reduce model size and computational requirements while maintaining performance. BitsandByters (BnB) quantization assigns 4-bit precision to the model using double quantization with the normalized floating-point format (NF4) and bfloat16 precision for computations, crucial for running large language models on smaller devices. For fine-tuning IDEFICS 9B Instruct on the ImageCLEF dataset, the checkpoint HuggingFaceM4/idefics-9b-instruct was specified to load the pre-trained model with 4-bit quantization using the BitsAndBytesConfig class.

LoRA was applied to the query projection (q_proj), key projection (k_proj), and value projection (v_proj) in both the ViT and decoder layers, as well as the perceiver attention and gated cross-attention layers. However, the output projection (o_proj and out_proj) in the decoder, gated cross-attention, and perceiver attention layers did not use LoRA but remained as standard Linear4bit layers. This selective application of LoRA allows for eficient fine-tuning by reducing the number of trainable parameters specifically within the attention mechanisms while leaving other projections, like the output projection layers, unmodified.

5.8.2. Training Process

Custom callbacks were defined for printing the best checkpoint and early stopping. The training arguments included a learning rate of 1e-4, the AdamW optimizer, batch sizes of 2 per device for training and evaluation, gradient accumulation steps of 8, and evaluation and save steps every 500 steps. The model was trained for a maximum of 8762 steps (2 epochs). Early stopping parameters were set with a patience of 6 and a threshold of 0.001. Evaluation loss was monitored to select the best model, with lower values being better. Early stopping was triggered at 8,000 steps, and the best model, saved at 8,000 steps, was loaded at the end of training.

5.9. Submission 9: VisionGPT2 5.9.1. Model Description

The Encoder-Decoder model was designed to take an image as input and generate a descriptive caption as output. In this model, the Encoder was a ViT [ 55, 51 ] that processed the input image and extracted meaningful features. These features were then fed into the Decoder, which is based on GPT-2 [ 56 ], a powerful language model that generates the corresponding textual caption. For fine-tuning the model, the Hugging Face Seq2SeqTrainer [ 57 ] was employed. This trainer, part of the Hugging Face transformers library, is specifically designed to handle sequence-to-sequence tasks, making it well-suited for this image captioning model. The fine-tuning process leverages the transformers library to adapt the pre-trained ViT and GPT-2 models.

5.9.2. Training Process

Initially, the pre-trained layers were frozen to focus on training the cross-attention layers. In subsequent epochs, GPT-2 was unfrozen and trained, and in the final few epochs, the ViT was also unfrozen. The Adam optimizer and the One Cycle Learning Rate (OneCycleLR) scheduler are used for optimization. Mixed-precision fp16 training was employed with autocast and GradScaler in PyTorch. The training metrics are cross-entropy loss and perplexity, with both metrics aimed to be minimized. The best model was saved based on validation perplexity and was loaded during caption generation.

5.10. Submission 10: CNN-Transformer Fusion Model

The CNN-Transformer fusion model for this submission was built around three core models. First, the pre-trained ChexNet [ 58 ] (a DenseNet121 backbone based CNN model) was used to extract features from the input images. These features captured essential visual information and were then passed to the second component, a Transformer Encoder [ 59 ]. The Transformer-based encoder processed the extracted image features to generate a new, more informative representation of the inputs. Finally, the third component, a Transformer Decoder [ 59 ], took the output from the encoder along with the text data (sequences). The decoder used these inputs to learn and generate the corresponding image captions, completing the image-to-text translation process. The hyper-parameters for the model included an embedding dimension set to 512 and an initial learning rate of 0.0001. The encoder used a single attention head, while the decoder utilized two attention heads to process the information. For early stopping, the patience level was set to 5, meaning the training process halted if there was no improvement in validation loss after five epochs.

5.11. Performance Measurement Metrics for the Caption Prediction Task

The performance of all the submissions regarding the caption generation task were evaluated using the following metrics.

• BERTScore [ 60 ] evaluates text generation by computing the similarity between BERT embeddings of the candidate and reference sentences, capturing semantic meaning better than traditional metrics. • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [ 61 ] is a set of metrics for evaluating automatic summarization and machine translation by comparing overlap in n-grams, word sequences, and word pairs between the candidate and reference texts. • BLEU (Bilingual Evaluation Understudy) [ 62 ] is a precision-based metric for evaluating machine translation quality by comparing n-grams of the candidate translation to those of the reference translation. BLEU-1 specifically considers unigram matches. • BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) [ 63 ] is a learned evaluation metric for natural language generation that uses pre-trained transformers ifne-tuned on a variety of supervised and unsupervised signals to predict human judgment scores. • METEOR (Metric for Evaluation of Translation with Explicit ORdering) [ 64 ] evaluates machine translation by considering precision, recall, stemming, synonymy, and alignment, aiming to improve correlation with human judgment. • CIDEr (Consensus-based Image Description Evaluation) [ 65 ] is a metric for evaluating image captioning by comparing candidate captions to reference captions using TF-IDF weighting and n-gram similarity, ensuring relevance and importance of the words are considered. • CLIPScore [ 66 ] is an evaluation metric that uses the CLIP model to compare image and text similarity. It measures the alignment between visual content and textual descriptions, providing a score based on their embedding similarity. • RefCLIPScore [ 66 ] is an extension of CLIPScore that includes a reference-based evaluation, incorporating both the similarity of the generated text to a reference text and the similarity between the image and the generated text. • ClinicalBLEURT [ 67 ] adapts BLEURT for clinical text generation, fine-tuning it on clinical datasets to better evaluate the quality and relevance of generated clinical text against reference clinical text. • MedBERTScore [ 67 ] adapts BERTScore for the medical domain, using BERT embeddings specifically fine-tuned on medical texts to provide a more accurate evaluation of medical text generation tasks.

5.12. Results and Discussion on Caption Prediction Submissions

In this year’s evaluation for the ImageCLEF task, BERTScore [ 19 ] was the primary metric used to assess the quality of the generated captions, with ROUGE [ 19 ] as the secondary metric.

Table 1 shows the results of submissions in terms of the primary metrics of performance. In addition to BERTScore and ROUGE, some other performance metrics were also adopted to assess submission results. These metrics are BLEU-1, BLEURT, METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and MedBERTScore. Table 2 shows the results of the additional performance metrics other than the BERTScore and ROUGE used for the caption prediction task. In both tables, the submissions are listed according to the BERTScore (highest to lowest).

Our results indicate that LMMs, when selectively fine-tuned with fewer parameters, can achieve high performance. Additionally, LMMs obtained through quantization and smaller VLMs can maintain competitive performance in medical image understanding and caption generation. From Tables 1 and 2, it is evident that four diferent submission outperformed the others in terms of the pre-specified performance measurement metrics. Submission 1 using the LLaVA-v1.6-Mistral-7B model with 40.1M ifne-tuned parameters using the LoRA technique achieved the highest scores across several key metrics: BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). Submission 3, also using the LLaVA-v.1.6-Mistral-7B model with hybrid LoRA fine-tuning approach (350.6M parameters) attained the highest CLIPScore of 0.824171, indicating an improved semantic match between the generated captions and the visual content of the medical images. Submission 10, the CNN-Transformer fusion approach (Pre-trained CheXNet as the encoder and Transformer as the decoder) performs better than other submissions in terms of the ClinicalBEURT score of 0.676905. Finally, the submission 8 which was IDEFICS-9B-Instruct quantized to 4-bit, excelled in capturing relevant biomedical concepts compared to other submissions, achieving the highest MedBERTScore of 0.657460034. Overall, the first submission can be claimed as the top performer because of the highest scores in the primary and secondary metrics. Figure 7 shows the comparison of the submissions in terms of the primary and secondary metrics. The significance of these submissions lies in their demonstration of advanced fine-tuning techniques and model performance optimization in the context of generative models. These findings highlight the evolving landscape of model fine-tuning strategies, advocating for resource-eficient methods that maintain or enhance performance. This is crucial for practical and scalable AI deployments across diverse medical applications. Submission Results for the Caption Prediction Task - Secondary Scores

BLEU-1

In addition to the above-mentioned submissions, Submission 4, utilizing the LLaVA v.1.6 Vicuna 7B with selective fine-tuning using LoRA (34.4M parameters), demonstrated well-balanced performance and closely followed Submission 1. Moreover, submissions 3 and 2, both based on the LLaVA v.1.6 Mistral 7B model but with diferent approaches for optimization, closely followed submission 4 in terms of BERTScore and ROUGE. However, the sixth submission of LLaVA v.1.5 7B, based on another variant of LLaVA, could not outperform the LLaVA v.1.6 variants except for LLaVA v.1.6 Vicuna 7B with hybrid finetuning using LoRA technique (Submission 5). The experiment with the MoonDream2 with 74.4M finetuned parameters in Submission 7 showed competitive performance on the test data relative to the larger models across multiple metrics. Submissions 9 and 10 were based on the pre-trained Transformer-based encoder-decoder models other than the LMMs. VisionGPT2 outperformed the conventional pre-trained CheXNet-Transformer or CNN-Transformer based model in every metric except ClinicalBLEURT. Table 3 shows the generated captions for a test image (ID: ImageCLEFmedical_Caption_2024_test_000016) corresponding to the submissions made for the caption prediction task.

Anteroposterior radiograph of the pelvis showing a sacral fracture (yellow arrows) and a pubic fracture (yellow arrowhead).

X-ray of the pelvis showing bilateral sacroiliitis (yellow arrows) and bilateral pubic symphysis (yellow arrowheads).

X-ray of the pelvis showing a large pelvic mass (arrows).

Plain radiograph of the pelvis showing a large pelvic mass (yellow arrows) with a large right-sided pelvic hematoma.

X-ray of the pelvis showing the presence of a foreign body in the bladder (yellow arrow) and the presence of a foreign body in the rectum.

X-ray of the pelvis showing the fracture of the right pubis.

Anteroposterior radiograph of the pelvis showing a right-sided sacroiliitis.

X-ray of the pelvis showing the fracture of the right ilium (yellow arrows).

CT scan of the chest. The CT scan showed a nodule in the right upper lobe.

Bone defect detected in the axillary region.

6. Concept Detection Task

This year CS_Morgan team submitted a single submission for the concept detection task. The submission involved the implementation of ConvMixer [ 68, 69 ] model which combined the CNN and Transformer architectures.

6.1. Model Description

ConvMixer [ 68, 69 ] closely resembles the MLP-Mixer [ 70 ] model, with key diferences in its architecture. Instead of fully-connected layers, ConvMixer employs standard convolution layers. It uses batch normalization rather than layer normalization technique, which is typically used in ViT [ 51 ] and MLPMixers [ 70 ]. ConvMixer utilizes two types of convolution layers: depth-wise convolutions for mixing spatial locations of the images and point-wise convolutions, following the depth-wise convolutions, for mixing channel-wise information across the patches. Additionally, ConvMixer uses larger kernel sizes to achieve a larger receptive field. Figure 8 shows the corresponding architecture of the model.

6.2. Training and Result

The training process involved developing a ConvMixer model designed for classification or concept detection task with 1,944 unique CUIs. The model was built using TensorFlow and Keras, with key components including an initial rescaling layer, a patch extraction stem, and a series of ConvMixer blocks. The model utilized GELU activations and batch normalization for better performance. The architecture included a global average pooling layer followed by a dense output layer with a sigmoid activation function. Training was conducted over 200 epochs with a batch size of 8, a learning rate of 0.001, and a weight decay of 0.0001. The Adam optimizer was used for training, and the binary cross-entropy loss function was chosen for the multi-label classification task. Performance metrics such as accuracy, precision, recall, and Area Under the Curve (AUC) were tracked during training. However, on the F1-score was reported for the submission. A model checkpoint callback was implemented to save the best model based on validation accuracy. After training, the model was evaluated using the best checkpointed weights.

By implementing this model, the F1-score of 0.107645 was attained on the test data, placing it the ninth position for the concept detection task among the participants. This score indicates that the model’s performance in terms of precision and recall is relatively low, as it represents the harmonic mean of precision and recall, providing a single metric that balances both. The score suggests that the model is struggling to correctly identify and classify the relevant instances among the 1,944 classes, leading to either a high number of false positives, false negatives, or both. This low score reflects room for improvement in the model’s ability to accurately predict the target labels. For a test image (ID: ImageCLEFmedical_Caption_2024_test_000016), the predicted concepts or CUIs based on this ConvMixer model were C0030797, C0000726, and C1306645, whereas the ground truth concepts were C1306645, C0030797, and C0034014 (See Figure 9).

7. Conclusion

For the Caption Prediction task, submitted models included LLaVA v.1.6 with Mistral 7B and Vicuna 7B checkpoints, as well as the LLaVA v.1.5 7B model. Additionally, a 4-bit quantized instruct variant of the IDEFICS 9B model and MoonDream2, a compact VLM, were explored. Two fine-tuning strategies, selective and hybrid fine-tuning, were utilized. Furthermore, traditional encoder-decoder models like VisionGPT2 and CNN-Transformer architectures were also experimented with. Among these, the top-performing submission was the selective training of the LoRA projectors (40.1M parameters) on the LLaVA 1.6 model with Mistral 7B weights. For the Concept Detection subtask, a single model based on the ConvMixer architecture was submitted, which combines the strengths of CNNs and Transformers.

In future research, the primary aim will be to incorporate Explainable AI and reinforcement learning. Explainable AI will enhance model safety and reliability by identifying potential failures and undesirable actions in LMMs. Reinforcement learning, using context-aware reward modeling, will integrate detailed medical image concepts to improve content understanding and performance in multimodal tasks.

8. Acknowledgments

This work was supported by the National Science Foundation (NSF) grant (ID. 2131307) “CISE-MSI: DP: IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards Developing an Efective Decision Support."

A. Specifications of the Computational Environment

The specifications of the utilised computational resources and environment included two machines. The details are as follows.

• Machine 1 • Machine 2 – Machine Type: a2-highgpu-2g (Accelerator Optimized: 2 NVIDIA Tesla A100 GPUs, 24 vCPUs, 170GB RAM) – GPU: NVIDIA A100-40GB x 2 – Booting Disk: 1000 GB SSD – Data Disk: 1000 GB SSD – Language: Python 3.12.x – Machine Type: n1-highmem-16 (16 vCPUs, 104 GB RAM) – GPU: NVIDIA V100 x 2 – Boot disk: 150 GB SSD – Data disk: 1000 GB SSD – Language: Python 3.12.x – Frameworks: PyTorch 2.x and Tensorflow 2.16.x

B. GitHub Repository

[ 45 ] provides the link to GitHub repository which is publicly available for accessing the reproducible codes relevant to the submissions made for this competition.

[1]

Allaouzi ,

M. Ben

Ahmed ,

Benamrou ,

Ouardouz , Automatic caption generation for medical images , in: Proceedings of the 3rd International Conference on Smart City Applications , 2018 , pp. 1 - 6 .

[2]

Pang ,

Li ,

Zhao , A survey on automatic generation of medical imaging reports based on deep learning , BioMedical Engineering OnLine 22 ( 2023 ) 48 .

[3]

Li ,

Wang , L. Zhang, Image caption and medical report generation based on deep learning: a review and algorithm analysis , in: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI) , IEEE, 2021 , pp. 373 - 379 .

[4] M.-H. Van , P.

Verma , X.

Wu , On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study , arXiv e-prints ( 2024 ) arXiv- 2402 .

[5]

Liu ,

Li ,

Wu ,

Y. J.

Lee , Visual instruction tuning, Advances in neural information processing systems 36 ( 2024 ).

[6] J.-B. Alayrac , J.

Donahue , P.

Luc , A.

Miech , I. Barr, Y.

Hasson , K.

Lenc , A.

Mensch , K.

Millican , M.

Reynolds , et al., Flamingo: a visual language model for few-shot learning , Advances in neural information processing systems 35 ( 2022 ) 23716 - 23736 .

[7]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: International conference on machine learning, PMLR , 2021 , pp. 8748 - 8763 .

[8]

Tian ,

Jiang ,

Zhang ,

Lu ,

Xu , The role of large language models in medical image processing: a narrative review , Quantitative Imaging in Medicine and Surgery 14 ( 2024 ) 1108 .

[9]

Hu ,

Pan ,

Li ,

Yang , Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT , arXiv e-prints ( 2023 ) arXiv- 2304 .

[10]

Hartsock , G. Rasool, Vision-language models for medical report generation and visual question answering: A review , 2024 . arXiv: 2403 . 02469 .

[11] OpenAI , J.

Achiam , S.

Adler , S.

Agarwal , L.

Ahmad , I.

Akkaya , et al., Gpt-4 technical report , 2024 . arXiv: 2303 . 08774 .

[12]

Zhu ,

Chen ,

Shen ,

Li ,

Elhoseiny , Minigpt-4: Enhancing vision-language understanding with advanced large language models , 2023 . arXiv: 2304 . 10592 .

[13] BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, author=Sheng Zhang and Yanbo Xu and Naoto Usuyama and Hanwen Xu and Jaspreet Bagga and Robert Tinn and Sam Preston and Rajesh Rao and Mu Wei and Naveen Valluri and Clif Wong and Andrea Tupini and Yu Wang and Matt Mazzola and Swadheen Shukla and Lars Liden and Jianfeng Gao and Matthew P. Lungren and Tristan Naumann and Sheng Wang and

Hoifung

Poon , 2024 . arXiv: 2303 . 00915 .

[14]

Zhang ,

Xu ,

Usuyama ,

Bagga ,

Tinn ,

Preston ,

Rao ,

Wei ,

Valluri ,

Wong , et al., Large-scale domain-specific pretraining for biomedical vision-language processing , arXiv e-prints ( 2023 ) arXiv- 2303 .

[15]

Li ,

Zhang ,

Dan ,

Jiang , Y. Zhang, Chatdoctor: A medical chat model fine-tuned on a Large Language Model Meta-AI (LLAMA) using medical domain knowledge , Cureus 15 ( 2023 ).

[16]

Yang ,

Chen , N. PourNejatian,

H. C.

Shin ,

K. E.

Smith ,

Parisien ,

Compas ,

Martin ,

A. B.

Costa ,

M. G.

Flores , et al., A large language model for electronic health records , NPJ digital medicine 5 ( 2022 ) 194 .

[17]

Ionescu ,

Müller ,

Drăgulinescu ,

Rückert ,

A. Ben

Abacha ,

Garcıa Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

T. M. G.

Pakull ,

Damm ,

Bracke ,

C. M.

Friedrich ,

Andrei ,

Prokopchuk ,

Karpenka ,

Radzhabov ,

Kovalev ,

Macaire ,

Schwab ,

Lecouteux ,

Esperança-Rodier ,

Yim ,

Fu ,

Sun ,

Yetisgen ,

Xia ,

S. A.

Hicks ,

M. A.

Riegler ,

Thambawita ,

Storås ,

Halvorsen ,

Heinrich ,

Kiesel ,

Potthast ,

Stein , Overview of ImageCLEF 2024: Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024 ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France, 2024 .

[18]

Rückert ,

Bloch ,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

C. S.

Schmidt ,

Koitka ,

Pelka ,

A. B.

Abacha , A. G. S. de Herrera , H. Müller, P. A.

Horn , F.

Nensa , C.

M. Friedrich, ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset , 2024 . URL: https: //arxiv.org/abs/2405.10004v1. arXiv: 2405 . 10004 .

[19]

Rückert ,

A. Ben

Abacha ,

A. G.

Seco de Herrera , L. Bloch,

Brüngel ,

Idrissi-Yaghir ,

Schäfer ,

Bracke ,

Damm ,

T. M. G.

Pakull ,

C. S.

Schmidt ,

Müller ,

C. M.

Friedrich , Overview of ImageCLEFmedical 2024 - Caption Prediction and Concept Detection , in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024 .

[20]

Pelka ,

Koitka ,

Rückert ,

Nensa ,

C. M.

Friedrich , Radiology Objects in Context (ROCO): a multimodal image dataset , in: Intravascular Imaging and Computer Assisted Stenting and LargeScale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop , CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada , Spain, September 16 , 2018 , Proceedings 3, Springer, 2018 , pp. 180 - 189 .

[21]

Bodenreider , The Unified Medical Language System (UMLS): integrating biomedical terminology , Nucleic acids research 32 ( 2004 ) D267 - D270 .

[22]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong ,

Du ,

Yang ,

Chen ,

Jiang ,

Ren ,

Li ,

Tang ,

Liu , P. Liu,

J.-Y.

Nie ,

J.-R.

Wen , A survey of large language models , 2023 . arXiv: 2303 . 18223 .

[23]

Chang ,

Wang ,

Wu ,

Yang ,

Zhu ,

Chen ,

Yi ,

Wang ,

Wang , et al., A survey on evaluation of large language models , ACM Transactions on Intelligent Systems and Technology 15 ( 2024 ) 1 - 45 .

[24]

M. A.

Manzoor ,

Albarri ,

Xian ,

Meng ,

Nakov ,

Liang , Multimodality representation learning: A survey on evolution, pretraining and its applications , ACM Transactions on Multimedia Computing, Communications and Applications 20 ( 2023 ) 1 - 34 .

[25]

Wan ,

Wang ,

Liu ,

Alam ,

Zheng ,

Liu ,

Qu ,

Yan ,

Zhu ,

Zhang ,

Chowdhury , M. Zhang, Eficient large language models: A survey , Transactions on Machine Learning Research ( 2024 ). URL: https://openreview.net/forum?id=bsCCJHbO8A, survey Certification.

[26]

Zhang ,

Dong ,

Li ,

Zhang ,

Sun ,

Wang ,

Li ,

Hu ,

Zhang ,

Wu , G. Wang, Instruction tuning for large language models: A survey , 2024 . arXiv: 2308 . 10792 .

[27]

Lu ,

Batra ,

Parikh ,

Lee , ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks , Advances in neural information processing systems 32 ( 2019 ).

[28]

Li ,

Savarese , S. Hoi, BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models , in: International conference on machine learning, PMLR , 2023 , pp. 19730 - 19742 .

[29]

Dai ,

Li ,

A. M. H.

Tiong ,

Zhao ,

Wang ,

Li ,

P. N.

Fung , S. Hoi, InstructBLIP: Towards general -purpose vision-language models with instruction tuning , Advances in Neural Information Processing Systems 36 ( 2024 ).

[30]

Liu ,

Tam ,

Muqeeth ,

Mohta ,

Huang ,

Bansal ,

C. A.

Rafel , Few-shot parametereficient fine-tuning is better and cheaper than in-context learning , Advances in Neural Information Processing Systems 35 ( 2022 ) 1950 - 1965 .

[31]

Fu ,

Yang , A. M.-C. So , W.

Lam , L.

Bing , N.

Collier , On the efectiveness of parameter-eficient ifne-tuning , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 37 , 2023 , pp. 12799 - 12807 .

[32]

Li ,

Xu ,

Tian ,

Wang ,

Yan ,

Bi ,

Ye ,

Chen ,

Xu ,

Cao ,

Zhang ,

Huang ,

Zhou , L. Si, mPLUG: Efective and Eficient Vision-Language Learning by Cross-modal Skip-connections , 2022 . arXiv: 2205 . 12005 .

[33]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-Rank Adaptation of Large Language Models , 2021 . arXiv: 2106 . 09685 .

[34]

Zeng ,

Lee , The Expressive Power of Low-Rank Adaptation , 2024 . arXiv: 2310 . 17513 .

[35]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 . URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[36] NVIDIA , Flash

Attention

, https://docs.nvidia.com/nemo-framework/user-guide/latest/ nemotoolkit/nlp/nemo_megatron/flash_attention.html, 2024 . Accessed: 2024 -05-28.

[37]

Liu ,

Li ,

Y. J.

Lee , Improved Baselines with Visual Instruction Tuning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024 , pp. 26296 - 26306 .

[38] Hugging

Face

, Hugging Face Transformers Documentation: LLaVA , 2024 . URL: https://huggingface. co/docs/transformers/model_doc/llava, hugging Face documentation.

[39] lmsys, Vicuna- 7B -V1 .3, https://huggingface.co/lmsys/vicuna-7b -v1.3 , 2023 . Hugging Face model hub .

[40]

Zheng , W.-L. Chiang,

Sheng ,

Zhuang ,

Wu ,

Zhuang ,

Lin ,

Li ,

Xing , et al., Judging llm-as-a-judge with mt-bench and chatbot arena , Advances in Neural Information Processing Systems 36 ( 2024 ).

[41]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. d. l. Casas,

Bressand , G. Lengyel,

Lample ,

Saulnier , et al., Mistral 7B, arXiv preprint arXiv:2310.06825 ( 2023 ).

[42] M. AI , Mistral- 7B -v0.1 , https://huggingface.co/mistralai/Mistral-7B -v0.1 , 2024 . Hugging Face model hub .

[43] Liuhaotian , Llava v1 . 6 mistral 7b , https://huggingface.co/liuhaotian/llava-v1. 6 -mistral-7b, 2024 .

[44] OpenAI , Clip vit-l/14 model, https://huggingface.co/openai/clip-vit -large- patch14 , 2021 . Accessed: 2024 -05-28.

[45]

Hoque ,

M. R.

Hasan , Medical image interpretation with large multimodal models , https://github. com/HoqueMahmudul/Medical-Image- Interpretation-with- Large-Multimodal-Models , 2023 . Accessed: 2024 -06-19.

[46]

Loshchilov ,

Hutter , Decoupled weight decay regularization, 2019 . arXiv: 1711 . 05101 .

[47] vikhyatk, Moondream2, https://huggingface.co/vikhyatk/moondream2, 2024 .

[48]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz , et al., Transformers: State-of-the-art natural language processing , in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , 2020 , pp. 38 - 45 .

[49]

Li ,

Bubeck ,

Eldan ,

A. D.

Giorno ,

Gunasekar ,

Y. T.

Lee , Textbooks Are All You Need II: phi-1.5 technical report , 2023 . arXiv: 2309 . 05463 .

[50]

Zhai ,

Mustafa ,

Kolesnikov , L. Beyer, Sigmoid loss for language image pre-training , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023 , pp. 11975 - 11986 .

[51]

Dosovitskiy ,

Beyer ,

Kolesnikov ,

Weissenborn ,

Zhai ,

Unterthiner ,

Dehghani ,

Minderer , G. Heigold,

Gelly ,

Uszkoreit ,

Houlsby , An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020 . arXiv: 2010 .11929, https://huggingface. co/google/vit-base-patch16- 224 .

[52]

Laurençon ,

Saulnier ,

Tronchon ,

Bekman ,

Singh ,

Lozhkov ,

Wang ,

Karamcheti ,

A. M.

Rush ,

Kiela ,

Cord ,

Sanh , Obelics: An open web-scale filtered dataset of interleaved image-text documents , 2023 . arXiv: 2306 . 16527 .

[53] LAION , CLIP ViT H-14 LAION2B S32B B79K , https://huggingface.co/laion/ CLIP-ViT -H- 14 -laion2B-s32B-b79K , 2023 . Accessed: 2024 -06-19.

[54]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , 2023 . arXiv: 2302 . 13971 .

[55]

Wu ,

Xu ,

Dai ,

Wan ,

Zhang ,

Yan ,

Tomizuka ,

Gonzalez ,

Keutzer ,

Vajda , Visual transformers: Token-based image representation and processing for computer vision , 2020 . arXiv: 2006 .03677.

[56]

Radford , J. Wu ,

Child ,

Luan ,

Amodei , I. Sutskever , Language models are unsupervised multitask learners ( 2019 ).

[57]

Face , transformers.seq2seqtrainer, https://huggingface.co/docs/transformers/main_classes/ trainer#transformers. Seq2SeqTrainer , 2023 . Accessed: 2024 -05-28.

[58]

Rajpurkar ,

Irvin ,

Zhu ,

Yang ,

Mehta ,

Duan ,

Ding ,

Bagul ,

Langlotz ,

Shpanskaya , et al., CheXNet: Radiologist-level pneumonia detection on chest x-rays with deep learning , arXiv preprint arXiv:1711.05225 ( 2017 ).

[59]

Chollet , Image captioning, https://keras.io/examples/vision/image_captioning/, 2023 . Accessed: 2024 -05-28.

[60]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger ,

Artzi , Bertscore: Evaluating text generation with bert , 2020 . arXiv: 1904 .09675.

[61] C.-Y. Lin , ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: https://aclanthology.org/W04-1013.

[62]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation , in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics , 2002 , pp. 311 - 318 .

[63]

Sellam , D. Das , A. P. Parikh , BLEURT: Learning Robust Metrics for Text Generation , 2020 . arXiv: 2004 .04696.

[64]

Banerjee ,

Lavie , METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005 , pp. 65 - 72 .

[65]

Vedantam ,

C. L.

Zitnick , D. Parikh, CIDEr: Consensus-based Image Description Evaluation , 2015 . arXiv: 1411 . 5726 .

[66]

Hessel ,

Holtzman ,

Forbes ,

R. L.

Bras , Y. Choi, CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2022 . arXiv: 2104 . 08718 .

[67] A. B. Abacha , W. wai Yim, G. Michalopoulos, T. Lin , An Investigation of Evaluation Metrics for Automated Medical Note Generation , 2023 . arXiv: 2305 . 17364 .

[68]

Trockman ,

J. Z.

Kolter , Patches are all you need? , 2022 . arXiv: 2201 . 09792 .

[69] Keras , ConvMixer example, https://keras.io/examples/vision/convmixer/, 2023 . Accessed: 2024 -05- 28.

[70]

I. O.

Tolstikhin ,

Houlsby ,

Kolesnikov ,

Beyer ,

Zhai ,

Unterthiner ,

Yung ,

Steiner ,

Keysers ,

Uszkoreit , et al., Mlp-mixer: An all-mlp architecture for vision , Advances in neural information processing systems 34 ( 2021 ) 24261 - 24272 .