<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Medical Image Interpretation with Large Multimodal Models Notebook for the CS_Morgan Lab at CLEF 2024</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mahmudul</forename><surname>Hoque</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<addrLine>1700 East Cold Spring Lane</addrLine>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Md</roleName><forename type="first">Rakibul</forename><surname>Hasan</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<addrLine>1700 East Cold Spring Lane</addrLine>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><roleName>Md. Ismail</roleName><forename type="first">Siddiqi</forename><surname>Emon</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<addrLine>1700 East Cold Spring Lane</addrLine>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fahmi</forename><surname>Khalifa</surname></persName>
							<email>fahmi.khalifa@morgan.edu</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Electrical and Computer Engineering Department</orgName>
								<orgName type="department" key="dep2">School of Engineering</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>MD</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Md</forename><forename type="middle">Mahmudur</forename><surname>Rahman</surname></persName>
							<email>md.rahman@morgan.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">Morgan State University</orgName>
								<address>
									<addrLine>1700 East Cold Spring Lane</addrLine>
									<postCode>21251</postCode>
									<settlement>Baltimore</settlement>
									<region>Maryland</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Medical Image Interpretation with Large Multimodal Models Notebook for the CS_Morgan Lab at CLEF 2024</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AF3B7F39A523C17AE78B7B423610F44E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Multimodal Models</term>
					<term>Vision Language Models</term>
					<term>Transformer</term>
					<term>Large Language and Vision Assistant</term>
					<term>Caption Prediction</term>
					<term>Concept Detection</term>
					<term>Medical Images</term>
					<term>Low-Rank Adaptation</term>
					<term>Quantization</term>
					<term>Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS</term>
					<term>Vision Generative Pre-trained Transformer 2</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, different variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture-a hybrid approach leveraging CNN and Transformer advantages-ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The tasks of automatic caption generation and multi-label prediction from medical images have become crucial for improving healthcare due to the growing availability of medical images from different modalities like X-radiation (X-ray), Computed Tomography (CT), Positron Emission Tomography (PET), Magnetic Resonance Imaging (MRI), and Ultrasound (US), as well as the significant advancements in the computing power of modern graphics processing units <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. The increasing need for diagnostic radiology services and the lack of report writing expertise in many medical facilities highlight the need for automating the mentioned tasks. As a result, extensive applications of recently developed AI models have been found in these domains. As an active research area of AI, combining large language models (LLMs) with vision capabilities allows users to explore emergent abilities using multimodal data, which is being popularized as LMMs or VLMs <ref type="bibr" target="#b3">[4]</ref>. For example, LLaVA <ref type="bibr" target="#b4">[5]</ref>, Flamingo <ref type="bibr" target="#b5">[6]</ref>, and Contrastive Language-Image Pretraining (CLIP) <ref type="bibr" target="#b6">[7]</ref> have shown remarkable performance in various vision-text tasks. Consequently, there is also potential for applying LLMs in the biomedical imaging field <ref type="bibr" target="#b7">[8]</ref>. These models are trained on extensive databases of human knowledge, demonstrating remarkable capabilities in offering valuable insights to physicians and healthcare professionals <ref type="bibr" target="#b8">[9]</ref>. Utilizing knowledge from millions to billions of training examples, VLMs can help detect minor abnormalities in low-resolution radiology images that are difficult to spot with the naked eye <ref type="bibr" target="#b9">[10]</ref>. Moreover, pre-trained LLMs like ChatGPT-4 <ref type="bibr" target="#b10">[11]</ref> exhibit emergent abilities on tasks they were not specifically trained for (i.e., visionlanguage domain) <ref type="bibr" target="#b11">[12]</ref>. Models like BiomedCLIP <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>, ChatDoctor <ref type="bibr" target="#b14">[15]</ref>, and GatorTron <ref type="bibr" target="#b15">[16]</ref>, which are pretrained on high-quality medical datasets, offer more useful applications for medical domain users. In this working note, various multimodal models were demonstrated that were initially pretrained on multimodal image-instruction pairs from diverse sources. This approach allows for attaining competitive results in this competition of analyzing medical images such as brain MRI, chest X-ray, PET, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Objectives</head><p>For the ImageCLEFmedical Caption 2024 <ref type="bibr" target="#b16">[17]</ref> challenge, CS_Morgan, participant in the competition, was tasked with developing solutions to automatically predict captions and identify multi-label concepts of radiology images from ROCOv2 <ref type="bibr" target="#b17">[18]</ref> dataset. Considering the tasks, the objectives include the following:</p><p>• Concept Detection <ref type="bibr" target="#b18">[19]</ref>: This task involved identifying and locating relevant concepts in the specified dataset. This formed the foundation for scene understanding and was essential for context-based image and information retrieval. The evaluation process was conducted using metrics like F1-score. • Caption Prediction <ref type="bibr" target="#b18">[19]</ref>: This task focused on predicting coherent captions for the entire image test dataset using the detected concepts and their interactions within the image. This task provided insights into the interplay of visual elements. Evaluation metrics used for this task consisted of BERTScore (as a primary approach), ROUGE (as a secondary approach), BLEU-1, BLEURT, METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and MedBERTScore.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dataset</head><p>Dataset for both tasks included curated images from ROCOv2 <ref type="bibr" target="#b17">[18]</ref>, an updated version of the original ROCO <ref type="bibr" target="#b19">[20]</ref> dataset. The medical images were collected from biomedical articles in the PMC OpenAccess and were accompanied by corresponding captions and concepts. The latter was also expressed using UMLS <ref type="bibr" target="#b20">[21]</ref> terms. The training, validation, and test sets contained 70,108, 9,972, and 17,237 radiology images, respectively, with average dimensions of the images being 600×600. As a result, for the deep learning models implemented here, the images were resized to that average dimension, and the smaller images were padded to have a uniform distribution of image dimensions. Furthermore, the length of captions in words (without punctuations) or tokens for each image was 100 or fewer on average. Moreover, by analyzing both training and validation image-caption pairs, 42,121 unique words (without the punctuations) were found and used as the set of vocabulary in the models implemented. Additionally, there were 1,944 unique CUIs found in the concept list of the train and validation images, among which 1,934 were enlisted in the CUI mapping file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Large Multimodal Models (LMMs)</head><p>LMMs as an extended variation of LLMs mark a major leap forward in AI by handling and comprehending various data types, including text, images, audio, and video <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24]</ref>. By integrating and interpreting information from these diverse sources, LMMs achieve a holistic understanding of complex data <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>. This capability allows them to perform sophisticated tasks, such as image captioning, visual question answering, and content recommendation, by leveraging the relationships between different data types <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref>. Figure <ref type="figure" target="#fig_0">1</ref> demonstrates theoretical architecture of LMMs. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Pre-training and Fine-tuning of LMMs</head><p>During pre-training, the model is initially trained on vast and diverse datasets, enabling it to learn general representations before being fine-tuned for specific tasks. This involves utilizing large-scale datasets that include various modalities <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26]</ref>. For instance, models like ViLBERT <ref type="bibr" target="#b26">[27]</ref> have been pre-trained on extensive image-text pairs to increase their performance in downstream tasks like image captioning and visual question answering <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26]</ref>. Fine-tuning LMMs involves adjusting all pre-trained model parameters to enhance performance on specific tasks, such as image captioning. This process is computationally intensive and resourcedemanding, especially for models with billions of parameters. Despite these challenges, the full finetuning technique remains popular due to its potential for achieving high accuracy. For instance, models like BLIP-2 <ref type="bibr" target="#b27">[28]</ref> and InstructBLIP <ref type="bibr" target="#b28">[29]</ref> have demonstrated enhancements in image captioning tasks through full fine-tuning, utilizing their extensive pre-training on large datasets to adapt to specific tasks. However, the substantial computational and memory requirements make full fine-tuning impractical for many applications, leading to the exploration of more efficient fine-tuning methods.</p><p>As a result, Parameter-Efficient Fine-Tuning (PEFT) <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b30">31]</ref> presents a more efficient approach compared to full fine-tuning by modifying only a small portion of the model's parameters while leaving the majority unchanged. This strategy substantially decreases computational and memory demands, making it suitable for a variety of applications. In the domain of image captioning, PEFT techniques have proven effective with models such as mPLUG <ref type="bibr" target="#b31">[32]</ref> and LLaVA <ref type="bibr" target="#b4">[5]</ref>. Notably, approaches like Low-Rank Adaptation (LoRA) <ref type="bibr" target="#b32">[33]</ref> have been particularly successful in fine-tuning. LoRA optimizes a matrix of updates to the pre-trained model weights rather than directly modifying them. This update matrix is decomposed into two smaller, lower-rank matrices, reducing the number of parameters that need updating while preserving the original weights <ref type="bibr" target="#b32">[33,</ref><ref type="bibr" target="#b33">34]</ref>. This allows different task-specific LoRAs to be easily swapped, effectively tailoring the pre-trained model for various applications. LoRA matches the performance of the full fine-tuning technique by updating a small number of additional weights, preventing catastrophic forgetting, and enabling better generalization with limited data <ref type="bibr" target="#b32">[33,</ref><ref type="bibr" target="#b33">34]</ref>. Figure <ref type="figure" target="#fig_1">2</ref> compares the approaches of LoRA and linear projection techniques. Figure <ref type="figure" target="#fig_1">2</ref> indicates that the LoRA approach involves two matrices, 𝐴 and 𝐵. The matrix 𝐴 is the first step in the adaptation process, projecting high-dimensional input features into a lower-dimensional latent space. Typically, its shape includes two values: rank and original dimension (e.g., 32 and 4096). The matrix 𝐵 is the second component, mapping the lower-dimensional features back to the original high-dimensional space, effectively reversing the reduction performed by the matrix 𝐴 and the shape becomes <ref type="bibr">[4096,</ref><ref type="bibr" target="#b31">32]</ref>. Both 𝐴 and 𝐵 matrices are trainable and updated during fine-tuning. LoRA focuses on specific weight matrices within the model, for example, the query, key, and value matrices in Transformer <ref type="bibr" target="#b34">[35]</ref> architectures. However, traditional Transformers are hindered by their slow performance and high memory consumption, particularly with long sequences, due to the quadratic time and memory complexity of self-attention. Flash Attention <ref type="bibr" target="#b35">[36]</ref> addresses these issues with an IO-aware exact attention algorithm that utilizes tiling to reduce the number of memory reads and writes between the GPU's high-bandwidth memory (HBM) and on-chip SRAM.</p><p>Visual instruction tuning <ref type="bibr" target="#b4">[5]</ref> enhances LMMs by fine-tuning them with instructions that combine visual and textual data. This technique uses machine-generated instruction-following data to improve the model's zero-shot and few-shot performance on new tasks. For example, the LLaVA <ref type="bibr" target="#b36">[37]</ref> model integrates a vision encoder with LLM for general-purpose visual and language understanding. The process involves generating detailed, context-aware language-image instructions using a language-only model like . This data is then used to train the LMM, enabling it to perform tasks such as image captioning, visual question answering, and detailed image descriptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Large Language and Vision Assistant (LLaVA)</head><p>LLaVA <ref type="bibr" target="#b36">[37,</ref><ref type="bibr" target="#b37">38]</ref> stands as a comprehensive, end-to-end trained multimodal model that seamlessly merges a vision encoder and a LLM to facilitate broad-ranging visual and language comprehension (see Figure <ref type="figure" target="#fig_2">3</ref>). The vision encoder is tasked with processing input images (𝑋 𝑣 ) and transforming them into a series of feature representations (𝑍 𝑣 ). Situated above the vision encoder is the Projection (𝑊 ), functioning as a vital conduit between the vision encoder and the language model. The projection matrix facilitates the conversion of feature representations (𝑍 𝑣 ) from the vision encoder into a compatible format (𝐻 𝑣 ) for the language model. On the right side of the diagram, the Language Instruction input (𝑋 𝑞 ) represents the textual component that the model must comprehend and respond to in conjunction with the visual input. This input undergoes processing by the language model, generating its own set of feature representations (𝐻 𝑞 ). The Language Model (𝑓 𝜑 ) (e.g., Vicuna 7B <ref type="bibr" target="#b38">[39,</ref><ref type="bibr" target="#b39">40]</ref> or Mistral 7B <ref type="bibr" target="#b40">[41,</ref><ref type="bibr" target="#b41">42]</ref> in this working note) ingests both the projected vision features (𝐻 𝑣 ) and the language features (𝐻 𝑞 ), seamlessly integrating them to produce a Language Response (𝑋 𝑎 ). The resulting output constitutes a coherent response incorporating elements from both visual and textual inputs. Figure <ref type="figure" target="#fig_2">3</ref> shows the basic architecture of LLaVA and demonstrates its working principles. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">LLaVA-v.1.6-Vicuna-7B</head><p>The Vicuna 7B <ref type="bibr" target="#b38">[39,</ref><ref type="bibr" target="#b39">40]</ref> language model components include (see Figure <ref type="figure" target="#fig_4">4</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">LLaVA-v.1.6-Mistral-7B</head><p>The LLaVA v.1.6 Mistral 7B <ref type="bibr" target="#b42">[43]</ref> model integrates several key components for its functionality (see Figure <ref type="figure" target="#fig_5">5</ref>). At its core is the vision encoder, utilizing a pre-trained CLIP ViT-L/14 <ref type="bibr" target="#b43">[44]</ref> to extract visual embeddings from high resolution images. This encoder processes visual input, converting it into a format compatible with the language model. The language model itself is based on the Mistral-7B architecture, which inherently incorporates advanced features like Sliding Window Attention and Grouped-Query Attention, enhancing its capability to manage long sequences and improve inference efficiency <ref type="bibr" target="#b40">[41,</ref><ref type="bibr" target="#b41">42]</ref>. Additionally, A two-layer MLP projection matrix is employed to map the visual embeddings from the vision encoder into the same embedding space as the language model, ensuring seamless integration of visual and textual information. The CLIP ViT-L/14 <ref type="bibr" target="#b43">[44]</ref>, a Vision Transformer(ViT) with 14 layers, is renowned for its ability to handle complex visual tasks, contributing to the model's overall performance. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Caption Prediction Task</head><p>To address the caption prediction task, the CS_Morgan team fine-tuned several LMMs that were pretrained using extensive standard datasets from the field of computer vision. These models were derived from well-known LLMs commonly utilized in Natural Language Processing (NLP). Ten submissions were made, and the technical details, methods, and approaches of these submissions are detailed in the following section. Moreover, the reproducible codes relevant to the following submissions can be found here <ref type="bibr" target="#b44">[45]</ref>.</p><p>Before any tasks are performed, the dataset is pre-processed to ensure that it is clean and correctly formatted. Beyond the initial image-text pre-processing described earlier, the training, validation, and testing datasets were structured for generating captions to meet the input requirements of the corresponding vision-language models. Furthermore, the dataset was managed using the Hugging Face Hub. Computational details can be found in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Submission 1: Selective fine-tuning of LLaVA-v.1.6-Mistral-7B</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Model Description</head><p>For this submission, the pre-trained LLaVA 1.6 on Mistral 7B weights was loaded using Mistral-7B-Instruct-v.0.2 as the base LLM and flash attention was used to optimize attention mechanism computations. To enhance training stability, all float16 instances of the Vision Tower model were replaced with bfloat16. Additionally, prompts were set up by combining images and texts using the "mistral_instruct" conversation mode.</p><p>For efficient fine-tuning, LoRA was applied to specific layers, configuring it with a rank r = 16, an alpha (lora_alpha) of 32, and a dropout rate of 0.05. The query, key, and value projection layers in the self-attention mechanisms of the Mistral Decoder Layer, as well as the projection layers in the MLP, were specifically targeted. In the vision model, LoRA was applied to the linear projection layers within the self-attention mechanism (CLIP attention) of the encoder layers in the CLIP encoder. This resulted in 40,108,032 trainable parameters, about 0.527% of the model's total parameters. The LoRA components included lora_A, lora_B, and lora_dropout representing the low-rank projection to a smaller dimension, projection back to the original dimension, and a parameter to prevent overfitting, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">Training Process</head><p>The training process involved setting up a Data Loader for the dataset, ensuring images and text inputs were properly loaded. Custom callbacks were defined for printing the best checkpoint and implementing early stopping. Key training parameters included a learning rate of 1e-4, bfloat16 precision, and the AdamW <ref type="bibr" target="#b45">[46]</ref> optimizer. Each device processed batches of 4, with gradient accumulation steps of 8. Evaluations and saves were performed every 1,095 steps, with the training capped at 21,900 steps (10 epochs). Early stopping was set with a patience of 5 steps and a threshold of 0.01, monitoring evaluation loss (where lower values are better). Training was halted at 9,855 steps, and the best model, saved at 4,380 steps, was reloaded at the end. For evaluation, caption generation was configured with a temperature of 1.0, a beam width of 1, and a maximum of 512 new tokens. Figure <ref type="figure" target="#fig_6">6</ref> depicts the training and validation loss over the steps. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Submission 2: Additional fine-tuning of LLaVA-v.1.6-Mistral-7B Model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1.">Model Description</head><p>The second submission was built upon the first one by fine-tuning a larger portion of the model using the same pattern. This included an expanded application of LoRA to improve utilization of the model's capacity for more accurate and robust predictions. The fine-tuning involved additional layers to enhance the learning and improve visual-textual alignment. Specifically, output projection layers such as o_proj in the Mistral Decoder Layer's self-attention mechanism and out_proj in the vision model were included to better capture complex relationships within the data, which is essential for tasks like image captioning. Targeting multimodal projector layers (mm_projector.0 and mm_projector.2) enhanced the alignment of visual and textual representations, which is crucial for multimodal tasks. Despite the increased number of trainable parameters (98,467,840 compared to 40,108,032), this expansion represented only a small fraction (1.285%) of the total model parameters, maintaining parameter efficiency while improving learning capabilities. LoRA was configured with a rank r = 32, lora_alpha of 32, and a dropout rate of 0.05. Various layers were targeted in the Mistral Decoder Layers, including query projection (q_proj), key projection (k_proj), value projection (v_proj), and output projection (o_proj) in the self-attention mechanism, as well as gate projection (gate_proj), up projection (up_proj), and down projection (down_proj) in the MLP components.</p><p>In the CLIP Vision Model, LoRA was applied to the similar projection layers in the attention mechanism and fully connected layers (fc1 and fc2) in the MLP. Additionally, the multimodal projector layers (mm_projector.0 and mm_projector.2) were included to further enhance the model's capabilities. These modifications were applied to the LLaVA-v.1.6 model and its pre-trained checkpoints on the Mistral-7B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2.">Training Process</head><p>The training configuration included a learning rate of 1e-5, using the AdamW <ref type="bibr" target="#b45">[46]</ref> optimizer, bfloat16 precision, and Flash Attention enabled. Each device handled a batch size of 4, with gradient accumulation steps set to 8. The model underwent training for a maximum of 8,760 steps (4 epochs), with checkpoints and evaluations performed every 548 steps. Early stopping parameters were defined with a patience of 4 and a threshold of 0.01, monitoring the evaluation loss to select the best model, with lower values being preferable. Training was halted early at 3,836 steps, and the model saved at this point was considered the best and subsequently loaded. For evaluation, specifically for generating captions, parameters were set with a temperature of 1.0, beam width of 1, and a maximum of 100 new tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Submission 3: Hybrid fine-tuning of LLaVA-v.1.6-Mistral-7B</head><p>This submission was built on the previous one, maintaining the same general pattern but altering which layers were fine-tuned and the fine-tuning strategy itself. The fine-tuning strategy was multifaceted, employing LoRA to adapt key components, such as attention mechanism projections, MLP components, and multimodal projector layers. Additionally, the language model's head (lm_head) and the embedding tokens (embed_tokens) were explicitly set as trainable parameters to further enable these parts of the model to learn and adapt to the task. This hybrid approach leveraged the strengths of both LoRA adapters and traditional fine-tuning. Fine-tuning the lm_head allowed the model to better tailor its output generation to specific tasks or datasets, which was particularly important for generating appropriate language or captions from medical images. On the other hand, fine-tuning the embed_tokens layer helped the model learn better representations of input tokens, improving overall performance, especially when the input data distribution differs from the pre-training data.</p><p>In this configuration, LoRA was set with a rank r = 32, and the lora_alpha was calculated as 32 × √ 32 to stabilize training and enhance low-rank adaptation performance. This scaling factor normalized the learning rate for LoRA parameters based on rank, ensuring effective updates without causing gradient explosion or vanishing gradients. A dropout rate of 0.05 was applied to prevent overfitting and maintain generalization ability.</p><p>For layers explicitly set as trainable, the lm_head was a linear layer that mapped hidden states to the vocabulary space, generating the final output logits for each token. This layer was crucial for the model's text generation capability. The embed_tokens layer converted input token indices into dense vectors, providing initial representations of the input tokens essential for the model to process the input text. Both the lm_head and embed_tokens layers had their full weights fine-tuned, in addition to the LoRA adapters.</p><p>Overall, this hybrid fine-tuning approach combined LoRA fine-tuning for attention, MLP, and multimodal projection layers with full weight fine-tuning of the lm_head and embed_tokens layers. The total number of trainable parameters was 350,650,368 out of 7,654,729,728 total parameters, making up 4.581% of the parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1.">Training Process</head><p>The training arguments included a learning rate of 1e-5, the AdamW <ref type="bibr" target="#b45">[46]</ref> optimizer, bfloat16 precision, Flash Attention, per-device batch sizes of 4, and gradient accumulation steps of 8. The model was trained for a maximum of 6,570 steps (3 epochs), with checkpoints and evaluations performed every 548 steps. Gradient checkpointing was enabled using a re-entrant approach to reduce memory usage. Early stopping was configured with a patience of 3 and a threshold of 0.01, monitoring evaluation loss (with lower values being better). Early stopping was triggered at 3,836 steps, at which point the best model was saved and later loaded. For evaluation and caption generation, the parameters were set to a temperature of 1.0, num_beams of 1, and max_new_tokens of 100.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Submission 4: Selective Fine-tuning of LLaVA-v.1.6-Vicuna-7B</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.1.">Model Description</head><p>For this submission, the pre-trained multimodal language model on checkpoints of Llava v.1.6 Vicuna 7B was loaded, which used the lmsys/vicuna-7b-v1.5 as its base LLM. The model preparation involved configuring LoRA with a rank (r) of 16, an lora_alpha of 32, and a dropout rate of 0.05. The target modules for LoRA were expanded to include the query (q_proj), key (k_proj), and value (v_proj) projections within the self-attention mechanism of the LLaMA Decoder Layer, as well as the gate (gate_proj), up (up_proj), and down (down_proj) projections in the MLP components of the same layer. Additionally, in the CLIP Vision Model's CLIP Encoder layers, the key (k_proj), value (v_proj), and query (q_proj) projections, along with the first (fc1) and second (fc2) fully connected layers of the CLIP MLP, were targeted. Furthermore, the multimodal projector layers (mm_projector.0 and mm_projector.2) were included. This expanded application of LoRA resulted in 34,422,784 trainable parameters out of a total of 7,097,329,664 parameters, constituting approximately 0.485% of the model's parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.2.">Training Process</head><p>The training process involved setting up a Data Loader for the dataset and inspecting batches to ensure correct loading of images and text inputs. Custom callbacks were created for printing the best checkpoint and enabling early stopping. The training used a learning rate of 1e-4, bfloat16 precision, Flash Attention, the AdamW optimizer, batch sizes of 4 per device, and gradient accumulation steps of 8, with evaluation and save steps every 548 steps. The model was trained for a maximum of 10,950 steps (5 epochs), with early stopping configured with a patience of 3 and a threshold of 0.01. The evaluation loss was monitored to select the best model, with lower values being preferable. Early stopping occurred at 4,932 steps, and the best model, saved at 4,384 steps, was loaded at the end. For generating captions during evaluation, parameters included a temperature of 1.0, num_beams set to 1, and a maximum of 512 new tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Submission 5: Hybrid Fine-tuning of LLaVA-v.1.6-Vicuna-7B</head><p>For this submission, same approach of the third submission was followed. The only difference was the implementation of Vicuna LLM. The total number of trainable parameters is 346,718,208 out of 7,147,481,088 total parameters (4.851% of parameters). The training process employed was similar to the previous submission. The only difference was that the maximum token limit was set to 150 for this submission. The model was trained for a maximum of 10,950 steps (5 epochs), with early stopping configured with a patience of 5 and a threshold of 0.01. Early stopping occurred at 6,576 steps, and the best model, saved at 4,384 steps, was loaded for evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">Submission 6: Selective Fine-tuning of LLaVA-v.1.5-7B</head><p>The LLaVA 1.5 7B shares a similar architecture with that of LLaVA-v.1.6 Vicuna-7B. LLaVA 1.5 checkpoints on 7B parameters were loaded, and the expanded use of LoRA resulted in 84,574,208 trainable parameters out of a total of 7,147,476,992, constituting approximately 1.183% of the model's parameters. Precision was adjusted from float16 to bfloat16 to enhance computational efficiency, and Flash attention was not enabled in this submission. Instead, LLaMA Scaled Dot-Product Attention (SDPA) was utilized in the 32 layers of the LLaMA Decoder Layer. LoRA was configured with a rank of 32, lora_alpha of 32, and a dropout rate of 0.05. Target modules for LoRA included various projections in LLaMA Decoder Layer, MLP components, and attention mechanisms within CLIP Vision Model. The training process involved creating a Data Loader, defining custom callbacks for early stopping and checkpoint printing, and setting training arguments such as a learning rate of 1e-5, and AdamW optimizer. Training was conducted for a maximum of 8760 steps with early stopping triggered at 4,672 steps, saving the best model. For evaluation, parameters included temperature = 1.0, num_beams = 1, and max_new_tokens = 100.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7.">Submission 7: Adaptation of MoonDream2</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7.1.">Model Description</head><p>Moondream2, a small vision language model designed for efficient operation on edge devices, was evaluated on the ImageCLEF 2024 dataset using pre-trained weights from Huggingface <ref type="bibr" target="#b46">[47,</ref><ref type="bibr" target="#b47">48]</ref>. These weights were initialized from Sigmoid Loss for Language-Image Pre-Training (SigLIP) and Phi-1.5 models. Phi-1.5 <ref type="bibr" target="#b48">[49]</ref>, developed by Microsoft Research, is a compact Transformer-based language model with 24 layers, 32 heads (each with a dimension of 64), rotary embeddings, a rotary dimension of 32, a context length of 2,048, and flash-attention. SigLIP <ref type="bibr" target="#b49">[50]</ref>, an enhancement of the CLIP model, replaces the softmax loss with a pairwise sigmoid loss, operating on image-text pairs without global normalization. SigLIP's architecture includes a ViT <ref type="bibr" target="#b50">[51]</ref> backbone that processes image patches through a transformer encoder and a classification head with a MLP using Gaussian Error Linear Unit (GELU) activation for final predictions. Moreover, the pre-processing included resizing, type conversion, and normalization. This architecture effectively combined visual and textual processing for caption generation.</p><p>LoRA was configured with an alpha (lora_alpha) of 32, which adjusts the learning rate for lowrank matrices, and a rank (lora_rank) of 64 for the adaptation process. It was applied to specific linear layers in both the vision encoder and the text model. In the vision encoder, LoRA targeted the projection layers (proj), and fully connected layers (fc1 and fc2) within the 27 ViTBlock components. Additionally, LoRA was applied to the fc1 and fc2 layers in the multimodal projection layer, a custom module integrated to adapt the projection layer for the purpose of the caption prediction challenge. In the language model, LoRA targeted the Wqkv, out_proj, fc1, and fc2 layers within the 24 Phi Decoder Layer components. Wqkv in the Phi Decoder Layer represents the combined weights for the self-attention mechanism's linear projections (query, key, and value). With LoRA applied, the model had 74,422,272 trainable parameters, which was about 3.850% of the total parameters (1,931,904,880).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7.2.">Training Process</head><p>The training process employed various key parameters and strategies to optimize the model's performance. The number of image tokens was set to 729, aligning with text tokens. Training spanned 10 epochs over 40,000 steps, using a batch size of 8 and gradient accumulation steps of 4, with evaluation after each epoch. An early stopping mechanism with a patience of 6 epochs and a minimum delta of 0.0001 monitored validation loss to prevent overfitting. Data loading and batching utilized PyTorch's DataLoader with custom collation for images and text tokens, pre-processed and padded for uniform sequence lengths. Gradient accumulation steps set to 4 simulated a larger batch size for better GPU memory management. The Adam8bit optimizer from the bitsandbytes library, with a dynamic learning rate adjusted via a cosine schedule, was used. Loss computation combined image and text embeddings, processed by the Phi language model. The training loop iterated over epochs and batches, updating parameters post-gradient accumulation and checking validation loss for early stopping. LoRA parameters were optimized with an initial learning rate of 3e-6, scaled by a factor of 4, balancing exploration and convergence. This approach, along with gradient accumulation, enhanced resource use and fine-tuning efficiency. <ref type="bibr" target="#b51">[52]</ref> is an advanced multimodal model developed by Hugging Face for integrated image and text processing tasks. The model combines the vision model CLIP ViT-H/14 <ref type="bibr" target="#b52">[53]</ref> and the language model LLaMA 7B <ref type="bibr" target="#b53">[54]</ref>, incorporating novel transformer blocks to connect these modalities. Trained on extensive datasets, including OBELICS, Wikipedia, LAION, and PMD, the IDEFICS 9B Instruct variant is fine-tuned on supervised and instruction datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8.">Submission 8: Selective Fine-tuning of 4-bit Quantized IDEFICS-9B-Instruct</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8.1.">Model Description</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IDEFICS 9B Instruct</head><p>The lightweight IDEFICS 9B Instruct variant was explored using 4-bit quantization to reduce model size and computational requirements while maintaining performance. BitsandByters (BnB) quantization assigns 4-bit precision to the model using double quantization with the normalized floating-point format (NF4) and bfloat16 precision for computations, crucial for running large language models on smaller devices. For fine-tuning IDEFICS 9B Instruct on the ImageCLEF dataset, the checkpoint HuggingFaceM4/idefics-9b-instruct was specified to load the pre-trained model with 4-bit quantization using the BitsAndBytesConfig class.</p><p>LoRA was applied to the query projection (q_proj), key projection (k_proj), and value projection (v_proj) in both the ViT and decoder layers, as well as the perceiver attention and gated cross-attention layers. However, the output projection (o_proj and out_proj) in the decoder, gated cross-attention, and perceiver attention layers did not use LoRA but remained as standard Linear4bit layers. This selective application of LoRA allows for efficient fine-tuning by reducing the number of trainable parameters specifically within the attention mechanisms while leaving other projections, like the output projection layers, unmodified.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.8.2.">Training Process</head><p>Custom callbacks were defined for printing the best checkpoint and early stopping. The training arguments included a learning rate of 1e-4, the AdamW optimizer, batch sizes of 2 per device for training and evaluation, gradient accumulation steps of 8, and evaluation and save steps every 500 steps. The model was trained for a maximum of 8762 steps (2 epochs). Early stopping parameters were set with a patience of 6 and a threshold of 0.001. Evaluation loss was monitored to select the best model, with lower values being better. Early stopping was triggered at 8,000 steps, and the best model, saved at 8,000 steps, was loaded at the end of training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.9.">Submission 9: VisionGPT2</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.9.1.">Model Description</head><p>The Encoder-Decoder model was designed to take an image as input and generate a descriptive caption as output. In this model, the Encoder was a ViT <ref type="bibr" target="#b54">[55,</ref><ref type="bibr" target="#b50">51]</ref> that processed the input image and extracted meaningful features. These features were then fed into the Decoder, which is based on GPT-2 <ref type="bibr" target="#b55">[56]</ref>, a powerful language model that generates the corresponding textual caption. For fine-tuning the model, the Hugging Face Seq2SeqTrainer <ref type="bibr" target="#b56">[57]</ref> was employed. This trainer, part of the Hugging Face transformers library, is specifically designed to handle sequence-to-sequence tasks, making it well-suited for this image captioning model. The fine-tuning process leverages the transformers library to adapt the pre-trained ViT and GPT-2 models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.9.2.">Training Process</head><p>Initially, the pre-trained layers were frozen to focus on training the cross-attention layers. In subsequent epochs, GPT-2 was unfrozen and trained, and in the final few epochs, the ViT was also unfrozen. The Adam optimizer and the One Cycle Learning Rate (OneCycleLR) scheduler are used for optimization. Mixed-precision fp16 training was employed with autocast and GradScaler in PyTorch. The training metrics are cross-entropy loss and perplexity, with both metrics aimed to be minimized. The best model was saved based on validation perplexity and was loaded during caption generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.10.">Submission 10: CNN-Transformer Fusion Model</head><p>The CNN-Transformer fusion model for this submission was built around three core models. First, the pre-trained ChexNet <ref type="bibr" target="#b57">[58]</ref> (a DenseNet121 backbone based CNN model) was used to extract features from the input images. These features captured essential visual information and were then passed to the second component, a Transformer Encoder <ref type="bibr" target="#b58">[59]</ref>. The Transformer-based encoder processed the extracted image features to generate a new, more informative representation of the inputs. Finally, the third component, a Transformer Decoder <ref type="bibr" target="#b58">[59]</ref>, took the output from the encoder along with the text data (sequences). The decoder used these inputs to learn and generate the corresponding image captions, completing the image-to-text translation process. The hyper-parameters for the model included an embedding dimension set to 512 and an initial learning rate of 0.0001. The encoder used a single attention head, while the decoder utilized two attention heads to process the information. For early stopping, the patience level was set to 5, meaning the training process halted if there was no improvement in validation loss after five epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.11.">Performance Measurement Metrics for the Caption Prediction Task</head><p>The performance of all the submissions regarding the caption generation task were evaluated using the following metrics.</p><p>• BERTScore <ref type="bibr" target="#b59">[60]</ref> evaluates text generation by computing the similarity between BERT embeddings of the candidate and reference sentences, capturing semantic meaning better than traditional metrics. • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) <ref type="bibr" target="#b60">[61]</ref> is a set of metrics for evaluating automatic summarization and machine translation by comparing overlap in n-grams, word sequences, and word pairs between the candidate and reference texts. • BLEU (Bilingual Evaluation Understudy) <ref type="bibr" target="#b61">[62]</ref> is a precision-based metric for evaluating machine translation quality by comparing n-grams of the candidate translation to those of the reference translation. BLEU-1 specifically considers unigram matches. • BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) <ref type="bibr" target="#b62">[63]</ref> is a learned evaluation metric for natural language generation that uses pre-trained transformers fine-tuned on a variety of supervised and unsupervised signals to predict human judgment scores. • METEOR (Metric for Evaluation of Translation with Explicit ORdering) <ref type="bibr" target="#b63">[64]</ref> evaluates machine translation by considering precision, recall, stemming, synonymy, and alignment, aiming to improve correlation with human judgment. • CIDEr (Consensus-based Image Description Evaluation) <ref type="bibr" target="#b64">[65]</ref> is a metric for evaluating image captioning by comparing candidate captions to reference captions using TF-IDF weighting and n-gram similarity, ensuring relevance and importance of the words are considered. • CLIPScore <ref type="bibr" target="#b65">[66]</ref> is an evaluation metric that uses the CLIP model to compare image and text similarity. It measures the alignment between visual content and textual descriptions, providing a score based on their embedding similarity. • RefCLIPScore <ref type="bibr" target="#b65">[66]</ref> is an extension of CLIPScore that includes a reference-based evaluation, incorporating both the similarity of the generated text to a reference text and the similarity between the image and the generated text. • ClinicalBLEURT <ref type="bibr" target="#b66">[67]</ref> adapts BLEURT for clinical text generation, fine-tuning it on clinical datasets to better evaluate the quality and relevance of generated clinical text against reference clinical text. • MedBERTScore <ref type="bibr" target="#b66">[67]</ref> adapts BERTScore for the medical domain, using BERT embeddings specifically fine-tuned on medical texts to provide a more accurate evaluation of medical text generation tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.12.">Results and Discussion on Caption Prediction Submissions</head><p>In this year's evaluation for the ImageCLEF task, BERTScore <ref type="bibr" target="#b18">[19]</ref> was the primary metric used to assess the quality of the generated captions, with ROUGE <ref type="bibr" target="#b18">[19]</ref> as the secondary metric. Table <ref type="table" target="#tab_0">1</ref> shows the results of submissions in terms of the primary metrics of performance. In addition to BERTScore and ROUGE, some other performance metrics were also adopted to assess submission results. These metrics are BLEU-1, BLEURT, METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and MedBERTScore. Table <ref type="table" target="#tab_1">2</ref> shows the results of the additional performance metrics other than the BERTScore and ROUGE used for the caption prediction task. In both tables, the submissions are listed according to the BERTScore (highest to lowest). Our results indicate that LMMs, when selectively fine-tuned with fewer parameters, can achieve high performance. Additionally, LMMs obtained through quantization and smaller VLMs can maintain competitive performance in medical image understanding and caption generation. From Tables <ref type="table" target="#tab_1">1 and  2</ref>, it is evident that four different submission outperformed the others in terms of the pre-specified performance measurement metrics. Submission 1 using the LLaVA-v1.6-Mistral-7B model with 40.1M fine-tuned parameters using the LoRA technique achieved the highest scores across several key metrics: BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). Submission 3, also using the LLaVA-v.1.6-Mistral-7B model with hybrid LoRA fine-tuning approach (350.6M parameters) attained the highest CLIPScore of 0.824171, indicating an improved semantic match between the generated captions and the visual content of the medical images. Submission 10, the CNN-Transformer fusion approach (Pre-trained CheXNet as the encoder and Transformer as the decoder) performs better than other submissions in terms of the ClinicalBEURT score of 0.676905. Finally, the submission 8 which was IDEFICS-9B-Instruct quantized to 4-bit, excelled in capturing relevant biomedical concepts compared to other submissions, achieving the highest MedBERTScore of 0.657460034. Overall, the first submission can be claimed as the top performer because of the highest scores in the primary and secondary metrics. Figure <ref type="figure" target="#fig_7">7</ref> shows the comparison of the submissions in terms of the primary and secondary metrics. The significance of these submissions lies in their demonstration of advanced fine-tuning techniques and model performance optimization in the context of generative models. These findings highlight the evolving landscape of model fine-tuning strategies, advocating for resource-efficient methods that maintain or enhance performance. This is crucial for practical and scalable AI deployments across diverse medical applications.  In addition to the above-mentioned submissions, Submission 4, utilizing the LLaVA v.1.6 Vicuna 7B with selective fine-tuning using LoRA (34.4M parameters), demonstrated well-balanced performance and closely followed Submission 1. Moreover, submissions 3 and 2, both based on the LLaVA v.1.6 Mistral 7B model but with different approaches for optimization, closely followed submission 4 in terms of BERTScore and ROUGE. However, the sixth submission of LLaVA v.1.5 7B, based on another variant of LLaVA, could not outperform the LLaVA v.1.6 variants except for LLaVA v.1.6 Vicuna 7B with hybrid finetuning using LoRA technique (Submission 5). The experiment with the MoonDream2 with 74.4M finetuned parameters in Submission 7 showed competitive performance on the test data relative to the larger models across multiple metrics. Submissions 9 and 10 were based on the pre-trained Transformer-based encoder-decoder models other than the LMMs. VisionGPT2 outperformed the conventional pre-trained CheXNet-Transformer or CNN-Transformer based model in every metric except ClinicalBLEURT. Table <ref type="table" target="#tab_2">3</ref> shows the generated captions for a test image (ID: ImageCLEFmedical_Caption_2024_test_000016) corresponding to the submissions made for the caption prediction task. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">CNN-Transformer Fusion Model</head><p>Bone defect detected in the axillary region.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Concept Detection Task</head><p>This year CS_Morgan team submitted a single submission for the concept detection task. The submission involved the implementation of ConvMixer <ref type="bibr" target="#b67">[68,</ref><ref type="bibr" target="#b68">69]</ref> model which combined the CNN and Transformer architectures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Model Description</head><p>ConvMixer <ref type="bibr" target="#b67">[68,</ref><ref type="bibr" target="#b68">69]</ref> closely resembles the MLP-Mixer <ref type="bibr" target="#b69">[70]</ref> model, with key differences in its architecture. Instead of fully-connected layers, ConvMixer employs standard convolution layers. It uses batch normalization rather than layer normalization technique, which is typically used in ViT <ref type="bibr" target="#b50">[51]</ref> and MLP-Mixers <ref type="bibr" target="#b69">[70]</ref>. ConvMixer utilizes two types of convolution layers: depth-wise convolutions for mixing spatial locations of the images and point-wise convolutions, following the depth-wise convolutions, for mixing channel-wise information across the patches. Additionally, ConvMixer uses larger kernel sizes to achieve a larger receptive field. Figure <ref type="figure" target="#fig_8">8</ref> shows the corresponding architecture of the model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Training and Result</head><p>The By implementing this model, the F1-score of 0.107645 was attained on the test data, placing it the ninth position for the concept detection task among the participants. This score indicates that the model's performance in terms of precision and recall is relatively low, as it represents the harmonic mean of precision and recall, providing a single metric that balances both. The score suggests that the model is struggling to correctly identify and classify the relevant instances among the 1,944 classes, leading to either a high number of false positives, false negatives, or both. This low score reflects room for improvement in the model's ability to accurately predict the target labels. For a test image (ID: ImageCLEFmedical_Caption_2024_test_000016), the predicted concepts or CUIs based on this ConvMixer model were C0030797, C0000726, and C1306645, whereas the ground truth concepts were C1306645, C0030797, and C0034014 (See Figure <ref type="figure" target="#fig_10">9</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>For the Caption Prediction task, submitted models included LLaVA v.1.6 with Mistral 7B and Vicuna 7B checkpoints, as well as the LLaVA v.1.5 7B model. Additionally, a 4-bit quantized instruct variant of the IDEFICS 9B model and MoonDream2, a compact VLM, were explored. Two fine-tuning strategies, selective and hybrid fine-tuning, were utilized. Furthermore, traditional encoder-decoder models like VisionGPT2 and CNN-Transformer architectures were also experimented with. Among these, the top-performing submission was the selective training of the LoRA projectors (40.1M parameters) on the LLaVA 1.6 model with Mistral 7B weights. For the Concept Detection subtask, a single model based on the ConvMixer architecture was submitted, which combines the strengths of CNNs and Transformers.</p><p>In future research, the primary aim will be to incorporate Explainable AI and reinforcement learning. Explainable AI will enhance model safety and reliability by identifying potential failures and undesirable actions in LMMs. Reinforcement learning, using context-aware reward modeling, will integrate detailed medical image concepts to improve content understanding and performance in multimodal tasks.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Theoretical architecture of large multimodal models.</figDesc><graphic coords="3,138.45,183.15,315.89,94.43" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Comparison of approaches relevant to LoRA and linear projection techniques.</figDesc><graphic coords="4,138.45,65.60,315.90,186.33" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Architecture and working principle of LLaVA [37].</figDesc><graphic coords="5,206.14,90.71,180.51,187.62" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>): (a) Embedding Layer -Converts input tokens into dense vectors with an embedding dimension of 4,096, (b) Decoder Layers -Consists of 32 LLaMA-based Decoder Layer instances, where each layer includes a self-attention mechanism, a Multi-layer Perceptron (MLP) using Sigmoid Linear Unit (SiLU) activation, and Root Mean Square (RMS) normalization layers applied before and after the attention mechanisms, and (c) Final Normalization Layer -A RMS normalization layer applied to the final output of the decoder layers. The model supports input image resolutions of 672×672, 336×1344, and 1344×336, enhancing visual detail comprehension.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Major components and corresponding layers of LLaVA 1.6 Vicuna 7B model.</figDesc><graphic coords="5,72.00,472.92,451.28,183.54" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Important components and respective processing layers of LLaVA 1.6 Mistral 7B.</figDesc><graphic coords="6,72.00,196.91,451.28,200.57" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Training and validation loss of submission 1-LLaVA-v.1.6-Mistral-7B with LoRA for selective fine-tuning.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Comaprison of the submissions in terms of BERTScore and ROUGE values.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Key components and layers of the ConvMixer model used for the concept detection task submission.</figDesc><graphic coords="16,161.01,65.61,270.77,130.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head></head><label></label><figDesc>training process involved developing a ConvMixer model designed for classification or concept detection task with 1,944 unique CUIs. The model was built using TensorFlow and Keras, with key components including an initial rescaling layer, a patch extraction stem, and a series of ConvMixer blocks. The model utilized GELU activations and batch normalization for better performance. The architecture included a global average pooling layer followed by a dense output layer with a sigmoid activation function. Training was conducted over 200 epochs with a batch size of 8, a learning rate of 0.001, and a weight decay of 0.0001. The Adam optimizer was used for training, and the binary cross-entropy loss function was chosen for the multi-label classification task. Performance metrics such as accuracy, precision, recall, and Area Under the Curve (AUC) were tracked during training. However, on the F1-score was reported for the submission. A model checkpoint callback was implemented to save the best model based on validation accuracy. After training, the model was evaluated using the best checkpointed weights.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Ground Truth Concept CUIs: [C1306645, C0030797, C0034014] and Predicted Concept CUIs: [C0030797, C0000726, C1306645] of the test image ID: ImageCLEFmedical_Caption_2024_test_000016 (CC BY [Munihire et al. (2023)])</figDesc><graphic coords="16,228.70,573.77,135.40,133.41" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Submission Results for the Caption Prediction Task -Primary Scores</figDesc><table><row><cell>Submission ID</cell><cell>Model</cell><cell cols="4"># of Parameters Trained % of Parameters Trained BERTScore ROUGE</cell></row><row><cell>1</cell><cell>LLaVA-v.1.6 Mistral-7B</cell><cell>40108032</cell><cell>0.527</cell><cell>0.628059</cell><cell>0.250801</cell></row><row><cell>4</cell><cell>LLaVA-v.1.6-Vicuna-7B</cell><cell>34422784</cell><cell>0.485</cell><cell>0.625402</cell><cell>0.245398</cell></row><row><cell>3</cell><cell>LLaVA-v.1.6-Mistral-7B</cell><cell>350650368</cell><cell>4.581</cell><cell>0.624988</cell><cell>0.243983</cell></row><row><cell>2</cell><cell>LLaVA-v.1.6-Mistral-7B</cell><cell>98467840</cell><cell>1.285</cell><cell>0.622964</cell><cell>0.238009</cell></row><row><cell>8</cell><cell>IDEFICS-9B-Instruct</cell><cell>21061632</cell><cell>0.235</cell><cell>0.621052</cell><cell>0.229319</cell></row><row><cell>6</cell><cell>LLaVA-v.1.5-7B</cell><cell>84574208</cell><cell>1.183</cell><cell>0.617342</cell><cell>0.217850</cell></row><row><cell>7</cell><cell>MoonDream2</cell><cell>74422272</cell><cell>3.852</cell><cell>0.616561</cell><cell>0.215981</cell></row><row><cell>5</cell><cell>LLaVA-v.1.6-Vicuna-7B</cell><cell>346718208</cell><cell>4.851</cell><cell>0.615692</cell><cell>0.223682</cell></row><row><cell>9</cell><cell>VisionGPT2</cell><cell>28366848</cell><cell>13.493</cell><cell>0.545773</cell><cell>0.118446</cell></row><row><cell>10</cell><cell>CNN-Transformer Fusion Model</cell><cell>9053056</cell><cell>99.084</cell><cell>0.414342</cell><cell>0.044218</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Submission Results for the Caption Prediction Task -Secondary Scores</figDesc><table><row><cell cols="2">Submission ID BLEU-1 BLEURT METEOR</cell><cell>CIDEr</cell><cell cols="4">CLIPScore RefCLIPScore ClinicalBLEURT MedBERTScore</cell></row><row><cell>1</cell><cell cols="2">0.209298 0.317385 0.092682 0.245029</cell><cell>0.821262</cell><cell>0.815534</cell><cell>0.455942</cell><cell>0.632664</cell></row><row><cell>4</cell><cell cols="2">0.207555 0.316524 0.089231 0.224142</cell><cell>0.820785</cell><cell>0.814251</cell><cell>0.443495</cell><cell>0.631529</cell></row><row><cell>3</cell><cell cols="2">0.204902 0.315257 0.089844 0.219909</cell><cell>0.824171</cell><cell>0.814689</cell><cell>0.443766</cell><cell>0.630013</cell></row><row><cell>2</cell><cell cols="2">0.195061 0.309630 0.085367 0.203407</cell><cell>0.822694</cell><cell>0.812071</cell><cell>0.435760</cell><cell>0.629846</cell></row><row><cell>8</cell><cell cols="2">0.154041 0.296429 0.077370 0.191725</cell><cell>0.811816</cell><cell>0.807033</cell><cell>0.443966</cell><cell>0.657460</cell></row><row><cell>6</cell><cell cols="2">0.155887 0.297628 0.072998 0.170832</cell><cell>0.816577</cell><cell>0.806713</cell><cell>0.448721</cell><cell>0.626198</cell></row><row><cell>7</cell><cell cols="2">0.182720 0.305801 0.076002 0.161918</cell><cell>0.815116</cell><cell>0.807100</cell><cell>0.453388</cell><cell>0.624334</cell></row><row><cell>5</cell><cell cols="2">0.174058 0.300482 0.076997 0.172990</cell><cell>0.819258</cell><cell>0.807451</cell><cell>0.433863</cell><cell>0.624581</cell></row><row><cell>9</cell><cell cols="2">0.102386 0.244725 0.035134 0.028805</cell><cell>0.685257</cell><cell>0.692432</cell><cell>0.450116</cell><cell>0.556308</cell></row><row><cell>10</cell><cell cols="2">0.028921 0.261376 0.019912 0.003447</cell><cell>0.666524</cell><cell>0.669843</cell><cell>0.676905</cell><cell>0.406249</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Model predicted captions corresponding to the submissions for the test image ID: ImageCLEFmed-ical_Caption_2024_test_000016 (Ground Truth Caption: Complex pelvic fracture with widening and displacement of the pubis rami and widening of the S1joints bilaterally: arrows shows multiple fractures in the pelvis.)</figDesc><table><row><cell>Submission ID</cell><cell>Model</cell><cell>Predicted Caption</cell></row><row><cell>1</cell><cell>LLaVA-v.1.6 Mistral-7B</cell><cell>Anteroposterior radiograph of the pelvis showing a sacral fracture (yellow arrows) and a pubic fracture (yellow arrowhead).</cell></row><row><cell>4</cell><cell>LLaVA-v.1.6-Vicuna-7B</cell><cell>X-ray of the pelvis showing bilateral sacroiliitis (yellow arrows) and bilat-eral pubic symphysis (yellow arrowheads).</cell></row><row><cell>3</cell><cell>LLaVA-v.1.6-Mistral-7B</cell><cell>X-ray of the pelvis showing a large pelvic mass (arrows).</cell></row><row><cell>2</cell><cell>LLaVA-v.1.6-Mistral-7B</cell><cell>Plain radiograph of the pelvis showing a large pelvic mass (yellow arrows) with a large right-sided pelvic hematoma.</cell></row><row><cell>8</cell><cell>IDEFICS-9B-Instruct</cell><cell>X-ray of the pelvis showing the presence of a foreign body in the bladder (yellow arrow) and the presence of a foreign body in the rectum.</cell></row><row><cell>6</cell><cell>LLaVA-v.1.5-7B</cell><cell>X-ray of the pelvis showing the fracture of the right pubis.</cell></row><row><cell>7</cell><cell>MoonDream2</cell><cell>Anteroposterior radiograph of the pelvis showing a right-sided sacroiliitis.</cell></row><row><cell>5</cell><cell>LLaVA-v.1.6-Vicuna-7B</cell><cell>X-ray of the pelvis showing the fracture of the right ilium (yellow arrows).</cell></row><row><cell>9</cell><cell>VisionGPT2</cell><cell>CT scan of the chest. The CT scan showed a nodule in the right upper lobe.</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Acknowledgments</head><p>This work was supported by the National Science Foundation (NSF) grant (ID. 2131307) "CISE-MSI: DP: IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards Developing an Effective Decision Support. "</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Specifications of the Computational Environment</head><p>The specifications of the utilised computational resources and environment included two machines. The details are as follows.</p><p>• Machine 1 -Machine Type: a2-highgpu-2g (Accelerator Optimized: 2 NVIDIA Tesla A100 GPUs, <ref type="bibr" target="#b23">24</ref> vCPUs, 170GB RAM) -GPU: NVIDIA A100-40GB x 2 -Booting Disk: 1000 GB SSD -Data Disk: 1000 GB SSD -Language: Python 3.12.x • Machine 2 -Machine Type: n1-highmem-16 (16 vCPUs, 104 GB RAM) -GPU: NVIDIA V100 x 2 -Boot disk: 150 GB SSD -Data disk: 1000 GB SSD -Language: Python 3.12.x -Frameworks: PyTorch 2.x and Tensorflow 2.16.x <ref type="bibr" target="#b44">[45]</ref> provides the link to GitHub repository which is publicly available for accessing the reproducible codes relevant to the submissions made for this competition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. GitHub Repository</head></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Automatic caption generation for medical images</title>
		<author>
			<persName><forename type="first">I</forename><surname>Allaouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ben Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Benamrou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ouardouz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd International Conference on Smart City Applications</title>
				<meeting>the 3rd International Conference on Smart City Applications</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey on automatic generation of medical imaging reports based on deep learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">BioMedical Engineering OnLine</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page">48</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Image caption and medical report generation based on deep learning: a review and algorithm analysis</title>
		<author>
			<persName><forename type="first">R</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="373" to="379" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">M.-H</forename><surname>Van</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Verma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
		<idno>arXiv-2402</idno>
		<title level="m">On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Flamingo: a visual language model for few-shot learning</title>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Alayrac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Donahue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Luc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Miech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Barr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hasson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lenc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Millican</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reynolds</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="23716" to="23736" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The role of large language models in medical image processing: a narrative review</title>
		<author>
			<persName><forename type="first">D</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Quantitative Imaging in Medicine and Surgery</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">1108</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<idno>arXiv-2304</idno>
		<title level="m">Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Hartsock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rasool</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.02469</idno>
		<title level="m">Vision-language models for medical report generation and visual question answering: A review</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">J</forename><surname>Openai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Achiam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Adler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><surname>Akkaya</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.08774</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">Gpt-4 technical report</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Elhoseiny</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.10592</idno>
		<title level="m">Minigpt-4: Enhancing vision-language understanding with advanced large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image</title>
		<author>
			<persName><forename type="first">Yanbo</forename><surname>Author=sheng Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Naoto</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hanwen</forename><surname>Usuyama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jaspreet</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Bagga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sam</forename><surname>Tinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rajesh</forename><surname>Preston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mu</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Naveen</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cliff</forename><surname>Valluri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrea</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Tupini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matt</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Swadheen</forename><surname>Mazzola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lars</forename><surname>Shukla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianfeng</forename><surname>Liden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthew</forename><forename type="middle">P</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tristan</forename><surname>Lungren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sheng</forename><surname>Naumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hoifung</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><surname>Poon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.00915</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usuyama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bagga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tinn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Preston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Valluri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wong</surname></persName>
		</author>
		<idno>arXiv-2303</idno>
		<title level="m">Large-scale domain-specific pretraining for biomedical vision-language processing</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Chatdoctor: A medical chat model fine-tuned on a Large Language Model Meta-AI (LLAMA) using medical domain knowledge</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Dan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cureus</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A large language model for electronic health records</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Pournejatian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">C</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">E</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Parisien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Compas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Costa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Flores</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NPJ digital medicine</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">194</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEF 2024: Multimedia retrieval in medical applications</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drăgulinescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcıa Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M G</forename><surname>Pakull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Damm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Andrei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Prokopchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Karpenka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radzhabov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kovalev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Macaire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schwab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lecouteux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Esperança-Rodier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yetisgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Hicks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Riegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Thambawita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Storås</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Halvorsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Heinrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kiesel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings of the 15th International Conference of the CLEF Association (CLEF 2024</title>
		<title level="s">Springer Lecture Notes in Computer Science LNCS</title>
		<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koitka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G S</forename><surname>De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Horn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Nensa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2405.10004v1.arXiv:2405.10004" />
		<title level="m">ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Overview of ImageCLEFmedical 2024 -Caption Prediction and Concept Detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">G</forename><surname>Seco De Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bloch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Brüngel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Idrissi-Yaghir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Damm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M G</forename><surname>Pakull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF2024 Working Notes, CEUR Workshop Proceedings</title>
				<meeting><address><addrLine>Grenoble, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Radiology Objects in Context (ROCO): a multimodal image dataset</title>
		<author>
			<persName><forename type="first">O</forename><surname>Pelka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Koitka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rückert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Nensa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Friedrich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop</title>
		<title level="s">Proceedings</title>
		<meeting><address><addrLine>LABELS; Granada, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018-09-16">2018. September 16, 2018. 2018</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="180" to="189" />
		</imprint>
	</monogr>
	<note>Held in Conjunction with MICCAI 2018</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">The Unified Medical Language System (UMLS): integrating biomedical terminology</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bodenreider</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Nucleic acids research</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page" from="D267" to="D270" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-R</forename><surname>Wen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.18223</idno>
		<title level="m">A survey of large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A survey on evaluation of large language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1" to="45" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Multimodality representation learning: A survey on evolution, pretraining and its applications</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Manzoor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Albarri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Liang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Multimedia Computing, Communications and Applications</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="1" to="34" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Efficient large language models: A survey</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Alam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=bsCCJHbO8A,surveyCertification" />
	</analytic>
	<monogr>
		<title level="j">Transactions on Machine Learning Research</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.10792</idno>
		<title level="m">Instruction tuning for large language models: A survey</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="19730" to="19742" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">InstructBLIP: Towards general-purpose vision-language models with instruction tuning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M H</forename><surname>Tiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Few-shot parameterefficient fine-tuning is better and cheaper than in-context learning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Muqeeth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mohta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bansal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Raffel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="1950" to="1965" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">On the effectiveness of parameter-efficient fine-tuning</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename></persName>
		</author>
		<author>
			<persName><forename type="first">.-C</forename><surname>So</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Collier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="12799" to="12807" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Si</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.12005</idno>
		<title level="m">mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wallis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Allen-Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2106.09685</idno>
		<title level="m">LoRA: Low-Rank Adaptation of Large Language Models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.17513</idno>
		<title level="m">The Expressive Power of Low-Rank Adaptation</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">U</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><forename type="middle">V</forename><surname>Luxburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Vishwanathan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">30</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<ptr target="https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/flash_attention.html" />
		<title level="m">Flash Attention</title>
				<imprint>
			<date type="published" when="2024-05-28">2024. 2024-05-28</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Improved Baselines with Visual Instruction Tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="26296" to="26306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<ptr target="https://huggingface.co/docs/transformers/model_doc/llava,huggingFacedocumentation" />
		<title level="m">Hugging Face, Hugging Face Transformers Documentation: LLaVA</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Vicuna</forename><surname>Lmsys</surname></persName>
		</author>
		<ptr target="https://huggingface.co/lmsys/vicuna-7b-v1.3,2023.HuggingFacemodelhub" />
		<imprint>
			<biblScope unit="volume">3</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Judging llm-as-a-judge with mt-bench and chatbot arena</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Xing</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sablayrolles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mensch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bamford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Chaplot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Casas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bressand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lengyel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.06825</idno>
		<title level="m">Mistral 7B</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Ai</surname></persName>
		</author>
		<ptr target="https://huggingface.co/mistralai/Mistral-7B-v0.1,2024.HuggingFacemodelhub" />
		<title level="m">Mistral-7B</title>
				<imprint>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">mistral 7b</title>
		<author>
			<persName><surname>Liuhaotian</surname></persName>
		</author>
		<ptr target="https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b" />
	</analytic>
	<monogr>
		<title level="m">Llava v1</title>
				<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">6</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://huggingface.co/openai/clip-vit-large-patch14" />
		<title level="m">Clip vit-l/14 model</title>
				<imprint>
			<date type="published" when="2021">2021. 2024-05-28</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Hoque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Hasan</surname></persName>
		</author>
		<ptr target="https://github.com/HoqueMahmudul/Medical-Image-Interpretation-with-Large-Multimodal-Models" />
		<title level="m">Medical image interpretation with large multimodal models</title>
				<imprint>
			<date type="published" when="2023">2023. Accessed: 2024-06-19</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<monogr>
		<title level="m" type="main">Decoupled weight decay regularization</title>
		<author>
			<persName><forename type="first">I</forename><surname>Loshchilov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.05101</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Moondream2</forename><surname>Vikhyatk</surname></persName>
		</author>
		<ptr target="https://huggingface.co/vikhyatk/moondream2" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">Transformers: State-of-the-art natural language processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Delangue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cistac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Louf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Funtowicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations</title>
				<meeting>the 2020 conference on empirical methods in natural language processing: system demonstrations</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="38" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bubeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Eldan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Giorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gunasekar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">T</forename><surname>Lee</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.05463</idno>
		<title level="m">Textbooks Are All You Need II: phi-1</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page">5</biblScope>
		</imprint>
	</monogr>
	<note type="report_type">technical report</note>
</biblStruct>

<biblStruct xml:id="b49">
	<analytic>
		<title level="a" type="main">Sigmoid loss for language image pre-training</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="11975" to="11986" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<monogr>
		<title level="m" type="main">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<ptr target="https://huggingface.co/google/vit-base-patch16-224" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Laurençon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Saulnier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tronchon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bekman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lozhkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Karamcheti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Rush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kiela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.16527</idno>
		<title level="m">Obelics: An open web-scale filtered dataset of interleaved image-text documents</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b52">
	<monogr>
		<author>
			<persName><forename type="first">Clip</forename><surname>Laion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Vit</surname></persName>
		</author>
		<ptr target="https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K" />
		<title level="m">-14 LAION2B S32B B79K</title>
				<imprint>
			<date type="published" when="2023">2023. 2024-06-19</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b53">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Izacard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lacroix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rozière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Azhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.13971</idno>
		<title level="m">Llama: Open and efficient foundation language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b54">
	<monogr>
		<title level="m" type="main">Visual transformers: Token-based image representation and processing for computer vision</title>
		<author>
			<persName><forename type="first">B</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tomizuka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Keutzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vajda</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.03677</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b55">
	<monogr>
		<title level="m" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b56">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Face</surname></persName>
		</author>
		<ptr target="https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainer" />
		<title level="m">transformers.seq2seqtrainer</title>
				<imprint>
			<date type="published" when="2023">2023. 2024-05-28</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b57">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Irvin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bagul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Langlotz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shpanskaya</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.05225</idno>
		<title level="m">CheXNet: Radiologist-level pneumonia detection on chest x-rays with deep learning</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b58">
	<monogr>
		<title level="m" type="main">Image captioning</title>
		<author>
			<persName><forename type="first">F</forename><surname>Chollet</surname></persName>
		</author>
		<ptr target="https://keras.io/examples/vision/image_captioning/" />
		<imprint>
			<date type="published" when="2023">2023. 2024-05-28</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b59">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.09675</idno>
		<title level="m">Bertscore: Evaluating text generation with bert</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b60">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b61">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th annual meeting of the Association for Computational Linguistics</title>
				<meeting>the 40th annual meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b62">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Sellam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Parikh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.04696</idno>
		<title level="m">BLEURT: Learning Robust Metrics for Text Generation</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b63">
	<analytic>
		<title level="a" type="main">METEOR: An automatic metric for MT evaluation with improved correlation with human judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</title>
				<meeting>the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b64">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1411.5726</idno>
		<title level="m">CIDEr: Consensus-based Image Description Evaluation</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b65">
	<monogr>
		<title level="m" type="main">CLIPScore: A Reference-free Evaluation Metric for Image Captioning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hessel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holtzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forbes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Bras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.08718</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b66">
	<monogr>
		<title level="m" type="main">An Investigation of Evaluation Metrics for Automated Medical Note Generation</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Michalopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.17364</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b67">
	<monogr>
		<title level="m" type="main">Patches are all you need?</title>
		<author>
			<persName><forename type="first">A</forename><surname>Trockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Z</forename><surname>Kolter</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2201.09792</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b68">
	<monogr>
		<author>
			<persName><surname>Keras</surname></persName>
		</author>
		<ptr target="https://keras.io/examples/vision/convmixer/" />
		<title level="m">ConvMixer example</title>
				<imprint>
			<date type="published" when="2023">2023. 2024-05-28</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b69">
	<analytic>
		<title level="a" type="main">Mlp-mixer: An all-mlp architecture for vision</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">O</forename><surname>Tolstikhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Houlsby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Keysers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="24261" to="24272" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
