<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Novel Evaluation Framework for Image2Text Generation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jia-Hong</forename><surname>Huang</surname></persName>
							<email>j.huang@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hongyi</forename><surname>Zhu</surname></persName>
							<email>h.zhu@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yixian</forename><surname>Shen</surname></persName>
							<email>y.shen@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stevan</forename><surname>Rudinac</surname></persName>
							<email>s.rudinac@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessio</forename><forename type="middle">M</forename><surname>Pacces</surname></persName>
							<email>a.m.pacces@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Evangelos</forename><surname>Kanoulas</surname></persName>
							<email>e.kanoulas@uva.nl</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval</orgName>
								<address>
									<addrLine>18 July 2024</addrLine>
									<settlement>Washington</settlement>
									<region>DC</region>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Novel Evaluation Framework for Image2Text Generation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3CF3B6DEB1070DD646729A0B4B0834A5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Image Captioning</term>
					<term>Metrics for Automated Evaluation</term>
					<term>Large Language Models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The evaluation of sentences generated through automated methods remains a formidable challenge in the realm of image captioning. Current metrics for evaluating image descriptions aim to gauge multiple desirable attributes, such as grammaticality, covering crucial aspects, correctness, truthfulness, and more. Human evaluation plays a pivotal role in quantifying these properties, utilizing separate Likert scales or pairwise scales <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. However, due to the expensive, challenging-to-reproduce, and time-consuming nature of human studies, there is a growing need for automated evaluation measures. For practical utility, these automated metrics should align closely with human judgment. Therefore, the challenge in designing such an automatic metric lies in integrating the aforementioned diverse evaluations attributes into a unified measure of sentence quality.</p><p>Several automated metrics, including BLEU <ref type="bibr" target="#b6">[7]</ref>, ROUGE <ref type="bibr" target="#b7">[8]</ref>, METEOR <ref type="bibr" target="#b3">[4]</ref>, CIDEr <ref type="bibr" target="#b8">[9]</ref>, and more, have been introduced to assess image descriptions generated by automated approaches. BLEU, initially designed for machine translation, relies on precision, while ROUGE, originating from the summarization community, is a recall-based metric. METEOR is tailored for assessing the overall quality of image descriptions. Nonetheless, research has indicated a weak correlation between these metrics and human judgment <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12]</ref>. In contrast, the consensus-based metric CIDEr measures the similarity between a generated sentence and a set of ground truth sentences authored by humans, demonstrating high agreement with human consensus. However, preparing a set of ground truth sentences in advance is a prerequisite for CIDEr. If the quantity of human-authored ground truth sentences is insufficient, CIDEr may struggle to effectively evaluate image descriptions <ref type="bibr" target="#b8">[9]</ref>. A similar limitation is observed in To aid comprehension, we represent RNN-based methods with blue paths and transformer-based approaches with red paths. The process involves feeding an input image through an image encoder for feature extraction, followed by a language generator to produce text-based descriptions using the extracted image features.</p><p>the CLAIR method <ref type="bibr" target="#b12">[13]</ref> and other aforementioned approaches. Some metrics involve caption ranking <ref type="bibr" target="#b11">[12]</ref> but are limited in evaluating novel image descriptions.</p><p>In addressing the above challenge, we present a novel framework for evaluating image descriptions. This framework is rooted in the utilization of a modern LLM approach, e.g., GPT-4 <ref type="bibr" target="#b13">[14]</ref> or Gemini <ref type="bibr" target="#b14">[15]</ref>, capable of generating images. The advancement of LLMs <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>, exemplified by models like GPT-4, empowers us to provide textual descriptions, i.e., prompt, for generating images that closely correspond and align with the semantic meaning conveyed in the given text. The underlying design philosophy of the proposed framework hinges on the idea that if an image captioning model is validated as effective, the generated image description by the model should be sufficiently accurate to reconstruct the same or a highly similar image compared to the original input image, relying on LLMs. The ongoing evolution of LLM technology forms the bedrock of the proposed framework.</p><p>Starting with the definition of the image captioning task, as illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, our proposed framework begins by taking an image as input. Subsequently, this input undergoes processing through a given image captioning model, generating a textual description for the initial image. Following this, a given LLM, such as GPT-4, is employed to generate an image based on the textual description. Then, we extract the image features from both the original input image and the LLM-generated image, and assess their similarity using the cosine similarity metric. It is worth noting that human-annotated reference captions are not needed in our proposed evaluation framework. In the proposed evaluation framework, a high cosine similarity score is anticipated if the generated text-based description is of sufficient quality, signifying that the LLM can accurately reproduce an image highly similar to the original input. Conversely, if the generated text-based description lacks accuracy, the image produced by the LLM will deviate from the original input image and lead to a low cosine similarity score. This incongruity suggests the suboptimal performance of the image captioning model. Consequently, the proposed framework proves valuable for evaluating the efficacy of a given image captioning model. The main contributions of this work are summarized as follows:</p><p>• Innovative Framework for Image Captioning Model Evaluation: We present a novel framework that relies on the utilization of an LLM, such as GPT-4 or Gemini, to evaluate the quality of image descriptions generated by an image captioning model. The proposed evaluation framework does not necessitate human-annotated reference captions. • Human Evaluation of the Framework: To verify the effectiveness of our evaluation framework, we introduce a human-annotated dataset and conduct human evaluations.</p><p>• Comprehensive Experiments on Established Datasets: We perform extensive experiments to demonstrate the efficacy of the proposed evaluation framework using widely-used image captioning datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In this section, we begin by reviewing existing related literature, covering topics such as the existing image captioning methods, the evolution of automated metrics, and the latest advancements in LLM technology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Image Captioning Methods</head><p>The encoder-decoder network architecture has become a cornerstone in the field of image captioning, as evidenced by various studies <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b27">28,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b29">30,</ref><ref type="bibr" target="#b30">31,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b32">33,</ref><ref type="bibr" target="#b33">34]</ref>. Typically, these networks employ a CNN as the encoder for extracting global image features, and an RNN as the decoder for generating word sequences. <ref type="bibr" target="#b34">[35]</ref> introduces a method for generating referring expressions, which are descriptions for specific objects or regions within an image. In <ref type="bibr" target="#b35">[36]</ref>, the bidirectional LSTM-based method for image captioning takes advantage of both past and future information to learn long-term visual-language interactions. Attention mechanisms have significantly enhanced the performance of image captioning models. <ref type="bibr" target="#b36">[37]</ref> introduces an area-based attention model that predicts the next word and the corresponding image regions at each RNN timestep. While these advancements represent significant strides, they predominantly focus on single-image based description generation. However, certain abstract concepts or descriptions might not be fully captured using only image data <ref type="bibr" target="#b37">[38,</ref><ref type="bibr" target="#b38">39]</ref>. <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b26">27]</ref> have explored the use of expert-defined keyword sequences to augment model capabilities in generating more accurate and contextually relevant descriptions. Recent advancements have also explored transformer-based architectures, such as Vision Transformers (ViT), which have shown promise in capturing finer details and global context in images for caption generation <ref type="bibr" target="#b39">[40]</ref>. Furthermore, the integration of multimodal learning approaches, where models are trained on both visual and textual data, has led to significant improvements in generating contextually richer and more nuanced image descriptions <ref type="bibr" target="#b40">[41]</ref>. The domain of medical image captioning has witnessed significant advancements, particularly through methods that meld human expertise with algorithmic prowess. <ref type="bibr" target="#b41">[42]</ref> has developed a Hybrid Retrieval-Generation Reinforced Agent, which integrates human prior knowledge with AI-based caption generation for medical images. This agent alternates between a generative module and a retrieval mechanism that utilizes a template database reflecting human expertise, thereby producing multifaceted, sequential sentences. <ref type="bibr" target="#b38">[39]</ref> has contributed to this field with a multi-task learning framework that simultaneously predicts tags and generates captions. Their method, which focuses on abnormal areas in chest radiology images using an attention mechanism and a hierarchical LSTM, offers detailed descriptions. These methods primarily focus on generating reports for chest radiology images, which are structurally different in terms of object size and detail compared to retinal images <ref type="bibr" target="#b37">[38,</ref><ref type="bibr" target="#b42">43,</ref><ref type="bibr" target="#b26">27]</ref>. Additionally, the color features in chest radiology and retinal images differ significantly, with the former being predominantly grey-scale and the latter being colorful <ref type="bibr" target="#b37">[38,</ref><ref type="bibr" target="#b26">27]</ref>. Most existing methods rely primarily on the image input for caption generation. Recent advancements also include the enhancement of the CNN-RNN framework with the TransFuser model <ref type="bibr" target="#b27">[28]</ref>. This model adeptly combines features from different modalities and addresses the challenge of incorporating unordered keyword sequences with visual inputs, minimizing information loss <ref type="bibr" target="#b27">[28]</ref>. This development represents a significant stride in medical image captioning, reflecting the growing complexity and capability of these methods. Further progress in deep learning, particularly the application of ViTs, has offered promising results in medical imaging <ref type="bibr" target="#b43">[44]</ref>. ViTs excel in capturing intricate details and providing a broader context for more accurate medical image analysis and caption generation.</p><p>The evaluation framework proposed in this paper is versatile and capable of assessing any existing image captioning approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Automatic Metrics for Image Captioning</head><p>The evolution of image captioning has been significantly influenced by the development and application of automatic metrics for evaluating caption quality <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b44">45,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b45">46,</ref><ref type="bibr" target="#b46">47]</ref>. These metrics guide the training of captioning models and provide a scalable means for performance assessment. The BLEU score, a pioneering metric by <ref type="bibr" target="#b6">[7]</ref>, gauges n-gram precision in generated text against a reference. ROUGE, developed by <ref type="bibr" target="#b7">[8]</ref>, emphasizes recall through the overlap of N-grams and longest common subsequences. Subsequent innovations introduced refined approaches. METEOR, by <ref type="bibr" target="#b44">[45]</ref>, aligns more closely with human judgment by incorporating synonym matching and stemming.In <ref type="bibr" target="#b8">[9]</ref>, the CIDEr metric, specifically designed for image captioning, assesses the similarity of generated captions to a set of reference captions. The SPICE metric by <ref type="bibr" target="#b45">[46]</ref> evaluates semantic content and the depiction of objects, attributes, and relationships. Additionally, the NLG-Eval toolkit by <ref type="bibr" target="#b46">[47]</ref> provides a comprehensive suite of metrics for a more holistic evaluation of natural language generation. However, these metrics have limitations. Metrics like BLEU and ROUGE often fail to capture the contextual nuances of captions <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>. The challenge of evaluating creativity and novelty in caption generation is also evident, as automated metrics may penalize deviations from standard references <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b45">46]</ref>. Recently, advancements like BERTScore <ref type="bibr" target="#b47">[48]</ref> and CLIPScore <ref type="bibr" target="#b48">[49]</ref>, which utilize contextual embeddings and visual-textual alignment, respectively, have been proposed to address these challenges.</p><p>In this study, human evaluation is employed to validate the effectiveness of the proposed evaluation framework. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature Extraction Module Image Captioning Module</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Large Language Models</head><p>The advent of LLMs has significantly reshaped the landscape of natural language processing (NLP) and Artificial Intelligence (AI). Pioneering models such as GPT, developed by <ref type="bibr" target="#b13">[14]</ref>, and BERT by <ref type="bibr" target="#b49">[50]</ref>, have marked critical milestones in this evolution. These models, characterized by their vast number of parameters and advanced deep learning architectures, have enhanced the capacity to understand and generate human language, excelling in diverse tasks like translation, summarization, and question-answering <ref type="bibr" target="#b49">[50,</ref><ref type="bibr" target="#b50">51]</ref>. The efficacy of LLMs such as GPT, which utilizes a transformerbased architecture, stems from their comprehensive training across a broad spectrum of internet text, enabling the generation of coherent and contextually pertinent language <ref type="bibr" target="#b50">[51]</ref>. BERT's introduction of bidirectional transformers has revolutionized pre-training in language understanding, showing remarkable efficiency in tasks requiring intricate contextual comprehension <ref type="bibr" target="#b49">[50]</ref>. The incorporation of attention mechanisms, as conceptualized by <ref type="bibr" target="#b16">[17]</ref>, has further refined these models' ability for nuanced understanding and text generation. In the realm of image captioning, the deployment of LLMs like GPT-3 has brought transformative changes. GPT-3's adeptness in image captioning tasks is a testament to its sophisticated transformer-based architecture and comprehensive training on a wide array of internet text. This extensive training enables GPT-3 to intricately understand and generate content that accurately aligns with both textual and visual contexts, producing coherent, contextually relevant, and detailed image descriptions <ref type="bibr" target="#b50">[51]</ref>. The fusion of LLMs with advanced computer vision techniques has been a significant leap forward, leading to the development of more sophisticated systems. These systems are now better equipped to interpret and describe complex visual data with greater accuracy and nuance <ref type="bibr" target="#b51">[52]</ref>. This integration highlights the evolving capability of AI to understand and convey the subtleties of visual information, mirroring a more human-like perception and articulation of images. This advancement in image captioning technology is pivotal in enhancing how machines process and narrate visual data, bridging the gap between visual perception and linguistic expression. Furthermore, the use of LLMs goes beyond generating captions to evaluating their quality. A notable method in this regard is CLAIR <ref type="bibr" target="#b12">[13]</ref>, which leverages zero-shot language modeling to assess caption quality. CLAIR shows a stronger correlation with human judgment compared to traditional metrics like BLEU, ROUGE, METEOR, and CIDEr. By soliciting an LLM to rate how likely a candidate caption accurately describes an image relative to a set of reference captions, CLAIR outperforms language-only measures, approaching human-level correlation. However, CLAIR requires a set of human-annotated reference captions to function, without which it cannot be applied.</p><p>In this work, the proposed approach leverages modern LLMs like GPT-4 for an innovative and comprehensive evaluation. We use LLMs to reverse-engineer the image captioning process, generating images from textual descriptions to assess caption accuracy. This method offers a unique advantage in evaluating the semantic richness and contextual relevance of captions. By comparing the generated images with the original, our approach provides a direct, visual assessment of caption quality, moving beyond mere textual analysis. This novel methodology not only aligns with human perception but also embraces the creativity and diversity inherent in image captioning, offering a more rounded and practical evaluation framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>The proposed evaluation framework comprises several key components: an image captioning module, an LLM-based text-to-image generator, an image feature extraction module, and a similarity calculator, as depicted in Figure <ref type="figure" target="#fig_1">2</ref>. Each of these components will be introduced in detail in the following subsections. Furthermore, to ensure the validity of the evaluation results based on our framework-specifically, their alignment with human judgment-we introduce a human-annotated image captioning dataset to validate the effectiveness of the proposed framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Image Captioning Module</head><p>The module incorporates an image captioning model, which will undergo evaluation using the proposed framework. This module takes an image as input and generates a text-based description as output.</p><p>To facilitate user comprehension of the proposed evaluation framework, we utilize the InstructBLIP model <ref type="bibr" target="#b52">[53]</ref> as an illustrative example in Section 4. This demonstration showcases the entire process of leveraging the proposed framework to evaluate a given image captioning model, making it easily understandable for users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">LLM-based Text-to-Image Generator</head><p>Numerous studies <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref> have demonstrated the proficiency of LLM-based image generators, exemplified by models like GPT-4, in producing high-quality images that closely align with the semantic meaning of provided text-based prompts. Specifically, DALL-E, functioning as an image generation model within GPT-4, a variant of GPT-3 boasting 12 billion parameters, is engineered by OpenAI to generate images based on textual descriptions, drawing from a dataset comprising text-image pairs. Its versatile capabilities include crafting anthropomorphized versions of animals and objects, seamlessly combining unrelated concepts, rendering text, and applying transformations to existing images. In the context of the proposed framework, the LLM-based image generator utilizes the textbased image description generated by a preceding image captioning model. If the image captioning model performs well, generating a high-quality and accurate image description, the LLM-based image generator subsequently creates an image that is similar to the original input image. This connection highlights the interplay between effective image captioning and the generation of corresponding images by the LLM-based approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Image Feature Extraction Module</head><p>The image feature extraction module primarily consists of a pre-trained image encoder. This module takes an image as input and produces a feature vector representing the input image as output. To enhance user understanding of the proposed evaluation framework, we employ ViT-g/14 <ref type="bibr" target="#b53">[54]</ref> as a demonstrative example for image feature extraction in Section 4. ViT-g/14 is a vanilla ViT pre-trained for reconstructing masked-out image-text aligned vision features conditioned on visible image patches. Through this pretext task, the model efficiently scales up to one billion parameters, achieving notable performance across various vision downstream tasks, including image recognition, video action recognition, object detection, instance segmentation, and semantic segmentation, all without extensive supervised training. This demonstration in Section 4 highlights the complete process, encompassing image feature extraction for calculating similarity scores between the input and generated images. It illustrates how the proposed framework can be leveraged to assess a given image captioning model, providing users with a clear understanding. It is worth noting that the image feature extractor can be substituted with other pre-trained CNNs, such as VGG-16 <ref type="bibr" target="#b54">[55]</ref> or ResNet-52 <ref type="bibr" target="#b55">[56]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Similarity Calculator</head><p>Cosine similarity, as defined in Equation ( <ref type="formula" target="#formula_0">1</ref>), serves as a metric for quantifying the similarity between two vectors in a multi-dimensional space. It evaluates the cosine of the angle between these vectors, offering insight into their degree of similarity or dissimilarity. The advantage of cosine similarity lies in its ability to assess directional similarity rather than magnitude, rendering it robust against variations in scale and orientation. This characteristic makes it a widely adopted metric in diverse domains, including image processing and NLP. In these fields, cosine similarity is frequently employed to assess the similarity between images, documents, or sentences represented as vectors in high-dimensional spaces. The cosine similarity value CosSim(• , •) ∈ [−1, 1], where a value of 1 signifies that the vectors are identical, 0 indicates orthogonality (i.e., no similarity), and −1 indicates complete dissimilarity or "A dog wearing a leash laying next to an orange frisbee", "A dog with a collar on laying down next to a frisbee", "A dog lies in the grass next to a Frisbee.", "A frisbee on the ground next to a dog sitting in the grass", "A dog that is laying on the ground next to a Frisbee." "A skateboarder is jumping down a flight of stairs.", "A skaterboarder getting major air over some stairs during a night time shoot", "A skate boarder jumping down some stairs at night", "A skateboarder riding down a flight of stone stairs", "A young man skateboarding over a flight of steps" "All of these sheep have coats that are ready for shearing.", "some sheep standing around by a wooden wall", "Five sheep are standing and sitting in their enclosure.", "One sheep lies down as four others stand near.", "A group of five sheep wait outside a barn." "A person is on a living room couch watching TV and there is a stuffed panda bear and a purse on the table.", "A child watches television while a panda bear sits by a purse.", "A simple living room with a panda on the coffee table", "A black and white stuffed koala bear is in the room.", "A stuffed panda is on the living room table." opposition.</p><formula xml:id="formula_0">CosSim(i o , i g ) = i o • i g ‖i o ‖‖i g ‖ ,<label>(1)</label></formula><p>where i o • i g denotes the dot product (also known as the inner product) of the original input image feature vector i o and the LLM-generated image feature vector i g . ‖i o ‖ and ‖i g ‖ represent the Euclidean norms (also known as the magnitudes or lengths) of vectors i o and i g , respectively. In words, cosine similarity measures the cosine of the angle between two vectors, which represents their similarity in direction and magnitude.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Human-annotated Image Captioning Dataset</head><p>The Microsoft Common Objects in Context (MSCOCO) dataset is a comprehensive resource widely used across various image recognition tasks including object detection, segmentation, and captioning. Originally, the MSCOCO Captions dataset comprised over 330, 000 images, each meticulously annotated with 80 object categories. Notably, both the training and validation sets feature each image accompanied by five distinct human-generated captions. This dataset holds significant importance within the realm of computer vision research, serving as a cornerstone for the development and evaluation of numerous stateof-the-art object detection and segmentation models. In our study, we enhance the existing MSCOCO Caption dataset by incorporating an additional 30, 000 human-annotated image-description pairs. This augmented dataset serves as the basis for evaluating the alignment of our proposed evaluation method with human-annotated image descriptions. To aid in understanding the dataset, several examples from the dataset are provided in Figure <ref type="figure" target="#fig_2">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments and Analysis</head><p>In this section, our goal is to evaluate the effectiveness of the proposed evaluation framework designed for image captioning models. To achieve this, we will validate our framework using both the widely adopted human-annotated image captioning datasets and our newly introduced dataset, the details of which are outlined in the Section 3.5. Since all datasets have undergone human annotation, our primary objective in this assessment is to ascertain whether the evaluation results obtained through our proposed framework align with human consensus or judgment. To elaborate, a correct caption-matching the human-annotated counterpart-should yield a substantial cosine similarity score between the generated and original images, as measured by our evaluation framework. Conversely, an incorrect caption-deviating from the human-annotated version-should result in a comparatively smaller cosine similarity score. This approach allows us to empirically validate the effectiveness of our proposed evaluation framework in aligning with human judgment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experimental Settings</head><p>To illustrate the application of the proposed framework for evaluating an image captioning model, we employ the InstructBLIP <ref type="bibr" target="#b56">[57]</ref> model in our image captioning module. This model is equipped with the pre-trained language model Vicuna-7B <ref type="bibr" target="#b57">[58]</ref> to generate image descriptions. Image captions are generated using the prompt "&lt;Image&gt; A short image caption:", guiding the model to produce sentences of fewer than 100 tokens, excluding special symbols. For text-to-image generation, GPT-4 with the built-in diffusion model DALL-E-3 is employed. Notably, the diffusion model can be replaced by Stable Diffusion models <ref type="bibr" target="#b58">[59]</ref>, utilizing a fixed, pre-trained encoder (ViT-g/14) <ref type="bibr" target="#b59">[60]</ref>, and the entire diffusion model is pre-trained on the LAION-2B dataset <ref type="bibr" target="#b60">[61]</ref>. Human evaluation serves as the validation method for the proposed framework. Each image in the dataset comes with five human-annotated image captions, and performance is quantified using the average cosine similarity score, as detailed in Section 4.3. The experiments are conducted using two NVIDIA-A6000 GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Datasets</head><p>MSCOCO Dataset <ref type="bibr" target="#b61">[62]</ref>. The MSCOCO dataset comprises two primary components: the images and their corresponding annotations. The images are organized into a directory hierarchy, with top-level directories for the train, validation, and test sets. Annotations are provided in JSON format, with each file corresponding to a single image. Each annotation includes details such as the image file name, dimensions (width and height), a list of objects with their respective class labels (e.g., "person, " "car"), bounding box coordinates (𝑥, 𝑦, width, height), segmentation mask (in polygon or RLE format), keypoints and their positions (if available), and five captions describing the scene. Additional information provided by the MSCOCO dataset includes image super categories, license details, and coco-stuff annotations (pixel-wise annotations for stuff classes in addition to the 80 object classes). The MSCOCO dataset provides various types of annotations, including object detection with bounding box coordinates and full segmentation masks for 80 different objects, stuff image segmentation with pixel maps displaying 91 amorphous background areas, panoptic segmentation identifying items in images based on 80 "things" and 91 "stuff" categories, dense pose annotations featuring over 39, 000 photos and mapping between pixels and a template for over 56, 000 tagged persons, 3D model annotations and natural language descriptions for each image, and keypoint annotations for over 250, 000 persons annotated with key points such as the right eye, nose, and left hip.</p><p>Flickr30k Dataset <ref type="bibr" target="#b62">[63]</ref>. The authors in <ref type="bibr" target="#b62">[63]</ref> advocate for utilizing the visual denotations of linguistic expressions, represented by the set of images they describe, to define new denotational similarity metrics. These metrics, as demonstrated in <ref type="bibr" target="#b62">[63]</ref>, prove to be at least as advantageous as distributional similarities for tasks requiring semantic inference. The computation of these denotational similarities involves the construction of a denotation graph-a subsumption hierarchy over constituents and their denotations. This graph is established using a substantial corpus comprising 30, 000 images and 150, 000 descriptive captions. The creation of this denotation graph involves the development of an image caption corpus by the authors in <ref type="bibr" target="#b62">[63]</ref>, consisting of 158, 915 crowd-sourced captions elucidating 31, 783 images. This corpus serves as an extension of their previous work on the Flickr8k Dataset. The new images and captions specifically focus on individuals engaged in everyday activities and events.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Effectiveness Analysis of the Proposed Evaluation Framework</head><p>Human Evaluation Using the Proposed Dataset. The dataset introduced in this work, consisting of pairs of images and captions, has undergone human annotation. Each image is accompanied by five distinct human-generated captions. The details of our human evaluation process are outlined below. In</p><p>Step 1, we directly utilize the human-annotated ground truth caption to generate an image through a text-to-image LLM, such as GPT-4 or Gemini. In Step 2, we extract the image features of both the ground truth caption's corresponding image and the image generated by the text-to-image LLM. In Step 3, we apply the cosine similarity formula from Section 3.4 to compute the cosine similarity scores between these two sets of image features. Given that the caption is a human-annotated ground truth description, accurately portraying the corresponding image, we expect the similarity score from Step 3 to be high. Conversely, if a caption inaccurately describes a given image, the cosine similarity score from Step 3 should be low. Consistency between the experimental result and these expectations indicates the effectiveness of the proposed evaluation framework in aligning with human consensus.</p><p>The evaluation results depicted in Figure <ref type="figure" target="#fig_3">4</ref> reveal notable insights. The blue lines in Figure <ref type="figure" target="#fig_3">4</ref> illustrate the impact of the provided captions on the cosine similarity scores. Specifically, when the provided caption matches the correct human-annotated description (upper blue line), the average cosine similarity score reaches approximately 0.67. Conversely, when the caption is incorrect (lower blue line), the average cosine similarity score drops to around 0.47. This discrepancy results in a similarity gap of approximately 0.2. These findings underscore the effectiveness of the proposed evaluation framework, as it closely aligns with human judgment. It is noteworthy that the robustness of this human evaluation method is attributed to the remarkable text-to-image generation capabilities of modern LLM models. Widely recognized models such as GPT-4 and Gemini have been extensively acclaimed in various studies and by the broader community <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>. Assessment Using MSCOCO and Flickr30k Datasets. Figure <ref type="figure" target="#fig_3">4</ref> reveals consistent trends in the evaluation results across MSCOCO, Flickr30k, and our dataset. Similar patterns are observed in MSCOCO and Flickr30k, where there is a notable decrease in the average cosine similarity when the modelgenerated image caption differs from the human-annotated ground truth caption. These findings affirm the effectiveness and reliability of the proposed evaluation framework for assessing image captioning models. Qualitative Analysis. To gain deeper insights into the performance of the proposed evaluation framework, we present qualitative results in Figure <ref type="figure">5</ref> and Figure <ref type="figure">6</ref>. In Figure <ref type="figure">5</ref>, we observe that the human-annotated ground truth captions and the model-predicted captions exhibit poor alignment in these four examples. Given the accurate image generation capabilities of existing LLMs based on text-based prompts, the accuracy of model-generated image descriptions is crucial. However, in these instances, all predicted captions are incorrect, resulting in LLM-generated images that significantly differ from the ground truth images. Consequently, this discrepancy contributes to the low cosine-based similarity scores.</p><p>In Figure <ref type="figure">6</ref>, these two examples illustrate a strong alignment between the model-generated descriptions and the human-generated ground truth captions. Hence, this alignment results in LLM-generated images that closely resemble the ground truth images. As a result, when calculating cosine similarity scores based on the image features extracted from the LLM-generated and ground truth images, the scores are notably high. We also calculate scores based on these metrics to highlight the advantage of our proposed method over the aforementioned text-based evaluation metrics. In Figure <ref type="figure">6</ref>, we observe that despite the model-generated image captions closely matching the ground truth captions, the scores based on text-based evaluation metrics are comparatively low. This observation underscores the superiority of our proposed evaluation framework over existing text-based evaluation metrics for image captioning models.</p><formula xml:id="formula_1">BP = {︂ 1 if 𝑐 &gt; 𝑟 exp(1 − 𝑟 𝑐 ) if 𝑐 ≤ 𝑟 ; BLEU = BP • exp (︃ 𝑁 ∑︁ 𝑛=1 𝑤 𝑛 log𝑝 𝑛 )︃ ,<label>(2)</label></formula><p>where 𝑟 represents the effective length of the ground truth text, 𝑐 signifies the length of the predicted The outcomes are derived from three datasets: MSCOCO (highlighted in red), Flickr30k (highlighted in green), and our dataset (highlighted in blue). The top three lines represent scenarios where the provided caption aligns with the correct human-annotated description, while the bottom three lines represent scenarios where the caption is incorrect. "Gap 1", "Gap 2", and "Gap 3" signify the disparities in average cosine similarity scores. We observe that these gaps are approximately 0.2, indicating the influence of the provided captions on the cosine similarity scores. A larger gap indicates a substantial mismatch between the human-annotated image description and the provided or model-generated caption, whereas a smaller gap suggests a higher degree of alignment.</p><p>text, and BP stands for brevity penalty. The geometric mean of the adjusted 𝑛-gram precisions 𝑝 𝑛 is calculated using 𝑛-grams up to a length of 𝑁 , with positive weights 𝑤 𝑛 that sum to 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this study, we have introduced a novel framework for evaluating automatically generated image descriptions, aiming to overcome the limitations of existing evaluation metrics like BLEU, ROUGE, METEOR, and CIDEr. Our framework leverages advancements in LLMs such as GPT-4 or Gemini to utilize image descriptions generated by an image captioning model for creating corresponding images. By quantifying the cosine similarity between the representation of the original input image in the image captioning model and the representation of the LLM-generated image, we can effectively assess the model's performance without relying on human-annotated reference captions. Through extensive experiments on the established datasets like Flickr30k and MSCOCO, we have demonstrated the effectiveness of the proposed evaluation framework. Our experimental results suggest that the proposed framework's performance closely correlates with human judgment, offering a valuable method</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>(Figure 1 :</head><label>1</label><figDesc>Figure 1: Flowchart for image captioning. Existing image captioning architectures can be broadly categorized into two groups: those based on the recurrent neural network (RNN) and those based on the transformer architecture.To aid comprehension, we represent RNN-based methods with blue paths and transformer-based approaches with red paths. The process involves feeding an input image through an image encoder for feature extraction, followed by a language generator to produce text-based descriptions using the extracted image features.</figDesc><graphic coords="2,291.61,171.04,135.58,55.43" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>(Figure 2 :</head><label>2</label><figDesc>Figure 2: Flowchart of the proposed evaluation framework. The proposed framework consists of four main components: an image captioning module, an image feature extractor, a large language model (LLM), and a similarity calculator. The image captioning module employs a chosen model to process an input image and generate textual descriptions. The image feature extractor is tasked with extracting features from the input image. The LLM utilizes the text descriptions produced by the image captioning model to generate the corresponding image. Finally, the similarity calculator computes the similarity between the features of the input image and the image generated by the LLM.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Dataset examples. To provide a clearer insight into the introduced human-annotated dataset, we have randomly selected four examples for illustrative purposes. Each image in the dataset is accompanied by five human-annotated descriptions that vividly depict the content of the image.</figDesc><graphic coords="7,304.00,184.15,108.00,80.83" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure4: Human evaluation results. The outcomes are derived from three datasets: MSCOCO (highlighted in red), Flickr30k (highlighted in green), and our dataset (highlighted in blue). The top three lines represent scenarios where the provided caption aligns with the correct human-annotated description, while the bottom three lines represent scenarios where the caption is incorrect. "Gap 1", "Gap 2", and "Gap 3" signify the disparities in average cosine similarity scores. We observe that these gaps are approximately 0.2, indicating the influence of the provided captions on the cosine similarity scores. A larger gap indicates a substantial mismatch between the human-annotated image description and the provided or model-generated caption, whereas a smaller gap suggests a higher degree of alignment.</figDesc></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ground truth caption</head><p>for evaluating the effectiveness of image captioning models. Additionally, human evaluations conducted on our introduced dataset validate the framework's efficacy in capturing various aspects such as grammaticality, coverage, correctness, and truthfulness in automatically generated image descriptions. Moving forward, the proposed framework presents new opportunities for evaluating image captioning models, offering a more efficient and reliable alternative to traditional human evaluations and existing automated evaluation metrics. It is designed to complement, rather than replace, human judgment. In summary, our work contributes to the ongoing development of robust evaluation frameworks for image captioning models, bridging the gap between automated metrics and human judgment, and driving advancements in this field.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Midge: Generating descriptions of images</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hayes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="131" to="133" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Translating video content to natural language descriptions</title>
		<author>
			<persName><forename type="first">M</forename><surname>Rohrbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Titov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Thater</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pinkal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schiele</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="433" to="440" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Corpus-guided sentence generation of natural images</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Teo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Daumé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iii</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Aloimonos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 conference on empirical methods in natural language processing</title>
				<meeting>the 2011 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="444" to="454" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Image description using visual dependency representations</title>
		<author>
			<persName><forename type="first">D</forename><surname>Elliott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Keller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2013 conference on empirical methods in natural language processing</title>
				<meeting>the 2013 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1292" to="1302" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">See no evil, say no evil: Description generation from densely labeled images</title>
		<author>
			<persName><forename type="first">M</forename><surname>Yatskar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Galley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Vanderwende</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third Joint Conference on Lexical and Computational Semantics</title>
				<meeting>the Third Joint Conference on Lexical and Computational Semantics<address><addrLine>SEM</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="110" to="120" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A technique for measuring attitude scale</title>
		<author>
			<persName><forename type="first">R</forename><surname>Linkert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Psychometrical</title>
		<imprint>
			<biblScope unit="volume">140</biblScope>
			<biblScope unit="page" from="40" to="55" />
			<date type="published" when="1932">1932</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th annual meeting of the Association for Computational Linguistics</title>
				<meeting>the 40th annual meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Rouge: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text summarization branches out</title>
				<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Cider: Consensus-based image description evaluation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Vedantam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lawrence Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="4566" to="4575" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Saheel</surname></persName>
		</author>
		<title level="m">Baby talk: Understanding and generating image descriptions</title>
				<imprint/>
	</monogr>
	<note>????</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Re-evaluating the role of bleu in machine translation research</title>
		<author>
			<persName><forename type="first">C</forename><surname>Callison-Burch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Osborne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Koehn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">11th conference of the european chapter of the association for computational linguistics</title>
				<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="249" to="256" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Framing image description as a ranking task: Data, models and evaluation metrics</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hodosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hockenmaier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="853" to="899" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Petryk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Canny</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.12971</idno>
		<title level="m">Clair: Evaluating image captions with large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Team</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Anil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Borgeaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Alayrac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Soricut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schalkwyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hauth</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.11805</idno>
		<title level="m">Gemini: a family of highly capable multimodal models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A brief overview of chatgpt: The history, status quo and potential future development</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q.-L</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/CAA Journal of Automatica Sinica</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1122" to="1136" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1706.03762</idno>
		<title level="m">Attention is all you need</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Deep hierarchical encoder-decoder network for image captioning</title>
		<author>
			<persName><forename type="first">X</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Xiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Pan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Multimedia</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="2942" to="2956" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Show and tell: A neural image caption generator</title>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Toshev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="3156" to="3164" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">T-net: Nested encoder-decoder architecture for the main vessel segmentation in coronary angiography</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">J</forename><surname>Jun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kweon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural Networks</title>
		<imprint>
			<biblScope unit="volume">128</biblScope>
			<biblScope unit="page" from="216" to="233" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Robustness analysis of visual qa models by basic questions, VQA Challenge and Visual Dialog Workshop</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alfadly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ghanem</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>CVPR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Vqabq: Visual question answering by basic questions, VQA Challenge Workshop</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alfadly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ghanem</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
			<publisher>CVPR</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Robustness analysis of visual question answering models by basic questions</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
		<respStmt>
			<orgName>King Abdullah University of Science and Technology,</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master Thesis</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A novel framework for robustness analysis of visual qa models</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Dao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alfadly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ghanem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence</title>
				<meeting>the Thirty-Third AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8449" to="8456" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alfadly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ghanem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.03147</idno>
		<title level="m">Improving visual question answering models through robustness analysis and in-context learning with a chain of basic questions</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Alfadly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ghanem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.01452</idno>
		<title level="m">Assessing the robustness of visual question answering</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Deepopht: medical report generation for retinal images via deep models and visual explanation</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><forename type="middle">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Morikawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WACV</title>
		<imprint>
			<biblScope unit="page" from="2442" to="2452" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Non-local attention improves description generation for retinal images</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><forename type="middle">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tegner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WACV</title>
		<imprint>
			<biblScope unit="page" from="1606" to="1615" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Contextualized keyword representations for multi-modal retinal image captioning</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICMR</title>
		<imprint>
			<biblScope unit="page" from="645" to="652" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Deep context-encoding network for retinal image captioning</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><forename type="middle">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICIP, IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3762" to="3766" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Longer version for&quot; deep context-encoding network for retinal image captioning</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-H</forename><forename type="middle">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.14538</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Murn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mrak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICMR</title>
		<imprint>
			<biblScope unit="page" from="580" to="589" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Query-controllable video summarization</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICMR</title>
		<imprint>
			<biblScope unit="page" from="242" to="250" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Expert-defined keywords improve interpretability of retinal image captioning</title>
		<author>
			<persName><forename type="first">T.-W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Worring</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</title>
				<meeting>the IEEE/CVF Winter Conference on Applications of Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1859" to="1868" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Generation and comprehension of unambiguous object descriptions</title>
		<author>
			<persName><forename type="first">J</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Toshev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Camburu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">L</forename><surname>Yuille</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Murphy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="11" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Image captioning with deep bidirectional lstms</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Meinel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th ACM international conference on Multimedia</title>
				<meeting>the 24th ACM international conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="988" to="997" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Areas of attention for image captioning</title>
		<author>
			<persName><forename type="first">M</forename><surname>Pedersoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Verbeek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE international conference on computer vision</title>
				<meeting>the IEEE international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1242" to="1250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Textray: Mining clinical reports to gain a broad understanding of chest x-rays</title>
		<author>
			<persName><forename type="first">J</forename><surname>Laserson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Lantsman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cohen-Sfady</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Tamir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Goz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Brestel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Atar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Elnekave</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Medical Image Computing and Computer Assisted Intervention-MICCAI 2018: 21st International Conference</title>
				<meeting><address><addrLine>Granada, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">September 16-20, 2018. 2018</date>
			<biblScope unit="page" from="553" to="561" />
		</imprint>
	</monogr>
	<note>Proceedings, Part II 11</note>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Jing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Xing</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1711.08195</idno>
		<title level="m">On the automatic generation of medical imaging reports</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<title level="m">An image is worth 16x16 words: Transformers for image recognition at scale</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Hybrid retrieval-generation reinforced agent for medical image report generation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Comparative performance of pulmonary ultrasound, chest radiograph, and ct among patients with acute respiratory failure</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Tierney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Huelster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Overgaard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Plunkett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">L</forename><surname>Boland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>St Hill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">K</forename><surname>Agboto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">F</forename><surname>Mikel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">E</forename><surname>Weise</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Critical Care Medicine</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<biblScope unit="page" from="151" to="157" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">C</forename><surname>Frey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Du</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.06468</idno>
		<title level="m">Vit-v-net: Vision transformer for unsupervised volumetric medical image registration</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">Meteor: An automatic metric for mt evaluation with improved correlation with human judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</title>
				<meeting>the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Spice: Semantic propositional image caption evaluation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fernando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gould</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2016: 14th European Conference</title>
				<meeting><address><addrLine>Amsterdam, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">October 11-14, 2016. 2016</date>
			<biblScope unit="page" from="382" to="398" />
		</imprint>
	</monogr>
	<note>Proceedings, Part V 14</note>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<title level="m" type="main">Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">E</forename><surname>Asri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schulz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zumer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1706.09799</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b47">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.09675</idno>
		<title level="m">Bertscore: Evaluating text generation with bert</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b48">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Hessel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holtzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forbes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">L</forename><surname>Bras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.08718</idno>
		<title level="m">Clipscore: A reference-free evaluation metric for image captioning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b49">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b50">
	<analytic>
		<title level="a" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">OpenAI blog</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">9</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<analytic>
		<title level="a" type="main">Scaling up visual and vision-language representation learning with noisy text supervision</title>
		<author>
			<persName><forename type="first">C</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Parekh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-H</forename><surname>Sung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Duerig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4904" to="4916" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b52">
	<monogr>
		<title level="m" type="main">Instructblip: Towards general-purpose vision-language models with instruction tuning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.06500</idno>
		<imprint/>
	</monogr>
	<note type="report_type">arXiv preprint</note>
	<note>????)</note>
</biblStruct>

<biblStruct xml:id="b53">
	<analytic>
		<title level="a" type="main">Eva: Exploring the limits of masked visual representation learning at scale</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="19358" to="19369" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b54">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Simonyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1409.1556</idno>
		<title level="m">Very deep convolutional networks for large-scale image recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b55">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="770" to="778" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b56">
	<monogr>
		<title level="m" type="main">Instructblip: Towards general-purpose vision-language models with instruction tuning</title>
		<author>
			<persName><forename type="first">W</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M H</forename><surname>Tiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">A</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C H</forename><surname>Hoi</surname></persName>
		</author>
		<idno>ArXiv abs/2305.06500</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:258615266" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b57">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Stoica</surname></persName>
		</author>
		<idno>ArXiv abs/2306.05685</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:259129398" />
		<title level="m">Judging llm-as-a-judge with mt-bench and chatbot arena</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b58">
	<analytic>
		<title level="a" type="main">High-resolution image synthesis with latent diffusion models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Rombach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Blattmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lorenz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Esser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ommer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="10684" to="10695" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b59">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:231591445" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b60">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Schuhmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vencu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Beaumont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kaczmarczyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mullis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Katta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Coombes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jitsev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Komatsuzaki</surname></persName>
		</author>
		<idno>ArXiv abs/2111.02114</idno>
		<ptr target="https://api.semanticscholar.org/CorpusID:241033103" />
		<title level="m">Laion-400m: Open dataset of clip-filtered 400 million image-text pairs</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b61">
	<analytic>
		<title level="a" type="main">Microsoft coco: Common objects in context</title>
		<author>
			<persName><forename type="first">T.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Maire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Belongie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Perona</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2014: 13th European Conference</title>
				<meeting><address><addrLine>Zurich, Switzerland</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014">September 6-12, 2014. 2014</date>
			<biblScope unit="page" from="740" to="755" />
		</imprint>
	</monogr>
	<note>Proceedings, Part V 13</note>
</biblStruct>

<biblStruct xml:id="b62">
	<analytic>
		<title level="a" type="main">From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions</title>
		<author>
			<persName><forename type="first">P</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hodosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hockenmaier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="67" to="78" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
