<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Toward Automatic Relevance Judgment using Vision-Language Models for Image-Text Retrieval Evaluation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jheng-Hong</forename><surname>Yang</surname></persName>
							<email>jheng-hong.yang@uwaterloo.ca</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Waterloo</orgName>
								<address>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jimmy</forename><surname>Lin</surname></persName>
							<email>jimmylin@uwaterloo.ca</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Waterloo</orgName>
								<address>
									<country key="CA">Canada</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">The First Workshop on Large Language Models for Evaluation in Information Retrieval</orgName>
								<address>
									<addrLine>18 July 2024</addrLine>
									<settlement>Washington</settlement>
									<region>DC</region>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Toward Automatic Relevance Judgment using Vision-Language Models for Image-Text Retrieval Evaluation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">FD3CA4BD14606F7E72F9BF33C53C478D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Relevance Assessments</term>
					<term>Image-Text Retrieval</term>
					<term>Vision-Language Model</term>
					<term>Large Language Model</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Vision-Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale ad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instructiontuned Large Language Models (LLMs), achieve notable Kendall's 𝜏 ∼ 0.4 when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's 𝜅 value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Cranfield-style test collections, consisting of a document corpus, a set of queries, and manually assessed relevance judgments, have long served as the foundation of information retrieval research <ref type="bibr" target="#b0">[1]</ref>. However, evaluating every document for every query in a substantial corpus often proves cost-prohibitive. To tackle this challenge, a subset of documents is selected for assessment through a pooling process. While this method is cost-effective compared to user studies, it has limitations due to its simplifications and struggles to adapt to complex search scenarios and large document collections.</p><p>In this study, we explore the adaptability of model-based relevance judgments for imagetext retrieval evaluation. Leveraging model-based retrieval judgments presents an appealing option. Not only does it provide valuable insights before undertaking the laborious processes of document curation, query creation, and costly annotation, but it also has the potential to extend and scale up to complex search scenarios and large document collections. To explore opportunities and meet the demands for large-scale, fine-grained, and long-form text enrichment scenarios in image-text retrieval evaluation <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref>, our objective is to extend the human-machine collaborative framework proposed by Faggioli et al. <ref type="bibr" target="#b5">[6]</ref> to the context of image-text retrieval evaluation, alongside widely adopted model-based image-text evaluation metrics <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>.</p><p>Our primary focus is on a fully automatic evaluation paradigm, where we harness the capabilities of Vision-Language Models (VLMs), including CLIP <ref type="bibr" target="#b11">[12]</ref>, as well as visual instruction-tuned Large Language Models (LLMs) like LLaVA <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref> and GPT-4V <ref type="bibr" target="#b14">[15]</ref>. To evaluate this approach, we conducted a pilot study using the TREC-AToMiC 2023 test collection, which is designed for multimedia content creation <ref type="bibr" target="#b4">[5]</ref>, based on our instruction prompt template for VLMs (cf. Table <ref type="table" target="#tab_2">1 and Section 3.2)</ref>.</p><p>We observe that model-based relevance judgments generated by visual instruction-tuned LLMs outperform the widely adopted CLIPScore <ref type="bibr" target="#b6">[7]</ref> in terms of ranking correlations and agreements when compared to human annotations. While this discovery holds promise, we also uncover the potential evaluation bias when using model-based relevance judgments. Our analysis reveals a bias in favor of CLIP-based retrieval systems in the rankings when employing model-based relevance judgments, resulting in higher overall effectiveness assessments for these systems. In summary, our contributions can be distilled as follows:</p><p>• We demonstrate and explore the feasibility of incorporating VLMs for fully automatic imagetext retrieval evaluation.</p><p>• We shed light on the evaluation bias when utilizing model-based relevance judgments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Evaluation Metrics for Image-Text Relevance. Nowadays, model-based evaluation metrics are widely utilized in various vision-language tasks, including image captioning <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b15">16]</ref> and text-to-image synthesis <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b16">17]</ref>. Among model-based approaches, CLIP-based methods <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>, such as CLIPScore <ref type="bibr" target="#b6">[7]</ref>, are particularly prevalent. However, while these metrics are capable of measuring coarse text-image similarity, they may fall short in capturing finegrained image-text correspondence <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b18">19]</ref>. Recent research has highlighted the effectiveness of enhancing model-based evaluation metrics by leveraging LLMs to harness their reasoning capabilities <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b20">21]</ref>. There exists significant potential for incorporating LLMs into modelbased approaches, as LLM outputs are not limited to mere scores but can also provide free-form texts, e.g., reasons, for further analysis and many downstream tasks <ref type="bibr" target="#b21">[22]</ref>.</p><p>Model-based Relevance Judgments. Traditionally, relevance judgments in retrieval tasks have adhered to the Cranfield evaluation paradigm due to its cost-effectiveness, reproducibility, and reliability when compared to conducting user studies. However, this approach often relies on simplified assumptions and encounters scalability challenges. Researchers have recently explored model-based automatic relevance estimation as a promising alternative. This approach aims to optimize human-machine collaboration to obtain ideal relevance judgments. Notably, studies of Dietz and Dalton <ref type="bibr" target="#b22">[23]</ref> and Faggioli et al. <ref type="bibr" target="#b5">[6]</ref> have revealed high rank correlations between model-based and human-based judgments. Additionally, MacAvaney and Soldaini <ref type="bibr" target="#b23">[24]</ref> have delved into the task of filling gaps in relevance judgments using model-based annotations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this study, we investigate techniques for estimating image-text relevance scores, denoted as ℱ(𝑞, 𝑑) ∈ R, where 𝑞 represents the text (query) and 𝑑 represents the image (document). Our primary focus is on utilizing VLMs to generate relevance scores, akin to empirical values annotated by human assessors denoted as ℱ ^(𝑞, 𝑑). The main objective is to assess the proximity between model-based ℱ and human-based ℱ ^in image-text retrieval evaluation. We begin with a discussion of the setting for human-based annotations, followed by the process for generating model-based annotations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Human-based Annotations</head><p>Our primary focus revolves around a critical aspect of multimedia content creation, specifically, the image suggestion task, an ad hoc image retrieval task as part of the AToMiC track in the TREC conference 2023 (TREC-AToMiC 2023). <ref type="foot" target="#foot_0">1</ref> The image suggestion task aims to identify relevant images from a predefined collection, given a specific section of an article. Its overarching goal is to enrich textual content by selecting images that aid readers in better comprehending the material.</p><p>Relevance scores for this task are meticulously annotated by NIST assessors, adhering to the TREC-style top-𝑘 pooling relevance annotation process. A total of sixteen valid participant runs, generated by diverse image-text retrieval systems, are considered, encompassing (CLIPbased) dense retrievers, learned sparse retrievers, caption-based retrievers, hybrid systems, and multi-stage retrieval systems. The pooling depth is set to 25 for eight baseline systems and 30 for the remaining participant runs. NIST assessors classify candidate results into three graded relevance levels to capture nuances in suitability, guided by the content of the test query. The test query comprises textual elements such as the section title, section context description, page title, and page context description. Assessors base their relevance judgments on the following criteria:</p><p>• 0 (Non-relevant): Candidates deemed irrelevant.</p><p>• 1 (Related): Candidates that are related but not relevant to the section context are categorized as related. They contain pertinent information but do not align with the section's context.</p><p>• 2 (Relevant): These candidates are considered relevant to the section context and effectively illustrate it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Model-based Annotations</head><p>For automatic relevance estimation, we employ pretrained VLMs as our relevance estimator, denoted as ℱ(𝑞, 𝑑 |𝒫). Our relevance estimator produces relevance scores given a pair of 𝑞 and 𝑑, which is conditioned on 𝒫, where 𝒫 represents the prompt template we used to instruct the models. Prompt engineering is a commonly adopted technique for enhancing or guiding VLMs and LLMs in various tasks <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b11">12]</ref>. It's important to note that our current focus is on pointwise estimation, leaving more advanced ranking methods (such as pairwise or listwise) that consider multiple 𝑞 and 𝑑 for future exploration <ref type="bibr" target="#b25">[26,</ref><ref type="bibr" target="#b26">27]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prompt Template Design</head><p>In line with our approach to relevance score annotation, we have created a prompt template designed to guide models in generating relevance scores. The prompt template, presented in Table <ref type="table" target="#tab_0">1</ref>, has been constructed based on our heuristics and is not an exhaustive search of all possible templates. Pretrained VLMs are expected to take both 𝑞 and 𝑑 to produce a relevance score following the instructions defined in the prompt template 𝒫. We anticipate that VLMs will independently process textual and visual information, and our prompt template is only applied to textual inputs.Our template comprises three essential components:</p><p>• Context: This section processes the textual information from 𝑞.<ref type="foot" target="#foot_1">2</ref> </p><p>• Relevance Instruction: It incorporates task-specific information designed to provide VLMs with an understanding of the task.</p><p>• Output Instruction: This component offers instructions concerning the expected output, e.g., output types and format.</p><p>From Scores to Relevance Judgments. We utilize parsing scripts to process the relevance scores generated by the models and convert them into relevance judgments. <ref type="foot" target="#foot_2">3</ref> Considering potential score variations across different models, we apply an additional heuristic rule to map these scores into graded relevance levels: 0 (non-relevant), 1 (related), and 2 (relevant). Specifically, scores falling below the median value are categorized as 0; scores within the 50-75th quantile range are designated as 1; and scores exceeding the 75th quantile are assigned a relevance level of 2. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>We have undertaken an empirical comparison between human assessors and vision-language models to offer an initial evaluation of their current capabilities in estimating relevance judgments. This comparative analysis encompasses one embedding-based model (CLIP) and two LLMs trained by visual instruction tuning (LLaVA and GPT-4V). The experiments were carried out in January 2024.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Setups</head><p>Test Collection. Our study focuses on the image suggestion task in TREC-AToMiC 2023. In this task, queries are sections from Wikipedia pages, and the corpus contains images from Wikipedia. We assess VLMs' ability to assign relevance labels to 9,818 image-text pairs across 74 test topics. We predict relevance scores, generate qrels for 16 retrieval runs, and compare them with NIST human-assigned qrels. Note that the test topics consist of Wikipedia text sections (level-3 vital articles) without accompanying images, and NIST qrels are not publicly accessible during the training of VLMs we study in this work.</p><p>Vision-Language Models. Our experiments feature three models: CLIP <ref type="bibr" target="#b11">[12]</ref>, LLaVA <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>, and GPT-4V <ref type="bibr" target="#b14">[15]</ref>. CLIP serves as a versatile baseline model, offering similarity scores for imagetext pairs. We use CLIPScore <ref type="bibr" target="#b6">[7]</ref> (referred to as CLIP-S) for calculating relevance with CLIP. However, CLIP has limitations due to its text encoder's token limit (77 tokens), making it less adaptable for complex tasks with lengthy contexts. In contrast, LLMs like LLaVA and GPT-4V, fine-tuned for visual instruction understanding, possess larger text encoders capable of handling extended context. These models excel in various vision-language tasks, making them more versatile compared to CLIP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Correlation Study</head><p>In this subsection, our primary aim is to investigate the extrinsic properties of relevance judgments generated by various approaches, where we base our analysis on retrieval runs and ranking metrics. While various techniques exist to enhance the capabilities of visionlanguage models, including prompt engineering, few-shot instructions, and instruction tuning, our current focus centers on examining their zero-shot capabilities. We defer the exploration of other methods to future research endeavors. Following the work of Voorhees <ref type="bibr" target="#b27">[28]</ref>, we undertake an investigation into the system ranking correlation and the agreement between the relevance labels estimated by the model and those provided by NIST annotators. We evaluate the ranking correlations concerning the primary metrics utilized in the AToMiC track: NDCG@10 and MAP, and calculate Kendall's 𝜏 , Spearman's 𝜌 𝑠 , and Pearson's 𝜌 𝑝 . In our agreement study, we compute Cohen's 𝜅 using NIST's qrels as references.</p><p>Overall. The primary results are showcased in Table <ref type="table" target="#tab_1">2</ref>, where rows correspond to the backbone model used for relevance judgment generation. Notably, models leveraging LLMs such as LLaVA and GPT-4V outperform the CLIP-S baseline concerning ranking correlation. Specifically, they achieve Kendall's 𝜏 values of approximately 0.4 for NDCG@10 and around 0.5 for MAP. For comparison, previous research reported 0.9 for 𝜏 for MAP when comparing two types of human judgments <ref type="bibr" target="#b27">[28]</ref>. While there is still room for further improvement, our observations already demonstrate enhancement compared to the CLIP-S baseline: 0.200 (0.333) for NDCG@10 (MAP). Moreover, other correlation coefficients, including Spearman and Pearson, corroborate the trends identified by Kendall's 𝜏 . Additionally, we notice a rising trend in agreement levels when transitioning from CLIP-S (-0.096) to GPT-4V (0.080), as evidenced by Cohen's 𝜅 values. The agreements achieved by the two largest models (LLaVA-13b and GPT-4V) are categorized as 'slight, ' which represents an improvement over the smaller LLaVA-7b model and the baseline.</p><p>Evaluation Bias Model-based evaluations can introduce bias, often favoring models that are closely related to the assessor model <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b29">30]</ref>. We term this phenomenon as evaluation bias. This is distinct from source bias which indicates that neural retrievers might prefer contents generated by generative models <ref type="bibr" target="#b30">[31]</ref>. To address this potential concern, we conducted an initial analysis using the scatter plot presented in Fig. <ref type="figure" target="#fig_0">1</ref>. In this analysis, we compared the NDCG@10 scores of the 16 submissions made by participants employing different sets of qrels. Each data point on the plot corresponds to a specific run, with distinct markers representing variations in results based on relevance estimation models. Upon closer examination of the plot, we identified a positive correlation between model-based and human-based qrels. Notably, the effectiveness of submitted systems appeared slightly higher when compared to those using human-based qrels.</p><p>To gain deeper insights, we've visually highlighted CLIP-based submissions in red for a thorough investigation. This visual distinction underscores the preference for model-based qrels for CLIP-based systems, especially evident with CLIP-S qrels. We quantitatively assess this bias using a metric adapted from the work of Dai et al. <ref type="bibr" target="#b30">[31]</ref>:</p><formula xml:id="formula_0">Relative Δ = 2 Metric CLIP-based − Metric Others Metric CLIP-based + Metric Others × 100%,</formula><p>here Metric stands for a measure, e.g., NDCG@𝑘, averaged across systems. Observing Table <ref type="table" target="#tab_2">3</ref>, CLIP-S exhibits a strong bias, with Relative Δ = 114.7 for NDCG@10 and 120.5 for MAP. LLM-based approaches also display a slight bias towards CLIP-based systems, possibly because both LLaVA and GPT-4V rely on CLIP embeddings for image representations. In contrast, human-based qrels show the lowest bias: -11.7 for NDCG@10 and -19.5 for MAP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Estimated Relevance Analysis</head><p>In this subsection, we aim to explore the intrinsic properties of relevance judgments generated by various systems. We began our analysis by examining score distributions, visualized in   Figure <ref type="figure">2</ref> presents a Cumulative Distribution Function (CDF) plot of scores before postprocessing into relevance levels (0, 1, and 2). We included NIST qrels (human) results for reference. Notably, GPT-4V's score distribution closely aligns with the human CDF, while CLIP-S exhibits a smoother S-shaped distribution with limited representation of low-relevance data. LLaVA produces tightly concentrated scores, adding complexity to post-processing, particularly when compared to GPT-4V.</p><p>Figure <ref type="figure" target="#fig_3">3</ref> illustrates confusion matrices, highlighting LLaVA's tendency to generate more 1 (related) judgments, fewer 2 (relevant), and 0 (non-relevant) judgments compared to GPT-4V. We anticipate that future models will strive to produce score distributions that better match human annotations, thereby addressing these challenges and limitations. Further studies <ref type="bibr" target="#b31">[32]</ref> on harnessing LLMs' relevance prediction capability are necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This study delves into the capabilities of VLMs such as CLIP, LLaVA, and GPT-4V for automating relevance judgments in image-text retrieval evaluation. Our findings reveal that visual-instruction-tuned LLMs outperform traditional metrics like CLIPScore in aligning with human judgments, with GPT-4V showing particular promise due to its closer alignment with human judgment distributions.</p><p>Despite these advancements and low cost of model-based relevance annotation, 4 challenges such as evaluation bias and the complexity of mimicking human judgments remain. These issues underscore the need for ongoing model refinement and exploration of new techniques to enhance the reliability and scalability of automated relevance judgments.</p><p>In conclusion, our research highlights the potential of VLMs in streamlining multimedia content creation while also pointing to the critical areas requiring further investigation. The path toward fully automated relevance judgment is complex, necessitating continued collaborative efforts in the research community to harness the full potential of VLMs in this domain.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Scatter plots of effectiveness (NDCG@10) for TREC-AToMiC 2023 runs using human-based and model-based qrels. Each data point represents the mean effectiveness of a single run evaluated with different qrels. CLIP-based runs are highlighted in red. Best viewed in color.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :Figure 3 :</head><label>23</label><figDesc>Figure 2: Cumulative distribution function (CDF) plot of relevance scores from various models. Human stands for relevance annotations of NIST qrels.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figures 2 and 3 ,</head><label>3</label><figDesc>Figures 2 and 3, to gain insights into model-based scores.Figure2presents a Cumulative Distribution Function (CDF) plot of scores before postprocessing into relevance levels (0, 1, and 2). We included NIST qrels (human) results for reference. Notably, GPT-4V's score distribution closely aligns with the human CDF, while CLIP-S exhibits a smoother S-shaped distribution with limited representation of low-relevance data. LLaVA produces tightly concentrated scores, adding complexity to post-processing, particularly when compared to GPT-4V.Figure3illustrates confusion matrices, highlighting LLaVA's tendency to generate more 1 (related) judgments, fewer 2 (relevant), and 0 (non-relevant) judgments compared to GPT-4V. We anticipate that future models will strive to produce score distributions that better match human annotations, thereby addressing these challenges and limitations. Further studies<ref type="bibr" target="#b31">[32]</ref> on harnessing LLMs' relevance prediction capability are necessary.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Prompt template for relevance estimation. The VLMs are expected to take text 𝑞 and image 𝑑 independently. The prompts are only applied to the textual input 𝑞, while the VLMs process the pixel values of image 𝑑 directly. Think carefully about which images best illustrate the SECTION subject matter. Given the text and the image please answer the following questions given the criteria listed as follows:</figDesc><table><row><cell>Text Input:</cell></row></table><note>* Images must be significant and relevant in the topic's context, not primarily decorative. They are often an important illustrative aid to understanding. * Images should look like what they are meant to illustrate, whether or not they are provably authentic. * Textual information should almost always be entered as text rather than as an image. Output Instruction: Relevance: Rate the image's overall relevance (integer, scale: 1-100) in terms of matching the text. Output format should be: "Relevance: &lt;score&gt;"</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Ranking correlation and judgment agreement analysis. Correlations are reported in terms of Kendall's 𝜏 , Spearman's 𝜌 𝑠 , and Pearson's 𝜌 𝑝 , whereas judgment agreement is reported in terms of Cohen's 𝜅 when comparing to NIST qrels.</figDesc><table><row><cell>NDCG@10</cell><cell>MAP</cell><cell>Agreement</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Evaluation bias assessment using Relative Δ in terms of NDCG@10 and MAP. A positive Δ favors CLIP-based systems, while a negative Δ favors other types of systems.</figDesc><table><row><cell>Model</cell><cell cols="2">Δ(NDCG@10) Δ(MAP)</cell></row><row><cell>CLIP-S</cell><cell>114.7</cell><cell>120.5</cell></row><row><cell>LLaVA-7b</cell><cell>58.5</cell><cell>86.6</cell></row><row><cell>LLaVA-13b</cell><cell>55.8</cell><cell>83.1</cell></row><row><cell>GPT-4V</cell><cell>64.0</cell><cell>91.3</cell></row><row><cell>Human</cell><cell>-11.7</cell><cell>-19.5</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://trec-atomic.github.io/trec-2023-guidelines</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">For VLMs with limited context windows, e.g., CLIP, we only take the texts in the context part and ignore all the rest instructions.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">For CLIP, relevance scores are computed using text and image embeddings directly.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The aslib cranfield research project on the comparative efficiency of indexing systems</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">W</forename><surname>Cleverdon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Aslib Proceedings</title>
				<imprint>
			<date type="published" when="1960">1960</date>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="421" to="431" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Towards multi-modal text-image retrieval to improve human reading</title>
		<author>
			<persName><forename type="first">F</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ö</forename><surname>Alaçam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Biemann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics</title>
		<author>
			<persName><forename type="first">E</forename><surname>Kreiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hooshmand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zelikman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ringel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Morris</surname></persName>
		</author>
		<author>
			<persName><surname>Potts</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2022 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="4685" to="4697" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Enhancing textbooks with visuals from the web for improved learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Zouhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sachan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="11931" to="11944" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">AToMiC: An image/text retrieval test collection to support multimedia content creation</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lassance</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sampaio De Rezende</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Redi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Clinchant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 46th International ACM SIGIR conference on research and development in information retrieval</title>
				<meeting>the 46th International ACM SIGIR conference on research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="2975" to="2984" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Perspectives on large language models for relevance judgment, 4 The cost of using GPT-4V API for the experiments is around USD 150</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Demartini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</title>
				<meeting>the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="39" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">CLIPScore: A reference-free evaluation metric for image captioning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hessel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holtzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Forbes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">Le</forename><surname>Bras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="7514" to="7528" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Benchmark for compositional text-toimage synthesis</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Azadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rohrbach</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Mutual information divergence: A unified metric for multimodal generative models</title>
		<author>
			<persName><forename type="first">J.-H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Yoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="35072" to="35086" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation</title>
		<author>
			<persName><forename type="first">N</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Jampani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pritch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rubinstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Aberman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="22500" to="22510" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Kreiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">*</forename></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zelikman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">*</forename></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Potts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Haber</surname></persName>
		</author>
		<idno>arxiv:2309.11710</idno>
		<title level="m">ContextRef: Evaluating referenceless metrics for image description generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Improved baselines with visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<title level="m">GPT-4V(ision) system card</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">CLAIR: Evaluating image captions with large language models</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Petryk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Canny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="13638" to="13646" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">TIFA: Accurate and interpretable text-to-image faithfulness evaluation with question answering</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kasai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ostendorf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF International Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="20349" to="20360" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">IC3: Image captioning by committee consensus</title>
		<author>
			<persName><forename type="first">D</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Myers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vijayanarasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Canny</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="8975" to="9003" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">When and why vision-language models behave like bags-of-words, and what to do about it?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Yuksekgonul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Bianchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kalluri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.11116</idno>
		<title level="m">LLMScore: Unveiling the power of large language models in text-to-image synthesis evaluation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Let&apos;s ViCE! Mimicking human cognitive behavior in image generation evaluation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Betti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Staiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Baraldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cucchiara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sebe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st ACM International Conference on Multimedia</title>
				<meeting>the 31st ACM International Conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="9306" to="9312" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Socratic models: Composing zero-shot multimodal reasoning with language</title>
		<author>
			<persName><forename type="first">A</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Attarian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Choromanski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Welker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tombari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Purohit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Ryoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sindhwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Florence</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Eleventh International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Humans optional? automatic large-scale test collections for entity, passage, and entity-passage retrieval</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dalton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Datenbank-Spektrum</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="17" to="28" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">One-shot labeling for automatic relevance estimation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Macavaney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Soldaini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="2230" to="2235" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Is ChatGPT good at search? investigating large language models as re-ranking agents</title>
		<author>
			<persName><forename type="first">W</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="14918" to="14937" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jagerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.17563</idno>
		<title level="m">Large language models are effective text rankers with pairwise ranking prompting</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Variations in relevance judgments and the measurement of retrieval effectiveness</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Voorhees</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 21st annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="315" to="323" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.16634</idno>
		<title level="m">GPTEval: NLG evaluation using GPT-4 with better human alignment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Pangakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wolken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Fasching</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.00176</idno>
		<title level="m">Automated annotation with generative ai requires validation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.20501</idno>
		<title level="m">LLMs may dominate information access: Neural retrievers are biased towards llm-generated texts</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Berdersky</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.14122</idno>
		<title level="m">Beyond yes and no: Improving zero-shot LLM rankers via scoring fine-grained relevance labels</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
