<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Automatic Relevance Judgment using Vision-Language Models for Image-Text Retrieval Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jheng-Hong Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jimmy Lin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Vision-Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale ad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion. Preliminary experiments reveal the following: (1) Both LLaVA and GPT-4V, encompassing open-source and closed-source visual-instructiontuned Large Language Models (LLMs), achieve notable Kendall's  ∼ 0.4 when compared to human relevance judgments, surpassing the CLIPScore metric. (2) While CLIPScore is strongly preferred, LLMs are less biased towards CLIP-based retrieval systems. (3) GPT-4V's score distribution aligns more closely with human judgments than other models, achieving a Cohen's  value of around 0.08, which outperforms CLIPScore at approximately -0.096. These findings underscore the potential of LLM-powered VLMs in enhancing relevance judgments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Relevance Assessments</kwd>
        <kwd>Image-Text Retrieval</kwd>
        <kwd>Vision-Language Model</kwd>
        <kwd>Large Language Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Cranfield-style test collections, consisting of a document corpus, a set of queries, and manually
assessed relevance judgments, have long served as the foundation of information retrieval
research [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, evaluating every document for every query in a substantial corpus
often proves cost-prohibitive. To tackle this challenge, a subset of documents is selected for
assessment through a pooling process. While this method is cost-efective compared to user
studies, it has limitations due to its simplifications and struggles to adapt to complex search
scenarios and large document collections.
      </p>
      <p>
        In this study, we explore the adaptability of model-based relevance judgments for image–
text retrieval evaluation. Leveraging model-based retrieval judgments presents an appealing
option. Not only does it provide valuable insights before undertaking the laborious processes
of document curation, query creation, and costly annotation, but it also has the potential to
extend and scale up to complex search scenarios and large document collections. To explore
opportunities and meet the demands for large-scale, fine-grained, and long-form text enrichment
scenarios in image-text retrieval evaluation [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ], our objective is to extend the
humanmachine collaborative framework proposed by Faggioli et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to the context of image-text
retrieval evaluation, alongside widely adopted model-based image-text evaluation metrics
[7, 8, 9, 10, 11].
      </p>
      <p>
        Our primary focus is on a fully automatic evaluation paradigm, where we harness the
capabilities of Vision–Language Models (VLMs), including CLIP [12], as well as visual instruction-tuned
Large Language Models (LLMs) like LLaVA [13, 14] and GPT-4V [15]. To evaluate this approach,
we conducted a pilot study using the TREC-AToMiC 2023 test collection, which is designed
for multimedia content creation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], based on our instruction prompt template for VLMs (cf.
Table 1 and Section 3.2).
      </p>
      <p>We observe that model-based relevance judgments generated by visual instruction-tuned
LLMs outperform the widely adopted CLIPScore [7] in terms of ranking correlations and
agreements when compared to human annotations. While this discovery holds promise, we
also uncover the potential evaluation bias when using model-based relevance judgments. Our
analysis reveals a bias in favor of CLIP-based retrieval systems in the rankings when employing
model-based relevance judgments, resulting in higher overall efectiveness assessments for
these systems. In summary, our contributions can be distilled as follows:
• We demonstrate and explore the feasibility of incorporating VLMs for fully automatic image–
text retrieval evaluation.
• We shed light on the evaluation bias when utilizing model-based relevance judgments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Evaluation Metrics for Image–Text Relevance. Nowadays, model-based evaluation metrics
are widely utilized in various vision–language tasks, including image captioning [7, 16] and
text-to-image synthesis [8, 17]. Among model-based approaches, CLIP-based methods [8, 9,
18, 10, 11], such as CLIPScore [7], are particularly prevalent. However, while these metrics
are capable of measuring coarse text-image similarity, they may fall short in capturing
finegrained image–text correspondence [
        <xref ref-type="bibr" rid="ref3">3, 19</xref>
        ]. Recent research has highlighted the efectiveness
of enhancing model-based evaluation metrics by leveraging LLMs to harness their reasoning
capabilities [16, 20, 21]. There exists significant potential for incorporating LLMs into
modelbased approaches, as LLM outputs are not limited to mere scores but can also provide free-form
texts, e.g., reasons, for further analysis and many downstream tasks [22].
      </p>
      <p>
        Model-based Relevance Judgments. Traditionally, relevance judgments in retrieval tasks
have adhered to the Cranfield evaluation paradigm due to its cost-efectiveness, reproducibility,
and reliability when compared to conducting user studies. However, this approach often relies
on simplified assumptions and encounters scalability challenges. Researchers have recently
explored model-based automatic relevance estimation as a promising alternative. This approach
aims to optimize human-machine collaboration to obtain ideal relevance judgments. Notably,
studies of Dietz and Dalton [23] and Faggioli et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have revealed high rank correlations
between model-based and human-based judgments. Additionally, MacAvaney and Soldaini [24]
have delved into the task of filling gaps in relevance judgments using model-based annotations.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we investigate techniques for estimating image-text relevance scores, denoted
as ℱ (, ) ∈ R, where  represents the text (query) and  represents the image (document).
Our primary focus is on utilizing VLMs to generate relevance scores, akin to empirical values
annotated by human assessors denoted as ℱ^ (, ). The main objective is to assess the proximity
between model-based ℱ and human-based ℱ^ in image–text retrieval evaluation. We begin with
a discussion of the setting for human-based annotations, followed by the process for generating
model-based annotations.</p>
      <sec id="sec-3-1">
        <title>3.1. Human-based Annotations</title>
        <p>Our primary focus revolves around a critical aspect of multimedia content creation, specifically,
the image suggestion task, an ad hoc image retrieval task as part of the AToMiC track in the TREC
conference 2023 (TREC-AToMiC 2023).1 The image suggestion task aims to identify relevant
images from a predefined collection, given a specific section of an article. Its overarching goal
is to enrich textual content by selecting images that aid readers in better comprehending the
material.</p>
        <p>Relevance scores for this task are meticulously annotated by NIST assessors, adhering to the</p>
        <sec id="sec-3-1-1">
          <title>1https://trec-atomic.github.io/trec-2023-guidelines</title>
          <p>TREC-style top- pooling relevance annotation process. A total of sixteen valid participant
runs, generated by diverse image–text retrieval systems, are considered, encompassing
(CLIPbased) dense retrievers, learned sparse retrievers, caption-based retrievers, hybrid systems, and
multi-stage retrieval systems. The pooling depth is set to 25 for eight baseline systems and 30
for the remaining participant runs.</p>
          <p>NIST assessors classify candidate results into three graded relevance levels to capture nuances
in suitability, guided by the content of the test query. The test query comprises textual elements
such as the section title, section context description, page title, and page context description.
Assessors base their relevance judgments on the following criteria:
• 0 (Non-relevant): Candidates deemed irrelevant.
• 1 (Related): Candidates that are related but not relevant to the section context are categorized
as related. They contain pertinent information but do not align with the section’s context.
• 2 (Relevant): These candidates are considered relevant to the section context and efectively
illustrate it.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model-based Annotations</title>
        <p>For automatic relevance estimation, we employ pretrained VLMs as our relevance estimator,
denoted as ℱ (,  |). Our relevance estimator produces relevance scores given a pair of  and
, which is conditioned on , where  represents the prompt template we used to instruct the
models. Prompt engineering is a commonly adopted technique for enhancing or guiding VLMs
and LLMs in various tasks [25, 12]. It’s important to note that our current focus is on pointwise
estimation, leaving more advanced ranking methods (such as pairwise or listwise) that consider
multiple  and  for future exploration [26, 27].</p>
        <p>Prompt Template Design In line with our approach to relevance score annotation, we have
created a prompt template designed to guide models in generating relevance scores. The prompt
template, presented in Table 1, has been constructed based on our heuristics and is not an
exhaustive search of all possible templates. Pretrained VLMs are expected to take both  and 
to produce a relevance score following the instructions defined in the prompt template . We
anticipate that VLMs will independently process textual and visual information, and our prompt
template is only applied to textual inputs.Our template comprises three essential components:
• Context: This section processes the textual information from .2
• Relevance Instruction: It incorporates task-specific information designed to provide VLMs
with an understanding of the task.
• Output Instruction: This component ofers instructions concerning the expected output, e.g.,
output types and format.
2For VLMs with limited context windows, e.g., CLIP, we only take the texts in the context part and ignore all the
rest instructions.</p>
        <p>From Scores to Relevance Judgments.</p>
        <sec id="sec-3-2-1">
          <title>We utilize parsing scripts to process the relevance</title>
          <p>scores generated by the models and convert them into relevance judgments.3 Considering
potential score variations across diferent models, we apply an additional heuristic rule to
map these scores into graded relevance levels: 0 (non-relevant), 1 (related), and 2 (relevant).
Specifically, scores falling below the median value are categorized as 0; scores within the
5075th quantile range are designated as 1; and scores exceeding the 75th quantile are assigned a
relevance level of 2.
 , Spearman’s  , and Pearson’s  , whereas judgment agreement is reported in terms of Cohen’s 
when comparing to NIST qrels.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>out in January 2024.</p>
      <sec id="sec-4-1">
        <title>4.1. Setups</title>
        <p>We have undertaken an empirical comparison between human assessors and vision-language
models to ofer an initial evaluation of their current capabilities in estimating relevance
judgments. This comparative analysis encompasses one embedding-based model (CLIP) and two
LLMs trained by visual instruction tuning (LLaVA and GPT-4V). The experiments were carried
Test Collection.</p>
        <sec id="sec-4-1-1">
          <title>Our study focuses on the image suggestion task in TREC-AToMiC 2023.</title>
          <p>In this task, queries are sections from Wikipedia pages, and the corpus contains images from
Wikipedia. We assess VLMs’ ability to assign relevance labels to 9,818 image–text pairs across 74
test topics. We predict relevance scores, generate qrels for 16 retrieval runs, and compare them
with NIST human-assigned qrels. Note that the test topics consist of Wikipedia text sections
(level-3 vital articles) without accompanying images, and NIST qrels are not publicly accessible
during the training of VLMs we study in this work.</p>
          <p>Vision–Language Models.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Our experiments feature three models: CLIP [12], LLaVA [13, 14],</title>
          <p>and GPT-4V [15]. CLIP serves as a versatile baseline model, ofering similarity scores for image–
text pairs. We use CLIPScore [7] (referred to as CLIP-S) for calculating relevance with CLIP.
However, CLIP has limitations due to its text encoder’s token limit (77 tokens), making it less
adaptable for complex tasks with lengthy contexts. In contrast, LLMs like LLaVA and GPT-4V,</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>3For CLIP, relevance scores are computed using text and image embeddings directly.</title>
          <p>ifne-tuned for visual instruction understanding, possess larger text encoders capable of handling
extended context. These models excel in various vision-language tasks, making them more
versatile compared to CLIP.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Correlation Study</title>
        <p>In this subsection, our primary aim is to investigate the extrinsic properties of relevance
judgments generated by various approaches, where we base our analysis on retrieval runs
and ranking metrics. While various techniques exist to enhance the capabilities of
visionlanguage models, including prompt engineering, few-shot instructions, and instruction tuning,
our current focus centers on examining their zero-shot capabilities. We defer the exploration of
other methods to future research endeavors. Following the work of Voorhees [28], we undertake
an investigation into the system ranking correlation and the agreement between the relevance
labels estimated by the model and those provided by NIST annotators. We evaluate the ranking
correlations concerning the primary metrics utilized in the AToMiC track: NDCG@10 and
MAP, and calculate Kendall’s  , Spearman’s  , and Pearson’s  . In our agreement study, we
compute Cohen’s  using NIST’s qrels as references.</p>
        <p>Overall. The primary results are showcased in Table 2, where rows correspond to the
backbone model used for relevance judgment generation. Notably, models leveraging LLMs such as
LLaVA and GPT-4V outperform the CLIP-S baseline concerning ranking correlation. Specifically,
they achieve Kendall’s  values of approximately 0.4 for NDCG@10 and around 0.5 for MAP.
For comparison, previous research reported 0.9 for  for MAP when comparing two types of
human judgments [28]. While there is still room for further improvement, our observations
already demonstrate enhancement compared to the CLIP-S baseline: 0.200 (0.333) for NDCG@10
(MAP). Moreover, other correlation coeficients, including Spearman and Pearson, corroborate
the trends identified by Kendall’s  . Additionally, we notice a rising trend in agreement levels
when transitioning from CLIP-S (-0.096) to GPT-4V (0.080), as evidenced by Cohen’s  values.
The agreements achieved by the two largest models (LLaVA-13b and GPT-4V) are categorized
as ’slight,’ which represents an improvement over the smaller LLaVA-7b model and the baseline.
Evaluation Bias Model-based evaluations can introduce bias, often favoring models that
are closely related to the assessor model [29, 30]. We term this phenomenon as evaluation bias.
0.8
0.0
0.8</p>
        <p>This is distinct from source bias which indicates that neural retrievers might prefer contents
generated by generative models [31]. To address this potential concern, we conducted an initial
analysis using the scatter plot presented in Fig. 1. In this analysis, we compared the NDCG@10
scores of the 16 submissions made by participants employing diferent sets of qrels. Each data
point on the plot corresponds to a specific run, with distinct markers representing variations in
results based on relevance estimation models. Upon closer examination of the plot, we identified
a positive correlation between model-based and human-based qrels. Notably, the efectiveness
of submitted systems appeared slightly higher when compared to those using human-based
qrels.</p>
        <p>To gain deeper insights, we’ve visually highlighted CLIP-based submissions in red for a
thorough investigation. This visual distinction underscores the preference for model-based
qrels for CLIP-based systems, especially evident with CLIP-S qrels. We quantitatively assess
this bias using a metric adapted from the work of Dai et al. [31]:</p>
        <p>Relative Δ = 2 MetricCLIP-based − MetricOthers</p>
        <p>MetricCLIP-based + MetricOthers
× 100%,
here Metric stands for a measure, e.g., NDCG@, averaged across systems. Observing Table 3,
CLIP-S exhibits a strong bias, with Relative Δ = 114.7 for NDCG@10 and 120.5 for MAP.
LLM-based approaches also display a slight bias towards CLIP-based systems, possibly because
both LLaVA and GPT-4V rely on CLIP embeddings for image representations. In contrast,
human-based qrels show the lowest bias: -11.7 for NDCG@10 and -19.5 for MAP.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Estimated Relevance Analysis</title>
        <p>In this subsection, we aim to explore the intrinsic properties of relevance judgments generated
by various systems. We began our analysis by examining score distributions, visualized in
0.0
0.2 0.4 0.6 0.8
Score (Min-Max Normalized)</p>
        <sec id="sec-4-3-1">
          <title>Figures 2 and 3, to gain insights into model-based scores.</title>
          <p>Figure 2 presents a Cumulative Distribution Function (CDF) plot of scores before
postprocessing into relevance levels (0, 1, and 2). We included NIST qrels (human) results for
reference. Notably, GPT-4V’s score distribution closely aligns with the human CDF, while
CLIPS exhibits a smoother S-shaped distribution with limited representation of low-relevance data.
LLaVA produces tightly concentrated scores, adding complexity to post-processing, particularly
when compared to GPT-4V.</p>
          <p>Figure 3 illustrates confusion matrices, highlighting LLaVA’s tendency to generate more 1
(related) judgments, fewer 2 (relevant), and 0 (non-relevant) judgments compared to GPT-4V.
We anticipate that future models will strive to produce score distributions that better match
human annotations, thereby addressing these challenges and limitations. Further studies [32]
on harnessing LLMs’ relevance prediction capability are necessary.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study delves into the capabilities of VLMs such as CLIP, LLaVA, and GPT-4V for
automating relevance judgments in image–text retrieval evaluation. Our findings reveal that
visual-instruction-tuned LLMs outperform traditional metrics like CLIPScore in aligning with
human judgments, with GPT-4V showing particular promise due to its closer alignment with
human judgment distributions.</p>
      <p>Despite these advancements and low cost of model-based relevance annotation,4 challenges
such as evaluation bias and the complexity of mimicking human judgments remain. These
issues underscore the need for ongoing model refinement and exploration of new techniques to
enhance the reliability and scalability of automated relevance judgments.</p>
      <p>In conclusion, our research highlights the potential of VLMs in streamlining multimedia
content creation while also pointing to the critical areas requiring further investigation. The path
toward fully automated relevance judgment is complex, necessitating continued collaborative
eforts in the research community to harness the full potential of VLMs in this domain.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This research was supported in part by the Canada First Research Excellence Fund and the
Natural Sciences and Engineering Research Council (NSERC) of Canada.</p>
      <p>in: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information
Retrieval, 2023, pp. 39–50.
[7] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, Y. Choi, CLIPScore: A reference-free
evaluation metric for image captioning, in: Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, 2021, pp. 7514–7528.
[8] D. H. Park, S. Azadi, X. Liu, T. Darrell, A. Rohrbach, Benchmark for compositional
text-toimage synthesis, in: Thirty-fifth Conference on Neural Information Processing Systems
Datasets and Benchmarks Track (Round 1), 2021.
[9] J.-H. Kim, Y. Kim, J. Lee, K. M. Yoo, S.-W. Lee, Mutual information divergence: A unified
metric for multimodal generative models, in: Advances in Neural Information Processing
Systems, volume 35, 2022, pp. 35072–35086.
[10] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, K. Aberman, Dreambooth: Fine
tuning text-to-image difusion models for subject-driven generation, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
22500–22510.
[11] E. Kreiss*, E. Zelikman*, C. Potts, N. Haber, ContextRef: Evaluating referenceless metrics
for image description generation, arXiv preprint arxiv:2309.11710 (2023).
[12] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language
supervision, in: International Conference on Machine Learning, 2021, pp. 8748–8763.
[13] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, in: Advances in Neural Information</p>
      <p>Processing Systems, 2023.
[14] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, in: NeurIPS
2023 Workshop on Instruction Tuning and Instruction Following, 2023.
[15] OpenAI, GPT-4V(ision) system card, 2023.
[16] D. M. Chan, S. Petryk, J. E. Gonzalez, T. Darrell, J. Canny, CLAIR: Evaluating image
captions with large language models, in: Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, 2023, pp. 13638–13646.
[17] Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, N. A. Smith, TIFA: Accurate
and interpretable text-to-image faithfulness evaluation with question answering, in: 2023
IEEE/CVF International Conference on Computer Vision, 2023, pp. 20349–20360.
[18] D. Chan, A. Myers, S. Vijayanarasimhan, D. Ross, J. Canny, IC3: Image captioning by
committee consensus, in: Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing, 2023, pp. 8975–9003.
[19] M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, J. Zou, When and why vision-language
models behave like bags-of-words, and what to do about it?, in: The Eleventh International
Conference on Learning Representations, 2023.
[20] Y. Lu, X. Yang, X. Li, X. E. Wang, W. Y. Wang, LLMScore: Unveiling the power of large
language models in text-to-image synthesis evaluation, arXiv preprint arXiv:2305.11116
(2023).
[21] F. Betti, J. Staiano, L. Baraldi, R. Cucchiara, N. Sebe, Let’s ViCE! Mimicking human cognitive
behavior in image generation evaluation, in: Proceedings of the 31st ACM International
Conference on Multimedia, 2023, p. 9306–9312.
[22] A. Zeng, M. Attarian, brian ichter, K. M. Choromanski, A. Wong, S. Welker, F. Tombari,
A. Purohit, M. S. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke, P. Florence, Socratic models:
Composing zero-shot multimodal reasoning with language, in: The Eleventh International
Conference on Learning Representations, 2023.
[23] L. Dietz, J. Dalton, Humans optional? automatic large-scale test collections for entity,
passage, and entity-passage retrieval, Datenbank-Spektrum 20 (2020) 17–28.
[24] S. MacAvaney, L. Soldaini, One-shot labeling for automatic relevance estimation, in:
Proceedings of the 46th International ACM SIGIR Conference on Research and Development
in Information Retrieval, 2023, p. 2230–2235.
[25] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, in: Advances
in Neural Information Processing Systems, volume 33, 2020, pp. 1877–1901.
[26] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is ChatGPT good at search?
investigating large language models as re-ranking agents, in: Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14918–14937.
[27] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang,
et al., Large language models are efective text rankers with pairwise ranking prompting,
arXiv preprint arXiv:2306.17563 (2023).
[28] E. M. Voorhees, Variations in relevance judgments and the measurement of retrieval
efectiveness, in: Proceedings of the 21st annual international ACM SIGIR conference on
Research and development in information retrieval, 1998, pp. 315–323.
[29] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, GPTEval: NLG evaluation using GPT-4 with
better human alignment, arXiv preprint arXiv:2303.16634 (2023).
[30] N. Pangakis, S. Wolken, N. Fasching, Automated annotation with generative ai requires
validation, arXiv preprint arXiv:2306.00176 (2023).
[31] S. Dai, Y. Zhou, L. Pang, W. Liu, X. Hu, Y. Liu, X. Zhang, J. Xu, LLMs may dominate
information access: Neural retrievers are biased towards llm-generated texts, arXiv
preprint arXiv:2310.20501 (2023).
[32] H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang, M. Berdersky, Beyond yes and no:
Improving zero-shot LLM rankers via scoring fine-grained relevance labels, arXiv preprint
arXiv:2310.14122 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Cleverdon</surname>
          </string-name>
          ,
          <article-title>The aslib cranfield research project on the comparative eficiency of indexing systems</article-title>
          ,
          <source>in: Aslib Proceedings</source>
          , volume
          <volume>12</volume>
          ,
          <year>1960</year>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          , Ö. Alaçam,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <article-title>Towards multi-modal text-image retrieval to improve human reading</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kreiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hooshmand</surname>
          </string-name>
          , E. Zelikman,
          <string-name>
            <given-names>M. Ringel</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <article-title>Context matters for image descriptions for accessibility: Challenges for referenceless evaluation metrics</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4685</fpage>
          -
          <lpage>4697</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zouhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sachan</surname>
          </string-name>
          ,
          <article-title>Enhancing textbooks with visuals from the web for improved learning</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>11931</fpage>
          -
          <lpage>11944</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lassance</surname>
          </string-name>
          , R. Sampaio De Rezende,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Redi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>AToMiC:</surname>
          </string-name>
          <article-title>An image/text retrieval test collection to support multimedia content creation</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR conference on research and development in information retrieval</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>2975</fpage>
          -
          <lpage>2984</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , G. Demartini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , et al.,
          <article-title>Perspectives on large language models for relevance judgment,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>