<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Cost-Accuracy Trade-ofs in Multimodal Search Relevance Judgements</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvia Terragni</string-name>
          <email>silvia@objective.inc</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang Cuong</string-name>
          <email>hoang@objective.inc</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Daiber</string-name>
          <email>jo@objective.inc</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pallavi Gudipati</string-name>
          <email>pallavi@objective.inc</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo N. Mendes</string-name>
          <email>pablo@objective.inc</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Objective</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inc. San Francisco</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Multimodal Search, Relevance Judgments, Large Language Models, Multimodal Large Language Models</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) have demonstrated potential as efective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-ofs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has shown that Large Language Models (LLMs) and Multimodal Language Models
(MLLMs) are viable alternatives for producing relevance judgments. LLMs-as-judges are enticing since
they can unlock higher relevance judgment throughput at a fraction of the cost. As a result, they
ofer the potential of widespread relevance improvement in search systems due to more accessible and
MMSR’24
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
extensive evaluations, as well as training data generation. However, progress is hampered by a number
of under-explored questions about how to best employ LLMs-as-judges.</p>
      <p>In this paper, we evaluate a number of LLMs and MLLMs in terms of their alignment with human
judgments and ask the following research questions:
1. Is LLM performance use-case dependent? In other words, would the same LLM perform well in
one use case but not in another?
2. Is there a clear winner? In other words, is there a model that consistently outperforms all the
others across all use cases?
3. Is multimodal support necessary for search relevance judgment in multimodal search?
4. What models ofer the optimal cost-accuracy trade-ofs?</p>
      <p>In the next section we summarize related work. We then present our experimental setting and discuss
results. Finally, we present concluding remarks and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Large Language Models (LLMs) have shown exceptional abilities in a wide variety of tasks, and using
them for evaluating Information Retrieval systems is receiving considerable attention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Recent studies
have explored diferent methods for generating relevance judgments. For example, Prometheus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a
13-billion parameter LLM designed to evaluate long texts using customized scoring rubrics provided
by users. JudgeLM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] uses fine-tuned LLMs as scalable judges to evaluate other LLMs efectively in
open-ended tasks. They find that JudgeLM has a high agreement with expert judges, over 90%, and
works well in evaluating single answers, multimodal models, multiple answers, and multi-turn dialogues.
Thomas et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] developed an LLM prompt based on feedback from search engine users. They show
accuracy similar to human judges and can identify dificult queries, best results, and efective groupings.
They also find that both changes to prompts and simple paraphrases can improve accuracy.
      </p>
      <p>
        In the context of Multimodal LLMs (MLLMs), Chen et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] assess these models as judges through a
new benchmark. They examine their performance in tasks such as Scoring Evaluation, Pair Comparison,
and Batch Ranking. The study points out that MLLMs need more improvements and research before
they can be fully trusted, as they can have biases like ego-centric bias, position bias, length bias, and
hallucinations. Additionally, Yang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] investigates the relevance estimation of Vision-Language
Models (VLMs), including CLIP, LLaVA, and GPT-4V, within a large-scale ad hoc zero-shot retrieval
task aimed at multimedia content creation.
      </p>
      <p>To the extent of our knowledge, we are the first to compare the cost-accuracy trade-ofs of several
generally available LLMs of diferent sizes.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This study employs a relevance evaluation process to assess the performance of LLMs and MLLMs
(collectively referred to as “models”) for search relevance judgments. We assess these models based on
two critical dimensions: accuracy and costs. Our evaluation pipeline consists of three stages:
• Data Collection: We obtained search results from three datasets across diferent domains using a
list of predefined queries.
• Human Annotation: Two trained human annotators assigned relevance grades to each (query,
result) pair following some established relevance criteria.
• Model Evaluation: We applied a range of LLMs and MLLMs to generate relevance judgments for
the same sets of search results, comparing their performance against human annotations.
Each stage is discussed in detail in the following subsections, covering the datasets, retrieval system,
grading strategy, and the models used.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>We conducted our experiments on three datasets: Fashion, Hotel Supplies, and Design. The Fashion
dataset is a subset of the publicly available dataset H&amp;M Personalized Fashion Recommendations1. The
Hotel Supplies and Design datasets are proprietary and represent domains in the e-commerce search
for hotel supply products, and social media search for design assets, respectively. Each dataset includes
multiple textual fields per product, along with one or more associated images. Table 3.1 summarizes
the characteristics of each dataset, detailing the average number of fields per search result, the average
number of empty fields, and the average word count per result. These factors can impact the dificulty
of generating relevance judgments.</p>
        <sec id="sec-3-1-1">
          <title>Dataset</title>
          <p>Fashion
Hotel Supplies
Design</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Total Number of</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Search Results</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Avg Number of</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Textual Fields</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Avg Number of</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Empty Textual</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>Fields</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>Avg Number of Words per Result</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval System and Evaluation</title>
        <p>
          To obtain relevant search results, we utilized a baseline retrieval system that combines BM25 with BGE
M3 embeddings [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which is one of the top-ranked text embedding models in the MTEB Leaderboard
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] as of June 2024. We created indexes for each dataset to enable eficient retrieval of results based on
a predefined list of queries. These queries were either derived from real trafic data or carefully crafted
by human experts to ensure they represented a wide range of search scenarios. Our aim was to include
queries and results that included hits and misses generated by both lexical and semantic retrievers.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Relevance Judgment Strategy</title>
        <p>Each dataset was structured as a collection of query-result pairs. Two expert human annotators assessed
the relevance of each pair on a 0-2 rating scale:
• 2: Highly relevant, a perfect match for the query
• 1: Somewhat relevant, a result that partially matches the query’s intent
• 0: Not relevant, a poor result that should not be shown
The human annotators were provided with detailed guidelines to ensure consistency in their relevance
judgments. Table 3.3 provides examples of diferent relevance judgment categories for the query “ v-neck
white tee”. In the first row, the result is highly relevant, as both the image and the text describe a white
v-neck t-shirt. Therefore the human relevance judgment for this pair is a 2. The second row shows a
partial match: while the text mentions a white t-shirt, the image depicts a white v-neck t-shirt with
black stripes, resulting in a relevance judgment of 1. The third row illustrates an irrelevant result (0),
where the product shown is a strap top, unrelated to the query.</p>
        <p>In the Appendix, we describe in detail our internal H&amp;M grading guidelines that human annotators
follow to assign relevance grades to query-result pairs. We note that our grading guidelines are based
on similar principles to those used for the Hotel Supplies and Design datasets. However, the guidelines
have been also adapted to suit the specific characteristics of the datasets.
1https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations. For access to our human
annotations for this dataset, please reach out to the corresponding author.
prod_name: Premium ELKE vneck tee,
index_name: Ladieswear,
detail_desc: V-neck T-shirt in airy slub lin[...],
department_name: Jersey/Knitwear Premium,
index_group_name: Ladieswear,
colour_group_name: White,
product_type_name: T-shirt,
graphical_appearance_name: Solid,
perceived_colour_value_name: Light,
perceived_colour_master_name: White
prod_name: ED Lizzie tee,
index_name: Ladieswear,
detail_desc: Short-sleeved top in lightwei[...],
department_name: Jersey,
index_group_name: Ladieswear,
colour_group_name: White,
product_type_name: T-shirt,
graphical_appearance_name: Stripe,
perceived_colour_value_name: Light,
perceived_colour_master_name: White
prod_name: V-neck Strap Top,
index_name: Ladieswear,
detail_desc: V-neck top in soft organic [...],
department_name: Jersey Basic,
index_group_name: Ladieswear,
colour_group_name: White,
product_type_name: Vest top,
graphical_appearance_name: Solid,
perceived_colour_value_name: Light,
perceived_colour_master_name: White
2
1
0</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Inter-Annotator Agreement</title>
        <p>To assess the reliability of the relevance judgments, either human or LLM-generated, we followed
common practice and calculated Cohen’s kappa coeficient. Cohen’s kappa is a robust statistical measure
commonly employed to quantify inter-annotator agreement for categorical data. Cohen’s kappa values
range from -1 to 1, where 1 indicates strong agreement, while values closer to 0 suggest agreement no
better than chance. To interpret the kappa values, we use the guidelines reported in Table 3.</p>
        <p>In our evaluation, we compute Cohen’s kappa to measure the agreement between human annotators
and LLM-generated annotations, as well as between pairs of human annotators. The degree of agreement
between human annotators also provided insights into the dificulty of evaluating certain datasets.</p>
        <sec id="sec-3-4-1">
          <title>Cohen’s kappa Interpretation</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Models</title>
        <p>Our evaluation included a range of LLMs and MLLMs to reflect varying levels of performance and cost.
We considered both large-scale proprietary models and more cost-eficient alternatives:
• OpenAI Models2: GPT-4V (gpt-4-vision-preview), GPT-4o (gpt-4o-2024-05-13), GTP-4o-mini
(gpt4o-mini-2024-07-18)
• Anthropic Models3: Claude 3.5 Sonnet, Claude 3 Haiku</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Prompts</title>
        <p>To design the prompts for the models under consideration, we created a template aimed at guiding the
models to generate accurate relevance judgments.</p>
        <p>In the multimodal setup, where an image is provided, the prompt will reference and include the
image. Additionally, we require the model to provide an explanation for its relevance judgment. This
element could be useful for interpreting the model’s decisions.</p>
        <p>Below, we present the prompt used for the Claude family in the text-only scenario. For the complete
set of prompts, please refer to the Appendix. In the template, {{document}} and {{query}} are placeholders
for the search result and query, respectively.</p>
        <p>Haiku’s Prompt Template (Text-only Setup)
You are an assistant responsible for rating how the retrieved result is relevant to
the query. Output a token: "2", "1", or "0" followed by a full explanation.
Guidelines:
"2" - The result matches exactly with what the user's query is looking for.
"1" - The result is not exactly with what the user's query is looking for. But it's
pretty similar. As our aim is to be strict on exact matches, this grade is less
likely to be used.
"0" - The result is not related to the query at all.</p>
        <p>Result: {{document}}
Query: {{query}}
Output:"</p>
        <p>It is important to note that these prompt templates are not the result of an extensive exploration of
all possible templates. In Section 4.2, we provide an analysis of the prompt engineering process that led
to the best-performing prompts for Haiku.
2https://platform.openai.com/docs/models
3https://docs.anthropic.com/en/docs/about-claude/models</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Multimodal vs Single-modality Evaluation</title>
        <p>GPT-4v</p>
        <p>Use-case Dependency of LLM Performance The analysis reveals that the LLM performance is
dependent on the use case. The models show varying levels of correlation with the human relevance
judgments across the diferent domains. For example, GPT-4V shows higher performance in the Hotel
Supplies use case but performs relatively worse in the other areas. We can observe a similar trend across
the other models. This variability in model performance is also connected to the inherent dificulty of
the tasks. This is also reflected by the varying levels of agreement among the human annotators for the
diferent use cases.</p>
        <p>One Model to Rule Them All? The multimodal version of GPT4-o generally performs better than
the other models in two out of three cases, achieving the highest average Cohen’s kappa coeficient
(0.548). It stands out in the Hotel Supplies and Fashion domains, where it shows substantial
agreement with human annotations. However, it is outperformed by Sonnet in the Hotel Supplies domain,
suggesting that no single model outperforms all the others across every use case.</p>
        <p>Tailor Model Prompt to a Specific Dataset We observe that tailoring a model’s prompt to a specific
domain can greatly improve its grading performance on the corresponding dataset. A notable example
is the text-only Haiku LLM. Despite it being among the least powerful models in our experiments, we
achieved the highest Cohen’s kappa coeficient for that dataset ( 0.6403 compared to 0.560) by refining
the prompt for the Hotel Supplies dataset. Nevertheless, we also note that using the Hotel Supplies
dataset-adjusted prompt may lead to significant overfitting to other datasets. For example, when using
the same prompt for the Design dataset, Cohen’s kappa coeficient decreases to 0.333 (instead of 0.431).
Necessity of Multimodal Support In the table, we compare each Multimodal (MM) model with
its text-only (Text) counterpart. It is worth noticing that the benefits of multimodal support are not
uniform across all the models and use cases. For models like GPT-4o, the vision component significantly
enhances the performance, increasing the correlation from 0.506 (Text) to 0.548 (MM). This leads to the
highest average performance and the metric is remarkably very close to human correlation (i.e. 0.589).
GPT-4V and Sonnet also benefit from the visual component. However, for smaller models, such as
Haiku, the vision component appears to have a detrimental efect, decreasing the correlation from 0.433
(Text) to 0.368 (MM). To further investigate the impact of the visual component in Haiku, we performed
an ablation study by excluding the textual component and relying solely on the image. Under this
configuration, the highest correlation achieved was 0.1 for the Design case, which is significantly lower
than the text-only correlation of 0.309. This indicates that for smaller models like Haiku, the visual
component may not be suficiently robust to provide efective multimodal support.</p>
        <p>Error Analysis To investigate the poor performance of the smaller multimodal models, we conducted
an error analysis on a sample of relevance judgments from Haiku that disagreed with both human
annotators. We examined 31 instances of disagreement and identified three distinct error categories.
Notably, Haiku generates an explanation for its relevance judgments, which allows us to categorize the
errors efectively.</p>
        <p>The most frequent issue (17 cases) involved the model’s failure to correctly identify the product
type. For example, when given the query “pure cotton dressing gown”, the model misclassifies a linen
dressing gown, justifying its choice with the explanation: The product is [...] made of linen, which is
a natural fiber similar to cotton . In half of the remaining cases, Haiku’s errors originate from wrong
assumptions. For instance, the model confused the brand name for bras with the word “band” as in
“hairband”, leading to incorrect judgments. Lastly, 7 of the cases were related to the model’s vision
capabilities, where it failed to recognize specific patterns or prints on products, resulting in inaccurate
relevance assessments. Table 5 provides an example of this type of error, including the image, query,
and explanation generated by Haiku.</p>
        <p>Image</p>
        <p>Query</p>
        <p>Explanation
The provided result does not match
the user’s query for a “h&amp;m kids
unicorn printed t-shirt”. The result is
for a “Mia l/s top” which is a
longsleeved top in soft, printed cotton
jerh&amp;m kids unicorn printed t-shirt sey, but it does not appear to have a
unicorn print. The image also does
not show a unicorn print.
Therefore, the result is not relevant to the
user’s query, and I would rate it as a
“0”.</p>
        <p>Cost-Accuracy Trade-of Considering the previous results comparing multimodal versus text-only
performance, we can make important cost-accuracy trade-of considerations when choosing a model
to adopt for relevance judgment. The costs reported in Table 6 reflect the providers’ pricing as of
August 16, 2024. For image processing, calculations are based on handling 1M low-resolution images.
Specifically, OpenAI’s GPT-4V and GPT-4o allow users to use a low-resolution with 512x512 pixels of
the image and represent it with a budget of 85 tokens. This results in a cost of $0.000425 per image.4
4For GPT-4o-mini, this limit is set at 2,833 tokens (instead of 85 tokens) per image and this leads to the same per-image cost.
For fairness to Claude models, we thus also report their prices for images resized to 512x512 pixels.</p>
        <p>In this setup, the costs for image processing are fixed per search result, while the input and output
tokens are variable, depending on the length of the search result being evaluated. This means that,
when evaluating 1M images, we have a fixed cost of $425 for the GPT family, approximately $1048
for Sonnet, and $87 for Haiku. To these fixed costs, we must add variable expenses, depending on the
prompt, the search result, and output lengths. As Table 3.1 shows, our datasets contain search results
with varying word counts. Additionally, diferent models use varying prompts and diferent tokenizers,
leading to diferences in the number of tokens.</p>
        <p>Along with its strong performance for both text and multimodal tasks, GPT-4V is the most expensive
model with high costs for tokens and image processing. With an average of 867 input tokens, for 1M
multimodal search results, the cost for processing 1M multimodal search results with GPT-4V is $425
(image cost) + 867 ⋅ $10.00 (input token cost) = $9,095.</p>
        <p>In contrast, GPT-4o ofers higher performance at a lower cost. With an average of 889 input tokens
per result, the cost for processing 1M multimodal results is approximately $425 (image cost) + 889 ⋅ $5
(input token cost) = $4,865. This significantly lower cost (i.e. 50% of the cost of using GPT-4V) makes it
the current best choice when high precision is required.</p>
        <p>For Sonnet, with an average of 784 input tokens, the cost for processing 1M multimodal results would
be $1048.58 (image cost) + 784 ⋅ $3.00 (input token cost) = $3400.58. Given Sonnet’s performance as
the third-best model in terms of correlation with human evaluations, it represents a suitable choice for
scenarios where a moderate budget is available but maintaining high-quality results is still important.</p>
        <p>For smaller models like GPT-4o-mini and Haiku, the cost diferences become significant, though
at the expense of performance. In a text-only setting, GPT-4o-mini ofers the lowest cost per result,
making it an attractive option for large-scale applications where lower accuracy can be tolerated. While
Haiku’s cost per result is slightly higher than GPT-4o-mini, its performance is also superior. However,
our experiments indicate that the visual component did not significantly enhance the performance of
these smaller models for relevance judgment. In fact it can be even detrimental when using the visual
component. Therefore, the multimodal capabilities of GPT-4o-mini and Haiku should be employed
with caution, especially considering the high costs associated with image processing—particularly for
GPT-4o-mini.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Prompt Engineering</title>
        <p>We made the following observations:
Strictness Guidelines Many of the initial disagreements with humans stemmed from the models
being more lenient about the 1 (OK) grade. Results improved after we appended instructions to prefer
grades 2 (GREAT) and 0 (BAD) – e.g. “As our aim is to be strict on exact matches, this grade is less
likely to be used.”</p>
      </sec>
      <sec id="sec-4-3">
        <title>Smaller Models Are More Sensitive to Prompt Complexity We found that smaller models, such</title>
        <p>as Haiku, are highly sensitive to prompt complexity, whereas larger models like GPT-4V manage these
complexities more efectively. For example, when using a prompt with comprehensive and somewhat
lengthy grading instructions, we observed a significantly higher Cohen’s kappa coeficient for the Hotel
Supplies dataset with GPT-4V (0.54) compared to Haiku (0.32).</p>
        <p>Note that this does not imply that Haiku is not suitable for grading the task; rather, it suggests that
the model performs better when prompts are simpler. We verify this hypothesis further by making
prompts progressively more concise while still retaining the essential instructions. Ultimately, we were
able to refine the prompt for the Hotel Supplies dataset that helps the text-only Haiku model achieve
the highest Cohen’s kappa coeficient of 0.64.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Diferent Models May Work with Diferent Prompts Although Haiku achieves the highest</title>
        <p>Cohen’s kappa coeficient of 0.64 on the Hotel Supplies domain with the refined prompt we developed,
we did not observe the same improvement with GPT-4V. When using the same prompt, GPT-4V
maintained a similar Cohen’s kappa coeficient of approximately 0.54. This indicates that prompt
engineering can be highly model-specific, and a prompt that works exceptionally well for one LLM
model may not perform as well for others. In fact, we could not find a systematic way to reliably
optimize model accuracy across the board. As a result, the process of prompt engineering feels more
like art than science and motivates further work to develop systematic ways to discover the upper limits
of accuracy for each model size.</p>
        <p>Asking for explanations In our experiments, we observed that asking LLMs to provide explanations
for grading outputs is beneficial for several important reasons:
• Relevance is subjective, and asking an LLM to explain its grading output can be helpful in verifying
the correctness of the assessment. In our experiments, it was not uncommon for humans to
initially disagree with the grading outputs from LLMs; however, they often reached a consensus
after reviewing the detailed explanations provided.
• Having an explanation also helps us to understand how to iterate via prompt engineering to
make the instructions less ambiguous for the model.</p>
        <p>We also note that asking LLMs to provide explanations may help the model perform better at grading.
However, note that prompts need to be carefully crafted, as we also observed the performance may
regress if we do not do it well.</p>
        <p>In the end, we were able to meaningfully improve the accuracy of Haiku through prompt engineering
that requests LLMs for explanations (from 0.36 to 0.40). Given that this is not far in accuracy, and Haiku
is 20-40 times cheaper than the GPT family, this makes it a very appealing option for application at
a large scale. For instance, smaller models can be used to generate larger label sets to explore recall
issues, while more expensive models focus on smaller sets to evaluate precision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we have presented a new analysis of MLLMs-as-a-Judge, to assess the cost-accuracy
trade-ofs of relevance judgment capabilities of MLLMs across three multimodal search use cases: Hotel
Supplies, Design, and Fashion. Various LLMs have shown potential, but no single LLM showed optimal
cost-accuracy trade-of across all use cases evaluated.</p>
      <p>We have found that for any given practitioner looking to choose the best LLM judge for their use case,
a comprehensive evaluation of all available models is both time-intensive, financially demanding, and
requires significant amounts of energy, which can have a significant efect on the environment. This
motivates future work in the following directions: 1) improving the abilities of general MLLMs across
use cases, 2) improving cost and computational eficiency of large MLLMs, and 3) creating small MLLMs
that are optimized for judging relevance in cost-optimal ways for more specialized applications.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Thanks to the entire Objective team for building many pieces of the puzzle that made this work possible.
Special thanks to Lance Hasson, Brian Porter, George Gkotsis, Kuei-da Liao, and Faizaan Merchant. We
would also like to thank Yev Rotar, John Gulley, and Brian Ip for their valuable relevance judgment
inputs.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Prompt Templates</title>
      <p>In this section, we report the prompts used for the considered models. In the templates, {{document}},
{{query}}, and {{image}} are placeholders for the search result, query, and image respectively. For the
OpenAI’s models, the image corresponds to the image URL, while for the Anthropic’s models, it
corresponds to a base64-encoded image.</p>
      <p>Haiku and Sonnet’s Prompt Template (Multimodal Setup)</p>
      <p>You are an assistant responsible for rating how the retrieved result is relevant to the query. If an image
is available, use it to determine the relevance to the query. Output a token: "2", "1", or "0" followed by
a full explanation.</p>
      <p>Guidelines:
"2" - The result matches exactly with what the user's query is looking for.
"1" - The result is not exactly with what the user's query is looking for. But it's pretty similar. As our
aim is to be strict on exact matches, this grade is less likely to be used.
"0" - The result is not related to the query at all.</p>
      <p>Result: {{document}}
Query: {{query}}
### Query Analysis
Before you start grading, it's essential to understand user's intent by breaking apart the query. Keep in
mind, some queries may be more explicit than others. For instance, if user is searching for a clothing
product, then "Red checkered jacket" is more specific compared to "Red jacket". Another example, if user is
searching for a venue, then "Rock concert in San Francisco this weekend" is more specific compared to "Rock
concert in San Francisco". Therefore, adapt your grading contextually.</p>
      <p>Consider all the information from all fields.</p>
      <p>Note: All fields should be taken with equal importance. You should adhere strictly to these guidelines
while grading and ensure a holistic evaluation of
the search results based on all considered fields.</p>
      <p>User Role: User
Query: {{query}}
Result: {{result}}
Score: 0 or 1 or 2
--Query: {{query}}
Result: {{document}}
User Role: User
You are given a search query and a search result in json format. If an image is available, use it to
determine the relevance to the query. You must indicate with a score whether the result is relevant or not.
--</p>
      <p>Follow the following format.</p>
    </sec>
    <sec id="sec-8">
      <title>B. H&amp;M grading guidelines</title>
      <p>In this section, we introduce the internal H&amp;M grading guidelines that our human annotators follow to
assign relevance grades to each pair of &lt;query, search result&gt;.</p>
      <sec id="sec-8-1">
        <title>B.1. Grading rules</title>
        <p>• Is the query asking for a specific category of product, either narrow or broad?
– If the answer is yes, then the user is asking for results that are restricted to that category.</p>
        <p>Only results matching that product category will be marked as GREAT. If the results don’t
match the category, mark them as BAD.
– It is the grader’s job to identify the category as introduced within the query and classify
products accordingly.</p>
        <p>∗ For instance, examples of queries-&gt;categories can be “ofice clothes” -&gt;“clothes” or “music
t-shirt”-&gt;“t-shirt”. Note that t-shirt can be categorized as a sub-category of clothes and
clothes is a super-category of t-shirts. As such, “ofice clothes” includes more products
since it implies a broader categorization, whereas “music t-shirt” is restricting products
under “t-shirts”.
• Is the query mentioning a feature, for instance: color, size, utility (e.g. windproof, maternity)?
– If the query mentions a feature, only results matching that feature will be marked as GREAT.
– If the results don’t match the feature, mark it as BAD.
– If the results match the feature close enough but not exactly, mark it as OK. For example, if
the query is “yellow jacket” and a search result is a light orange jacket, or some jacket that
contains some clear patches of yellow but is otherwise not yellow, then this result is OK.
• If a query mentions both a category and a feature, only results matching both the category and the
feature will be marked as GREAT. Results matching the feature but not the category (or vice-versa)
are BAD.</p>
      </sec>
      <sec id="sec-8-2">
        <title>B.2. Grading examples</title>
        <p>Query
yellow raincoat
rainy weather
sports utility wear
BAD
black raincoat yellow
cardigan yellow jacket
anything else/products
that do not help during
a rainy weather
anything that cannot
be used during a sport
activity or some kind of
manual labour task.
GREAT
raincoats that are yellow
any product that is rainproof
products that visually
and textually support
comfortable movement
and gym workouts</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Siro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , P. Thomas,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <source>Report on the 1st workshop on large language model for evaluation in information retrieval (llm4eval 2024) at sigir</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2408.05388. arXiv:
          <volume>2408</volume>
          .
          <fpage>05388</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , L. Sun,
          <article-title>Mllmas-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.04788. arXiv:
          <volume>2402</volume>
          .
          <fpage>04788</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          , Prometheus:
          <article-title>Inducing fine-grained evaluation capability in language models</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2310.08491. arXiv:
          <volume>2310</volume>
          .
          <fpage>08491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Judgelm:
          <article-title>Fine-tuned large language models are scalable judges</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.17631. arXiv:
          <volume>2310</volume>
          .
          <fpage>17631</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spielman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Large language models can accurately predict searcher preferences</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2309.10621. arXiv:
          <volume>2309</volume>
          .
          <fpage>10621</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Toward automatic relevance judgment using vision-language models for imagetext retrieval evaluation, 2024</article-title>
          . URL: https://arxiv.org/abs/2408.01363. arXiv:
          <volume>2408</volume>
          .
          <fpage>01363</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation</article-title>
          ,
          <source>arXiv preprint arXiv:2402.03216</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          , Mteb:
          <article-title>Massive text embedding benchmark</article-title>
          ,
          <source>arXiv preprint arXiv:2210.07316</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2210.07316. doi:
          <volume>10</volume>
          .48550/ARXIV.2210. 07316.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>