Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements SilviaTerragni Objective, Inc. San Francisco

CA USA

HoangCuong Objective, Inc. San Francisco

CA USA

JoachimDaiber Objective, Inc. San Francisco

CA USA

PallaviGudipati Objective, Inc. San Francisco

CA USA

PabloNMendes Objective, Inc. San Francisco

CA USA

Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements 1613-0073 54174453F33DBC86B14E1A16C97A6433 GROBID - A machine learning software for extracting information from scholarly documents Multimodal Search Relevance Judgments Large Language Models Multimodal Large Language Models

Large Language Models (LLMs) have demonstrated potential as effective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-offs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

Introduction

Search relevance evaluation is the process of assessing how effectively an information retrieval system returns results that are relevant to a user's search query. The process typically involves multiple human judges, tasked with stating whether each search result is relevant to a search query. The resulting relevance judgments are then aggregated through evaluation metrics to quantify relevance. Those in turn enable researchers and practitioners to compare different retrieval systems in order to select the best option for a given application.

Multimodal Search presents additional challenges in search relevance evaluations due to the complexity of interpreting and integrating information from various attributes across different modalities. For instance, in e-commerce search, assessing relevance requires understanding the intent behind the search query and comparing it with a judge's interpretation of product relevance based on multiple features including the title, description, and images, as well as other attributes such as category, color, and price.

The task is further complicated by different characteristics across use cases. For instance, when searching for very visual aspects (e.g. design assets) the images play a much more central role, as compared to other use cases where product category and other attributes are more important (e.g. searching for hotel supplies). Data quality also varies significantly by use case. In applications with user-generated content, data may be missing or low quality -e.g. product descriptions often conflict with the information that can be gleaned from images.

While human annotators remain the most reliable source for obtaining relevance judgments, the process is costly and time-consuming.

Recent work [1][2] has shown that Large Language Models (LLMs) and Multimodal Language Models (MLLMs) are viable alternatives for producing relevance judgments. LLMs-as-judges are enticing since they can unlock higher relevance judgment throughput at a fraction of the cost. As a result, they offer the potential of widespread relevance improvement in search systems due to more accessible and MMSR'24 * Corresponding author. Envelope silvia@objective.inc (S. Terragni); hoang@objective.inc (H. Cuong); jo@objective.inc (J. Daiber); pallavi@objective.inc (P. Gudipati); pablo@objective.inc (P. N. Mendes) extensive evaluations, as well as training data generation. However, progress is hampered by a number of under-explored questions about how to best employ LLMs-as-judges.

In this paper, we evaluate a number of LLMs and MLLMs in terms of their alignment with human judgments and ask the following research questions:

1. Is LLM performance use-case dependent? In other words, would the same LLM perform well in one use case but not in another? 2. Is there a clear winner? In other words, is there a model that consistently outperforms all the others across all use cases? 3. Is multimodal support necessary for search relevance judgment in multimodal search? 4. What models offer the optimal cost-accuracy trade-offs?

In the next section we summarize related work. We then present our experimental setting and discuss results. Finally, we present concluding remarks and future work.

Related Work

Large Language Models (LLMs) have shown exceptional abilities in a wide variety of tasks, and using them for evaluating Information Retrieval systems is receiving considerable attention [1]. Recent studies have explored different methods for generating relevance judgments. For example, Prometheus [3] is a 13-billion parameter LLM designed to evaluate long texts using customized scoring rubrics provided by users. JudgeLM [4] uses fine-tuned LLMs as scalable judges to evaluate other LLMs effectively in open-ended tasks. They find that JudgeLM has a high agreement with expert judges, over 90%, and works well in evaluating single answers, multimodal models, multiple answers, and multi-turn dialogues. Thomas et al. [5] developed an LLM prompt based on feedback from search engine users. They show accuracy similar to human judges and can identify difficult queries, best results, and effective groupings. They also find that both changes to prompts and simple paraphrases can improve accuracy.

In the context of Multimodal LLMs (MLLMs), Chen et al. [2] assess these models as judges through a new benchmark. They examine their performance in tasks such as Scoring Evaluation, Pair Comparison, and Batch Ranking. The study points out that MLLMs need more improvements and research before they can be fully trusted, as they can have biases like ego-centric bias, position bias, length bias, and hallucinations. Additionally, Yang et al. [6] investigates the relevance estimation of Vision-Language Models (VLMs), including CLIP, LLaVA, and GPT-4V, within a large-scale ad hoc zero-shot retrieval task aimed at multimedia content creation.

To the extent of our knowledge, we are the first to compare the cost-accuracy trade-offs of several generally available LLMs of different sizes.

Methodology

This study employs a relevance evaluation process to assess the performance of LLMs and MLLMs (collectively referred to as "models") for search relevance judgments. We assess these models based on two critical dimensions: accuracy and costs. Our evaluation pipeline consists of three stages:

• Data Collection: We obtained search results from three datasets across different domains using a list of predefined queries. • Human Annotation: Two trained human annotators assigned relevance grades to each (query, result) pair following some established relevance criteria. • Model Evaluation: We applied a range of LLMs and MLLMs to generate relevance judgments for the same sets of search results, comparing their performance against human annotations.

Each stage is discussed in detail in the following subsections, covering the datasets, retrieval system, grading strategy, and the models used.

Datasets

We conducted our experiments on three datasets: Fashion, Hotel Supplies, and Design. The Fashion dataset is a subset of the publicly available dataset H&M Personalized Fashion Recommendations1 . The Hotel Supplies and Design datasets are proprietary and represent domains in the e-commerce search for hotel supply products, and social media search for design assets, respectively. Each dataset includes multiple textual fields per product, along with one or more associated images. Table 3

Retrieval System and Evaluation

To obtain relevant search results, we utilized a baseline retrieval system that combines BM25 with BGE M3 embeddings [7], which is one of the top-ranked text embedding models in the MTEB Leaderboard [8] as of June 2024. We created indexes for each dataset to enable efficient retrieval of results based on a predefined list of queries. These queries were either derived from real traffic data or carefully crafted by human experts to ensure they represented a wide range of search scenarios. Our aim was to include queries and results that included hits and misses generated by both lexical and semantic retrievers.

Relevance Judgment Strategy

Each dataset was structured as a collection of query-result pairs. Two expert human annotators assessed the relevance of each pair on a 0-2 rating scale:

• 2: Highly relevant, a perfect match for the query • 1: Somewhat relevant, a result that partially matches the query's intent • 0: Not relevant, a poor result that should not be shown

The human annotators were provided with detailed guidelines to ensure consistency in their relevance judgments. Table 3.3 provides examples of different relevance judgment categories for the query "v-neck white tee". In the first row, the result is highly relevant, as both the image and the text describe a white v-neck t-shirt. Therefore the human relevance judgment for this pair is a 2. The second row shows a partial match: while the text mentions a white t-shirt, the image depicts a white v-neck t-shirt with black stripes, resulting in a relevance judgment of 1. The third row illustrates an irrelevant result (0), where the product shown is a strap top, unrelated to the query.

In the Appendix, we describe in detail our internal H&M grading guidelines that human annotators follow to assign relevance grades to query-result pairs. We note that our grading guidelines are based on similar principles to those used for the Hotel Supplies and Design datasets. However, the guidelines have been also adapted to suit the specific characteristics of the datasets. Examples of three relevance judgment categories for the query "v-neck white tee", accompanied by corresponding search results. The descriptions of the search results have been shortened for brevity.

Inter-Annotator Agreement

To assess the reliability of the relevance judgments, either human or LLM-generated, we followed common practice and calculated Cohen's kappa coefficient. Cohen's kappa is a robust statistical measure commonly employed to quantify inter-annotator agreement for categorical data. Cohen's kappa values range from -1 to 1, where 1 indicates strong agreement, while values closer to 0 suggest agreement no better than chance. To interpret the kappa values, we use the guidelines reported in Table 3.

In our evaluation, we compute Cohen's kappa to measure the agreement between human annotators and LLM-generated annotations, as well as between pairs of human annotators. The degree of agreement between human annotators also provided insights into the difficulty of evaluating certain datasets. Almost perfect agreement

Table 3

Guidelines for interpreting Cohen's kappa values.

Models

Our evaluation included a range of LLMs and MLLMs to reflect varying levels of performance and cost.

We considered both large-scale proprietary models and more cost-efficient alternatives:

• OpenAI Models2 : GPT-4V (gpt-4-vision-preview), GPT-4o (gpt-4o-2024-05-13), GTP-4o-mini (gpt-4o-mini-2024-07-18) • Anthropic Models3 : Claude 3.5 Sonnet, Claude 3 Haiku

Prompts

To design the prompts for the models under consideration, we created a template aimed at guiding the models to generate accurate relevance judgments.

In the multimodal setup, where an image is provided, the prompt will reference and include the image. Additionally, we require the model to provide an explanation for its relevance judgment. This element could be useful for interpreting the model's decisions.

Below, we present the prompt used for the Claude family in the text-only scenario. For the complete set of prompts, please refer to the Appendix. In the template, {{document}} and {{query}} are placeholders for the search result and query, respectively.

Haiku's Prompt Template (Text-only Setup)

You are an assistant responsible for rating how the retrieved result is relevant to the query. Output a token: "2", "1", or "0" followed by a full explanation. Guidelines: "2" -The result matches exactly with what the user's query is looking for. "1" -The result is not exactly with what the user's query is looking for. But it's pretty similar. As our aim is to be strict on exact matches, this grade is less likely to be used. "0" -The result is not related to the query at all. Result: {{document}} Query: {{query}} Output:"

It is important to note that these prompt templates are not the result of an extensive exploration of all possible templates. In Section 4.2, we provide an analysis of the prompt engineering process that led to the best-performing prompts for Haiku.

Table 4

Cohen's kappa coefficients between one of the human annotators and the considered Multimodal (MM) models and their text-only (Text) counterparts. The last column shows the inter-annotator agreement between the two human annotators.

The results presented in Table 4 offer several insights into the performance of the considered Large Language Models across different domains and modalities.

Use-case Dependency of LLM Performance

The analysis reveals that the LLM performance is dependent on the use case. The models show varying levels of correlation with the human relevance judgments across the different domains. For example, GPT-4V shows higher performance in the Hotel Supplies use case but performs relatively worse in the other areas. We can observe a similar trend across the other models. This variability in model performance is also connected to the inherent difficulty of the tasks. This is also reflected by the varying levels of agreement among the human annotators for the different use cases.

One Model to Rule Them All?

The multimodal version of GPT4-o generally performs better than the other models in two out of three cases, achieving the highest average Cohen's kappa coefficient (0.548). It stands out in the Hotel Supplies and Fashion domains, where it shows substantial agreement with human annotations. However, it is outperformed by Sonnet in the Hotel Supplies domain, suggesting that no single model outperforms all the others across every use case.

Tailor Model Prompt to a Specific Dataset We observe that tailoring a model's prompt to a specific domain can greatly improve its grading performance on the corresponding dataset. A notable example is the text-only Haiku LLM. Despite it being among the least powerful models in our experiments, we achieved the highest Cohen's kappa coefficient for that dataset (0.6403 compared to 0.560) by refining the prompt for the Hotel Supplies dataset. Nevertheless, we also note that using the Hotel Supplies dataset-adjusted prompt may lead to significant overfitting to other datasets. For example, when using the same prompt for the Design dataset, Cohen's kappa coefficient decreases to 0.333 (instead of 0.431).

Necessity of Multimodal Support

In the table, we compare each Multimodal (MM) model with its text-only (Text) counterpart. It is worth noticing that the benefits of multimodal support are not uniform across all the models and use cases. For models like GPT-4o, the vision component significantly enhances the performance, increasing the correlation from 0.506 (Text) to 0.548 (MM). This leads to the highest average performance and the metric is remarkably very close to human correlation (i.e. 0.589). GPT-4V and Sonnet also benefit from the visual component. However, for smaller models, such as Haiku, the vision component appears to have a detrimental effect, decreasing the correlation from 0.433 (Text) to 0.368 (MM). To further investigate the impact of the visual component in Haiku, we performed an ablation study by excluding the textual component and relying solely on the image. Under this configuration, the highest correlation achieved was 0.1 for the Design case, which is significantly lower than the text-only correlation of 0.309. This indicates that for smaller models like Haiku, the visual component may not be sufficiently robust to provide effective multimodal support.

Error Analysis To investigate the poor performance of the smaller multimodal models, we conducted an error analysis on a sample of relevance judgments from Haiku that disagreed with both human annotators. We examined 31 instances of disagreement and identified three distinct error categories. Notably, Haiku generates an explanation for its relevance judgments, which allows us to categorize the errors effectively.

The most frequent issue (17 cases) involved the model's failure to correctly identify the product type. For example, when given the query "pure cotton dressing gown", the model misclassifies a linen dressing gown, justifying its choice with the explanation: The product is [...] made of linen, which is a natural fiber similar to cotton. In half of the remaining cases, Haiku's errors originate from wrong assumptions. For instance, the model confused the brand name for bras with the word "band" as in "hairband", leading to incorrect judgments. Lastly, 7 of the cases were related to the model's vision capabilities, where it failed to recognize specific patterns or prints on products, resulting in inaccurate relevance assessments. Table 5 provides an example of this type of error, including the image, query, and explanation generated by Haiku.

Image Query Explanation h&m kids unicorn printed t-shirt

The provided result does not match the user's query for a "h&m kids unicorn printed t-shirt". The result is for a "Mia l/s top" which is a longsleeved top in soft, printed cotton jersey, but it does not appear to have a unicorn print. The image also does not show a unicorn print. Therefore, the result is not relevant to the user's query, and I would rate it as a "0".

Table 5

Example of an error in Haiku's relevance judgment due to its inability to understand the provided image.

GPT-4V GPT-4o GPT-4o-mini Sonnet Haiku $/1M Input tokens 10.00 5.00 0.15 3.00 0.25 $/1M Output tokens 30.00 15.00 0.60 15.00 1.25 $/1M images (low resolution) 425.00 425.00 425.00 1048.58 87.38

Table 6

Cost comparison across different models for input tokens, output tokens, and image processing. The costs are reported per million units, with image processing based on low-resolution images. Prices reflect the providers' rates as of August 16, 2024.

Cost-Accuracy Trade-off

Considering the previous results comparing multimodal versus text-only performance, we can make important cost-accuracy trade-off considerations when choosing a model to adopt for relevance judgment. The costs reported in Table 6 reflect the providers' pricing as of August 16, 2024. For image processing, calculations are based on handling 1M low-resolution images. Specifically, OpenAI's GPT-4V and GPT-4o allow users to use a low-resolution with 512x512 pixels of the image and represent it with a budget of 85 tokens. This results in a cost of $0.000425 per image. 4For fairness to Claude models, we thus also report their prices for images resized to 512x512 pixels. In this setup, the costs for image processing are fixed per search result, while the input and output tokens are variable, depending on the length of the search result being evaluated. This means that, when evaluating 1M images, we have a fixed cost of $425 for the GPT family, approximately $1048 for Sonnet, and $87 for Haiku. To these fixed costs, we must add variable expenses, depending on the prompt, the search result, and output lengths. As Table 3.1 shows, our datasets contain search results with varying word counts. Additionally, different models use varying prompts and different tokenizers, leading to differences in the number of tokens.

Along with its strong performance for both text and multimodal tasks, GPT-4V is the most expensive model with high costs for tokens and image processing. With an average of 867 input tokens, for 1M multimodal search results, the cost for processing 1M multimodal search results with GPT-4V is $425 (image cost) + 867 ⋅ $10.00 (input token cost) = $9,095.

In contrast, GPT-4o offers higher performance at a lower cost. With an average of 889 input tokens per result, the cost for processing 1M multimodal results is approximately $425 (image cost) + 889 ⋅ $5 (input token cost) = $4,865. This significantly lower cost (i.e. 50% of the cost of using GPT-4V) makes it the current best choice when high precision is required.

For Sonnet, with an average of 784 input tokens, the cost for processing 1M multimodal results would be $1048.58 (image cost) + 784 ⋅ $3.00 (input token cost) = $3400.58. Given Sonnet's performance as the third-best model in terms of correlation with human evaluations, it represents a suitable choice for scenarios where a moderate budget is available but maintaining high-quality results is still important.

For smaller models like GPT-4o-mini and Haiku, the cost differences become significant, though at the expense of performance. In a text-only setting, GPT-4o-mini offers the lowest cost per result, making it an attractive option for large-scale applications where lower accuracy can be tolerated. While Haiku's cost per result is slightly higher than GPT-4o-mini, its performance is also superior. However, our experiments indicate that the visual component did not significantly enhance the performance of these smaller models for relevance judgment. In fact it can be even detrimental when using the visual component. Therefore, the multimodal capabilities of GPT-4o-mini and Haiku should be employed with caution, especially considering the high costs associated with image processing-particularly for GPT-4o-mini.

Prompt Engineering

We made the following observations:

Strictness Guidelines Many of the initial disagreements with humans stemmed from the models being more lenient about the 1 (OK) grade. Results improved after we appended instructions to prefer grades 2 (GREAT) and 0 (BAD) -e.g. "As our aim is to be strict on exact matches, this grade is less likely to be used. "

Smaller Models Are More Sensitive to Prompt Complexity We found that smaller models, such as Haiku, are highly sensitive to prompt complexity, whereas larger models like GPT-4V manage these complexities more effectively. For example, when using a prompt with comprehensive and somewhat lengthy grading instructions, we observed a significantly higher Cohen's kappa coefficient for the Hotel Supplies dataset with GPT-4V (0.54) compared to Haiku (0.32).

Note that this does not imply that Haiku is not suitable for grading the task; rather, it suggests that the model performs better when prompts are simpler. We verify this hypothesis further by making prompts progressively more concise while still retaining the essential instructions. Ultimately, we were able to refine the prompt for the Hotel Supplies dataset that helps the text-only Haiku model achieve the highest Cohen's kappa coefficient of 0.64.

Different Models May Work with Different Prompts Although Haiku achieves the highest

Cohen's kappa coefficient of 0.64 on the Hotel Supplies domain with the refined prompt we developed, we did not observe the same improvement with GPT-4V. When using the same prompt, GPT-4V maintained a similar Cohen's kappa coefficient of approximately 0.54. This indicates that prompt engineering can be highly model-specific, and a prompt that works exceptionally well for one LLM model may not perform as well for others. In fact, we could not find a systematic way to reliably optimize model accuracy across the board. As a result, the process of prompt engineering feels more like art than science and motivates further work to develop systematic ways to discover the upper limits of accuracy for each model size.

Asking for explanations

In our experiments, we observed that asking LLMs to provide explanations for grading outputs is beneficial for several important reasons:

• Relevance is subjective, and asking an LLM to explain its grading output can be helpful in verifying the correctness of the assessment. In our experiments, it was not uncommon for humans to initially disagree with the grading outputs from LLMs; however, they often reached a consensus after reviewing the detailed explanations provided. • Having an explanation also helps us to understand how to iterate via prompt engineering to make the instructions less ambiguous for the model.

We also note that asking LLMs to provide explanations may help the model perform better at grading. However, note that prompts need to be carefully crafted, as we also observed the performance may regress if we do not do it well.

In the end, we were able to meaningfully improve the accuracy of Haiku through prompt engineering that requests LLMs for explanations (from 0.36 to 0.40). Given that this is not far in accuracy, and Haiku is 20-40 times cheaper than the GPT family, this makes it a very appealing option for application at a large scale. For instance, smaller models can be used to generate larger label sets to explore recall issues, while more expensive models focus on smaller sets to evaluate precision.

Conclusion

In this paper, we have presented a new analysis of MLLMs-as-a-Judge, to assess the cost-accuracy trade-offs of relevance judgment capabilities of MLLMs across three multimodal search use cases: Hotel Supplies, Design, and Fashion. Various LLMs have shown potential, but no single LLM showed optimal cost-accuracy trade-off across all use cases evaluated.

We have found that for any given practitioner looking to choose the best LLM judge for their use case, a comprehensive evaluation of all available models is both time-intensive, financially demanding, and requires significant amounts of energy, which can have a significant effect on the environment. This motivates future work in the following directions: 1) improving the abilities of general MLLMs across use cases, 2) improving cost and computational efficiency of large MLLMs, and 3) creating small MLLMs that are optimized for judging relevance in cost-optimal ways for more specialized applications.

Table 11Summary statistics of the used datasets..1 summarizes

Table 22ImageSearch ResultRelevance Judgment

prod_name: Premium ELKE vneck tee, index_name: Ladieswear, detail_desc: V-neck T-shirt in airy slub lin[...], department_name: Jersey/Knitwear Premium, index_group_name: Ladieswear, colour_group_name: White, product_type_name: T-shirt, graphical_appearance_name: Solid, perceived_colour_value_name: Light, perceived_colour_master_name: White 2 prod_name: ED Lizzie tee, index_name: Ladieswear, detail_desc: Short-sleeved top in lightwei[...], department_name: Jersey, index_group_name: Ladieswear, colour_group_name: White, product_type_name: T-shirt, graphical_appearance_name: Stripe, perceived_colour_value_name: Light, perceived_colour_master_name: White 1 prod_name: V-neck Strap Top, index_name: Ladieswear, detail_desc: V-neck top in soft organic [...], department_name: Jersey Basic, index_group_name: Ladieswear, colour_group_name: White, product_type_name: Vest top, graphical_appearance_name: Solid, perceived_colour_value_name: Light, perceived_colour_master_name: White 0 https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations. For access to our human annotations for this dataset, please reach out to the corresponding author. https://platform.openai.com/docs/models https://docs.anthropic.com/en/docs/about-claude/models For GPT-4o-mini, this limit is set at 2,833 tokens (instead of 85 tokens) per image and this leads to the same per-image cost.

Acknowledgments

Thanks to the entire Objective team for building many pieces of the puzzle that made this work possible. Special thanks to Lance Hasson, Brian Porter, George Gkotsis, Kuei-da Liao, and Faizaan Merchant. We would also like to thank Yev Rotar, John Gulley, and Brian Ip for their valuable relevance judgment inputs.

GPT4's Prompt Template (Multimodal Setup) User Role: System

You are a helpful assistant designed to output JSON. You are RateGPT, an intelligent assistant that can score search results based on their relevance to a query and the user's intent behind the query. You should return JSON with two required fields 'reasoning' and 'score'. In the 'reasoning' field, you can explain your observations of relevance. When producing a score, use the following grading criteria:

-0 (BAD) -Use this grade for a search result if it is not related to user's query at all.

-1 (OK) -This grade is for a search result that is not exactly what the user's query is looking for, but it's pretty similar. As our aim is to be strict on exact matches, this grade is less likely to be used.

-2 (GOOD) -The product matches exactly with the user's intent and query. Use this score this if the search result aligns perfectly with the user's query. ### Query Analysis Before you start grading, it's essential to understand user's intent by breaking apart the query. Keep in mind, some queries may be more explicit than others. For instance, if user is searching for a clothing product, then "Red checkered jacket" is more specific compared to "Red jacket". Another example, if user is searching for a venue, then "Rock concert in San Francisco this weekend" is more specific compared to "Rock concert in San Francisco". Therefore, adapt your grading contextually. Consider all the information from all fields.

Note: All fields should be taken with equal importance. You should adhere strictly to these guidelines while grading and ensure a holistic evaluation of the search results based on all considered fields.

User Role: User

B. H&M grading guidelines

In this section, we introduce the internal H&M grading guidelines that our human annotators follow to assign relevance grades to each pair of <query, search result>.

B.1. Grading rules

• Is the query asking for a specific category of product, either narrow or broad?

-If the answer is yes, then the user is asking for results that are restricted to that category.

Only results matching that product category will be marked as GREAT. If the results don't match the category, mark them as BAD.

-It is the grader's job to identify the category as introduced within the query and classify products accordingly. * For instance, examples of queries->categories can be "office clothes"->"clothes" or "music t-shirt"->"t-shirt". Note that t-shirt can be categorized as a sub-category of clothes and clothes is a super-category of t-shirts. As such, "office clothes" includes more products since it implies a broader categorization, whereas "music t-shirt" is restricting products under "t-shirts".

• Is the query mentioning a feature, for instance: color, size, utility (e.g. windproof, maternity)?

-If the query mentions a feature, only results matching that feature will be marked as GREAT.

-If the results don't match the feature, mark it as BAD.

-If the results match the feature close enough but not exactly, mark it as OK. For example, if the query is "yellow jacket" and a search result is a light orange jacket, or some jacket that contains some clear patches of yellow but is otherwise not yellow, then this result is OK.

• If a query mentions both a category and a feature, only results matching both the category and the feature will be marked as GREAT. Results matching the feature but not the category (or vice-versa) are BAD.

B.2. Grading examples

Multimodal vs Single-modality Evaluation GPT-4v GPT-4o GPT-4o-mini Sonnet Haiku Human References for evaluation in information retrieval (llm4eval 2024) at sigir 2024 HARahmani CSiro MAliannejadi NCraswell CL AClarke GFaggioli BMitra PThomas EYilmaz Report on the 1st workshop on large language model 2024 DChen RChen SZhang YLiu YWang HZhou QZhang YWan PZhou LSun Mllmas-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark 2024 Prometheus: Inducing fine-grained evaluation capability in language models SKim JShin YCho JJang SLongpre HLee SYun SShin SKim JThorne MSeo 2024 Judgelm: Fine-tuned large language models are scalable judges LZhu XWang XWang 2023 Large language models can accurately predict searcher preferences PThomas SSpielman NCraswell BMitra 2024 Toward automatic relevance judgment using vision-language models for imagetext retrieval evaluation J.-HYang JLin 2024 JChen SXiao PZhang KLuo DLian ZLiu arXiv:2402.03216 Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation 2024 arXiv preprint NMuennighoff NTazi LMagne NReimers 10.48550/ARXIV.2210.07316 arXiv:2210.07316 Mteb: Massive text embedding benchmark 2022 arXiv preprint . Prompt Templates In this section, we report the prompts used for the considered models. In the templates, {{document}}, {{query}}, and {{image}} are placeholders for the search result, query, and image respectively. For the OpenAI's models, the image corresponds to the image URL, while for the Anthropic's models, it corresponds to a base64-encoded image. Haiku and Sonnet's Prompt Template (Multimodal Setup) You are an assistant responsible for rating how the retrieved result is relevant to the query. If an image is available, use it to determine the relevance to the query. Output a token: "2", "1", or "0" followed by a full explanation. Guidelines: "2" -The result matches exactly with what the user's query is looking for. "1" -The result is not exactly with what the user's query is looking for. But it's pretty similar. As our aim is to be strict on exact matches, this grade is less likely to be used. "0" -The result is not related to the query at all. Result: {{document}} Query: {{query}} {{image}} Token