1. Introduction

1613-0073

Cost-Accuracy Trade-ofs in Multimodal Search Relevance Judgements

Silvia Terragni

silvia@objective.inc

Hoang Cuong

hoang@objective.inc

Joachim Daiber

jo@objective.inc

Pallavi Gudipati

pallavi@objective.inc

Pablo N. Mendes

pablo@objective.inc

Objective

Inc. San Francisco

Workshop

Multimodal Search, Relevance Judgments, Large Language Models, Multimodal Large Language Models

Large Language Models (LLMs) have demonstrated potential as efective search relevance evaluators. However, there is a lack of comprehensive guidance on which models consistently perform optimally across various contexts or within specific use cases. In this paper, we assess several LLMs and Multimodal Language Models (MLLMs) in terms of their alignment with human judgments across multiple multimodal search scenarios. Our analysis investigates the trade-ofs between cost and accuracy, highlighting that model performance varies significantly depending on the context. Interestingly, in smaller models, the inclusion of a visual component may hinder performance rather than enhance it. These findings highlight the complexities involved in selecting the most appropriate model for practical applications.

1. Introduction

Recent work [ 1 ][ 2 ] has shown that Large Language Models (LLMs) and Multimodal Language Models (MLLMs) are viable alternatives for producing relevance judgments. LLMs-as-judges are enticing since they can unlock higher relevance judgment throughput at a fraction of the cost. As a result, they ofer the potential of widespread relevance improvement in search systems due to more accessible and MMSR’24

CEUR

ceur-ws.org extensive evaluations, as well as training data generation. However, progress is hampered by a number of under-explored questions about how to best employ LLMs-as-judges.

In this paper, we evaluate a number of LLMs and MLLMs in terms of their alignment with human judgments and ask the following research questions: 1. Is LLM performance use-case dependent? In other words, would the same LLM perform well in one use case but not in another? 2. Is there a clear winner? In other words, is there a model that consistently outperforms all the others across all use cases? 3. Is multimodal support necessary for search relevance judgment in multimodal search? 4. What models ofer the optimal cost-accuracy trade-ofs?

In the next section we summarize related work. We then present our experimental setting and discuss results. Finally, we present concluding remarks and future work.

2. Related Work

Large Language Models (LLMs) have shown exceptional abilities in a wide variety of tasks, and using them for evaluating Information Retrieval systems is receiving considerable attention [ 1 ]. Recent studies have explored diferent methods for generating relevance judgments. For example, Prometheus [ 3 ] is a 13-billion parameter LLM designed to evaluate long texts using customized scoring rubrics provided by users. JudgeLM [ 4 ] uses fine-tuned LLMs as scalable judges to evaluate other LLMs efectively in open-ended tasks. They find that JudgeLM has a high agreement with expert judges, over 90%, and works well in evaluating single answers, multimodal models, multiple answers, and multi-turn dialogues. Thomas et al. [ 5 ] developed an LLM prompt based on feedback from search engine users. They show accuracy similar to human judges and can identify dificult queries, best results, and efective groupings. They also find that both changes to prompts and simple paraphrases can improve accuracy.

In the context of Multimodal LLMs (MLLMs), Chen et al. [ 2 ] assess these models as judges through a new benchmark. They examine their performance in tasks such as Scoring Evaluation, Pair Comparison, and Batch Ranking. The study points out that MLLMs need more improvements and research before they can be fully trusted, as they can have biases like ego-centric bias, position bias, length bias, and hallucinations. Additionally, Yang et al. [ 6 ] investigates the relevance estimation of Vision-Language Models (VLMs), including CLIP, LLaVA, and GPT-4V, within a large-scale ad hoc zero-shot retrieval task aimed at multimedia content creation.

To the extent of our knowledge, we are the first to compare the cost-accuracy trade-ofs of several generally available LLMs of diferent sizes.

3. Methodology

This study employs a relevance evaluation process to assess the performance of LLMs and MLLMs (collectively referred to as “models”) for search relevance judgments. We assess these models based on two critical dimensions: accuracy and costs. Our evaluation pipeline consists of three stages: • Data Collection: We obtained search results from three datasets across diferent domains using a list of predefined queries. • Human Annotation: Two trained human annotators assigned relevance grades to each (query, result) pair following some established relevance criteria. • Model Evaluation: We applied a range of LLMs and MLLMs to generate relevance judgments for the same sets of search results, comparing their performance against human annotations. Each stage is discussed in detail in the following subsections, covering the datasets, retrieval system, grading strategy, and the models used.

3.1. Datasets

We conducted our experiments on three datasets: Fashion, Hotel Supplies, and Design. The Fashion dataset is a subset of the publicly available dataset H&M Personalized Fashion Recommendations1. The Hotel Supplies and Design datasets are proprietary and represent domains in the e-commerce search for hotel supply products, and social media search for design assets, respectively. Each dataset includes multiple textual fields per product, along with one or more associated images. Table 3.1 summarizes the characteristics of each dataset, detailing the average number of fields per search result, the average number of empty fields, and the average word count per result. These factors can impact the dificulty of generating relevance judgments.

Dataset

Fashion Hotel Supplies Design

Total Number of Search Results Avg Number of Textual Fields Avg Number of Empty Textual Fields Avg Number of Words per Result 3.2. Retrieval System and Evaluation

To obtain relevant search results, we utilized a baseline retrieval system that combines BM25 with BGE M3 embeddings [ 7 ], which is one of the top-ranked text embedding models in the MTEB Leaderboard [ 8 ] as of June 2024. We created indexes for each dataset to enable eficient retrieval of results based on a predefined list of queries. These queries were either derived from real trafic data or carefully crafted by human experts to ensure they represented a wide range of search scenarios. Our aim was to include queries and results that included hits and misses generated by both lexical and semantic retrievers.

3.3. Relevance Judgment Strategy

Each dataset was structured as a collection of query-result pairs. Two expert human annotators assessed the relevance of each pair on a 0-2 rating scale: • 2: Highly relevant, a perfect match for the query • 1: Somewhat relevant, a result that partially matches the query’s intent • 0: Not relevant, a poor result that should not be shown The human annotators were provided with detailed guidelines to ensure consistency in their relevance judgments. Table 3.3 provides examples of diferent relevance judgment categories for the query “ v-neck white tee”. In the first row, the result is highly relevant, as both the image and the text describe a white v-neck t-shirt. Therefore the human relevance judgment for this pair is a 2. The second row shows a partial match: while the text mentions a white t-shirt, the image depicts a white v-neck t-shirt with black stripes, resulting in a relevance judgment of 1. The third row illustrates an irrelevant result (0), where the product shown is a strap top, unrelated to the query.

In the Appendix, we describe in detail our internal H&M grading guidelines that human annotators follow to assign relevance grades to query-result pairs. We note that our grading guidelines are based on similar principles to those used for the Hotel Supplies and Design datasets. However, the guidelines have been also adapted to suit the specific characteristics of the datasets. 1https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations. For access to our human annotations for this dataset, please reach out to the corresponding author. prod_name: Premium ELKE vneck tee, index_name: Ladieswear, detail_desc: V-neck T-shirt in airy slub lin[...], department_name: Jersey/Knitwear Premium, index_group_name: Ladieswear, colour_group_name: White, product_type_name: T-shirt, graphical_appearance_name: Solid, perceived_colour_value_name: Light, perceived_colour_master_name: White prod_name: ED Lizzie tee, index_name: Ladieswear, detail_desc: Short-sleeved top in lightwei[...], department_name: Jersey, index_group_name: Ladieswear, colour_group_name: White, product_type_name: T-shirt, graphical_appearance_name: Stripe, perceived_colour_value_name: Light, perceived_colour_master_name: White prod_name: V-neck Strap Top, index_name: Ladieswear, detail_desc: V-neck top in soft organic [...], department_name: Jersey Basic, index_group_name: Ladieswear, colour_group_name: White, product_type_name: Vest top, graphical_appearance_name: Solid, perceived_colour_value_name: Light, perceived_colour_master_name: White 2 1 0

3.4. Inter-Annotator Agreement

To assess the reliability of the relevance judgments, either human or LLM-generated, we followed common practice and calculated Cohen’s kappa coeficient. Cohen’s kappa is a robust statistical measure commonly employed to quantify inter-annotator agreement for categorical data. Cohen’s kappa values range from -1 to 1, where 1 indicates strong agreement, while values closer to 0 suggest agreement no better than chance. To interpret the kappa values, we use the guidelines reported in Table 3.

In our evaluation, we compute Cohen’s kappa to measure the agreement between human annotators and LLM-generated annotations, as well as between pairs of human annotators. The degree of agreement between human annotators also provided insights into the dificulty of evaluating certain datasets.

Cohen’s kappa Interpretation 3.5. Models

Our evaluation included a range of LLMs and MLLMs to reflect varying levels of performance and cost. We considered both large-scale proprietary models and more cost-eficient alternatives: • OpenAI Models2: GPT-4V (gpt-4-vision-preview), GPT-4o (gpt-4o-2024-05-13), GTP-4o-mini (gpt4o-mini-2024-07-18) • Anthropic Models3: Claude 3.5 Sonnet, Claude 3 Haiku

3.6. Prompts

To design the prompts for the models under consideration, we created a template aimed at guiding the models to generate accurate relevance judgments.

In the multimodal setup, where an image is provided, the prompt will reference and include the image. Additionally, we require the model to provide an explanation for its relevance judgment. This element could be useful for interpreting the model’s decisions.

Below, we present the prompt used for the Claude family in the text-only scenario. For the complete set of prompts, please refer to the Appendix. In the template, {{document}} and {{query}} are placeholders for the search result and query, respectively.

Haiku’s Prompt Template (Text-only Setup) You are an assistant responsible for rating how the retrieved result is relevant to the query. Output a token: "2", "1", or "0" followed by a full explanation. Guidelines: "2" - The result matches exactly with what the user's query is looking for. "1" - The result is not exactly with what the user's query is looking for. But it's pretty similar. As our aim is to be strict on exact matches, this grade is less likely to be used. "0" - The result is not related to the query at all.

Result: {{document}} Query: {{query}} Output:"

It is important to note that these prompt templates are not the result of an extensive exploration of all possible templates. In Section 4.2, we provide an analysis of the prompt engineering process that led to the best-performing prompts for Haiku. 2https://platform.openai.com/docs/models 3https://docs.anthropic.com/en/docs/about-claude/models

4. Results 4.1. Multimodal vs Single-modality Evaluation

GPT-4v

Use-case Dependency of LLM Performance The analysis reveals that the LLM performance is dependent on the use case. The models show varying levels of correlation with the human relevance judgments across the diferent domains. For example, GPT-4V shows higher performance in the Hotel Supplies use case but performs relatively worse in the other areas. We can observe a similar trend across the other models. This variability in model performance is also connected to the inherent dificulty of the tasks. This is also reflected by the varying levels of agreement among the human annotators for the diferent use cases.

One Model to Rule Them All? The multimodal version of GPT4-o generally performs better than the other models in two out of three cases, achieving the highest average Cohen’s kappa coeficient (0.548). It stands out in the Hotel Supplies and Fashion domains, where it shows substantial agreement with human annotations. However, it is outperformed by Sonnet in the Hotel Supplies domain, suggesting that no single model outperforms all the others across every use case.

Tailor Model Prompt to a Specific Dataset We observe that tailoring a model’s prompt to a specific domain can greatly improve its grading performance on the corresponding dataset. A notable example is the text-only Haiku LLM. Despite it being among the least powerful models in our experiments, we achieved the highest Cohen’s kappa coeficient for that dataset ( 0.6403 compared to 0.560) by refining the prompt for the Hotel Supplies dataset. Nevertheless, we also note that using the Hotel Supplies dataset-adjusted prompt may lead to significant overfitting to other datasets. For example, when using the same prompt for the Design dataset, Cohen’s kappa coeficient decreases to 0.333 (instead of 0.431). Necessity of Multimodal Support In the table, we compare each Multimodal (MM) model with its text-only (Text) counterpart. It is worth noticing that the benefits of multimodal support are not uniform across all the models and use cases. For models like GPT-4o, the vision component significantly enhances the performance, increasing the correlation from 0.506 (Text) to 0.548 (MM). This leads to the highest average performance and the metric is remarkably very close to human correlation (i.e. 0.589). GPT-4V and Sonnet also benefit from the visual component. However, for smaller models, such as Haiku, the vision component appears to have a detrimental efect, decreasing the correlation from 0.433 (Text) to 0.368 (MM). To further investigate the impact of the visual component in Haiku, we performed an ablation study by excluding the textual component and relying solely on the image. Under this configuration, the highest correlation achieved was 0.1 for the Design case, which is significantly lower than the text-only correlation of 0.309. This indicates that for smaller models like Haiku, the visual component may not be suficiently robust to provide efective multimodal support.

Error Analysis To investigate the poor performance of the smaller multimodal models, we conducted an error analysis on a sample of relevance judgments from Haiku that disagreed with both human annotators. We examined 31 instances of disagreement and identified three distinct error categories. Notably, Haiku generates an explanation for its relevance judgments, which allows us to categorize the errors efectively.

The most frequent issue (17 cases) involved the model’s failure to correctly identify the product type. For example, when given the query “pure cotton dressing gown”, the model misclassifies a linen dressing gown, justifying its choice with the explanation: The product is [...] made of linen, which is a natural fiber similar to cotton . In half of the remaining cases, Haiku’s errors originate from wrong assumptions. For instance, the model confused the brand name for bras with the word “band” as in “hairband”, leading to incorrect judgments. Lastly, 7 of the cases were related to the model’s vision capabilities, where it failed to recognize specific patterns or prints on products, resulting in inaccurate relevance assessments. Table 5 provides an example of this type of error, including the image, query, and explanation generated by Haiku.

Image

Query

Explanation The provided result does not match the user’s query for a “h&m kids unicorn printed t-shirt”. The result is for a “Mia l/s top” which is a longsleeved top in soft, printed cotton jerh&m kids unicorn printed t-shirt sey, but it does not appear to have a unicorn print. The image also does not show a unicorn print. Therefore, the result is not relevant to the user’s query, and I would rate it as a “0”.

Cost-Accuracy Trade-of Considering the previous results comparing multimodal versus text-only performance, we can make important cost-accuracy trade-of considerations when choosing a model to adopt for relevance judgment. The costs reported in Table 6 reflect the providers’ pricing as of August 16, 2024. For image processing, calculations are based on handling 1M low-resolution images. Specifically, OpenAI’s GPT-4V and GPT-4o allow users to use a low-resolution with 512x512 pixels of the image and represent it with a budget of 85 tokens. This results in a cost of $0.000425 per image.4 4For GPT-4o-mini, this limit is set at 2,833 tokens (instead of 85 tokens) per image and this leads to the same per-image cost. For fairness to Claude models, we thus also report their prices for images resized to 512x512 pixels.

In this setup, the costs for image processing are fixed per search result, while the input and output tokens are variable, depending on the length of the search result being evaluated. This means that, when evaluating 1M images, we have a fixed cost of $425 for the GPT family, approximately $1048 for Sonnet, and $87 for Haiku. To these fixed costs, we must add variable expenses, depending on the prompt, the search result, and output lengths. As Table 3.1 shows, our datasets contain search results with varying word counts. Additionally, diferent models use varying prompts and diferent tokenizers, leading to diferences in the number of tokens.

Along with its strong performance for both text and multimodal tasks, GPT-4V is the most expensive model with high costs for tokens and image processing. With an average of 867 input tokens, for 1M multimodal search results, the cost for processing 1M multimodal search results with GPT-4V is $425 (image cost) + 867 ⋅ $10.00 (input token cost) = $9,095.

In contrast, GPT-4o ofers higher performance at a lower cost. With an average of 889 input tokens per result, the cost for processing 1M multimodal results is approximately $425 (image cost) + 889 ⋅ $5 (input token cost) = $4,865. This significantly lower cost (i.e. 50% of the cost of using GPT-4V) makes it the current best choice when high precision is required.

For Sonnet, with an average of 784 input tokens, the cost for processing 1M multimodal results would be $1048.58 (image cost) + 784 ⋅ $3.00 (input token cost) = $3400.58. Given Sonnet’s performance as the third-best model in terms of correlation with human evaluations, it represents a suitable choice for scenarios where a moderate budget is available but maintaining high-quality results is still important.

For smaller models like GPT-4o-mini and Haiku, the cost diferences become significant, though at the expense of performance. In a text-only setting, GPT-4o-mini ofers the lowest cost per result, making it an attractive option for large-scale applications where lower accuracy can be tolerated. While Haiku’s cost per result is slightly higher than GPT-4o-mini, its performance is also superior. However, our experiments indicate that the visual component did not significantly enhance the performance of these smaller models for relevance judgment. In fact it can be even detrimental when using the visual component. Therefore, the multimodal capabilities of GPT-4o-mini and Haiku should be employed with caution, especially considering the high costs associated with image processing—particularly for GPT-4o-mini.

4.2. Prompt Engineering

We made the following observations: Strictness Guidelines Many of the initial disagreements with humans stemmed from the models being more lenient about the 1 (OK) grade. Results improved after we appended instructions to prefer grades 2 (GREAT) and 0 (BAD) – e.g. “As our aim is to be strict on exact matches, this grade is less likely to be used.”

Smaller Models Are More Sensitive to Prompt Complexity We found that smaller models, such

as Haiku, are highly sensitive to prompt complexity, whereas larger models like GPT-4V manage these complexities more efectively. For example, when using a prompt with comprehensive and somewhat lengthy grading instructions, we observed a significantly higher Cohen’s kappa coeficient for the Hotel Supplies dataset with GPT-4V (0.54) compared to Haiku (0.32).

Note that this does not imply that Haiku is not suitable for grading the task; rather, it suggests that the model performs better when prompts are simpler. We verify this hypothesis further by making prompts progressively more concise while still retaining the essential instructions. Ultimately, we were able to refine the prompt for the Hotel Supplies dataset that helps the text-only Haiku model achieve the highest Cohen’s kappa coeficient of 0.64.

Diferent Models May Work with Diferent Prompts Although Haiku achieves the highest

Cohen’s kappa coeficient of 0.64 on the Hotel Supplies domain with the refined prompt we developed, we did not observe the same improvement with GPT-4V. When using the same prompt, GPT-4V maintained a similar Cohen’s kappa coeficient of approximately 0.54. This indicates that prompt engineering can be highly model-specific, and a prompt that works exceptionally well for one LLM model may not perform as well for others. In fact, we could not find a systematic way to reliably optimize model accuracy across the board. As a result, the process of prompt engineering feels more like art than science and motivates further work to develop systematic ways to discover the upper limits of accuracy for each model size.

Asking for explanations In our experiments, we observed that asking LLMs to provide explanations for grading outputs is beneficial for several important reasons: • Relevance is subjective, and asking an LLM to explain its grading output can be helpful in verifying the correctness of the assessment. In our experiments, it was not uncommon for humans to initially disagree with the grading outputs from LLMs; however, they often reached a consensus after reviewing the detailed explanations provided. • Having an explanation also helps us to understand how to iterate via prompt engineering to make the instructions less ambiguous for the model.

We also note that asking LLMs to provide explanations may help the model perform better at grading. However, note that prompts need to be carefully crafted, as we also observed the performance may regress if we do not do it well.

In the end, we were able to meaningfully improve the accuracy of Haiku through prompt engineering that requests LLMs for explanations (from 0.36 to 0.40). Given that this is not far in accuracy, and Haiku is 20-40 times cheaper than the GPT family, this makes it a very appealing option for application at a large scale. For instance, smaller models can be used to generate larger label sets to explore recall issues, while more expensive models focus on smaller sets to evaluate precision.

5. Conclusion

In this paper, we have presented a new analysis of MLLMs-as-a-Judge, to assess the cost-accuracy trade-ofs of relevance judgment capabilities of MLLMs across three multimodal search use cases: Hotel Supplies, Design, and Fashion. Various LLMs have shown potential, but no single LLM showed optimal cost-accuracy trade-of across all use cases evaluated.

We have found that for any given practitioner looking to choose the best LLM judge for their use case, a comprehensive evaluation of all available models is both time-intensive, financially demanding, and requires significant amounts of energy, which can have a significant efect on the environment. This motivates future work in the following directions: 1) improving the abilities of general MLLMs across use cases, 2) improving cost and computational eficiency of large MLLMs, and 3) creating small MLLMs that are optimized for judging relevance in cost-optimal ways for more specialized applications.

Acknowledgments

Thanks to the entire Objective team for building many pieces of the puzzle that made this work possible. Special thanks to Lance Hasson, Brian Porter, George Gkotsis, Kuei-da Liao, and Faizaan Merchant. We would also like to thank Yev Rotar, John Gulley, and Brian Ip for their valuable relevance judgment inputs.

A. Prompt Templates

In this section, we report the prompts used for the considered models. In the templates, {{document}}, {{query}}, and {{image}} are placeholders for the search result, query, and image respectively. For the OpenAI’s models, the image corresponds to the image URL, while for the Anthropic’s models, it corresponds to a base64-encoded image.

Haiku and Sonnet’s Prompt Template (Multimodal Setup)

You are an assistant responsible for rating how the retrieved result is relevant to the query. If an image is available, use it to determine the relevance to the query. Output a token: "2", "1", or "0" followed by a full explanation.

Guidelines: "2" - The result matches exactly with what the user's query is looking for. "1" - The result is not exactly with what the user's query is looking for. But it's pretty similar. As our aim is to be strict on exact matches, this grade is less likely to be used. "0" - The result is not related to the query at all.

Result: {{document}} Query: {{query}} ### Query Analysis Before you start grading, it's essential to understand user's intent by breaking apart the query. Keep in mind, some queries may be more explicit than others. For instance, if user is searching for a clothing product, then "Red checkered jacket" is more specific compared to "Red jacket". Another example, if user is searching for a venue, then "Rock concert in San Francisco this weekend" is more specific compared to "Rock concert in San Francisco". Therefore, adapt your grading contextually.

Consider all the information from all fields.

Note: All fields should be taken with equal importance. You should adhere strictly to these guidelines while grading and ensure a holistic evaluation of the search results based on all considered fields.

User Role: User Query: {{query}} Result: {{result}} Score: 0 or 1 or 2 --Query: {{query}} Result: {{document}} User Role: User You are given a search query and a search result in json format. If an image is available, use it to determine the relevance to the query. You must indicate with a score whether the result is relevant or not. --

Follow the following format.

B. H&M grading guidelines

In this section, we introduce the internal H&M grading guidelines that our human annotators follow to assign relevance grades to each pair of <query, search result>.

B.1. Grading rules

• Is the query asking for a specific category of product, either narrow or broad? – If the answer is yes, then the user is asking for results that are restricted to that category.

Only results matching that product category will be marked as GREAT. If the results don’t match the category, mark them as BAD. – It is the grader’s job to identify the category as introduced within the query and classify products accordingly.

∗ For instance, examples of queries->categories can be “ofice clothes” ->“clothes” or “music t-shirt”->“t-shirt”. Note that t-shirt can be categorized as a sub-category of clothes and clothes is a super-category of t-shirts. As such, “ofice clothes” includes more products since it implies a broader categorization, whereas “music t-shirt” is restricting products under “t-shirts”. • Is the query mentioning a feature, for instance: color, size, utility (e.g. windproof, maternity)? – If the query mentions a feature, only results matching that feature will be marked as GREAT. – If the results don’t match the feature, mark it as BAD. – If the results match the feature close enough but not exactly, mark it as OK. For example, if the query is “yellow jacket” and a search result is a light orange jacket, or some jacket that contains some clear patches of yellow but is otherwise not yellow, then this result is OK. • If a query mentions both a category and a feature, only results matching both the category and the feature will be marked as GREAT. Results matching the feature but not the category (or vice-versa) are BAD.

B.2. Grading examples

Query yellow raincoat rainy weather sports utility wear BAD black raincoat yellow cardigan yellow jacket anything else/products that do not help during a rainy weather anything that cannot be used during a sport activity or some kind of manual labour task. GREAT raincoats that are yellow any product that is rainproof products that visually and textually support comfortable movement and gym workouts

[1]

H. A.

Rahmani ,

Siro ,

Aliannejadi ,

Craswell ,

C. L. A.

Clarke ,

Faggioli ,

Mitra , P. Thomas,

Yilmaz , Report on the 1st workshop on large language model for evaluation in information retrieval (llm4eval 2024) at sigir 2024 , 2024 . URL: https://arxiv.org/abs/2408.05388. arXiv: 2408 . 05388 .

[2]

Chen ,

Zhang , Y. Liu,

Wang ,

Zhou ,

Zhang ,

Wan ,

Zhou , L. Sun, Mllmas-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , 2024 . URL: https://arxiv.org/abs/2402.04788. arXiv: 2402 . 04788 .

[3]

Kim ,

Shin ,

Cho ,

Jang ,

Longpre ,

Lee ,

Yun ,

Shin ,

Kim ,

Thorne ,

Seo , Prometheus: Inducing fine-grained evaluation capability in language models , 2024 . URL: https: //arxiv.org/abs/2310.08491. arXiv: 2310 . 08491 .

[4]

Zhu ,

Wang ,

Wang , Judgelm: Fine-tuned large language models are scalable judges , 2023 . URL: https://arxiv.org/abs/2310.17631. arXiv: 2310 . 17631 .

[5]

Thomas ,

Spielman ,

Craswell ,

Mitra , Large language models can accurately predict searcher preferences , 2024 . URL: https://arxiv.org/abs/2309.10621. arXiv: 2309 . 10621 .

[6]

J.-H.

Yang ,

Lin , Toward automatic relevance judgment using vision-language models for imagetext retrieval evaluation, 2024 . URL: https://arxiv.org/abs/2408.01363. arXiv: 2408 . 01363 .

[7]

Chen ,

Xiao ,

Zhang ,

Luo ,

Lian ,

Liu , Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation , arXiv preprint arXiv:2402.03216 ( 2024 ).

[8]

Muennighof ,

Tazi ,

Magne ,

Reimers , Mteb: Massive text embedding benchmark , arXiv preprint arXiv:2210.07316 ( 2022 ). URL: https://arxiv.org/abs/2210.07316. doi: 10 .48550/ARXIV.2210. 07316.