=Paper= {{Paper |id=Vol-3752/paper8 |storemode=property |title=LLMJudge: LLMs for Relevance Judgments |pdfUrl=https://ceur-ws.org/Vol-3752/paper8.pdf |volume=Vol-3752 |authors=Hossein A. Rahmani,Emine Yilmaz,Nick Craswell,Bhaskar Mitra,Paul Thomas,Charles L. A. Clarke,Mohammad Aliannejadi,Clemencia Siro,Guglielmo Faggioli |dblpUrl=https://dblp.org/rec/conf/llm4eval/RahmaniYC00CASF24 }} ==LLMJudge: LLMs for Relevance Judgments== https://ceur-ws.org/Vol-3752/paper8.pdf
                                LLMJudge: LLMs for Relevance Judgments
                         Hossein A. Rahmani1 , Emine Yilmaz1 , Nick Craswell2 , Bhaskar Mitra3 , Paul Thomas4 ,
                         Charles L. A. Clarke5 , Mohammad Aliannejadi6 , Clemencia Siro6 and Guglielmo Faggioli7
                         1
                           University College London, London, UK
                         2
                           Microsoft, Seattle, US
                         3
                           Microsoft, Montréal, Canada
                         4
                           Microsoft, Adelaide, Australia
                         5
                           University of Waterloo, Ontario, Canada
                         6
                           University of Amsterdam, Amsterdam, The Netherlands
                         7
                           University of Padua, Padua, Italy




                         1. Introduction
                         The LLMJudge challenge1 is organized as part of the LLM4Eval2 workshop [1] at SIGIR 2024. Test
                         collections are essential for evaluating information retrieval (IR) systems. The evaluation and tuning
                         of a search system is largely based on relevance labels, which indicate whether a document is useful
                         for a specific search and user. However, collecting relevance judgments on a large scale is costly and
                         resource-intensive. Consequently, typical experiments rely on third-party labelers who may not always
                         produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by
                         using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate
                         reliable relevance judgments for search systems. However, it remains unclear which LLMs can match
                         the accuracy of human labelers, which prompts are most effective, how fine-tuned open-source LLMs
                         compare to closed-source LLMs like GPT-4, whether there are biases in synthetically generated data,
                         and if data leakage affects the quality of generated labels. This challenge will investigate these questions,
                         and the collected data will be released as a package to support automatic relevance judgment research
                         in information retrieval and search.


                         2. Related Work
                         Automatic relevance judgment has recently received significant attention in the Information Retrieval
                         (IR) community. In earlier studies, Faggioli et al. [2] studied different levels of human and LLMs
                         collaboration for automatic relevance judgement. They suggested the need for humans to support and
                         collaborate with LLMs for a human-machine collaboration judgment. Thomas et al. [3] leverage LLMs
                         capabilities in judgement at scale, in Microsoft Bing. They used real searcher feedback to consider an
                         LLM and prompt in a way that matches the small sample of searcher preferences. Their experiments
                         show that LLMs can be as good as human annotators in indicating the best systems. They also
                         comprehensively investigated various prompts and prompt features for the task and revealed that LLM
                         performance on judgments can varies with simple paraphrases of prompts. Recently, Rahmani et al. [4]
                         have studied fully synthetic test collection using LLMs. In their study, they not only generated synthetic
                         queries but also synthetic judgment to build a full synthetic test collation for retrieval evaluation. They
                         have shown that LLMs are able to generate a synthetic test collection that results in system ordering
                         performance similar to evaluation results obtained using the real test collection.

                          LLM4Eval: The First Workshop on Large Language Models for Evaluation in Information Retrieval, 18 July 2024, Washington DC,
                          United States
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                           https://llm4eval.github.io/challenge/
                         2
                           https://llm4eval.github.io/

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Statistics of LLMJudge Dataset
                                                                   Dev     Test
                                        # queries                    25      25
                                        # passage                  7,224   4,414
                                        # qrels                    7,263   4,423
                                        # irrelevant (0)           4,538   2,005
                                        # related (1)              1,403   1,233
                                        # highly relevant (2)       625     808
                                        # perfectly relevant (3)    697     377


3. LLMJudge Task Design
The challenge will be, given the query and document as input, how they are relevant. Here, we use
four-point scale judgments to evaluate the relevance of the query to document as follows:

       • [3] Perfectly relevant: The passage is dedicated to the query and contains the exact answer.
       • [2] Highly relevant: The passage has some answers for the query, but the answer may be a bit
         unclear, or hidden amongst extraneous information.
       • [1] Related: The passage seems related to the query but does not answer it.
       • [0] Irrelevant: The passage has nothing to do with the query.

  The task is, by providing the datasets that include queries, documents, and query-document files to
participants, to ask LLMs to generate a score [0, 1, 2, 3] indicating the relevance of the query to the
document.


4. LLMJudge Data
The LLMJudge challenge dataset is built upon the passage retrieval task dataset of the TREC 2023 Deep
Learning track3 (TREC-DL 2023) [5]. Table 1 shows the statistics of the LLMJudge challenge datasets.
We divide the data into development and test sets. The test set is used for the generation of judgment
by participants, while the development set could be used for few-shot or fine-tuning purposes. The
datasets, sample prompt, and the quick starter for automatic judgment can be found at the following
repository: https://github.com/llm4eval/LLMJudge


5. Evaluation
Participants’ results will then be evaluated in two methods after submission:
       • automated evaluation metrics on human labels in the test set hidden from the participants;
       • system ordering evaluation of multiple search systems on human judgments and LLM-based
         judgments


6. Submissions and Results
In order to evaluate the quality of the generated labels, we used Cohen’s 𝜅 to see the labeler’s agreement
with LLMJudge test data at query-document level and the Kendall’s 𝜅 to check the labeler’s agreement
with LLMJudge test data on system ordering, i.e., the runs that submitted to TREC DL 2023. In total,
we had 39 submissions (i.e., the 39 labelers) from 7 groups from National Institute of Standards and
3
    https://microsoft.github.io/msmarco/TREC-Deep-Learning.html
                                   1

                                 0.98

                                 0.96
                                                                TREMA-4prompts                    NISTRetrieval-instruct0
                                 0.94
                                                                                                                        RMITIR-llama70B

                                 0.92

                                                                                                                              h2oloo-fewself




                           tau
                                  0.9

                                                                                      prophet
                                 0.88                                                                                       willia-umbrela1

                                 0.86

                                 0.84

                                 0.82

                                  0.8
                                        0   0.05   0.1   0.15       0.2        0.25         0.3      0.35        0.4         0.45       0.5
                                                                          kappa [ 01|23 ]



Figure 1: Scatter plot of Cohen’s 𝜅 and Kendall’s 𝜏 for submitted labelers


Technology (NIST), RMIT University, The University of Melbourne, University of New Hampshire,
University of Waterloo, Included Health, and University of Amsterdam.
   Figure 1 shows the performance of submitted labelers on the LLMJudge test set. The x-axis represents
Cohen’s 𝜅, and the y-axis shows the labelers’ agreement on system ordering. Labelers exhibit low
variability in Kendall’s 𝜏 but greater variability in Cohen’s 𝜅. Most labelers cluster within a narrow
range of 𝜏 values, indicating consistent system rankings but more variation in inter-rater reliability,
as measured by Cohen’s 𝜅. This suggests that while labelers generally agree on rankings, their exact
labels are less consistent, leading to the observed variability in 𝜅.


Acknowledgment
The challenge is organized as a joint effort by the University College London, Microsoft, the University
of Amsterdam, the University of Waterloo, and the University of Padua. The views expressed in the
content are solely those of the authors and do not necessarily reflect the views or endorsements of
their employers and/or sponsors. This work is supported by the Engineering and Physical Sciences
Research Council [EP/S021566/1], the EPSRC Fellowship titled “Task Based Information Retrieval”
[EP/P024289/1], CAMEO, PRIN 2022 n. 2022ZLL7MW.


References
[1] H. A. Rahmani, C. Siro, M. Aliannejadi, N. Craswell, C. L. A. Clarke, G. Faggioli, B. Mitra, P. Thomas,
    E. Yilmaz, Llm4eval: Large language model for evaluation in ir, in: Proceedings of the 47th
    International ACM SIGIR Conference on Research and Development in Information Retrieval,
    SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 3040–3043. URL:
    https://doi.org/10.1145/3626772.3657992. doi:10.1145/3626772.3657992.
[2] G. Faggioli, L. Dietz, C. L. Clarke, G. Demartini, M. Hagen, C. Hauff, N. Kando, E. Kanoulas,
    M. Potthast, B. Stein, et al., Perspectives on large language models for relevance judgment, in:
    Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval,
    2023, pp. 39–50.
[3] P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately predict
    searcher preferences, arXiv preprint arXiv:2309.10621 (2023).
[4] H. A. Rahmani, N. Craswell, E. Yilmaz, B. Mitra, D. Campos, Synthetic test collections for retrieval
    evaluation, arXiv preprint arXiv:2405.07767 (2024).
[5] N. Craswell, B. Mitra, E. Yilmaz, H. A. Rahmani, D. Campos, J. Lin, E. M. Voorhees,
    I. Soboroff, Overview of the trec 2023 deep learning track, in: Text REtrieval Confer-
    ence (TREC), NIST, TREC, 2024. URL: https://www.microsoft.com/en-us/research/publication/
    overview-of-the-trec-2023-deep-learning-track/.