<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLMJudge: LLMs for Relevance Judgments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hossein A. Rahmani</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emine Yilmaz</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nick Craswell</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bhaskar Mitra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Thomas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles L. A. Clarke</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Aliannejadi</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clemencia Siro</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Microsoft</institution>
          ,
          <addr-line>Adelaide</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft</institution>
          ,
          <addr-line>Montréal</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Microsoft</institution>
          ,
          <addr-line>Seattle</addr-line>
          ,
          <country country="US">US</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The LLMJudge challenge1 is organized as part of the LLM4Eval2 workshop [1] at SIGIR 2024. Test collections are essential for evaluating information retrieval (IR) systems. The evaluation and tuning of a search system is largely based on relevance labels, which indicate whether a document is useful for a specific search and user. However, collecting relevance judgments on a large scale is costly and resource-intensive. Consequently, typical experiments rely on third-party labelers who may not always produce accurate annotations. The LLMJudge challenge aims to explore an alternative approach by using LLMs to generate relevance judgments. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. However, it remains unclear which LLMs can match the accuracy of human labelers, which prompts are most efective, how fine-tuned open-source LLMs compare to closed-source LLMs like GPT-4, whether there are biases in synthetically generated data, and if data leakage afects the quality of generated labels. This challenge will investigate these questions, and the collected data will be released as a package to support automatic relevance judgment research in information retrieval and search.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Automatic relevance judgment has recently received significant attention in the Information Retrieval
(IR) community. In earlier studies, Faggioli et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] studied diferent levels of human and LLMs
collaboration for automatic relevance judgement. They suggested the need for humans to support and
collaborate with LLMs for a human-machine collaboration judgment. Thomas et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] leverage LLMs
capabilities in judgement at scale, in Microsoft Bing. They used real searcher feedback to consider an
LLM and prompt in a way that matches the small sample of searcher preferences. Their experiments
show that LLMs can be as good as human annotators in indicating the best systems. They also
comprehensively investigated various prompts and prompt features for the task and revealed that LLM
performance on judgments can varies with simple paraphrases of prompts. Recently, Rahmani et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
have studied fully synthetic test collection using LLMs. In their study, they not only generated synthetic
queries but also synthetic judgment to build a full synthetic test collation for retrieval evaluation. They
have shown that LLMs are able to generate a synthetic test collection that results in system ordering
performance similar to evaluation results obtained using the real test collection.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. LLMJudge Task Design</title>
      <p>
        The challenge will be, given the query and document as input, how they are relevant. Here, we use
four-point scale judgments to evaluate the relevance of the query to document as follows:
• [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] Perfectly relevant: The passage is dedicated to the query and contains the exact answer.
• [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] Highly relevant: The passage has some answers for the query, but the answer may be a bit
unclear, or hidden amongst extraneous information.
• [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] Related: The passage seems related to the query but does not answer it.
      </p>
      <p>• [0] Irrelevant: The passage has nothing to do with the query.</p>
      <p>
        The task is, by providing the datasets that include queries, documents, and query-document files to
participants, to ask LLMs to generate a score [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0, 1, 2, 3</xref>
        ] indicating the relevance of the query to the
document.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. LLMJudge Data</title>
      <p>
        The LLMJudge challenge dataset is built upon the passage retrieval task dataset of the TREC 2023 Deep
Learning track3 (TREC-DL 2023) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Table 1 shows the statistics of the LLMJudge challenge datasets.
We divide the data into development and test sets. The test set is used for the generation of judgment
by participants, while the development set could be used for few-shot or fine-tuning purposes. The
datasets, sample prompt, and the quick starter for automatic judgment can be found at the following
repository: https://github.com/llm4eval/LLMJudge
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>Participants’ results will then be evaluated in two methods after submission:
• automated evaluation metrics on human labels in the test set hidden from the participants;
• system ordering evaluation of multiple search systems on human judgments and LLM-based
judgments</p>
    </sec>
    <sec id="sec-6">
      <title>6. Submissions and Results</title>
      <p>In order to evaluate the quality of the generated labels, we used Cohen’s  to see the labeler’s agreement
with LLMJudge test data at query-document level and the Kendall’s  to check the labeler’s agreement
with LLMJudge test data on system ordering, i.e., the runs that submitted to TREC DL 2023. In total,
we had 39 submissions (i.e., the 39 labelers) from 7 groups from National Institute of Standards and
3https://microsoft.github.io/msmarco/TREC-Deep-Learning.html</p>
      <p>1
0.98
0.96
0.94
0.92
tau0.9
0.88
0.86
0.84
0.82
0.8 0</p>
      <p>NISTRetrieval-instruct0
prophet</p>
      <p>RMITIR-llama70B
0.05 0.1 0.15 0.2
0.3 0.35 0.4 0.45 0.5
Technology (NIST), RMIT University, The University of Melbourne, University of New Hampshire,
University of Waterloo, Included Health, and University of Amsterdam.</p>
      <p>Figure 1 shows the performance of submitted labelers on the LLMJudge test set. The x-axis represents
Cohen’s  , and the y-axis shows the labelers’ agreement on system ordering. Labelers exhibit low
variability in Kendall’s  but greater variability in Cohen’s  . Most labelers cluster within a narrow
range of  values, indicating consistent system rankings but more variation in inter-rater reliability,
as measured by Cohen’s  . This suggests that while labelers generally agree on rankings, their exact
labels are less consistent, leading to the observed variability in  .</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>The challenge is organized as a joint efort by the University College London, Microsoft, the University
of Amsterdam, the University of Waterloo, and the University of Padua. The views expressed in the
content are solely those of the authors and do not necessarily reflect the views or endorsements of
their employers and/or sponsors. This work is supported by the Engineering and Physical Sciences
Research Council [EP/S021566/1], the EPSRC Fellowship titled “Task Based Information Retrieval”
[EP/P024289/1], CAMEO, PRIN 2022 n. 2022ZLL7MW.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Siro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , P. Thomas,
          <string-name>
            <surname>E. Yilmaz,</surname>
          </string-name>
          <article-title>Llm4eval: Large language model for evaluation in ir</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>3040</fpage>
          -
          <lpage>3043</lpage>
          . URL: https://doi.org/10.1145/3626772.3657992. doi:
          <volume>10</volume>
          .1145/3626772.3657992.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , G. Demartini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , et al.,
          <article-title>Perspectives on large language models for relevance judgment</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spielman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Large language models can accurately predict searcher preferences</article-title>
          ,
          <source>arXiv preprint arXiv:2309.10621</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>Synthetic test collections for retrieval evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2405.07767</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          , E. Yilmaz,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>Overview of the trec 2023 deep learning track, in: Text REtrieval Conference (TREC), NIST</article-title>
          , TREC,
          <year>2024</year>
          . URL: https://www.microsoft.com/en-us/research/publication/ overview
          <article-title>-of-the-</article-title>
          <string-name>
            <surname>trec-</surname>
          </string-name>
          2023
          <string-name>
            <surname>-</surname>
          </string-name>
          deep
          <article-title>-learning-track/.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>