<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <abstract>
        <p>• Investigating LLM-based relevance estimators for potential systemic biases. • End-to-end evaluation of Retrieval Augmented Generation systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This volume contains the proceedings of the First Workshop on Large
Language Models (LLMs) for Evaluation in Information Retrieval (LLM4Eval 2024)
held on July 18th, 2024 in Washington D.C, USA, and co-located with The 47th
International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2024, July 14-18, 2024 Washington D.C., USA).</p>
      <p>Large language models (LLMs) have demonstrated increasing task-solving
abilities not present in smaller models. Utilizing the capabilities and
responsibilities of LLMs for automated evaluation (LLM4Eval) has recently attracted
considerable attention in multiple research communities. For instance, LLM4Eval
models have been studied in the context of automated judgments, natural
language generation, and retrieval augmented generation systems. We believe that
the information retrieval community can significantly contribute to this growing
research area by designing, implementing, analyzing, and evaluating various
aspects of LLMs with applications to LLM4Eval tasks The main goal of LLM4Eval
workshop was to bring together researchers from industry and academia to
discuss various aspects of LLMs for evaluation in information retrieval, including
automated judgments, retrieval-augmented generation pipeline evaluation,
altering human evaluation, robustness, and trustworthiness of LLMs for
evaluation in addition to their impact on real-world applications.</p>
      <p>The contributions to LLM4Eval 2024 mainly address the following relevant
topics:
• LLM-based evaluation metrics for traditional IR and generative IR.
• Agreement between human and LLM labels.
• Efectiveness and/or eficiency of LLMs to produce robust relevance labels.
• Automated evaluation of text generation systems.
• Trustworthiness in the world of LLMs evaluation.
• Prompt engineering in LLMs evaluation.
• Efectiveness and/or eficiency of LLMs as ranking models.
• LLMs in specific IR tasks such as personalized search, conversational
search, and multimodal retrieval.
• Challenges and future directions in LLM-based IR evaluation.</p>
      <p>We received 21 submissions of original papers presenting new research results
and 5 submissions of already published results. The program committee involved
24 researchers, highly diversified in background and geographical region. Three
program committee members reviewed each submission. The reviewers looked
at originality, technical depth, style of presentation, and impact. Finally, the
committee accepted 18 original papers and all the previously published works
for presentation at the workshop. Out of these, 7 papers were further published
on the proceedings.</p>
      <p>The workshop program included a booster session where the authors of
the accepted paper presented their work, followed by a poster session, to
allow a more detailed discussion between presenters and workshop participants.
Furthermore, the workshop included a panel, with the following invited
panellists: Charlie L. A. Clarke (University of Waterloo), Laura Dietz (University
of New Hampshire), Michael D. Ekstrand (Drexel University), and Ian
Soborof (National Institute of Standards and Technology (NIST)). We also had two
keynotes. The first was by Ian Soborof (National Institute of Standards and
Technology (NIST)), titled “A Brief History of Automatic Evaluation in IR”,
the second was by Donald Metzler (Google DeepMind), and was titled “LLMs
as Rankers, Raters, and Rewarders”.</p>
      <p>The success of LLM4Eval 2024 would not have been possible without the
considerable efort of several people including the Program Committee, and the
participants who contribute their time and efort.</p>
      <p>Thank you all very much!
July, 2024</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>