Preface


                                    This volume contains the proceedings of the First Workshop on Large Lan-
                                guage Models (LLMs) for Evaluation in Information Retrieval (LLM4Eval 2024)
                                held on July 18th, 2024 in Washington D.C, USA, and co-located with The 47th
                                International ACM SIGIR Conference on Research and Development in Infor-
                                mation Retrieval (SIGIR 2024, July 14-18, 2024 Washington D.C., USA).
                                    Large language models (LLMs) have demonstrated increasing task-solving
                                abilities not present in smaller models. Utilizing the capabilities and responsibil-
                                ities of LLMs for automated evaluation (LLM4Eval) has recently attracted con-
                                siderable attention in multiple research communities. For instance, LLM4Eval
                                models have been studied in the context of automated judgments, natural lan-
                                guage generation, and retrieval augmented generation systems. We believe that
                                the information retrieval community can significantly contribute to this growing
                                research area by designing, implementing, analyzing, and evaluating various as-
                                pects of LLMs with applications to LLM4Eval tasks The main goal of LLM4Eval
                                workshop was to bring together researchers from industry and academia to dis-
                                cuss various aspects of LLMs for evaluation in information retrieval, including
                                automated judgments, retrieval-augmented generation pipeline evaluation, al-
                                tering human evaluation, robustness, and trustworthiness of LLMs for evalua-
                                tion in addition to their impact on real-world applications.
                                    The contributions to LLM4Eval 2024 mainly address the following relevant
                                topics:
                                   • LLM-based evaluation metrics for traditional IR and generative IR.
                                   • Agreement between human and LLM labels.

                                   • Effectiveness and/or efficiency of LLMs to produce robust relevance labels.
                                   • Investigating LLM-based relevance estimators for potential systemic bi-
                                     ases.
                                   • Automated evaluation of text generation systems.

                                   • End-to-end evaluation of Retrieval Augmented Generation systems.
                                   • Trustworthiness in the world of LLMs evaluation.
                                   • Prompt engineering in LLMs evaluation.
                                   • Effectiveness and/or efficiency of LLMs as ranking models.


                                                                         1


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   • LLMs in specific IR tasks such as personalized search, conversational
     search, and multimodal retrieval.
   • Challenges and future directions in LLM-based IR evaluation.
     We received 21 submissions of original papers presenting new research results
and 5 submissions of already published results. The program committee involved
24 researchers, highly diversified in background and geographical region. Three
program committee members reviewed each submission. The reviewers looked
at originality, technical depth, style of presentation, and impact. Finally, the
committee accepted 18 original papers and all the previously published works
for presentation at the workshop. Out of these, 7 papers were further published
on the proceedings.
     The workshop program included a booster session where the authors of
the accepted paper presented their work, followed by a poster session, to al-
low a more detailed discussion between presenters and workshop participants.
Furthermore, the workshop included a panel, with the following invited pan-
ellists: Charlie L. A. Clarke (University of Waterloo), Laura Dietz (University
of New Hampshire), Michael D. Ekstrand (Drexel University), and Ian Sobo-
roff (National Institute of Standards and Technology (NIST)). We also had two
keynotes. The first was by Ian Soboroff (National Institute of Standards and
Technology (NIST)), titled “A Brief History of Automatic Evaluation in IR”,
the second was by Donald Metzler (Google DeepMind), and was titled “LLMs
as Rankers, Raters, and Rewarders”.
     The success of LLM4Eval 2024 would not have been possible without the
considerable effort of several people including the Program Committee, and the
participants who contribute their time and effort.
Thank you all very much!


July, 2024


                                                          Hossein A. Rahmani
                                                               Clemencia Siro
                                                        Mohammad Aliannejadi
                                                                Nick Craswell
                                                          Charles L. A. Clarke
                                                            Guglielmo Faggioli
                                                                Bhaskar Mitra
                                                                 Paul Thomas
                                                                Emine Yilmaz


                                        2
Program Committee
 • Zahra Abbasiantaeb, University of Amsterdam
 • Mofetoluwa Adeyemi, University of Waterloo
 • Marwah Alaofi, RMIT University
 • Negar Arabzadeh, University of Waterloo

 • Shivangi Bithel, IIT Delhi
 • Francesco Luigi De Faveri, University of Padua
 • Yashar Deldjoo, Polytechnic University of Bari

 • Gianluca Demartini, The University of Queensland
 • Laura Dietz, University of New Hampshire
 • Yue Feng, UCL
 • Claudia Hauff, Spotify

 • Bhawesh Kumar, Verily Life Sciences
 • Yiqun Liu, Tsinghua University
 • Sean MacAvaney, University of Glasgow

 • James Mayfield, Johns Hopkins University
 • Chuan Meng, University of Amsterdam
 • Ipsita Mohanty, Carnegie Mellon University
 • Mohammadmehdi Naghiaei, University of Southern California

 • Pranoy Panda, Fujitsu Research
 • Orion Weller, Johns Hopkins University
 • Lu Wang, Microsoft
 • Xi Wang, University of Sheffield

 • Jheng-Hong Yang, University of Waterloo
 • Oleg Zendel, RMIT University


                                      3