The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches BhashitheAbeysinghe babeysinghe@air.org American Institutes for Research

Arlington VA

RuhanCirci rcirci@air.org American Institutes for Research

Arlington VA

The First Workshop on Large Language Models for Evaluation in Information Retrieval

18 July 2024 Washington DC United States

The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches 1613-0073 F3B5F57BD932DB891A97CC13A9B4D8D4 GROBID - A machine learning software for extracting information from scholarly documents LLM Human Evaluation Evaluation Challenges factor based evaluation LLM Evaluation

Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains for example medicine and psychology are implemented rapidly. This however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations which consumed educational reports, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval.

Introduction

The landscape of chatbot development is rapidly evolving, propelled by advancements in Large Language Model (LLM) APIs. While the pace of development is exciting, there is a gap between building an LLM-powered application and building a reliable system with LLMs. This challenge requires carefully considering whether the final product satisfies all requirements and evaluate it to test its alignment with performance and ethical standards. As highlighted by [1], this evaluation process should encompass both a technical assessment and a trust-oriented framework. It is essential to ensure a balance between operational efficiency and responsible usage.

This process is further complicated by common pitfalls in LLMs, as several authors [2,3,4,5] mention areas of LLM could make mistakes, such as hallucination, tone, and output formatting. Effective evaluation can help to improve and maintain validation and consistency to avoid common pitfalls. The development of an effective evaluation system is timely for researchers and developers alike, given the propagation of LLM based generative applications such as chatbots.

The development cycle of a generic LLM-based application typically covers three phases: a) selection of LLM, b) iterative development of the application, and c) operational deployment of the app. The evaluation of LLMs themselves, as discussed in various papers [6,7] is beyond the scope of this brief. However, it is essential to note that the quality of the base LLM is a fundamental component in leveraging its capabilities effectively and minimizing risk in the resulting application. For applications, developers may follow different development approaches (e.g., fine-tuning, chaining, prompting, Retrieval Augmented Generation (RAG), LLM search combined with Knowledge graphs, etc.) and each approach demands tailored evaluation steps e.g., quality of data used in fine-tuning or prompting styles [8], or chunk size and quantity in RAG [9]. This paper explores three fundamental approaches for evaluating the final response (i.e., output) generated by LLM-based chatbots namely automated metrics, human evaluation and LLM based evaluation. With respect to human evaluation we investigate preferential evaluation and factored evaluation methods.

Background

Chatbots interact with users in such a way that they resolve user queries. Some chatbots are domain specific [10] while others are general purpose chatbots [11]. Evaluating a chatbot largely hinges on the intended use and specialization of the chatbot. In reviewing 16 papers on this topic, we summarized several key components that require attention for the evaluation; among these, the clear definition of the chatbot's intended purpose (i.e., use case -that specify business goal or client expectations, and user interaction with app) is critical. Such clarity helps for a focused evaluation of whether the chatbot attains its designated purpose.

The components described in Table 1 suggest that chatbots can be evaluated on different factors (also known as factors or dimensions), such as their ability to answer the users' queries completely, their linguistic effectiveness, and their ability to recall information (either through information retrieval or memory). Additional metrics may include the system's response time, usability, and intuitiveness.

Currently, there are no common methods or agreed upon best practices that are robust enough to evaluate LLM-based applications. As pointed out in almost all the prior work on this topic, a notable challenge is the lack of consensus on appropriate evaluation criteria and metrics. Therefore, researchers and developers bear the responsibility of choosing evaluation methods that are most appropriate for their unique application. This responsibility may not only increase development timelines but may also lead to underpowered statistical evaluations [12,13]. A resounding issue of automated metrics is that they are inconsistent with results and may not always correlate with human evaluation. But many still prefer to use them in evaluation due to being readily available and also easily repeatable [14,15,16,17,18]. Which is not the case with human evaluation, it is expensive and will not be repeatable in the same context even if one uses the same humans [19,13,20]. We must acknowledge the work where generative AI models which are being used at the evaluation step such as ChatEval, GPTScore and ARES [21,22,23] which are novel applications of LLMs. [24] discusses about "bot-play" where an already evaluated LLM being used in evaluating a new un-evaluated LLM. When considering LLM based evaluators, one must make sure the evaluator LLM produces acceptable and accurate decisions to a given threshold.

Human evaluation remains the most widely accepted form of evaluation in research studies despite frequent reports of underpowered results [25,13]. Several attempts have been called for the standardization of human evaluation methods [26,20], but its costly nature often leads researchers to report on systems with statistically insufficient power. Additionally, the sensitivity of human evaluators to the framing of questions (framed negatively or positively) is reported to influence outcomes [27]. For conversational or dialogue systems, the common standard of human evaluation is Quality on Likert scales. Quality can vary across tasks, and it encompasses multiple factors such as correctness, relevance, informativeness, consistency, understanding, etc. [19]. [13] suggest using a minimum of 100 questions rated on 5 or 7-point Likert scales to evaluate multiple dimensions. This seems to be a difficult goal to achieve due to the expensive nature of human evaluation.

The variability in expert opinions has led to multiple recommendations for refining human evaluation approaches. Engaging at least four experts is recommended, but more is preferable for robust results [20]. However, using expert evaluations may not always be productive, particularly if the system is not designed for expert use [25]. In cases where the number of available experts is limited, a comparative (also known as preferential) evaluation approach is often preferred. Additionally, it is advisable to involve about 10 to 60 non-expert usersthe intended end-users of the system -in the evaluation process and to ensure that the Inter Annotator Agreement (IAA) is reported for reliability (refer to Table 3 in [13] for best practices). It is also imperative to use external evaluators who have not taken part in the conversation to judge the conversation [19]. [28] discusses the complexities in explaining human evaluations; noting that individuals with varying levels of expertise can provide divergent assessments of the same response, this again shows the importance of employing many humans with varying expertise to completely evaluate such a system.

In summarizing insights from reviewed research articles, it is evident that human evaluation remains a common and indispensable element in the evaluation pipeline of chatbot systems, albeit implemented at different stages. Additionally, a diverse selection of metrics is frequently employed to assess various aspects of chatbot responses. Utilizing evaluator LLMs seems to be a promising approach that warrants exploration due to its potential to offer efficient and scalable evaluation. While the current focus is on the evaluation, a potentially critical factor, often overlooked, is the nature of the data used for testing and evaluation and many papers lack specificity regarding the types of questions posed to chatbots. We propose that incorporating a range of question types, informed by cognitive psychology frameworks such as Bloom's Taxonomy, could significantly enhance the systematic evaluation of chatbot responses and the insights drawn from such an evaluation.

To experiment with the evaluation procedures, we implement a chatbot first (Figure 2). This implementation follows industry standards such as Retrieval Augmented Generation (RAG), Vector Databases etc. to create a chatbot. The chatbot EdTalk aims to assist users in navigating and comprehending lengthy reports by harnessing the power of LLMs and the goals are to have minimal hallucination and strict adherence to factual information from its knowledge base. The goal of this chatbot is to make the educational reports such as Condition of Education accessible to a wide range of readers. Hence, chatbots knowledge base is built with the said reports. By evaluating EdTalk, we investigate if this chatbot aligns with its initial goals. Simultaneously we find if the chatbot is able to consistently follow the goals for various different types of questions in Bloom's Taxonomy. Later we compare the results from various evaluation procedures including automated, human and LLM-based to find what is more informative with respect to the development of this chatbot.

Evaluation procedures

We understand that chatbots, like any software will have an iterative implementation where the developers would be updating components which make up the chatbot. Each of these components and the full system need to be evaluated for reliability and performance. In this section we dive into various evaluation procedures we conducted and briefly explain how they were implemented. But we only focus on the utterance-based evaluation; meaning that we shall only be investigating procedures which are built to look at responses of the chatbot. Other components performance such as the semantic search used for retrieval in RAG is not in scope for this investigation.

To conduct the evaluation we employ the service of 5 humans. Initially, one of the human evaluators, having access to the content to be evaluated, generated 40 questions based on Bloom's Taxonomy [29]. The purpose behind adopting Bloom's Taxonomy was to determine the efficacy of the chatbot in responding to different types of questions. This approach adds another unique dimension to the evaluation process, enabling us to evaluate the quality of the chatbot's responses against different types of questions. It should be noted that the specific questions used in the evaluation were dependent on the use case of the chatbot implementation and have not been disclosed in this article.

Then a pair of humans hereafter known asannotators, write their own responses to the above questions. Later another pair hereafter known asevaluators determines the quality of the responses. Both pairs consists of an expert and a novice. An expert is someone who has been working with these reports for more than 2 years and a novice is new to the area but has some experience with the content.

Automated evaluation

Selecting an automated evaluation model is one crucial step. We do not select n-gram based methods because of the issues that literature points out and hence, we utilize embedding based methods. In that regard we believe BLUERT [17] to be the best out of the selection. We must not forget that this methods would still sometimes produce inconsistent results, but as it is Schools experienced many other challenges-besides staffing-related ones-during their recovery from the coronavirus pandemic. Some of these other issues facing public schools included reported increases in student and teacher absenteeism, student socioemotional and behavioral development, and an increase in the percentage of students seeking mental health services from school, as compared with before the coronavirus pandemic.

Schools faced many challenges during the pandemic besides staffing-related issues. Some of these challenges included reported increases in student and teacher absenteeism, student socioemotional and behavioral development, and an increase in the percentage of students seeking mental health services from school, as compared with before the coronavirus pandemic. In particular, in April 2022, some 70 percent of public schools reported that the percentage of students who had sought mental health services from school had increased since the start of the coronavirus pandemic. Overall, only 12 percent of schools strongly agreed and 44 percent moderately agreed that their school was able to effectively provide mental health services to all students in need. So, schools faced challenges related to student mental health, absenteeism, and socioemotional and behavioral development during the pandemic.

repeatable, it can be used at the rapid development stage to test parameters such as chunk sizes, overlap sizes etc. BLEURT requires a reference text and a generated text to compute similarity, and [17] suggests using a specific checkpoint to achieve best comparison 1 , an example of the reference text (Expert response) and the generated text (generated response) is given in the Table 2. Evaluating if the chatbot responses are similar to annotators is straightforward with BLEURT.

Human evaluation

Human evaluation on the other hand is a bit complex. There is traditional human evaluation which is typically a preferential rating of what response a human would prefer more. While this is an acceptable measure [13], it may still miss insights from the results. We conduct this traditional preferential evaluation first to start the human evaluation. The humans do not need to be experts in the domain to conduct this type of evaluation [25].

Then we enlist evaluators to rate responses of the chatbot for the previously created questions. Rating will be conducted on a few factors [22,13]. We carefully select these factors so that we can effectively evaluate many aspects of the chatbot, where many of the selected factors were inspired by [13]. We develop a 5-point Likert scale-based questionnaire from which we collect expert ratings for the chatbot responses.

Instructions on how to perform the ratings were given prior to the evaluators. Table 3 shows what questions an evaluator should ask before rating a response for a criterion. The criterions are set up so that a response with all the accurate and relevant information, without unnecessary information, in the most clear and concise manner is rated high. We also take hallucinations into the equation as well; this covers most quality criteria a generative AI application should look for. Evaluators are also free to refer the text where the questions re based off of, but we did not make the previous Annotator responses available for the Evaluators. We gave example ratings for a few questions and responses which were not part of the 40 selected above, these included examples for ratings 1, 3 and 5. Evaluators were free to determine how to assign the intermediate ratings.

LLM-based evaluation

The evaluation procedure being discussed is a relatively new one, and there is currently limited literature available to support its reliability as compared to human evaluation. The purpose of this study is to contribute to the existing literature by comparing human-based evaluation with LLM-based evaluation. The researchers used the same instructions that were given to human evaluators to prompt the LLM for evaluation. In addition, examples for each Likert scale value were provided to ensure that the LLM was aligned with the evaluation criteria, this is the only difference between the human instructions as humans do not receive examples for all Likert scales. The evaluation prompt included the question, facts retrieved from the content, and the response generated by the chatbot, as per the methodology proposed by [23]. The responses were evaluated for a given factor at a time, and the generated evaluation responses were processed to extract similar Likert scales from the LLM. The LLM evaluators did not have access to the Annotator responses created in the automated evaluation step, but LLM evaluator did have access to the content of the document. This allowed the researchers to compare the LLM-based evaluation with the human evaluation in a similar light.

Results

In this section, the results of all evaluation procedures are compared and contrasted. The purpose is to gain an understanding of what was learned from each experiment and to identify any advantages or disadvantages associated with each method. Bloom's Taxonomy is used to make comparisons, but the specific types within the taxonomy are not explained in this work. Table 4 presents the results captured by the automated evaluation experiment. As we explain in the previous sections, here we use BLEURT [17] as the metric to compute similarities of the generated response against a human written answer. This evaluation can be conducted rapidly if the human written responses are readily available. Meaning that the human needs to only write the response once, where it is possible to repeatedly run the evaluation after the parameters of the application are altered. It is not clear how to compare two BLEURT scores for a similar task where multiple reference text are used. Upon inspection and comparison of BLEURT values, it was noted that for some question types, expert and novice fell into similar ranges. For both humans, the generated response has a lower similarity in Evaluate questions. For Apply questions, while Experts similarity is at 0.44, novice has 0.24. Highest similarities were reported in both humans at Understand questions.

We conducted traditional human evaluation through preferential rating first, this type of evaluation does not require domain experts to conduct evaluation and is much faster considering the other human evaluation methods. Here we find that the chatbots answers are preferred only 47% (on average) of the time, Table 5 present results broken down into the same Bloom's Taxonomy type. This measure does not reveal anything about what areas are needed improvement in order to perform better. Which is typically why the community prefers factored human evaluation.

Table 7 reports the results of the factored evaluation in both human and LLM procedures. Since we used Likert scales to capture ratings, we have reported the results via medians of each factor and question type. The visualized results are displayed in Figure 1, which clearly highlight the notable differences between novices and experts in their approaches to response analysis. The graph underscores the importance of recognizing individual variations in cognitive processing and interpretation of information.

Using the factored human evaluation procedure, we were able to experimentally figure out previously elusive facts about the generative application. When we initially conducted trivial automated and human evaluation (preferential), if we do not break questions down to Bloom's Taxonomy, we only get one measure to test if the chatbot works within the parameters of an acceptable application. This is not usually enough to understand the underlying complex issues of LLMs, and if they are present in the LLM-powered application or not. RAG systems are built to retrieve information which is available in context. This means that when posed with Remember questions, they must perform well, but as the results from the expert show; EdTalk does not perform well with Remember questions (Table 7 and Figure 1). It shows also that chatbot responses are not consistent enough to say anything related to other question types. This result reveals while RAG chatbots should be great at answering retrieval based questions they sometimes do not work as intended in the perspective of a human. We also note that the automated evaluation with BLEURT showed similar patterns with each of the question type as well, but when we take the novice into account, the similarity is not present anymore. One advantage in this type of evaluation is that we can now check the inter-rater reliability, and we show this in Table 6. We notice the major issue pointed out by many prior work here with, where humans not agreeing in their reviews. Also by categorizing questions into factors we notice that human agreement is moderate in Clarity but all other factors are low agreement. One disadvantage we notice here is the ability of repeating the evaluation effort, same humans may rate these responses differently if we change the order or the framing of the questions in the questionnaire [13,25].

Discussion

The goal of this work is to illustrate how challenging it is to evaluate an LLM based application, especially evaluating a chatbot with current methodologies including automated, human and LLM procedures. We first demonstrate that there are advantages and disadvantages in all three of these approaches. We also note the differences of results gained from all three evaluation procedures, there is very little correlation between these results and it would be difficult to suggest one to be used. We also observed that the experts evaluation results are a bit stricter and resulted lower scores generally for many factors. The novice had looked at the chatbot in a favorable light and we notice the slightly elevated scores. Using an LLM to evaluate the chatbot responses seems to be not reliable as the LLM scores its own responses high. In our experimental case, we used the same LLM (GPT-3.5) to generate the responses and also as the evaluator LLM. This is not the ideal setting as [24] points out, in [24] authors point out if an LLM is not evaluated it must be evaluated using an already evaluated LLM or a higher order LLM. Given this situation of uncertain evaluations from any procedure, we should not distract the readers from the need for evaluating. To improve the reliability of evaluation, we suggest increasing the number of humans used in the factored human evaluation. Also enlisting a wide range of expertise would create a smoothed preview of the results; however, this would increase the expensiveness of the evaluation. As [13] suggests, enlisting a larger amount of intended users of a chatbot would still not be ideal as these users may also create confusion on whats correct and whats not. Allowing untrained humans to make judgments on the factors will not yield the most accurate results, similar to the case we have with LLM results in Figure 1.

One deciding factor would be the repeatability and the amount of funds a person has toward evaluating a chatbot. In this regard we note while automated procedures are repeatable, low reliability of these metrics make a case against them. Human evaluation is considered the gold standard, while that can be true research indicates that the human disagreement is a greater issue; we also notice this issue indicated in Table 6. LLM evaluators are a novel adaptation of LLMs, its greatest adversary right now is not having enough research to support its reliability. We observe that in some cases LLM evaluators have similar responses to human evaluators. But this is not the case always, in most instances LLM evaluators tend to be overly confident in the response being correct. We cannot reject the promise in LLM evaluators as we can set various personalities and take various versions of its evaluation rapidly [21], but this also must be explored in terms of whether a person of such an expertise would rate the same response in a similar way. Further research needs to be conducted in understanding how LLMs can help us evaluate LLMs.

A. Prompts

This section notes the prompts that have been used in this work, we first note the prompt that has been utilized in the RAG process in the chatbot for clarity and then a sample prompt that was

A.1. RAG Prompt

The u s e r a s k s t h e q u e s t i o n < q u e s t i o n > . Here a r e some f a c t s t h a t c o u l d be u s e d t o s u p p o r t t h e q u e s t i o n , < f a c t s d e l i m i t e d by s e m i c o l o n s > .

You must f i r s t i n v e s t i g a t e i f i t i s p o s s i b l e t o s u p p o r t an answer w i t h t h e a v a i l a b l e f a c t s I f you do n o t have f a c t s t o s u p p o r t an answer , s t e p by s t e p e x p l a i n i n g your r e a s o n i n g b e h i n d e a c h a c t i o n you must come up w i t h a answer by p r o c e s s i n g , a p p l y i n g and e v a l u a t i n g f a c t s a s n e e d e d . O t h e r w i s e you must o n l y r e s p o n d w i t h " I d o n t know " and do n o t o u t p u t n y t h i n g e l s e .

A.2. LLM Evaluator Prompt

Here in this prompt we only add the prompt used with the "Correctness" criterion and similar prompts can be drawn for others.

You a r e an e x p e r t e d u c a t i o n r e s e a r c h e r . You a r e g i v e n a s e t o f f a c t s , a q u e s t i o n t h a t r e l a t e s t o t h e t e x t o f t h e s e f a c t s and an answer f o r t h e g i v e n q u e s t i o n . Your t a s k i s t o e v a l u a t e i f t h e answer i s a good answer t o t h e g i v e n q u e s t i o n b a s e d o f f o f a c r i t e r i o n and a l s o c o n s i d e r i n g t h e f a c t s . E v a l u a t i o n s t e p s : 1 . Read t h e f a c t s : S t a r t by c a r e f u l l y r e a d i n g t h e f a c t s p r o v i d e d . U n d e r s t a n d t h e c o n t e x t , main p o i n t s , and any r e l e v a n t d e t a i l s . 2 . A n a l y z e t h e Q u e s t i o n : Examine t h e q u e s t i o n t h a t r e l a t e s t o t h e f a c t s .

Figure 1 :1Figure 1: Median of Likert scale ratings of each evaluator. Each spoke shows how an evaluator rated a response based on the question type from Blooms Taxonomy.

Figure 2 :2Figure 2: Screen capture of the EdTalk chatbot answering a question

E n s u r e you have a c l e a r u n d e r s t a n d i n g o f what t h e q u e s t i o n i s a s k i n g f o r . 3 . Review t h e Answer : C a r e f u l l y r e a d t h e answer p r o v i d e d and a s s e s s i t b a s e d o n l y on t h e f o l l o w i n g c r i t e r i o n : C o r r e c t n e s s : Does t h e answer p r o v i d e a c c u r a t e i n f o r m a t i o n b a s e d on t h e p a r a g r a p h t e x t ?

Table 22Scenario from Condition of Education report 2023, the example question, Annotator expert response, generated response. Similar response pairs are used in the BLEURT evaluationQuestionExpert responseGenerated responseWhatchal-lengesdidschools faceduringthepandemic?

Table 33Criteria for the Likert scale questionnaireCriterionDescriptionRelevanceIf the facts presented are required by the question?InformativenessAre all the facts called by the question presented bythe response?CorrectnessHow correct the generated response?ClarityDoes the question call for a certain formatting ofrthe answer or is the response brief or verbose?hallucinationIs the answer a hallucinated reference, informationetc.?

Table 44Automated evaluation results; each generated answer is compared against a human (Expert or Novice) and the BLEURT score is reported herewithTypeExpert NoviceRemember0.450.40Understand0.610.55Apply0.440.24Analyze0.470.41Evaluate0.220.31

Table 55Percentage of preference of generated response in the preferential rating evaluationTypeGenerated response preferenceRemember31%Understand100%Apply0%Analyze57%Evaluate33%

Table 77Factored evaluation results; median across question type. Higher the better.TypeCorrectness Informativeness Relevance Clarity HallucinationsRemember22323Understand54423ExpertApply3.53.5332Analyze44445Evaluate23341Remember54343Understand33222NoviceApply42.53.52.52Analyze44545Evaluate44444Remember43455Understand42455LLMApply554.554Analyze55555Evaluate43455

https://github.com/google-research/bleurt?tab=readme-ov-file#checkpoints

Acknowledgments

Abhinav Cheruvu for helping with implementation of the chatbot and to Tabitha Tezil, Erika Kessler and Jijun Zhang for helping with human evaluation.

Evaluating Chatbots to Promote Users' Trust -Practices and Open Problems BSrivastava KLakkaraju TKoppel VNarayanan AKundu SJoshi arXiv: 2023 Bias and Fairness in Large Language Models: A Survey IOGallegos RARossi JBarrow MMTanjim SKim FDernoncourt TYu RZhang NKAhmed 10.48550/arXiv.2309.00770 arXiv:2309.00770 2023 A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions LHuang WYu WMa WZhong ZFeng HWang QChen WPeng XFeng BQin TLiu 10.48550/arXiv.2311.05232 arXiv:2311.05232 2023 Survey of Hallucination in Natural Language Generation ZJi NLee RFrieske TYu DSu YXu EIshii YJBang AMadotto PFung 10.1145/3571730 ACM Computing Surveys 55 2023 Challenges and Applications of Large Language Models JKaddour JHarris MMozes HBradley RRaileanu RMchardy 10.48550/arXiv.2307.10169 arXiv:2307.10169 2023 Evaluating Large Language Models: A Comprehensive Survey ZGuo RJin CLiu YHuang DShi LSupryadi YYu JLiu BLi DXiong Xiong 10.48550/arXiv.2310.19736 arXiv:2310.19736 2023 Holistic Evaluation of Language Models PLiang RBommasani TLee DTsipras DSoylu MYasunaga YZhang DNarayanan YWu AKumar BNewman BYuan BYan CZhang CCosgrove CDManning CRé DAcosta-Navas DAHudson EZelikman EDurmus FLadhak FRong HRen HYao JWang KSanthanam LOrr LZheng MYuksekgonul MSuzgun NKim NGuha NChatterji OKhattab PHenderson QHuang RChi SMXie SSanturkar SGanguli THashimoto TIcard TZhang VChaudhary WWang XLi YMai YZhang YKoreeda 10.48550/arXiv.2211.09110 arXiv:2211.09110 2023 Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine HNori YTLee SZhang DCarignan REdgar NFusi NKing JLarson YLi WLiu RLuo SMMckinney RONess HPoon TQin NUsuyama CWhite EHorvitz 10.48550/arXiv.2311.16452 arXiv:2311.16452 2023 cs YGao YXiong XGao KJia JPan YBi YDai JSun HWang 10.48550/arXiv.2312.10997 arXiv:2312.10997 Retrieval-Augmented Generation for Large Language Models: A Survey 2023 Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review AAbd-Alrazaq ZSafi MAlajlani JWarren MHouseh KDenecke 10.2196/18301 Journal of Medical Internet Research 22 e18301 2020 Vicuna An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality | LMSYS Org 2023 DCard PHenderson UKhandelwal RJia KMahowald DJurafsky arXiv: With Little Power Comes Great Responsibility 2020 Best practices for the human evaluation of automatically generated text CVan Der Lee AGatt EVan Miltenburg SWubben EKrahmer Proceedings of the 12th International Conference on Natural Language Generation the 12th International Conference on Natural Language Generation 2019 METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments SBanerjee ALavie Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics JGoldstein ALavie C.-YLin CVoss the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics

Ann Arbor, Michigan

2005 ROUGE: A Package for Automatic Evaluation of Summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 Bleu: a Method for Automatic Evaluation of Machine Translation KPapineni SRoukos TWard W.-JZhu 10.3115/1073083.1073135 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics PIsabelle ECharniak DLin the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Philadelphia, Pennsylvania, USA

2002 BLEURT: Learning Robust Metrics for Text Generation TSellam DDas APParikh 10.48550/arXiv.2004.04696 arXiv:2004.04696 2020 BERTScore: Evaluating Text Generation with BERT TZhang VKishore FWu KQWeinberger YArtzi 10.48550/arXiv.1904.09675 arXiv:1904.09675 2020 Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems SEFinch JDFinch JDChoi arXiv: 2023 Human evaluation of automatically generated text: Current trends and best practice guidelines CVan Der Lee AGatt EVan Miltenburg EKrahmer 10.1016/j.csl.2020.101151 Computer Speech & Language 67 101151 2021 ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate C.-MChan WChen YSu JYu WXue SZhang JFu ZLiu arXiv: 2023 GPTScore: Evaluate as You Desire JFu S.-KNg ZJiang PLiu arXiv:2302.04166 2023 ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems JSaad-Falcon OKhattab CPotts MZaharia arXiv:2311.09476 2024 cs Approximating Online Human Evaluation of Social Chatbots with Prompting ESvikhnushina PPu Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics SStoyanchev SJoty DSchlangen ODusek CKennington MAlikhani the 24th Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics

Prague, Czechia

2023 All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text EClark TAugust SSerrano NHaduong SGururangan NASmith arXiv: 2021 DMHowcroft VRieser What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think M.-F 10.18653/v1/2021.emnlp-main.703 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and XMoens LHuang SWSpecia .-T Yih the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 This is a Problem, Don't You Agree?" Framing and Bias in Human Evaluation for Natural Language Generation SSchoch DYang YJi Proceedings of the 1st Workshop on Evaluating NLG Evaluation, Association for Computational Linguistics SAgarwal ODušek SGehrmann DGkatzia IKonstas EVan Miltenburg SSanthanam the 1st Workshop on Evaluating NLG Evaluation, Association for Computational Linguistics

Dublin, Ireland

2020 Algorithm Inspection for Chatbot Performance Evaluation VVijayaraghavan JBCooper RL J 10.1016/j.procs.2020.04.245 Procedia Computer Science 171 2020 Bloom's Taxonomy PArmstrong 2010